WorldWideScience

Sample records for geographic imputation approaches

  1. Evaluating geographic imputation approaches for zip code level data: an application to a study of pediatric diabetes

    Directory of Open Access Journals (Sweden)

    Puett Robin C

    2009-10-01

    Full Text Available Abstract Background There is increasing interest in the study of place effects on health, facilitated in part by geographic information systems. Incomplete or missing address information reduces geocoding success. Several geographic imputation methods have been suggested to overcome this limitation. Accuracy evaluation of these methods can be focused at the level of individuals and at higher group-levels (e.g., spatial distribution. Methods We evaluated the accuracy of eight geo-imputation methods for address allocation from ZIP codes to census tracts at the individual and group level. The spatial apportioning approaches underlying the imputation methods included four fixed (deterministic and four random (stochastic allocation methods using land area, total population, population under age 20, and race/ethnicity as weighting factors. Data included more than 2,000 geocoded cases of diabetes mellitus among youth aged 0-19 in four U.S. regions. The imputed distribution of cases across tracts was compared to the true distribution using a chi-squared statistic. Results At the individual level, population-weighted (total or under age 20 fixed allocation showed the greatest level of accuracy, with correct census tract assignments averaging 30.01% across all regions, followed by the race/ethnicity-weighted random method (23.83%. The true distribution of cases across census tracts was that 58.2% of tracts exhibited no cases, 26.2% had one case, 9.5% had two cases, and less than 3% had three or more. This distribution was best captured by random allocation methods, with no significant differences (p-value > 0.90. However, significant differences in distributions based on fixed allocation methods were found (p-value Conclusion Fixed imputation methods seemed to yield greatest accuracy at the individual level, suggesting use for studies on area-level environmental exposures. Fixed methods result in artificial clusters in single census tracts. For studies

  2. Estimating the accuracy of geographical imputation

    Directory of Open Access Journals (Sweden)

    Boscoe Francis P

    2008-01-01

    Full Text Available Abstract Background To reduce the number of non-geocoded cases researchers and organizations sometimes include cases geocoded to postal code centroids along with cases geocoded with the greater precision of a full street address. Some analysts then use the postal code to assign information to the cases from finer-level geographies such as a census tract. Assignment is commonly completed using either a postal centroid or by a geographical imputation method which assigns a location by using both the demographic characteristics of the case and the population characteristics of the postal delivery area. To date no systematic evaluation of geographical imputation methods ("geo-imputation" has been completed. The objective of this study was to determine the accuracy of census tract assignment using geo-imputation. Methods Using a large dataset of breast, prostate and colorectal cancer cases reported to the New Jersey Cancer Registry, we determined how often cases were assigned to the correct census tract using alternate strategies of demographic based geo-imputation, and using assignments obtained from postal code centroids. Assignment accuracy was measured by comparing the tract assigned with the tract originally identified from the full street address. Results Assigning cases to census tracts using the race/ethnicity population distribution within a postal code resulted in more correctly assigned cases than when using postal code centroids. The addition of age characteristics increased the match rates even further. Match rates were highly dependent on both the geographic distribution of race/ethnicity groups and population density. Conclusion Geo-imputation appears to offer some advantages and no serious drawbacks as compared with the alternative of assigning cases to census tracts based on postal code centroids. For a specific analysis, researchers will still need to consider the potential impact of geocoding quality on their results and evaluate

  3. An imputation approach for oligonucleotide microarrays.

    Directory of Open Access Journals (Sweden)

    Ming Li

    Full Text Available Oligonucleotide microarrays are commonly adopted for detecting and qualifying the abundance of molecules in biological samples. Analysis of microarray data starts with recording and interpreting hybridization signals from CEL images. However, many CEL images may be blemished by noises from various sources, observed as "bright spots", "dark clouds", and "shadowy circles", etc. It is crucial that these image defects are correctly identified and properly processed. Existing approaches mainly focus on detecting defect areas and removing affected intensities. In this article, we propose to use a mixed effect model for imputing the affected intensities. The proposed imputation procedure is a single-array-based approach which does not require any biological replicate or between-array normalization. We further examine its performance by using Affymetrix high-density SNP arrays. The results show that this imputation procedure significantly reduces genotyping error rates. We also discuss the necessary adjustments for its potential extension to other oligonucleotide microarrays, such as gene expression profiling. The R source code for the implementation of approach is freely available upon request.

  4. A web-based approach to data imputation

    KAUST Repository

    Li, Zhixu

    2013-10-24

    In this paper, we present WebPut, a prototype system that adopts a novel web-based approach to the data imputation problem. Towards this, Webput utilizes the available information in an incomplete database in conjunction with the data consistency principle. Moreover, WebPut extends effective Information Extraction (IE) methods for the purpose of formulating web search queries that are capable of effectively retrieving missing values with high accuracy. WebPut employs a confidence-based scheme that efficiently leverages our suite of data imputation queries to automatically select the most effective imputation query for each missing value. A greedy iterative algorithm is proposed to schedule the imputation order of the different missing values in a database, and in turn the issuing of their corresponding imputation queries, for improving the accuracy and efficiency of WebPut. Moreover, several optimization techniques are also proposed to reduce the cost of estimating the confidence of imputation queries at both the tuple-level and the database-level. Experiments based on several real-world data collections demonstrate not only the effectiveness of WebPut compared to existing approaches, but also the efficiency of our proposed algorithms and optimization techniques. © 2013 Springer Science+Business Media New York.

  5. A SPATIOTEMPORAL APPROACH FOR HIGH RESOLUTION TRAFFIC FLOW IMPUTATION

    Energy Technology Data Exchange (ETDEWEB)

    Han, Lee [University of Tennessee, Knoxville (UTK); Chin, Shih-Miao [ORNL; Hwang, Ho-Ling [ORNL

    2016-01-01

    Along with the rapid development of Intelligent Transportation Systems (ITS), traffic data collection technologies have been evolving dramatically. The emergence of innovative data collection technologies such as Remote Traffic Microwave Sensor (RTMS), Bluetooth sensor, GPS-based Floating Car method, automated license plate recognition (ALPR) (1), etc., creates an explosion of traffic data, which brings transportation engineering into the new era of Big Data. However, despite the advance of technologies, the missing data issue is still inevitable and has posed great challenges for research such as traffic forecasting, real-time incident detection and management, dynamic route guidance, and massive evacuation optimization, because the degree of success of these endeavors depends on the timely availability of relatively complete and reasonably accurate traffic data. A thorough literature review suggests most current imputation models, if not all, focus largely on the temporal nature of the traffic data and fail to consider the fact that traffic stream characteristics at a certain location are closely related to those at neighboring locations and utilize these correlations for data imputation. To this end, this paper presents a Kriging based spatiotemporal data imputation approach that is able to fully utilize the spatiotemporal information underlying in traffic data. Imputation performance of the proposed approach was tested using simulated scenarios and achieved stable imputation accuracy. Moreover, the proposed Kriging imputation model is more flexible compared to current models.

  6. TRIP: An interactive retrieving-inferring data imputation approach

    KAUST Repository

    Li, Zhixu

    2016-06-25

    Data imputation aims at filling in missing attribute values in databases. Existing imputation approaches to nonquantitive string data can be roughly put into two categories: (1) inferring-based approaches [2], and (2) retrieving-based approaches [1]. Specifically, the inferring-based approaches find substitutes or estimations for the missing ones from the complete part of the data set. However, they typically fall short in filling in unique missing attribute values which do not exist in the complete part of the data set [1]. The retrieving-based approaches resort to external resources for help by formulating proper web search queries to retrieve web pages containing the missing values from the Web, and then extracting the missing values from the retrieved web pages [1]. This webbased retrieving approach reaches a high imputation precision and recall, but on the other hand, issues a large number of web search queries, which brings a large overhead [1]. © 2016 IEEE.

  7. Improving accuracy of rare variant imputation with a two-step imputation approach

    DEFF Research Database (Denmark)

    Kreiner-Møller, Eskil; Medina-Gomez, Carolina; Uitterlinden, André G;

    2015-01-01

    Genotype imputation has been the pillar of the success of genome-wide association studies (GWAS) for identifying common variants associated with common diseases. However, most GWAS have been run using only 60 HapMap samples as reference for imputation, meaning less frequent and rare variants not ...... in the low-frequency spectrum and is a cost-effective strategy in large epidemiological studies....

  8. Is missing geographic positioning system data in accelerometry studies a problem, and is imputation the solution?

    Directory of Open Access Journals (Sweden)

    Kristin Meseck

    2016-05-01

    Full Text Available The main purpose of the present study was to assess the impact of global positioning system (GPS signal lapse on physical activity analyses, discover any existing associations between missing GPS data and environmental and demographics attributes, and to determine whether imputation is an accurate and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run with the number of lapses and the length of lapses as outcomes. The signal lapses were imputed using a simple ruleset, and imputation was validated against person-worn camera imagery. A final generalised linear mixed model was used to identify the difference between the amount of GPS minutes pre- and post-imputation for the activity categories of sedentary, light, and moderate-to-vigorous physical activity. Over 17% of the dataset was comprised of GPS data lapses. No strong associations were found between increasing lapse length and number of lapses and the demographic and built environment variables. A significant difference was found between the pre- and postimputation minutes for each activity category. No demographic or environmental bias was found for length or number of lapses, but imputation of GPS data may make a significant difference for inclusion of physical activity data that occurred during a lapse. Imputing GPS data lapses is a viable technique for returning spatial context to accelerometer data and improving the completeness of the dataset.

  9. Missing Data and Multiple Imputation: An Unbiased Approach

    Science.gov (United States)

    Foy, M.; VanBaalen, M.; Wear, M.; Mendez, C.; Mason, S.; Meyers, V.; Alexander, D.; Law, J.

    2014-01-01

    The default method of dealing with missing data in statistical analyses is to only use the complete observations (complete case analysis), which can lead to unexpected bias when data do not meet the assumption of missing completely at random (MCAR). For the assumption of MCAR to be met, missingness cannot be related to either the observed or unobserved variables. A less stringent assumption, missing at random (MAR), requires that missingness not be associated with the value of the missing variable itself, but can be associated with the other observed variables. When data are truly MAR as opposed to MCAR, the default complete case analysis method can lead to biased results. There are statistical options available to adjust for data that are MAR, including multiple imputation (MI) which is consistent and efficient at estimating effects. Multiple imputation uses informing variables to determine statistical distributions for each piece of missing data. Then multiple datasets are created by randomly drawing on the distributions for each piece of missing data. Since MI is efficient, only a limited number, usually less than 20, of imputed datasets are required to get stable estimates. Each imputed dataset is analyzed using standard statistical techniques, and then results are combined to get overall estimates of effect. A simulation study will be demonstrated to show the results of using the default complete case analysis, and MI in a linear regression of MCAR and MAR simulated data. Further, MI was successfully applied to the association study of CO2 levels and headaches when initial analysis showed there may be an underlying association between missing CO2 levels and reported headaches. Through MI, we were able to show that there is a strong association between average CO2 levels and the risk of headaches. Each unit increase in CO2 (mmHg) resulted in a doubling in the odds of reported headaches.

  10. A Fully Conditional Specification Approach to Multilevel Imputation of Categorical and Continuous Variables.

    Science.gov (United States)

    Enders, Craig K; Keller, Brian T; Levy, Roy

    2017-05-29

    Specialized imputation routines for multilevel data are widely available in software packages, but these methods are generally not equipped to handle a wide range of complexities that are typical of behavioral science data. In particular, existing imputation schemes differ in their ability to handle random slopes, categorical variables, differential relations at Level-1 and Level-2, and incomplete Level-2 variables. Given the limitations of existing imputation tools, the purpose of this manuscript is to describe a flexible imputation approach that can accommodate a diverse set of 2-level analysis problems that includes any of the aforementioned features. The procedure employs a fully conditional specification (also known as chained equations) approach with a latent variable formulation for handling incomplete categorical variables. Computer simulations suggest that the proposed procedure works quite well, with trivial biases in most cases. We provide a software program that implements the imputation strategy, and we use an artificial data set to illustrate its use. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  11. Dealing with missing covariates in epidemiologic studies: a comparison between multiple imputation and a full Bayesian approach.

    Science.gov (United States)

    Erler, Nicole S; Rizopoulos, Dimitris; Rosmalen, Joost van; Jaddoe, Vincent W V; Franco, Oscar H; Lesaffre, Emmanuel M E H

    2016-07-30

    Incomplete data are generally a challenge to the analysis of most large studies. The current gold standard to account for missing data is multiple imputation, and more specifically multiple imputation with chained equations (MICE). Numerous studies have been conducted to illustrate the performance of MICE for missing covariate data. The results show that the method works well in various situations. However, less is known about its performance in more complex models, specifically when the outcome is multivariate as in longitudinal studies. In current practice, the multivariate nature of the longitudinal outcome is often neglected in the imputation procedure, or only the baseline outcome is used to impute missing covariates. In this work, we evaluate the performance of MICE using different strategies to include a longitudinal outcome into the imputation models and compare it with a fully Bayesian approach that jointly imputes missing values and estimates the parameters of the longitudinal model. Results from simulation and a real data example show that MICE requires the analyst to correctly specify which components of the longitudinal process need to be included in the imputation models in order to obtain unbiased results. The full Bayesian approach, on the other hand, does not require the analyst to explicitly specify how the longitudinal outcome enters the imputation models. It performed well under different scenarios. Copyright © 2016 John Wiley & Sons, Ltd.

  12. Imputation And Classification Of Missing Data Using Least Square Support Vector Machines – A New Approach In Dementia Diagnosis

    Directory of Open Access Journals (Sweden)

    T R Sivapriya

    2012-07-01

    Full Text Available This paper presents a comparison of different data imputation approaches used in filling missing data and proposes a combined approach to estimate accurately missing attribute values in a patient database. The present study suggests a more robust technique that is likely to supply a value closer to the one that is missing for effective classification and diagnosis. Initially data is clustered and z-score method is used to select possible values of an instance with missing attribute values. Then multiple imputation method using LSSVM (Least Squares Support Vector Machine is applied to select the most appropriate values for the missing attributes. Five imputed datasets have been used to demonstrate the performance of the proposed method. Experimental results show that our method outperforms conventional methods of multiple imputation and mean substitution. Moreover, the proposed method CZLSSVM (Clustered Z-score Least Square Support Vector Machine has been evaluated in two classification problems for incomplete data. The efficacy of the imputation methods have been evaluated using LSSVM classifier. Experimental results indicate that accuracy of the classification is increases with CZLSSVM in the case of missing attribute value estimation. It is found that CZLSSVM outperforms other data imputation approaches like decision tree, rough sets and artificial neural networks, K-NN (K-Nearest Neighbour and SVM. Further it is observed that CZLSSVM yields 95 per cent accuracy and prediction capability than other methods included and tested in the study.

  13. Random property allocation: A novel geographic imputation procedure based on a complete geocoded address file.

    Science.gov (United States)

    Walter, Scott R; Rose, Nectarios

    2013-09-01

    Allocating an incomplete address to randomly selected property coordinates within a locality, known as random property allocation, has many advantages over other geoimputation techniques. We compared the performance of random property allocation to four other methods under various conditions using a simulation approach. All methods performed well for large spatial units, but random property allocation was the least prone to bias and error under volatile scenarios with small units and low prevalence. Both its coordinate based approach as well as the random process of assignment contribute to its increased accuracy and reduced bias in many scenarios. Hence it is preferable to fixed or areal geoimputation for many epidemiological and surveillance applications.

  14. Identifying Geographic Clusters: A Network Analytic Approach

    CERN Document Server

    Catini, Roberto; Penner, Orion; Riccaboni, Massimo

    2015-01-01

    In recent years there has been a growing interest in the role of networks and clusters in the global economy. Despite being a popular research topic in economics, sociology and urban studies, geographical clustering of human activity has often studied been by means of predetermined geographical units such as administrative divisions and metropolitan areas. This approach is intrinsically time invariant and it does not allow one to differentiate between different activities. Our goal in this paper is to present a new methodology for identifying clusters, that can be applied to different empirical settings. We use a graph approach based on k-shell decomposition to analyze world biomedical research clusters based on PubMed scientific publications. We identify research institutions and locate their activities in geographical clusters. Leading areas of scientific production and their top performing research institutions are consistently identified at different geographic scales.

  15. A suggested approach for imputation of missing dietary data for young children in daycare

    Directory of Open Access Journals (Sweden)

    June Stevens

    2015-12-01

    Full Text Available Background: Parent-reported 24-h diet recalls are an accepted method of estimating intake in young children. However, many children eat while at childcare making accurate proxy reports by parents difficult. Objective: The goal of this study was to demonstrate a method to impute missing weekday lunch and daytime snack nutrient data for daycare children and to explore the concurrent predictive and criterion validity of the method. Design: Data were from children aged 2-5 years in the My Parenting SOS project (n=308; 870 24-h diet recalls. Mixed models were used to simultaneously predict breakfast, dinner, and evening snacks (B+D+ES; lunch; and daytime snacks for all children after adjusting for age, sex, and body mass index (BMI. From these models, we imputed the missing weekday daycare lunches by interpolation using the mean lunch to B+D+ES [L/(B+D+ES] ratio among non-daycare children on weekdays and the L/(B+D+ES ratio for all children on weekends. Daytime snack data were used to impute snacks. Results: The reported mean (± standard deviation weekday intake was lower for daycare children [725 (±324 kcal] compared to non-daycare children [1,048 (±463 kcal]. Weekend intake for all children was 1,173 (±427 kcal. After imputation, weekday caloric intake for daycare children was 1,230 (±409 kcal. Daily intakes that included imputed data were associated with age and sex but not with BMI. Conclusion: This work indicates that imputation is a promising method for improving the precision of daily nutrient data from young children.

  16. Imputation of missing genotypes: an empirical evaluation of IMPUTE

    Directory of Open Access Journals (Sweden)

    Steinberg Martin H

    2008-12-01

    Full Text Available Abstract Background Imputation of missing genotypes is becoming a very popular solution for synchronizing genotype data collected with different microarray platforms but the effect of ethnic background, subject ascertainment, and amount of missing data on the accuracy of imputation are not well understood. Results We evaluated the accuracy of the program IMPUTE to generate the genotype data of partially or fully untyped single nucleotide polymorphisms (SNPs. The program uses a model-based approach to imputation that reconstructs the genotype distribution given a set of referent haplotypes and the observed data, and uses this distribution to compute the marginal probability of each missing genotype for each individual subject that is used to impute the missing data. We assembled genome-wide data from five different studies and three different ethnic groups comprising Caucasians, African Americans and Asians. We randomly removed genotype data and then compared the observed genotypes with those generated by IMPUTE. Our analysis shows 97% median accuracy in Caucasian subjects when less than 10% of the SNPs are untyped and missing genotypes are accepted regardless of their posterior probability. The median accuracy increases to 99% when we require 0.95 minimum posterior probability for an imputed genotype to be acceptable. The accuracy decreases to 86% or 94% when subjects are African Americans or Asians. We propose a strategy to improve the accuracy by leveraging the level of admixture in African Americans. Conclusion Our analysis suggests that IMPUTE is very accurate in samples of Caucasians origin, it is slightly less accurate in samples of Asians background, but substantially less accurate in samples of admixed background such as African Americans. Sample size and ascertainment do not seem to affect the accuracy of imputation.

  17. Secure Geographic Routing Protocols: Issues and Approaches

    CERN Document Server

    sookhak, Mehdi; Haghparast, Mahboobeh; ISnin, Ismail Fauzi

    2011-01-01

    In the years, routing protocols in wireless sensor networks (WSN) have been substantially investigated by researches. Most state-of-the-art surveys have focused on reviewing of wireless sensor network .In this paper we review the existing secure geographic routing protocols for wireless sensor network (WSN) and also provide a qualitative comparison of them.

  18. Secure Geographic Routing Protocols: Issues and Approaches

    Directory of Open Access Journals (Sweden)

    Mehdi sookhak

    2011-09-01

    Full Text Available In the years, routing protocols in wireless sensor networks (WSN have been substantially investigated by researches Most state-of-the-art surveys have focused on reviewing of wireless sensor network .In this paper we review the existing secure geographic routing protocols for wireless sensor network (WSN and also provide a qualitative comparison of them.

  19. Calibrated hot deck imputation for numerical data under edit restrictions

    NARCIS (Netherlands)

    de Waal, A.G.; Coutinho, Wieger; Shlomo, Natalie

    2017-01-01

    We develop a non-parametric imputation method for item non-response based on the well-known hot-deck approach. The proposed imputation method is developed for imputing numerical data that ensure that all record-level edit rules are satisfied and previously estimated or known totals are exactly prese

  20. Multiple imputation using chained equations: Issues and guidance for practice.

    Science.gov (United States)

    White, Ian R; Royston, Patrick; Wood, Angela M

    2011-02-20

    Multiple imputation by chained equations is a flexible and practical approach to handling missing data. We describe the principles of the method and show how to impute categorical and quantitative variables, including skewed variables. We give guidance on how to specify the imputation model and how many imputations are needed. We describe the practical analysis of multiply imputed data, including model building and model checking. We stress the limitations of the method and discuss the possible pitfalls. We illustrate the ideas using a data set in mental health, giving Stata code fragments.

  1. Evaluation of the imputation performance of the program IMPUTE in an admixed sample from Mexico City using several model designs

    Directory of Open Access Journals (Sweden)

    Krithika S

    2012-05-01

    Full Text Available Abstract Background We explored the imputation performance of the program IMPUTE in an admixed sample from Mexico City. The following issues were evaluated: (a the impact of different reference panels (HapMap vs. 1000 Genomes on imputation; (b potential differences in imputation performance between single-step vs. two-step (phasing and imputation approaches; (c the effect of different INFO score thresholds on imputation performance and (d imputation performance in common vs. rare markers. Methods The sample from Mexico City comprised 1,310 individuals genotyped with the Affymetrix 5.0 array. We randomly masked 5% of the markers directly genotyped on chromosome 12 (n = 1,046 and compared the imputed genotypes with the microarray genotype calls. Imputation was carried out with the program IMPUTE. The concordance rates between the imputed and observed genotypes were used as a measure of imputation accuracy and the proportion of non-missing genotypes as a measure of imputation efficacy. Results The single-step imputation approach produced slightly higher concordance rates than the two-step strategy (99.1% vs. 98.4% when using the HapMap phase II combined panel, but at the expense of a lower proportion of non-missing genotypes (85.5% vs. 90.1%. The 1,000 Genomes reference sample produced similar concordance rates to the HapMap phase II panel (98.4% for both datasets, using the two-step strategy. However, the 1000 Genomes reference sample increased substantially the proportion of non-missing genotypes (94.7% vs. 90.1%. Rare variants ( Conclusions The program IMPUTE had an excellent imputation performance for common alleles in an admixed sample from Mexico City, which has primarily Native American (62% and European (33% contributions. Genotype concordances were higher than 98.4% using all the imputation strategies, in spite of the fact that no Native American samples are present in the HapMap and 1000 Genomes reference panels. The best balance of

  2. Data driven estimation of imputation error-a strategy for imputation with a reject option

    DEFF Research Database (Denmark)

    Bak, Nikolaj; Hansen, Lars Kai

    2016-01-01

    with missing values by weighing the "true errors" by similarity. The method can also be used to test the performance of different imputation methods. A universal numerical threshold of acceptable error cannot be set since this will differ according to the data, research question, and analysis method...... indiscriminately. We note that the effects of imputation can be strongly dependent on what is missing. To help make decisions about which records should be imputed, we propose to use a machine learning approach to estimate the imputation error for each case with missing data. The method is thought....... The effect of threshold can be estimated using the complete cases. The user can set an a priori relevant threshold for what is acceptable or use cross validation with the final analysis to choose the threshold. The choice can be presented along with argumentation for the choice rather than holding...

  3. On combining reference data to improve imputation accuracy.

    Directory of Open Access Journals (Sweden)

    Jun Chen

    Full Text Available Genotype imputation is an important tool in human genetics studies, which uses reference sets with known genotypes and prior knowledge on linkage disequilibrium and recombination rates to infer un-typed alleles for human genetic variations at a low cost. The reference sets used by current imputation approaches are based on HapMap data, and/or based on recently available next-generation sequencing (NGS data such as data generated by the 1000 Genomes Project. However, with different coverage and call rates for different NGS data sets, how to integrate NGS data sets of different accuracy as well as previously available reference data as references in imputation is not an easy task and has not been systematically investigated. In this study, we performed a comprehensive assessment of three strategies on using NGS data and previously available reference data in genotype imputation for both simulated data and empirical data, in order to obtain guidelines for optimal reference set construction. Briefly, we considered three strategies: strategy 1 uses one NGS data as a reference; strategy 2 imputes samples by using multiple individual data sets of different accuracy as independent references and then combines the imputed samples with samples based on the high accuracy reference selected when overlapping occurs; and strategy 3 combines multiple available data sets as a single reference after imputing each other. We used three software (MACH, IMPUTE2 and BEAGLE for assessing the performances of these three strategies. Our results show that strategy 2 and strategy 3 have higher imputation accuracy than strategy 1. Particularly, strategy 2 is the best strategy across all the conditions that we have investigated, producing the best accuracy of imputation for rare variant. Our study is helpful in guiding application of imputation methods in next generation association analyses.

  4. Missing value imputation for epistatic MAPs

    LENUS (Irish Health Repository)

    Ryan, Colm

    2010-04-20

    Abstract Background Epistatic miniarray profiling (E-MAPs) is a high-throughput approach capable of quantifying aggravating or alleviating genetic interactions between gene pairs. The datasets resulting from E-MAP experiments typically take the form of a symmetric pairwise matrix of interaction scores. These datasets have a significant number of missing values - up to 35% - that can reduce the effectiveness of some data analysis techniques and prevent the use of others. An effective method for imputing interactions would therefore increase the types of possible analysis, as well as increase the potential to identify novel functional interactions between gene pairs. Several methods have been developed to handle missing values in microarray data, but it is unclear how applicable these methods are to E-MAP data because of their pairwise nature and the significantly larger number of missing values. Here we evaluate four alternative imputation strategies, three local (Nearest neighbor-based) and one global (PCA-based), that have been modified to work with symmetric pairwise data. Results We identify different categories for the missing data based on their underlying cause, and show that values from the largest category can be imputed effectively. We compare local and global imputation approaches across a variety of distinct E-MAP datasets, showing that both are competitive and preferable to filling in with zeros. In addition we show that these methods are effective in an E-MAP from a different species, suggesting that pairwise imputation techniques will be increasingly useful as analogous epistasis mapping techniques are developed in different species. We show that strongly alleviating interactions are significantly more difficult to predict than strongly aggravating interactions. Finally we show that imputed interactions, generated using nearest neighbor methods, are enriched for annotations in the same manner as measured interactions. Therefore our method potentially

  5. Missing Data Imputation Approach Based on Incomplete Data Clustering%基于不完备数据聚类的缺失数据填补方法

    Institute of Scientific and Technical Information of China (English)

    武森; 冯小东; 单志广

    2012-01-01

    缺失数据的处理是数据挖掘领域进行数据预处理的一个重要问题.传统的缺失数据填补方法大部分是基于概率分布等一些统计假设,对于大数据集的数据挖掘不一定是最适合的方法.受不完备数据分析(ROUSTIDA)未采用传统的概率统计学方法启发,提出基于不完备数据聚类的缺失数据填补方法(MIBOI),针对分类变量不完备数据集定义约束容差集合差异度,直接计算不完备数据对象集合内所有对象的总体相异程度,以不完备数据聚类的结果为基础进行缺失数据的填补.采用UCI机器学习基准数据集进行实验表明,MIBOI对缺失数据的填补是有效可行的.%Missing data processing is an important problem of data pre-processing in data mining field. Traditional missing data filling methods are mostly based on some statistical hypothesis, such as probability distribution,which might not be the most applicable approaches for data mining of large data set. Inspired by ROUSTIDA,an incomplete data analysis approach not using probability statistical methods, MIBOI is proposed for missing data imputation based on incomplete data clustering. Constraint Tolerance Set Dissimilarity is defined for incomplete data set of categorical variables,so the general dissimilarity of all the incomplete data objects in a set can be directly computed,and the missing data is imputed according to the incomplete data clustering results. The empirical tests using UCI benchmark data sets show that MIBOI is effective and feasible.

  6. Geographical and environmental approaches to urban malaria in Antananarivo (Madagascar

    Directory of Open Access Journals (Sweden)

    Rudant Jean-Paul

    2010-06-01

    Full Text Available Abstract Background Previous studies, conducted in the urban of Antananarivo, showed low rate of confirmed malaria cases. We used a geographical and environmental approach to investigate the contribution of environmental factors to urban malaria in Antananarivo. Methods Remote sensing data were used to locate rice fields, which were considered to be the principal mosquito breeding sites. We carried out supervised classification by the maximum likelihood method. Entomological study allowed vector species determination from collected larval and adult mosquitoes. Mosquito infectivity was studied, to assess the risk of transmission, and the type of mosquito breeding site was determined. Epidemiological data were collected from November 2006 to December 2007, from public health centres, to determine malaria incidence. Polymerase chain reaction was carried out on dried blood spots from patients, to detect cases of malaria. Rapid diagnostic tests were used to confirm malaria cases among febrile school children in a school survey. A geographical information system was constructed for data integration. Altitude, temperature, rainfall, population density and rice field surface area were analysed and the effects of these factors on the occurrence of confirmed malaria cases were studied. Results Polymerase chain reaction confirmed malaria in 5.1% of the presumed cases. Entomological studies showed An. arabiensis as potential vector. Rice fields remained to be the principal breeding sites. Travel report was considered as related to the occurrence of P. falciparum malaria cases. Conclusion Geographical and environmental factors did not show direct relationship with malaria incidence but they seem ensuring suitability of vector development. Absence of relationship may be due to a lack of statistical power. Despite the presence of An. arabiensis, scarce parasitic reservoir and rapid access to health care do not constitute optimal conditions to a threatening

  7. Multiple imputation for handling missing outcome data when estimating the relative risk.

    Science.gov (United States)

    Sullivan, Thomas R; Lee, Katherine J; Ryan, Philip; Salter, Amy B

    2017-09-06

    Multiple imputation is a popular approach to handling missing data in medical research, yet little is known about its applicability for estimating the relative risk. Standard methods for imputing incomplete binary outcomes involve logistic regression or an assumption of multivariate normality, whereas relative risks are typically estimated using log binomial models. It is unclear whether misspecification of the imputation model in this setting could lead to biased parameter estimates. Using simulated data, we evaluated the performance of multiple imputation for handling missing data prior to estimating adjusted relative risks from a correctly specified multivariable log binomial model. We considered an arbitrary pattern of missing data in both outcome and exposure variables, with missing data induced under missing at random mechanisms. Focusing on standard model-based methods of multiple imputation, missing data were imputed using multivariate normal imputation or fully conditional specification with a logistic imputation model for the outcome. Multivariate normal imputation performed poorly in the simulation study, consistently producing estimates of the relative risk that were biased towards the null. Despite outperforming multivariate normal imputation, fully conditional specification also produced somewhat biased estimates, with greater bias observed for higher outcome prevalences and larger relative risks. Deleting imputed outcomes from analysis datasets did not improve the performance of fully conditional specification. Both multivariate normal imputation and fully conditional specification produced biased estimates of the relative risk, presumably since both use a misspecified imputation model. Based on simulation results, we recommend researchers use fully conditional specification rather than multivariate normal imputation and retain imputed outcomes in the analysis when estimating relative risks. However fully conditional specification is not without its

  8. Short communication: Imputation of markers on the bovine X chromosome.

    Science.gov (United States)

    Mao, Xiaowei; Johansson, Anna Maria; Sahana, Goutam; Guldbrandtsen, Bernt; De Koning, Dirk-Jan

    2016-09-01

    Imputation is a cost-effective approach to augment marker data for genomic selection and genome-wide association studies. However, most imputation studies have focused on autosomes. Here, we assessed the imputation of markers on the X chromosome in Holstein cattle for nongenotyped animals and animals genotyped with low-density (Illumina BovineLD, Illumina Inc., San Diego, CA) chips, using animals genotyped with medium-density (Illumina BovineSNP50) chips. A total of 26,884 genotyped Holstein individuals genotyped with medium-density chips were used in this study. Imputation was carried out using FImpute V2.2. The following parameters were examined: treating the pseudoautosomal region as autosomal or as X specific, different sizes of reference groups, different male/female proportions in the reference group, and cumulated degree of relationship between the reference group and target group. The imputation accuracy of markers on the X chromosome was improved if the pseudoautosomal region was treated as autosomal. Increasing the proportion of females in the reference group improved the imputation accuracy for the X chromosome. Imputation for nongenotyped animals in general had lower accuracy compared with animals genotyped with the low-density single nucleotide polymorphism array. In addition, higher cumulative pedigree relationships between the reference group and the target animal led to higher imputation accuracy. In the future, better marker coverage of the X chromosome should be developed to facilitate genomic studies involving the X chromosome.

  9. Molgenis-impute: imputation pipeline in a box.

    Science.gov (United States)

    Kanterakis, Alexandros; Deelen, Patrick; van Dijk, Freerk; Byelas, Heorhiy; Dijkstra, Martijn; Swertz, Morris A

    2015-08-19

    Genotype imputation is an important procedure in current genomic analysis such as genome-wide association studies, meta-analyses and fine mapping. Although high quality tools are available that perform the steps of this process, considerable effort and expertise is required to set up and run a best practice imputation pipeline, particularly for larger genotype datasets, where imputation has to scale out in parallel on computer clusters. Here we present MOLGENIS-impute, an 'imputation in a box' solution that seamlessly and transparently automates the set up and running of all the steps of the imputation process. These steps include genome build liftover (liftovering), genotype phasing with SHAPEIT2, quality control, sample and chromosomal chunking/merging, and imputation with IMPUTE2. MOLGENIS-impute builds on MOLGENIS-compute, a simple pipeline management platform for submission and monitoring of bioinformatics tasks in High Performance Computing (HPC) environments like local/cloud servers, clusters and grids. All the required tools, data and scripts are downloaded and installed in a single step. Researchers with diverse backgrounds and expertise have tested MOLGENIS-impute on different locations and imputed over 30,000 samples so far using the 1,000 Genomes Project and new Genome of the Netherlands data as the imputation reference. The tests have been performed on PBS/SGE clusters, cloud VMs and in a grid HPC environment. MOLGENIS-impute gives priority to the ease of setting up, configuring and running an imputation. It has minimal dependencies and wraps the pipeline in a simple command line interface, without sacrificing flexibility to adapt or limiting the options of underlying imputation tools. It does not require knowledge of a workflow system or programming, and is targeted at researchers who just want to apply best practices in imputation via simple commands. It is built on the MOLGENIS compute workflow framework to enable customization with additional

  10. Is there a role for expectation maximization imputation in addressing missing data in research using WOMAC questionnaire? Comparison to the standard mean approach and a tutorial

    Directory of Open Access Journals (Sweden)

    Rutledge John

    2011-05-01

    Full Text Available Abstract Background Standard mean imputation for missing values in the Western Ontario and Mc Master (WOMAC Osteoarthritis Index limits the use of collected data and may lead to bias. Probability model-based imputation methods overcome such limitations but were never before applied to the WOMAC. In this study, we compare imputation results for the Expectation Maximization method (EM and the mean imputation method for WOMAC in a cohort of total hip replacement patients. Methods WOMAC data on a consecutive cohort of 2062 patients scheduled for surgery were analyzed. Rates of missing values in each of the WOMAC items from this large cohort were used to create missing patterns in the subset of patients with complete data. EM and the WOMAC's method of imputation are then applied to fill the missing values. Summary score statistics for both methods are then described through box-plot and contrasted with the complete case (CC analysis and the true score (TS. This process is repeated using a smaller sample size of 200 randomly drawn patients with higher missing rate (5 times the rates of missing values observed in the 2062 patients capped at 45%. Results Rate of missing values per item ranged from 2.9% to 14.5% and 1339 patients had complete data. Probability model-based EM imputed a score for all subjects while WOMAC's imputation method did not. Mean subscale scores were very similar for both imputation methods and were similar to the true score; however, the EM method results were more consistent with the TS after simulation. This difference became more pronounced as the number of items in a subscale increased and the sample size decreased. Conclusions The EM method provides a better alternative to the WOMAC imputation method. The EM method is more accurate and imputes data to create a complete data set. These features are very valuable for patient-reported outcomes research in which resources are limited and the WOMAC score is used in a multivariate

  11. Public Undertakings and Imputability

    DEFF Research Database (Denmark)

    Ølykke, Grith Skovgaard

    2013-01-01

    in Article 107(1) TFEU is analysed. It is concluded that where the public undertaking transgresses the control system put in place by the State, conditions for imputability are not fulfilled, and it is argued that in the current state of law, there is no conditional link between the level of control...... that this is not the case. Lastly, it is discussed whether other legal instruments, namely competition law, public procurement law, or the Transparency Directive, regulate public undertakings’ market behaviour. It is found that those rules are not sufficient to mend the gap created by the imputability requirement. Legal......In this article, the issue of impuability to the State of public undertakings’ decision-making is analysed and discussed in the context of the DSBFirst case. DSBFirst is owned by the independent public undertaking DSB and the private undertaking FirstGroup plc and won the contracts in the 2008...

  12. Alcohol outlet density and violence: A geographically weighted regression approach.

    Science.gov (United States)

    Cameron, Michael P; Cochrane, William; Gordon, Craig; Livingston, Michael

    2016-05-01

    We investigate the relationship between outlet density (of different types) and violence (as measured by police activity) across the North Island of New Zealand, specifically looking at whether the relationships vary spatially. We use New Zealand data at the census area unit (approximately suburb) level, on police-attended violent incidents and outlet density (by type of outlet), controlling for population density and local social deprivation. We employed geographically weighted regression to obtain both global average and locally specific estimates of the relationships between alcohol outlet density and violence. We find that bar and night club density, and licensed club density (e.g. sports clubs) have statistically significant and positive relationships with violence, with an additional bar or night club is associated with nearly 5.3 additional violent events per year, and an additional licensed club associated with 0.8 additional violent events per year. These relationships do not show significant spatial variation. In contrast, the effects of off-licence density and restaurant/café density do exhibit significant spatial variation. However, the non-varying effects of bar and night club density are larger than the locally specific effects of other outlet types. The relationships between outlet density and violence vary significantly across space for off-licences and restaurants/cafés. These results suggest that in order to minimise alcohol-related harms, such as violence, locally specific policy interventions are likely to be necessary. [Cameron MP, Cochrane W, Gordon C, Livingston M. Alcohol outlet density and violence: A geographically weighted regression approach. Drug Alcohol Rev 2016;35:280-288]. © 2015 Australasian Professional Society on Alcohol and other Drugs.

  13. A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information.

    Science.gov (United States)

    Pirola, Yuri; Della Vedova, Gianluca; Biffani, Stefano; Stella, Alessandra; Bonizzoni, Paola

    2012-01-01

    The MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances that have improved the efficiency, its applicability to real data sets has been limited since it does not take into account some important phenomena such as mutations, genotyping errors, and missing data. In this work, we propose the MINIMUM-RECOMBINANT HAPLOTYPE CONFIGURATION WITH BOUNDED ERRORS problem (MRHCE), which extends the original MRHC formulation by incorporating the two most common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the well-known Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the biological soundness of the phasing model and the effectiveness (on both accuracy and performance) of the algorithm under several scenarios. The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.

  14. Restrictive Imputation of Incomplete Survey Data

    NARCIS (Netherlands)

    Vink, G.

    2015-01-01

    This dissertation focuses on finding plausible imputations when there is some restriction posed on the imputation model. In these restrictive situations, current imputation methodology does not lead to satisfactory imputations. The restrictions, and the resulting missing data problems are real-life

  15. Towards a Geographic Information Systems (GIS) Approach in ...

    African Journals Online (AJOL)

    Indeed, GIS is to geographical analysis what the microscope has been to ... In relation to housing and building research, GIS can be an investigative tool, design ... Par ce procédé, des routes et des fleuves sont représentés en simples lignes, ...

  16. Using imputation to provide location information for nongeocoded addresses.

    Directory of Open Access Journals (Sweden)

    Frank C Curriero

    Full Text Available BACKGROUND: The importance of geography as a source of variation in health research continues to receive sustained attention in the literature. The inclusion of geographic information in such research often begins by adding data to a map which is predicated by some knowledge of location. A precise level of spatial information is conventionally achieved through geocoding, the geographic information system (GIS process of translating mailing address information to coordinates on a map. The geocoding process is not without its limitations, though, since there is always a percentage of addresses which cannot be converted successfully (nongeocodable. This raises concerns regarding bias since traditionally the practice has been to exclude nongeocoded data records from analysis. METHODOLOGY/PRINCIPAL FINDINGS: In this manuscript we develop and evaluate a set of imputation strategies for dealing with missing spatial information from nongeocoded addresses. The strategies are developed assuming a known zip code with increasing use of collateral information, namely the spatial distribution of the population at risk. Strategies are evaluated using prostate cancer data obtained from the Maryland Cancer Registry. We consider total case enumerations at the Census county, tract, and block group level as the outcome of interest when applying and evaluating the methods. Multiple imputation is used to provide estimated total case counts based on complete data (geocodes plus imputed nongeocodes with a measure of uncertainty. Results indicate that the imputation strategy based on using available population-based age, gender, and race information performed the best overall at the county, tract, and block group levels. CONCLUSIONS/SIGNIFICANCE: The procedure allows for the potentially biased and likely under reported outcome, case enumerations based on only the geocoded records, to be presented with a statistically adjusted count (imputed count with a measure of

  17. Place Branding – Geographical Approach. Case Study: Waterloo

    Directory of Open Access Journals (Sweden)

    Marius-Cristian Neacşu

    2016-11-01

    Full Text Available This study represents an exploratory analysis of the evolution of the place branding concept, with an important focus on the geographical perspective. How has this notion, a newcomer into the geographers' analysis, changed over time and what role does it have in the decision making process of intervening into the way a certain place is organised or as an instrument of economic revival and territorial development? At least from the perspective of Romanian geographical literature, the originality and novelty of this study is obvious. An element of the originality of this research is the attempt of redefining the concept of place branding so that it is more meaningful from the perspective of spatial analyses. The reason for which Waterloo was chosen as a case study is multi-dimensional: the case studies so far have mainly focused on large cities (urban branding instead of place branding and this site has all the theoretical elements to create a stand-alone brand.

  18. A statistical approach to latitude measurements: Ptolemy's and Riccioli's geographical works as case studies

    Science.gov (United States)

    Santoro, Luca

    2017-08-01

    The aim of this work is to analyze latitude measurements typically used in historical geographical works through a statistical approach. We use two sets of different age as case studies: Ptolemy's Geography and Riccioli's work on geography. A statistical approach to historical latitude and longitude databases can reveal systematic errors in geographical georeferencing processes. On the other hand, once exploiting the right statistical analysis, this approach can also lead to new information about ancient city locations.

  19. East Europeans on the Spanish Job Market: A Geographical Approach

    Directory of Open Access Journals (Sweden)

    Rafael Viruela Martínez

    2009-01-01

    Full Text Available This study presents some of the socio-employment characteristics of East European workers in Spain, particularly Romanians and Bulgarians. Firstly (and following a brief commentary on statistical sources, the author analyses the evolution and geographical distribution of these workers, highlighting their high representation in provinces in inland areas of the peninsular, and in rural municipalities. Subsequently, an examination is carried out oftheir participation on the job market, according to payments made to the Social Security system and the workers’ main sectors of activity, which shows the differences according to sex, nationality and place of residence. Finally, the statistical information is supplementedby the results from different empirical research studies.

  20. Improvement of the F-Perceptory Approach Through Management of Fuzzy Complex Geographic Objects

    Science.gov (United States)

    Khalfi, B.; de Runz, C.; Faiz, S.; Akdag, H.

    2015-08-01

    In the real world, data is imperfect and in various ways such as imprecision, vagueness, uncertainty, ambiguity and inconsistency. For geographic data, the fuzzy aspect is mainly manifested in time, space and the function of objects and is due to a lack of precision. Therefore, the researchers in the domain emphasize the importance of modeling data structures in GIS but also their lack of adaptation to fuzzy data. The F-Perceptory approachh manages the modeling of imperfect geographic information with UML. This management is essential to maintain faithfulness to reality and to better guide the user in his decision-making. However, this approach does not manage fuzzy complex geographic objects. The latter presents a multiple object with similar or different geographic shapes. So, in this paper, we propose to improve the F-Perceptory approach by proposing to handle fuzzy complex geographic objects modeling. In a second step, we propose its transformation to the UML modeling.

  1. IMPROVEMENT OF THE F-PERCEPTORY APPROACH THROUGH MANAGEMENT OF FUZZY COMPLEX GEOGRAPHIC OBJECTS

    Directory of Open Access Journals (Sweden)

    B. Khalfi

    2015-08-01

    Full Text Available In the real world, data is imperfect and in various ways such as imprecision, vagueness, uncertainty, ambiguity and inconsistency. For geographic data, the fuzzy aspect is mainly manifested in time, space and the function of objects and is due to a lack of precision. Therefore, the researchers in the domain emphasize the importance of modeling data structures in GIS but also their lack of adaptation to fuzzy data. The F-Perceptory approachh manages the modeling of imperfect geographic information with UML. This management is essential to maintain faithfulness to reality and to better guide the user in his decision-making. However, this approach does not manage fuzzy complex geographic objects. The latter presents a multiple object with similar or different geographic shapes. So, in this paper, we propose to improve the F-Perceptory approach by proposing to handle fuzzy complex geographic objects modeling. In a second step, we propose its transformation to the UML modeling.

  2. Using a 'value-added' approach for contextual design of geographic information.

    Science.gov (United States)

    May, Andrew J

    2013-11-01

    The aim of this article is to demonstrate how a 'value-added' approach can be used for user-centred design of geographic information. An information science perspective was used, with value being the difference in outcomes arising from alternative information sets. Sixteen drivers navigated a complex, unfamiliar urban route, using visual and verbal instructions representing the distance-to-turn and junction layout information presented by typical satellite navigation systems. Data measuring driving errors, navigation errors and driver confidence were collected throughout the trial. The results show how driver performance varied considerably according to the geographic context at specific locations, and that there are specific opportunities to add value with enhanced geographical information. The conclusions are that a value-added approach facilitates a more explicit focus on 'desired' (and feasible) levels of end user performance with different information sets, and is a potentially effective approach to user-centred design of geographic information.

  3. Multiple imputation and its application

    CERN Document Server

    Carpenter, James

    2013-01-01

    A practical guide to analysing partially observed data. Collecting, analysing and drawing inferences from data is central to research in the medical and social sciences. Unfortunately, it is rarely possible to collect all the intended data. The literature on inference from the resulting incomplete  data is now huge, and continues to grow both as methods are developed for large and complex data structures, and as increasing computer power and suitable software enable researchers to apply these methods. This book focuses on a particular statistical method for analysing and drawing inferences from incomplete data, called Multiple Imputation (MI). MI is attractive because it is both practical and widely applicable. The authors aim is to clarify the issues raised by missing data, describing the rationale for MI, the relationship between the various imputation models and associated algorithms and its application to increasingly complex data structures. Multiple Imputation and its Application: Discusses the issues ...

  4. Performance of genotype imputations using data from the 1000 Genomes Project.

    Science.gov (United States)

    Sung, Yun Ju; Wang, Lihua; Rankinen, Tuomo; Bouchard, Claude; Rao, D C

    2012-01-01

    Genotype imputations based on 1000 Genomes (1KG) Project data have the advantage of imputing many more SNPs than imputations based on HapMap data. It also provides an opportunity to discover associations with relatively rare variants. Recent investigations are increasingly using 1KG data for genotype imputations, but only limited evaluations of the performance of this approach are available. In this paper, we empirically evaluated imputation performance using 1KG data by comparing imputation results to those using the HapMap Phase II data that have been widely used. We used three reference panels: the CEU panel consisting of 120 haplotypes from HapMap II and 1KG data (June 2010 release) and the EUR panel consisting of 566 haplotypes also from 1KG data (August 2010 release). We used Illumina 324,607 autosomal SNPs genotyped in 501 individuals of European ancestry. Our most important finding was that both 1KG reference panels provided much higher imputation yield than the HapMap II panel. There were more than twice as many successfully imputed SNPs as there were using the HapMap II panel (6.7 million vs. 2.5 million). Our second most important finding was that accuracy using both 1KG panels was high and almost identical to accuracy using the HapMap II panel. Furthermore, after removing SNPs with MACH Rsq Project is still underway, we expect that later versions will provide even better imputation performance.

  5. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits

    DEFF Research Database (Denmark)

    Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L

    2017-01-01

    Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader alleli...

  6. Recursive partitioning for missing data imputation in the presence of interaction effects

    NARCIS (Netherlands)

    Doove, L. L.; Van Buuren, S.; Dusseldorp, E.

    2014-01-01

    Standard approaches to implement multiple imputation do not automatically incorporate nonlinear relations like interaction effects. This leads to biased parameter estimates when interactions are present in a dataset. With the aim of providing an imputation method which preserves interactions in the

  7. Predictive mean matching imputation of semicontinuous variables

    NARCIS (Netherlands)

    Vink, G.; Frank, L.E.; Pannekoek, J.; Buuren, S. van

    2014-01-01

    Multiple imputation methods properly account for the uncertainty of missing data. One of those methods for creating multiple imputations is predictive mean matching (PMM), a general purpose method. Little is known about the performance of PMM in imputing non-normal semicontinuous data (skewed data w

  8. Genetic diversity analysis of highly incomplete SNP genotype data with imputations: an empirical assessment.

    Science.gov (United States)

    Fu, Yong-Bi

    2014-03-13

    Genotyping by sequencing (GBS) recently has emerged as a promising genomic approach for assessing genetic diversity on a genome-wide scale. However, concerns are not lacking about the uniquely large unbalance in GBS genotype data. Although some genotype imputation has been proposed to infer missing observations, little is known about the reliability of a genetic diversity analysis of GBS data, with up to 90% of observations missing. Here we performed an empirical assessment of accuracy in genetic diversity analysis of highly incomplete single nucleotide polymorphism genotypes with imputations. Three large single-nucleotide polymorphism genotype data sets for corn, wheat, and rice were acquired, and missing data with up to 90% of missing observations were randomly generated and then imputed for missing genotypes with three map-independent imputation methods. Estimating heterozygosity and inbreeding coefficient from original, missing, and imputed data revealed variable patterns of bias from assessed levels of missingness and genotype imputation, but the estimation biases were smaller for missing data without genotype imputation. The estimates of genetic differentiation were rather robust up to 90% of missing observations but became substantially biased when missing genotypes were imputed. The estimates of topology accuracy for four representative samples of interested groups generally were reduced with increased levels of missing genotypes. Probabilistic principal component analysis based imputation performed better in terms of topology accuracy than those analyses of missing data without genotype imputation. These findings are not only significant for understanding the reliability of the genetic diversity analysis with respect to large missing data and genotype imputation but also are instructive for performing a proper genetic diversity analysis of highly incomplete GBS or other genotype data.

  9. A comparison of selected parametric and imputation methods for estimating snag density and snag quality attributes

    Science.gov (United States)

    Eskelson, Bianca N.I.; Hagar, Joan; Temesgen, Hailemariam

    2012-01-01

    of large snags than the RF imputation approach. Adjusting the decision threshold to account for unequal size for presence and absence classes is more straightforward for the logistic regression than for the RF imputation approach. Overall, model accuracies were poor in this study, which can be attributed to the poor predictive quality of the explanatory variables and the large range of forest types and geographic conditions observed in the data.

  10. TR01: Time-continuous Sparse Imputation

    CERN Document Server

    Gemmeke, J F

    2009-01-01

    An effective way to increase the noise robustness of automatic speech recognition is to label noisy speech features as either reliable or unreliable (missing) prior to decoding, and to replace the missing ones by clean speech estimates. We present a novel method to obtain such clean speech estimates. Unlike previous imputation frameworks which work on a frame-by-frame basis, our method focuses on exploiting information from a large time-context. Using a sliding window approach, denoised speech representations are constructed using a sparse representation of the reliable features in an overcomplete basis of fixed-length exemplar fragments. We demonstrate the potential of our approach with experiments on the AURORA-2 connected digit database.

  11. Binary variable multiple-model multiple imputation to address missing data mechanism uncertainty: application to a smoking cessation trial.

    Science.gov (United States)

    Siddique, Juned; Harel, Ofer; Crespi, Catherine M; Hedeker, Donald

    2014-07-30

    The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables, which formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software.

  12. Addressing Missing Data Mechanism Uncertainty using Multiple-Model Multiple Imputation: Application to a Longitudinal Clinical Trial.

    Science.gov (United States)

    Siddique, Juned; Harel, Ofer; Crespi, Catherine M

    2012-12-01

    We present a framework for generating multiple imputations for continuous data when the missing data mechanism is unknown. Imputations are generated from more than one imputation model in order to incorporate uncertainty regarding the missing data mechanism. Parameter estimates based on the different imputation models are combined using rules for nested multiple imputation. Through the use of simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal clinical trial of low-income women with depression where nonignorably missing data were a concern. We show that different assumptions regarding the missing data mechanism can have a substantial impact on inferences. Our method provides a simple approach for formalizing subjective notions regarding nonresponse so that they can be easily stated, communicated, and compared.

  13. Missing Data Imputation for Supervised Learning

    OpenAIRE

    Poulos, Jason; Valle, Rafael

    2016-01-01

    This paper compares methods for imputing missing categorical data for supervised learning tasks. The ability of researchers to accurately fit a model and yield unbiased estimates may be compromised by missing data, which are prevalent in survey-based social science research. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different degrees of missing-data perturbat...

  14. A two-step semiparametric method to accommodate sampling weights in multiple imputation.

    Science.gov (United States)

    Zhou, Hanzhi; Elliott, Michael R; Raghunathan, Trviellore E

    2016-03-01

    Multiple imputation (MI) is a well-established method to handle item-nonresponse in sample surveys. Survey data obtained from complex sampling designs often involve features that include unequal probability of selection. MI requires imputation to be congenial, that is, for the imputations to come from a Bayesian predictive distribution and for the observed and complete data estimator to equal the posterior mean given the observed or complete data, and similarly for the observed and complete variance estimator to equal the posterior variance given the observed or complete data; more colloquially, the analyst and imputer make similar modeling assumptions. Yet multiply imputed data sets from complex sample designs with unequal sampling weights are typically imputed under simple random sampling assumptions and then analyzed using methods that account for the sampling weights. This is a setting in which the analyst assumes more than the imputer, which can led to biased estimates and anti-conservative inference. Less commonly used alternatives such as including case weights as predictors in the imputation model typically require interaction terms for more complex estimators such as regression coefficients, and can be vulnerable to model misspecification and difficult to implement. We develop a simple two-step MI framework that accounts for sampling weights using a weighted finite population Bayesian bootstrap method to validly impute the whole population (including item nonresponse) from the observed data. In the second step, having generated posterior predictive distributions of the entire population, we use standard IID imputation to handle the item nonresponse. Simulation results show that the proposed method has good frequentist properties and is robust to model misspecification compared to alternative approaches. We apply the proposed method to accommodate missing data in the Behavioral Risk Factor Surveillance System when estimating means and parameters of

  15. 16 CFR 1115.11 - Imputed knowledge.

    Science.gov (United States)

    2010-01-01

    ... 16 Commercial Practices 2 2010-01-01 2010-01-01 false Imputed knowledge. 1115.11 Section 1115.11... PRODUCT HAZARD REPORTS General Interpretation § 1115.11 Imputed knowledge. (a) In evaluating whether or... care to ascertain the truth of complaints or other representations. This includes the knowledge a...

  16. Dual imputation model for incomplete longitudinal data

    NARCIS (Netherlands)

    Jolani, S.; Frank, L.E.; Buuren, S. van

    2014-01-01

    Missing values are a practical issue in the analysis of longitudinal data. Multiple imputation (MI) is a well-known likelihood-based method that has optimal properties in terms of efficiency and consistency if the imputation model is correctly specified. Doubly robust (DR) weighing-based methods pro

  17. Imputing amino acid polymorphisms in human leukocyte antigens.

    Directory of Open Access Journals (Sweden)

    Xiaoming Jia

    Full Text Available DNA sequence variation within human leukocyte antigen (HLA genes mediate susceptibility to a wide range of human diseases. The complex genetic structure of the major histocompatibility complex (MHC makes it difficult, however, to collect genotyping data in large cohorts. Long-range linkage disequilibrium between HLA loci and SNP markers across the major histocompatibility complex (MHC region offers an alternative approach through imputation to interrogate HLA variation in existing GWAS data sets. Here we describe a computational strategy, SNP2HLA, to impute classical alleles and amino acid polymorphisms at class I (HLA-A, -B, -C and class II (-DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 loci. To characterize performance of SNP2HLA, we constructed two European ancestry reference panels, one based on data collected in HapMap-CEPH pedigrees (90 individuals and another based on data collected by the Type 1 Diabetes Genetics Consortium (T1DGC, 5,225 individuals. We imputed HLA alleles in an independent data set from the British 1958 Birth Cohort (N = 918 with gold standard four-digit HLA types and SNPs genotyped using the Affymetrix GeneChip 500 K and Illumina Immunochip microarrays. We demonstrate that the sample size of the reference panel, rather than SNP density of the genotyping platform, is critical to achieve high imputation accuracy. Using the larger T1DGC reference panel, the average accuracy at four-digit resolution is 94.7% using the low-density Affymetrix GeneChip 500 K, and 96.7% using the high-density Illumina Immunochip. For amino acid polymorphisms within HLA genes, we achieve 98.6% and 99.3% accuracy using the Affymetrix GeneChip 500 K and Illumina Immunochip, respectively. Finally, we demonstrate how imputation and association testing at amino acid resolution can facilitate fine-mapping of primary MHC association signals, giving a specific example from type 1 diabetes.

  18. Joint multiple imputation for longitudinal outcomes and clinical events that truncate longitudinal follow-up.

    Science.gov (United States)

    Hu, Bo; Li, Liang; Greene, Tom

    2016-07-30

    Longitudinal cohort studies often collect both repeated measurements of longitudinal outcomes and times to clinical events whose occurrence precludes further longitudinal measurements. Although joint modeling of the clinical events and the longitudinal data can be used to provide valid statistical inference for target estimands in certain contexts, the application of joint models in medical literature is currently rather restricted because of the complexity of the joint models and the intensive computation involved. We propose a multiple imputation approach to jointly impute missing data of both the longitudinal and clinical event outcomes. With complete imputed datasets, analysts are then able to use simple and transparent statistical methods and standard statistical software to perform various analyses without dealing with the complications of missing data and joint modeling. We show that the proposed multiple imputation approach is flexible and easy to implement in practice. Numerical results are also provided to demonstrate its performance. Copyright © 2015 John Wiley & Sons, Ltd.

  19. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

    Directory of Open Access Journals (Sweden)

    Bryan N Howie

    2009-06-01

    Full Text Available Genotype imputation methods are now being widely used in the analysis of genome-wide association studies. Most imputation analyses to date have used the HapMap as a reference dataset, but new reference panels (such as controls genotyped on multiple SNP chips and densely typed samples from the 1,000 Genomes Project will soon allow a broader range of SNPs to be imputed with higher accuracy, thereby increasing power. We describe a genotype imputation method (IMPUTE version 2 that is designed to address the challenges presented by these new datasets. The main innovation of our approach is a flexible modelling framework that increases accuracy and combines information across multiple reference panels while remaining computationally feasible. We find that IMPUTE v2 attains higher accuracy than other methods when the HapMap provides the sole reference panel, but that the size of the panel constrains the improvements that can be made. We also find that imputation accuracy can be greatly enhanced by expanding the reference panel to contain thousands of chromosomes and that IMPUTE v2 outperforms other methods in this setting at both rare and common SNPs, with overall error rates that are 15%-20% lower than those of the closest competing method. One particularly challenging aspect of next-generation association studies is to integrate information across multiple reference panels genotyped on different sets of SNPs; we show that our approach to this problem has practical advantages over other suggested solutions.

  20. Socioeconomic determinants of geographic disparities in campylobacteriosis risk: a comparison of global and local modeling approaches

    Directory of Open Access Journals (Sweden)

    Weisent Jennifer

    2012-10-01

    Full Text Available Abstract Background Socioeconomic factors play a complex role in determining the risk of campylobacteriosis. Understanding the spatial interplay between these factors and disease risk can guide disease control programs. Historically, Poisson and negative binomial models have been used to investigate determinants of geographic disparities in risk. Spatial regression models, which allow modeling of spatial effects, have been used to improve these modeling efforts. Geographically weighted regression (GWR takes this a step further by estimating local regression coefficients, thereby allowing estimations of associations that vary in space. These recent approaches increase our understanding of how geography influences the associations between determinants and disease. Therefore the objectives of this study were to: (i identify socioeconomic determinants of the geographic disparities of campylobacteriosis risk (ii investigate if regression coefficients for the associations between socioeconomic factors and campylobacteriosis risk demonstrate spatial variability and (iii compare the performance of four modeling approaches: negative binomial, spatial lag, global and local Poisson GWR. Methods Negative binomial, spatial lag, global and local Poisson GWR modeling techniques were used to investigate associations between socioeconomic factors and geographic disparities in campylobacteriosis risk. The best fitting models were identified and compared. Results Two competing four variable models (Models 1 & 2 were identified. Significant variables included race, unemployment rate, education attainment, urbanicity, and divorce rate. Local Poisson GWR had the best fit and showed evidence of spatially varying regression coefficients. Conclusions The international significance of this work is that it highlights the inadequacy of global regression strategies that estimate one parameter per independent variable, and therefore mask the true relationships between

  1. The utility of low-density genotyping for imputation in the Thoroughbred horse.

    Science.gov (United States)

    Corbin, Laura J; Kranis, Andreas; Blott, Sarah C; Swinburne, June E; Vaudin, Mark; Bishop, Stephen C; Woolliams, John A

    2014-02-04

    Despite the dramatic reduction in the cost of high-density genotyping that has occurred over the last decade, it remains one of the limiting factors for obtaining the large datasets required for genomic studies of disease in the horse. In this study, we investigated the potential for low-density genotyping and subsequent imputation to address this problem. Using the haplotype phasing and imputation program, BEAGLE, it is possible to impute genotypes from low- to high-density (50K) in the Thoroughbred horse with reasonable to high accuracy. Analysis of the sources of variation in imputation accuracy revealed dependence both on the minor allele frequency of the single nucleotide polymorphisms (SNPs) being imputed and on the underlying linkage disequilibrium structure. Whereas equidistant spacing of the SNPs on the low-density panel worked well, optimising SNP selection to increase their minor allele frequency was advantageous, even when the panel was subsequently used in a population of different geographical origin. Replacing base pair position with linkage disequilibrium map distance reduced the variation in imputation accuracy across SNPs. Whereas a 1K SNP panel was generally sufficient to ensure that more than 80% of genotypes were correctly imputed, other studies suggest that a 2K to 3K panel is more efficient to minimize the subsequent loss of accuracy in genomic prediction analyses. The relationship between accuracy and genotyping costs for the different low-density panels, suggests that a 2K SNP panel would represent good value for money. Low-density genotyping with a 2K SNP panel followed by imputation provides a compromise between cost and accuracy that could promote more widespread genotyping, and hence the use of genomic information in horses. In addition to offering a low cost alternative to high-density genotyping, imputation provides a means to combine datasets from different genotyping platforms, which is becoming necessary since researchers are

  2. A comparison of multiple imputation methods for incomplete longitudinal binary data.

    Science.gov (United States)

    Yamaguchi, Yusuke; Misumi, Toshihiro; Maruo, Kazushi

    2017-09-08

    Longitudinal binary data are commonly encountered in clinical trials. Multiple imputation is an approach for getting a valid estimation of treatment effects under an assumption of missing at random mechanism. Although there are a variety of multiple imputation methods for the longitudinal binary data, a limited number of researches have reported on relative performances of the methods. Moreover, when focusing on the treatment effect throughout a period that has often been used in clinical evaluations of specific disease areas, no definite investigations comparing the methods have been available. We conducted an extensive simulation study to examine comparative performances of six multiple imputation methods available in the SAS MI procedure for longitudinal binary data, where two endpoints of responder rates at a specified time point and throughout a period were assessed. The simulation study suggested that results from nave approaches of a single imputation with non-responders and a complete case analysis could be very sensitive against missing data. The multiple imputation methods using a monotone method and a full conditional specification with a logistic regression imputation model were recommended for obtaining unbiased and robust estimations of the treatment effect. The methods were illustrated with data from a mental health research.

  3. Hospital distribution in a metropolitan city: assessment by a geographical information system grid modelling approach

    Directory of Open Access Journals (Sweden)

    Kwang-Soo Lee

    2014-05-01

    Full Text Available Grid models were used to assess urban hospital distribution in Seoul, the capital of South Korea. A geographical information system (GIS based analytical model was developed and applied to assess the situation in a metropolitan area with a population exceeding 10 million. Secondary data for this analysis were obtained from multiple sources: the Korean Statistical Information Service, the Korean Hospital Association and the Statistical Geographical Information System. A grid of cells measuring 1 × 1 km was superimposed on the city map and a set of variables related to population, economy, mobility and housing were identified and measured for each cell. Socio-demographic variables were included to reflect the characteristics of each area. Analytical models were then developed using GIS software with the number of hospitals as the dependent variable. Applying multiple linear regression and geographically weighted regression models, three factors (highway and major arterial road areas; number of subway entrances; and row house areas were statistically significant in explaining the variance of hospital distribution for each cell. The overall results show that GIS is a useful tool for analysing and understanding location strategies. This approach appears a useful source of information for decision-makers concerned with the distribution of hospitals and other health care centres in a city.

  4. Hospital distribution in a metropolitan city: assessment by a geographic information system grid modelling approach.

    Science.gov (United States)

    Lee, Kwang-Soo; Moon, Kyeong-Jun

    2014-05-01

    Grid models were used to assess urban hospital distribution in Seoul, the capital of South Korea. A geographical information system (GIS) based analytical model was developed and applied to assess the situation in a metropolitan area with a population exceeding 10 million. Secondary data for this analysis were obtained from multiple sources: the Korean Statistical Information Service, the Korean Hospital Association and the Statistical Geographical Information System. A grid of cells measuring 1 × 1 km was superimposed on the city map and a set of variables related to population, economy, mobility and housing were identified and measured for each cell. Socio-demographic variables were included to reflect the characteristics of each area. Analytical models were then developed using GIS software with the number of hospitals as the dependent variable. Applying multiple linear regression and geographically weighted regression models, three factors (highway and major arterial road areas; number of subway entrances; and row house areas) were statistically significant in explaining the variance of hospital distribution for each cell. The overall results show that GIS is a useful tool for analysing and understanding location strategies. This approach appears a useful source of information for decision-makers concerned with the distribution of hospitals and other health care centres in a city.

  5. An Imputation Model for Dropouts in Unemployment Data

    Directory of Open Access Journals (Sweden)

    Nilsson Petra

    2016-09-01

    Full Text Available Incomplete unemployment data is a fundamental problem when evaluating labour market policies in several countries. Many unemployment spells end for unknown reasons; in the Swedish Public Employment Service’s register as many as 20 percent. This leads to an ambiguity regarding destination states (employment, unemployment, retired, etc.. According to complete combined administrative data, the employment rate among dropouts was close to 50 for the years 1992 to 2006, but from 2007 the employment rate has dropped to 40 or less. This article explores an imputation approach. We investigate imputation models estimated both on survey data from 2005/2006 and on complete combined administrative data from 2005/2006 and 2011/2012. The models are evaluated in terms of their ability to make correct predictions. The models have relatively high predictive power.

  6. A Geographical-Based Multi-Criteria Approach for Marine Energy Farm Planning

    Directory of Open Access Journals (Sweden)

    Nicolas Maslov

    2014-05-01

    Full Text Available The objective of this paper is to devise a strategy for developing a flexible tool to efficiently install a marine energy farm in a suitable area. The current methodology is applied to marine tidal current, although it can be extended to other energy contexts with some adaptations. We introduce a three-step approach that searches for marine farm sites and technological solutions. The methodology applied is based on a combination of Geographic Information Systems (GIS, multi-criteria analysis (MCA and an optimization algorithm. The integration of GIS and MCA is at the core of the search process for the best-suited marine areas, taking into account geographical constraints, such as human activity, pressure on the environment and technological opportunities. The optimization step of the approach evaluates the most appropriate technologies and farm configurations in order to maximize the quantity of energy produced while minimizing the cost of the farm. Three main criteria are applied to finally characterize a location for a marine energy farm: the global cost of the project, the quantity of energy produced and social acceptance. The social acceptance criterion is evaluated by the MCA method, Electre III, while the optimization of the energy cost is approximated by a genetic algorithm. The whole approach is illustrated by a case study applied to a maritime area in North-West France.

  7. Assessment of imputation methods using varying ecological information to fill the gaps in a tree functional trait database

    Science.gov (United States)

    Poyatos, Rafael; Sus, Oliver; Vilà-Cabrera, Albert; Vayreda, Jordi; Badiella, Llorenç; Mencuccini, Maurizio; Martínez-Vilalta, Jordi

    2016-04-01

    Plant functional traits are increasingly being used in ecosystem ecology thanks to the growing availability of large ecological databases. However, these databases usually contain a large fraction of missing data because measuring plant functional traits systematically is labour-intensive and because most databases are compilations of datasets with different sampling designs. As a result, within a given database, there is an inevitable variability in the number of traits available for each data entry and/or the species coverage in a given geographical area. The presence of missing data may severely bias trait-based analyses, such as the quantification of trait covariation or trait-environment relationships and may hamper efforts towards trait-based modelling of ecosystem biogeochemical cycles. Several data imputation (i.e. gap-filling) methods have been recently tested on compiled functional trait databases, but the performance of imputation methods applied to a functional trait database with a regular spatial sampling has not been thoroughly studied. Here, we assess the effects of data imputation on five tree functional traits (leaf biomass to sapwood area ratio, foliar nitrogen, maximum height, specific leaf area and wood density) in the Ecological and Forest Inventory of Catalonia, an extensive spatial database (covering 31900 km2). We tested the performance of species mean imputation, single imputation by the k-nearest neighbors algorithm (kNN) and a multiple imputation method, Multivariate Imputation with Chained Equations (MICE) at different levels of missing data (10%, 30%, 50%, and 80%). We also assessed the changes in imputation performance when additional predictors (species identity, climate, forest structure, spatial structure) were added in kNN and MICE imputations. We evaluated the imputed datasets using a battery of indexes describing departure from the complete dataset in trait distribution, in the mean prediction error, in the correlation matrix

  8. Comparison of HLA allelic imputation programs

    Science.gov (United States)

    Shaffer, Christian M.; Bastarache, Lisa; Gaudieri, Silvana; Glazer, Andrew M.; Steiner, Heidi E.; Mosley, Jonathan D.; Mallal, Simon; Denny, Joshua C.; Phillips, Elizabeth J.; Roden, Dan M.

    2017-01-01

    Imputation of human leukocyte antigen (HLA) alleles from SNP-level data is attractive due to importance of HLA alleles in human disease, widespread availability of genome-wide association study (GWAS) data, and expertise required for HLA sequencing. However, comprehensive evaluations of HLA imputations programs are limited. We compared HLA imputation results of HIBAG, SNP2HLA, and HLA*IMP:02 to sequenced HLA alleles in 3,265 samples from BioVU, a de-identified electronic health record database coupled to a DNA biorepository. We performed four-digit HLA sequencing for HLA-A, -B, -C, -DRB1, -DPB1, and -DQB1 using long-read 454 FLX sequencing. All samples were genotyped using both the Illumina HumanExome BeadChip platform and a GWAS platform. Call rates and concordance rates were compared by platform, frequency of allele, and race/ethnicity. Overall concordance rates were similar between programs in European Americans (EA) (0.975 [SNP2HLA]; 0.939 [HLA*IMP:02]; 0.976 [HIBAG]). SNP2HLA provided a significant advantage in terms of call rate and the number of alleles imputed. Concordance rates were lower overall for African Americans (AAs). These observations were consistent when accuracy was compared across HLA loci. All imputation programs performed similarly for low frequency HLA alleles. Higher concordance rates were observed when HLA alleles were imputed from GWAS platforms versus the HumanExome BeadChip, suggesting that high genomic coverage is preferred as input for HLA allelic imputation. These findings provide guidance on the best use of HLA imputation methods and elucidate their limitations. PMID:28207879

  9. Multi-population classical HLA type imputation.

    Directory of Open Access Journals (Sweden)

    Alexander Dilthey

    Full Text Available Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007 and Ron et al. (1998. HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%. On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.

  10. Missing value imputation for microarray gene expression data using histone acetylation information

    Directory of Open Access Journals (Sweden)

    Feng Jihua

    2008-05-01

    Full Text Available Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method is presented. It incorporates the histone acetylation information into the conventional KNN(k-nearest neighbor and LLS(local least square imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE. Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.

  11. Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression.

    Science.gov (United States)

    Zhang, Guosheng; Huang, Kuan-Chieh; Xu, Zheng; Tzeng, Jung-Ying; Conneely, Karen N; Guan, Weihua; Kang, Jian; Li, Yun

    2016-05-01

    DNA methylation is a key epigenetic mark involved in both normal development and disease progression. Recent advances in high-throughput technologies have enabled genome-wide profiling of DNA methylation. However, DNA methylation profiling often employs different designs and platforms with varying resolution, which hinders joint analysis of methylation data from multiple platforms. In this study, we propose a penalized functional regression model to impute missing methylation data. By incorporating functional predictors, our model utilizes information from nonlocal probes to improve imputation quality. Here, we compared the performance of our functional model to linear regression and the best single probe surrogate in real data and via simulations. Specifically, we applied different imputation approaches to an acute myeloid leukemia dataset consisting of 194 samples and our method showed higher imputation accuracy, manifested, for example, by a 94% relative increase in information content and up to 86% more CpG sites passing post-imputation filtering. Our simulated association study further demonstrated that our method substantially improves the statistical power to identify trait-associated methylation loci. These findings indicate that the penalized functional regression model is a convenient and valuable imputation tool for methylation data, and it can boost statistical power in downstream epigenome-wide association study (EWAS).

  12. High Degree Spherical Harmonic Synthesis Over Geographic Rectangles: A Simple Approach

    Science.gov (United States)

    Holmes, S. A.; Featherstone, W. E.; Kuhn, M.

    Future spherical harmonic models of the geopotential and other quantities, such as digital elevation models, are likely extend to degree 2160 (corresponding to 5' by 5' geographic rectangles), and beyond. Simple techniques have been developed by the first two authors (Journal of Geodesy, in press) for high-degree (2700) `point' synthesis of gravimetric quantities, in IEEE double precision, pole to pole. Numerical underflows are avoided by modifying exiting recursive algorithms to generate scaled, fully normalised, associated Legendre functions [ALFs] and their first and second derivatives. Final point-synthesis and rescaling was achieved using Horner's scheme. This simple approach has now been extended to stabilise high-degree (2700) `integral' synthesis over geographic rectangles (bound by meridians and parallels). Existing recursive routines compute definite integrals of ALFs for constant orders (`column- wise'). New routines have been designed to compute definite integrals of ALFs for constant degrees (`row-wise'). Both routines have been modified to generate scaled in- tegrals. Final synthesis and rescaling is achieved using Horner's scheme. Preliminary tests indicate that this approach allows, in IEEE double precision, integral synthesis to degree and order 2700, pole to pole, without underflow or overflow errors. Numer- ical tests suggest the new row-wise routines to be more precise than the column-wise routines, especially in polar regions.

  13. Imputation of missing data in time series for air pollutants

    Science.gov (United States)

    Junger, W. L.; Ponce de Leon, A.

    2015-02-01

    Missing data are major concerns in epidemiological studies of the health effects of environmental air pollutants. This article presents an imputation-based method that is suitable for multivariate time series data, which uses the EM algorithm under the assumption of normal distribution. Different approaches are considered for filtering the temporal component. A simulation study was performed to assess validity and performance of proposed method in comparison with some frequently used methods. Simulations showed that when the amount of missing data was as low as 5%, the complete data analysis yielded satisfactory results regardless of the generating mechanism of the missing data, whereas the validity began to degenerate when the proportion of missing values exceeded 10%. The proposed imputation method exhibited good accuracy and precision in different settings with respect to the patterns of missing observations. Most of the imputations obtained valid results, even under missing not at random. The methods proposed in this study are implemented as a package called mtsdi for the statistical software system R.

  14. Genotype Imputation To Improve the Cost-Efficiency of Genomic Selection in Farmed Atlantic Salmon

    Directory of Open Access Journals (Sweden)

    Hsin-Yuan Tsai

    2017-04-01

    Full Text Available Genomic selection uses genome-wide marker information to predict breeding values for traits of economic interest, and is more accurate than pedigree-based methods. The development of high density SNP arrays for Atlantic salmon has enabled genomic selection in selective breeding programs, alongside high-resolution association mapping of the genetic basis of complex traits. However, in sibling testing schemes typical of salmon breeding programs, trait records are available on many thousands of fish with close relationships to the selection candidates. Therefore, routine high density SNP genotyping may be prohibitively expensive. One means to reducing genotyping cost is the use of genotype imputation, where selected key animals (e.g., breeding program parents are genotyped at high density, and the majority of individuals (e.g., performance tested fish and selection candidates are genotyped at much lower density, followed by imputation to high density. The main objectives of the current study were to assess the feasibility and accuracy of genotype imputation in the context of a salmon breeding program. The specific aims were: (i to measure the accuracy of genotype imputation using medium (25 K and high (78 K density mapped SNP panels, by masking varying proportions of the genotypes and assessing the correlation between the imputed genotypes and the true genotypes; and (ii to assess the efficacy of imputed genotype data in genomic prediction of key performance traits (sea lice resistance and body weight. Imputation accuracies of up to 0.90 were observed using the simple two-generation pedigree dataset, and moderately high accuracy (0.83 was possible even with very low density SNP data (∼250 SNPs. The performance of genomic prediction using imputed genotype data was comparable to using true genotype data, and both were superior to pedigree-based prediction. These results demonstrate that the genotype imputation approach used in this study can

  15. Genotype Imputation To Improve the Cost-Efficiency of Genomic Selection in Farmed Atlantic Salmon

    Science.gov (United States)

    Tsai, Hsin-Yuan; Matika, Oswald; Edwards, Stefan McKinnon; Antolín–Sánchez, Roberto; Hamilton, Alastair; Guy, Derrick R.; Tinch, Alan E.; Gharbi, Karim; Stear, Michael J.; Taggart, John B.; Bron, James E.; Hickey, John M.; Houston, Ross D.

    2017-01-01

    Genomic selection uses genome-wide marker information to predict breeding values for traits of economic interest, and is more accurate than pedigree-based methods. The development of high density SNP arrays for Atlantic salmon has enabled genomic selection in selective breeding programs, alongside high-resolution association mapping of the genetic basis of complex traits. However, in sibling testing schemes typical of salmon breeding programs, trait records are available on many thousands of fish with close relationships to the selection candidates. Therefore, routine high density SNP genotyping may be prohibitively expensive. One means to reducing genotyping cost is the use of genotype imputation, where selected key animals (e.g., breeding program parents) are genotyped at high density, and the majority of individuals (e.g., performance tested fish and selection candidates) are genotyped at much lower density, followed by imputation to high density. The main objectives of the current study were to assess the feasibility and accuracy of genotype imputation in the context of a salmon breeding program. The specific aims were: (i) to measure the accuracy of genotype imputation using medium (25 K) and high (78 K) density mapped SNP panels, by masking varying proportions of the genotypes and assessing the correlation between the imputed genotypes and the true genotypes; and (ii) to assess the efficacy of imputed genotype data in genomic prediction of key performance traits (sea lice resistance and body weight). Imputation accuracies of up to 0.90 were observed using the simple two-generation pedigree dataset, and moderately high accuracy (0.83) was possible even with very low density SNP data (∼250 SNPs). The performance of genomic prediction using imputed genotype data was comparable to using true genotype data, and both were superior to pedigree-based prediction. These results demonstrate that the genotype imputation approach used in this study can provide a cost

  16. Teaching Introductory GIS Programming to Geographers Using an Open Source Python Approach

    Science.gov (United States)

    Etherington, Thomas R.

    2016-01-01

    Computer programming is not commonly taught to geographers as a part of geographic information system (GIS) courses, but the advent of NeoGeography, big data and open GIS means that programming skills are becoming more important. To encourage the teaching of programming to geographers, this paper outlines a course based around a series of…

  17. Teaching Introductory GIS Programming to Geographers Using an Open Source Python Approach

    Science.gov (United States)

    Etherington, Thomas R.

    2016-01-01

    Computer programming is not commonly taught to geographers as a part of geographic information system (GIS) courses, but the advent of NeoGeography, big data and open GIS means that programming skills are becoming more important. To encourage the teaching of programming to geographers, this paper outlines a course based around a series of…

  18. Multiple imputation: dealing with missing data.

    Science.gov (United States)

    de Goeij, Moniek C M; van Diepen, Merel; Jager, Kitty J; Tripepi, Giovanni; Zoccali, Carmine; Dekker, Friedo W

    2013-10-01

    In many fields, including the field of nephrology, missing data are unfortunately an unavoidable problem in clinical/epidemiological research. The most common methods for dealing with missing data are complete case analysis-excluding patients with missing data--mean substitution--replacing missing values of a variable with the average of known values for that variable-and last observation carried forward. However, these methods have severe drawbacks potentially resulting in biased estimates and/or standard errors. In recent years, a new method has arisen for dealing with missing data called multiple imputation. This method predicts missing values based on other data present in the same patient. This procedure is repeated several times, resulting in multiple imputed data sets. Thereafter, estimates and standard errors are calculated in each imputation set and pooled into one overall estimate and standard error. The main advantage of this method is that missing data uncertainty is taken into account. Another advantage is that the method of multiple imputation gives unbiased results when data are missing at random, which is the most common type of missing data in clinical practice, whereas conventional methods do not. However, the method of multiple imputation has scarcely been used in medical literature. We, therefore, encourage authors to do so in the future when possible.

  19. A geographic approach to modelling human exposure to traffic air pollution using GIS

    Energy Technology Data Exchange (ETDEWEB)

    Solvang Jensen, S.

    1998-10-01

    A new exposure model has been developed that is based on a physical, single media (air) and single source (traffic) micro environmental approach that estimates traffic related exposures geographically with the postal address as exposure indicator. The micro environments: residence, workplace and street (road user exposure) may be considered. The model estimates outdoor levels for selected ambient air pollutants (benzene, CO, NO{sub 2} and O{sub 3}). The influence of outdoor air pollution on indoor levels can be estimated using average (I/O-ratios. The model has a very high spatial resolution (the address), a high temporal resolution (one hour) and may be used to predict past, present and future exposures. The model may be used for impact assessment of control measures provided that the changes to the model inputs are obtained. The exposure model takes advantage of a standard Geographic Information System (GIS) (ArcView and Avenue) for generation of inputs, for visualisation of input and output, and uses available digital maps, national administrative registers and a local traffic database, and the Danish Operational Street Pollution Model (OSPM). The exposure model presents a new approach to exposure determination by integration of digital maps, administrative registers, a street pollution model and GIS. New methods have been developed to generate the required input parameters for the OSPM model: to geocode buildings using cadastral maps and address points, to automatically generate street configuration data based on digital maps, the BBR and GIS; to predict the temporal variation in traffic and related parameters; and to provide hourly background levels for the OSPM model. (EG) 109 refs.

  20. A geographic approach to modelling human exposure to traffic air pollution using GIS. Separate appendix report

    Energy Technology Data Exchange (ETDEWEB)

    Solvang Jensen, S.

    1998-10-01

    A new exposure model has been developed that is based on a physical, single media (air) and single source (traffic) micro environmental approach that estimates traffic related exposures geographically with the postal address as exposure indicator. The micro environments: residence, workplace and street (road user exposure) may be considered. The model estimates outdoor levels for selected ambient air pollutants (benzene, CO, NO{sub 2} and O{sub 3}). The influence of outdoor air pollution on indoor levels can be estimated using average (I/O-ratios. The model has a very high spatial resolution (the address), a high temporal resolution (one hour) and may be used to predict past, present and future exposures. The model may be used for impact assessment of control measures provided that the changes to the model inputs are obtained. The exposure model takes advantage of a standard Geographic Information System (GIS) (ArcView and Avenue) for generation of inputs, for visualisation of input and output, and uses available digital maps, national administrative registers and a local traffic database, and the Danish Operational Street Pollution Model (OSPM). The exposure model presents a new approach to exposure determination by integration of digital maps, administrative registers, a street pollution model and GIS. New methods have been developed to generate the required input parameters for the OSPM model: to geocode buildings using cadastral maps and address points, to automatically generate street configuration data based on digital maps, the BBR and GIS; to predict the temporal variation in traffic and related parameters; and to provide hourly background levels for the OSPM model. (EG)

  1. Working with Missing Data: Imputation of Nonresponse Items in Categorical Survey Data with a Non-Monotone Missing Pattern

    OpenAIRE

    Wilson, Machelle D; Kerstin Lueck

    2014-01-01

    The imputation of missing data is often a crucial step in the analysis of survey data. This study reviews typical problems with missing data and discusses a method for the imputation of missing survey data with a large number of categorical variables which do not have a monotone missing pattern. We develop a method for constructing a monotone missing pattern that allows for imputation of categorical data in data sets with a large number of variables using a model-based MCMC approach. We repor...

  2. Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies

    Directory of Open Access Journals (Sweden)

    McElwee Joshua

    2009-06-01

    -eQTL discoveries detected by various methods can be interpreted as their relative statistical power in the GWAS. In this study, we find that imputation offer modest additional power (by 4% on top of either Ilmn317K or Ilmn650Y, much less than the power gain from Ilmn317K to Ilmn650Y (13%. Conclusion Current algorithms can accurately impute genotypes for untyped markers, which enables researchers to pool data between studies conducted using different SNP sets. While genotyping itself results in a small error rate (e.g. 0.5%, imputing genotypes is surprisingly accurate. We found that dense marker sets (e.g. Ilmn650Y outperform sparser ones (e.g. Ilmn317K in terms of imputation yield and accuracy. We also noticed it was harder to impute genotypes for African American samples, partially due to population admixture, although using a pooled reference boosts performance. Interestingly, GWAS carried out using imputed genotypes only slightly increased power on top of assayed SNPs. The reason is likely due to adding more markers via imputation only results in modest gain in genetic coverage, but worsens the multiple testing penalties. Furthermore, cis-eQTL mapping using dense SNP set derived from imputation achieves great resolution, and locate associate peak closer to causal variants than conventional approach.

  3. A nearest neighbour approach by genetic distance to the assignment of individual trees to geographic origin.

    Science.gov (United States)

    Degen, Bernd; Blanc-Jolivet, Céline; Stierand, Katrin; Gillet, Elizabeth

    2017-03-01

    During the past decade, the use of DNA for forensic applications has been extensively implemented for plant and animal species, as well as in humans. Tracing back the geographical origin of an individual usually requires genetic assignment analysis. These approaches are based on reference samples that are grouped into populations or other aggregates and intend to identify the most likely group of origin. Often this grouping does not have a biological but rather a historical or political justification, such as "country of origin". In this paper, we present a new nearest neighbour approach to individual assignment or classification within a given but potentially imperfect grouping of reference samples. This method, which is based on the genetic distance between individuals, functions better in many cases than commonly used methods. We demonstrate the operation of our assignment method using two data sets. One set is simulated for a large number of trees distributed in a 120km by 120km landscape with individual genotypes at 150 SNPs, and the other set comprises experimental data of 1221 individuals of the African tropical tree species Entandrophragma cylindricum (Sapelli) genotyped at 61 SNPs. Judging by the level of correct self-assignment, our approach outperformed the commonly used frequency and Bayesian approaches by 15% for the simulated data set and by 5-7% for the Sapelli data set. Our new approach is less sensitive to overlapping sources of genetic differentiation, such as genetic differences among closely-related species, phylogeographic lineages and isolation by distance, and thus operates better even for suboptimal grouping of individuals. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  4. Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines

    Directory of Open Access Journals (Sweden)

    Holder Roger L

    2009-07-01

    Full Text Available Abstract Background Multiple imputation (MI provides an effective approach to handle missing covariate data within prognostic modelling studies, as it can properly account for the missing data uncertainty. The multiply imputed datasets are each analysed using standard prognostic modelling techniques to obtain the estimates of interest. The estimates from each imputed dataset are then combined into one overall estimate and variance, incorporating both the within and between imputation variability. Rubin's rules for combining these multiply imputed estimates are based on asymptotic theory. The resulting combined estimates may be more accurate if the posterior distribution of the population parameter of interest is better approximated by the normal distribution. However, the normality assumption may not be appropriate for all the parameters of interest when analysing prognostic modelling studies, such as predicted survival probabilities and model performance measures. Methods Guidelines for combining the estimates of interest when analysing prognostic modelling studies are provided. A literature review is performed to identify current practice for combining such estimates in prognostic modelling studies. Results Methods for combining all reported estimates after MI were not well reported in the current literature. Rubin's rules without applying any transformations were the standard approach used, when any method was stated. Conclusion The proposed simple guidelines for combining estimates after MI may lead to a wider and more appropriate use of MI in future prognostic modelling studies.

  5. Working with Missing Data: Imputation of Nonresponse Items in Categorical Survey Data with a Non-Monotone Missing Pattern

    Directory of Open Access Journals (Sweden)

    Machelle D. Wilson

    2014-01-01

    Full Text Available The imputation of missing data is often a crucial step in the analysis of survey data. This study reviews typical problems with missing data and discusses a method for the imputation of missing survey data with a large number of categorical variables which do not have a monotone missing pattern. We develop a method for constructing a monotone missing pattern that allows for imputation of categorical data in data sets with a large number of variables using a model-based MCMC approach. We report the results of imputing the missing data from a case study, using educational, sociopsychological, and socioeconomic data from the National Latino and Asian American Study (NLAAS. We report the results of multiply imputed data on a substantive logistic regression analysis predicting socioeconomic success from several educational, sociopsychological, and familial variables. We compare the results of conducting inference using a single imputed data set to those using a combined test over several imputations. Findings indicate that, for all variables in the model, all of the single tests were consistent with the combined test.

  6. Sources of endocrine-disrupting compounds in North Carolina waterways: a geographic information systems approach.

    Science.gov (United States)

    Sackett, Dana K; Pow, Crystal Lee; Rubino, Matthew J; Aday, D Derek; Cope, W Gregory; Kullman, Seth; Rice, James A; Kwak, Thomas J; Law, Mac

    2015-02-01

    The presence of endocrine-disrupting compounds (EDCs), particularly estrogenic compounds, in the environment has drawn public attention across the globe, yet a clear understanding of the extent and distribution of estrogenic EDCs in surface waters and their relationship to potential sources is lacking. The objective of the present study was to identify and examine the potential input of estrogenic EDC sources in North Carolina water bodies using a geographic information system (GIS) mapping and analysis approach. Existing data from state and federal agencies were used to create point and nonpoint source maps depicting the cumulative contribution of potential sources of estrogenic EDCs to North Carolina surface waters. Water was collected from 33 sites (12 associated with potential point sources, 12 associated with potential nonpoint sources, and 9 reference), to validate the predictive results of the GIS analysis. Estrogenicity (measured as 17β-estradiol equivalence) ranged from 0.06 ng/L to 56.9 ng/L. However, the majority of sites (88%) had water 17β-estradiol concentrations below 1 ng/L. Sites associated with point and nonpoint sources had significantly higher 17β-estradiol levels than reference sites. The results suggested that water 17β-estradiol was reflective of GIS predictions, confirming the relevance of landscape-level influences on water quality and validating the GIS approach to characterize such relationships. © 2014 SETAC.

  7. Update on tick-borne rickettsioses around the world: a geographic approach.

    Science.gov (United States)

    Parola, Philippe; Paddock, Christopher D; Socolovschi, Cristina; Labruna, Marcelo B; Mediannikov, Oleg; Kernif, Tahar; Abdad, Mohammad Yazid; Stenos, John; Bitam, Idir; Fournier, Pierre-Edouard; Raoult, Didier

    2013-10-01

    Tick-borne rickettsioses are caused by obligate intracellular bacteria belonging to the spotted fever group of the genus Rickettsia. These zoonoses are among the oldest known vector-borne diseases. However, in the past 25 years, the scope and importance of the recognized tick-associated rickettsial pathogens have increased dramatically, making this complex of diseases an ideal paradigm for the understanding of emerging and reemerging infections. Several species of tick-borne rickettsiae that were considered nonpathogenic for decades are now associated with human infections, and novel Rickettsia species of undetermined pathogenicity continue to be detected in or isolated from ticks around the world. This remarkable expansion of information has been driven largely by the use of molecular techniques that have facilitated the identification of novel and previously recognized rickettsiae in ticks. New approaches, such as swabbing of eschars to obtain material to be tested by PCR, have emerged in recent years and have played a role in describing emerging tick-borne rickettsioses. Here, we present the current knowledge on tick-borne rickettsiae and rickettsioses using a geographic approach toward the epidemiology of these diseases.

  8. Sources of endocrine-disrupting compounds in North Carolina waterways: a geographic information systems approach

    Science.gov (United States)

    Sackett, Dana K.; Pow, Crystal Lee; Rubino, Matthew; Aday, D.D.; Cope, W. Gregory; Kullman, Seth W.; Rice, J.A.; Kwak, Thomas J.; Law, L.M.

    2015-01-01

    The presence of endocrine-disrupting compounds (EDCs), particularly estrogenic compounds, in the environment has drawn public attention across the globe, yet a clear understanding of the extent and distribution of estrogenic EDCs in surface waters and their relationship to potential sources is lacking. The objective of the present study was to identify and examine the potential input of estrogenic EDC sources in North Carolina water bodies using a geographic information system (GIS) mapping and analysis approach. Existing data from state and federal agencies were used to create point and nonpoint source maps depicting the cumulative contribution of potential sources of estrogenic EDCs to North Carolina surface waters. Water was collected from 33 sites (12 associated with potential point sources, 12 associated with potential nonpoint sources, and 9 reference), to validate the predictive results of the GIS analysis. Estrogenicity (measured as 17β-estradiol equivalence) ranged from 0.06 ng/L to 56.9 ng/L. However, the majority of sites (88%) had water 17β-estradiol concentrations below 1 ng/L. Sites associated with point and nonpoint sources had significantly higher 17β-estradiol levels than reference sites. The results suggested that water 17β-estradiol was reflective of GIS predictions, confirming the relevance of landscape-level influences on water quality and validating the GIS approach to characterize such relationships.

  9. Imputation strategies for missing binary outcomes in cluster randomized trials

    Directory of Open Access Journals (Sweden)

    Akhtar-Danesh Noori

    2011-02-01

    Full Text Available Abstract Background Attrition, which leads to missing data, is a common problem in cluster randomized trials (CRTs, where groups of patients rather than individuals are randomized. Standard multiple imputation (MI strategies may not be appropriate to impute missing data from CRTs since they assume independent data. In this paper, under the assumption of missing completely at random and covariate dependent missing, we compared six MI strategies which account for the intra-cluster correlation for missing binary outcomes in CRTs with the standard imputation strategies and complete case analysis approach using a simulation study. Method We considered three within-cluster and three across-cluster MI strategies for missing binary outcomes in CRTs. The three within-cluster MI strategies are logistic regression method, propensity score method, and Markov chain Monte Carlo (MCMC method, which apply standard MI strategies within each cluster. The three across-cluster MI strategies are propensity score method, random-effects (RE logistic regression approach, and logistic regression with cluster as a fixed effect. Based on the community hypertension assessment trial (CHAT which has complete data, we designed a simulation study to investigate the performance of above MI strategies. Results The estimated treatment effect and its 95% confidence interval (CI from generalized estimating equations (GEE model based on the CHAT complete dataset are 1.14 (0.76 1.70. When 30% of binary outcome are missing completely at random, a simulation study shows that the estimated treatment effects and the corresponding 95% CIs from GEE model are 1.15 (0.76 1.75 if complete case analysis is used, 1.12 (0.72 1.73 if within-cluster MCMC method is used, 1.21 (0.80 1.81 if across-cluster RE logistic regression is used, and 1.16 (0.82 1.64 if standard logistic regression which does not account for clustering is used. Conclusion When the percentage of missing data is low or intra

  10. Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis.

    Science.gov (United States)

    Siddique, Juned; Reiter, Jerome P; Brincks, Ahnalee; Gibbons, Robert D; Crespi, Catherine M; Brown, C Hendricks

    2015-11-20

    There are many advantages to individual participant data meta-analysis for combining data from multiple studies. These advantages include greater power to detect effects, increased sample heterogeneity, and the ability to perform more sophisticated analyses than meta-analyses that rely on published results. However, a fundamental challenge is that it is unlikely that variables of interest are measured the same way in all of the studies to be combined. We propose that this situation can be viewed as a missing data problem in which some outcomes are entirely missing within some trials and use multiple imputation to fill in missing measurements. We apply our method to five longitudinal adolescent depression trials where four studies used one depression measure and the fifth study used a different depression measure. None of the five studies contained both depression measures. We describe a multiple imputation approach for filling in missing depression measures that makes use of external calibration studies in which both depression measures were used. We discuss some practical issues in developing the imputation model including taking into account treatment group and study. We present diagnostics for checking the fit of the imputation model and investigate whether external information is appropriately incorporated into the imputed values.

  11. Multiple imputation as a flexible tool for missing data handling in clinical research.

    Science.gov (United States)

    Enders, Craig K

    2016-11-18

    The last 20 years has seen an uptick in research on missing data problems, and most software applications now implement one or more sophisticated missing data handling routines (e.g., multiple imputation or maximum likelihood estimation). Despite their superior statistical properties (e.g., less stringent assumptions, greater accuracy and power), the adoption of these modern analytic approaches is not uniform in psychology and related disciplines. Thus, the primary goal of this manuscript is to describe and illustrate the application of multiple imputation. Although maximum likelihood estimation is perhaps the easiest method to use in practice, psychological data sets often feature complexities that are currently difficult to handle appropriately in the likelihood framework (e.g., mixtures of categorical and continuous variables), but relatively simple to treat with imputation. The paper describes a number of practical issues that clinical researchers are likely to encounter when applying multiple imputation, including mixtures of categorical and continuous variables, item-level missing data in questionnaires, significance testing, interaction effects, and multilevel missing data. Analysis examples illustrate imputation with software packages that are freely available on the internet.

  12. Multiple imputation and analysis for high-dimensional incomplete proteomics data.

    Science.gov (United States)

    Yin, Xiaoyan; Levy, Daniel; Willinger, Christine; Adourian, Aram; Larson, Martin G

    2016-04-15

    Multivariable analysis of proteomics data using standard statistical models is hindered by the presence of incomplete data. We faced this issue in a nested case-control study of 135 incident cases of myocardial infarction and 135 pair-matched controls from the Framingham Heart Study Offspring cohort. Plasma protein markers (K = 861) were measured on the case-control pairs (N = 135), and the majority of proteins had missing expression values for a subset of samples. In the setting of many more variables than observations (K ≫ N), we explored and documented the feasibility of multiple imputation approaches along with subsequent analysis of the imputed data sets. Initially, we selected proteins with complete expression data (K = 261) and randomly masked some values as the basis of simulation to tune the imputation and analysis process. We randomly shuffled proteins into several bins, performed multiple imputation within each bin, and followed up with stepwise selection using conditional logistic regression within each bin. This process was repeated hundreds of times. We determined the optimal method of multiple imputation, number of proteins per bin, and number of random shuffles using several performance statistics. We then applied this method to 544 proteins with incomplete expression data (≤ 40% missing values), from which we identified a panel of seven proteins that were jointly associated with myocardial infarction.

  13. Short communication: imputing genotypes using PedImpute fast algorithm combining pedigree and population information.

    Science.gov (United States)

    Nicolazzi, E L; Biffani, S; Jansen, G

    2013-04-01

    Routine genomic evaluations frequently include a preliminary imputation step, requiring high accuracy and reduced computing time. A new algorithm, PedImpute (http://dekoppel.eu/pedimpute/), was developed and compared with findhap (http://aipl.arsusda.gov/software/findhap/) and BEAGLE (http://faculty.washington.edu/browning/beagle/beagle.html), using 19,904 Holstein genotypes from a 4-country international collaboration (United States, Canada, UK, and Italy). Different scenarios were evaluated on a sample subset that included only single nucleotide polymorphism from the Bovine low-density (LD) Illumina BeadChip (Illumina Inc., San Diego, CA). Comparative criteria were computing time, percentage of missing alleles, percentage of wrongly imputed alleles, and the allelic squared correlation. Imputation accuracy on ungenotyped animals was also analyzed. The algorithm PedImpute was slightly more accurate and faster than findhap and BEAGLE when sire, dam, and maternal grandsire were genotyped at high density. On the other hand, BEAGLE performed better than both PedImpute and findhap for animals with at least one close relative not genotyped or genotyped at low density. However, computing time and resources using BEAGLE were incompatible with routine genomic evaluations in Italy. Error rate and allelic squared correlation attained by PedImpute ranged from 0.2 to 1.1% and from 96.6 to 99.3%, respectively. When complete genomic information on sire, dam, and maternal grandsire are available, as expected to be the case in the close future in (at least) dairy cattle, and considering accuracies obtained and computation time required, PedImpute represents a valuable choice in routine evaluations among the algorithms tested.

  14. Geographic Information System (GIS) modeling approach to determine the fastest delivery routes.

    Science.gov (United States)

    Abousaeidi, Mohammad; Fauzi, Rosmadi; Muhamad, Rusnah

    2016-09-01

    This study involves the adoption of the Geographic Information System (GIS) modeling approach to determine the quickest routes for fresh vegetable delivery. During transport, fresh vegetables mainly deteriorate on account of temperature and delivery time. Nonetheless, little attention has been directed to transportation issues in most areas within Kuala Lumpur. In addition, perishable food normally has a short shelf life, thus timely delivery significantly affects delivery costs. Therefore, selecting efficient routes would consequently reduce the total transportation costs. The regression model is applied in this study to determine the parameters that affect route selection with respect to the fastest delivery of fresh vegetables. For the purpose of this research, ArcGIS software with network analyst extension is adopted to solve the problem of complex networks. The final output of this research is a map of quickest routes with the best delivery times based on all variables. The variables tested from regression analysis are the most effective parameters to make the flow of road networks slower. The objective is to improve the delivery services by achieving the least drive time. The main findings of this research are that Land use such as residential area and population as variables are the effective parameters on drive time.

  15. Geographic Information System (GIS modeling approach to determine the fastest delivery routes

    Directory of Open Access Journals (Sweden)

    Mohammad Abousaeidi

    2016-09-01

    Full Text Available This study involves the adoption of the Geographic Information System (GIS modeling approach to determine the quickest routes for fresh vegetable delivery. During transport, fresh vegetables mainly deteriorate on account of temperature and delivery time. Nonetheless, little attention has been directed to transportation issues in most areas within Kuala Lumpur. In addition, perishable food normally has a short shelf life, thus timely delivery significantly affects delivery costs. Therefore, selecting efficient routes would consequently reduce the total transportation costs. The regression model is applied in this study to determine the parameters that affect route selection with respect to the fastest delivery of fresh vegetables. For the purpose of this research, ArcGIS software with network analyst extension is adopted to solve the problem of complex networks. The final output of this research is a map of quickest routes with the best delivery times based on all variables. The variables tested from regression analysis are the most effective parameters to make the flow of road networks slower. The objective is to improve the delivery services by achieving the least drive time. The main findings of this research are that Land use such as residential area and population as variables are the effective parameters on drive time.

  16. Geographical information system approaches for hazard mapping of dilute lahars on Montserrat, West Indies

    Science.gov (United States)

    Darnell, A. R.; Barclay, J.; Herd, R. A.; Phillips, J. C.; Lovett, A. A.; Cole, P.

    2012-08-01

    Many research tools for lahar hazard assessment have proved wholly unsuitable for practical application to an active volcanic system where field measurements are challenging to obtain. Two simple routing models, with minimal data demands and implemented in a geographical information system (GIS), were applied to dilute lahars originating from Soufrière Hills Volcano, Montserrat. Single-direction flow routing by path of steepest descent, commonly used for simulating normal stream-flow, was tested against LAHARZ, an established lahar model calibrated for debris flows, for ability to replicate the main flow routes. Comparing the ways in which these models capture observed changes, and how the different modelled paths deviate can also provide an indication of where dilute lahars, do not follow behaviour expected from single-phase flow models. Data were collected over two field seasons and provide (1) an overview of gross morphological change after one rainy season, (2) details of dominant channels at the time of measurement, and (3) order of magnitude estimates of individual flow volumes. Modelling results suggested both GIS-based predictive tools had associated benefits. Dominant flow routes observed in the field were generally well-predicted using the hydrological approach with a consideration of elevation error, while LAHARZ was comparatively more successful at mapping lahar dispersion and was better suited to long-term hazard assessment. This research suggests that end-member models can have utility for first-order dilute lahar hazard mapping.

  17. An analytical approach to Sr isotope ratio determination in Lambrusco wines for geographical traceability purposes.

    Science.gov (United States)

    Durante, Caterina; Baschieri, Carlo; Bertacchini, Lucia; Bertelli, Davide; Cocchi, Marina; Marchetti, Andrea; Manzini, Daniela; Papotti, Giulia; Sighinolfi, Simona

    2015-04-15

    Geographical origin and authenticity of food are topics of interest for both consumers and producers. Among the different indicators used for traceability studies, (87)Sr/(86)Sr isotopic ratio has provided excellent results. In this study, two analytical approaches for wine sample pre-treatment, microwave and low temperature mineralisation, were investigated to develop accurate and precise analytical method for (87)Sr/(86)Sr determination. The two procedures led to comparable results (paired t-test, with twine sample), processed during each sample batch (calculated Relative Standard Deviation, RSD%, equal to 0.002%. Lambrusco PDO (Protected Designation of Origin) wines coming from four different vintages (2009, 2010, 2011 and 2012) were pre-treated according to the best procedure and their isotopic values were compared with isotopic data coming from (i) soils of their territory of origin and (ii) wines obtained by same grape varieties cultivated in different districts. The obtained results have shown no significant variability among the different vintages of wines and a perfect agreement between the isotopic range of the soils and wines has been observed. Nevertheless, the investigated indicator was not enough powerful to discriminate between similar products. To this regard, it is worth to note that more soil samples as well as wines coming from different districts will be considered to obtain more trustworthy results.

  18. Organ-to-Cell-Scale Health Assessment Using Geographical Information System Approaches with Multibeam Scanning Electron Microscopy.

    Science.gov (United States)

    Knothe Tate, Melissa L; Zeidler, Dirk; Pereira, André F; Hageman, Daniel; Garbowski, Tomasz; Mishra, Sanjay; Gardner, Lauren; Knothe, Ulf R

    2016-07-01

    This study combines novel multibeam electron microscopy with a geographical information system approach to create a first, seamless, navigable anatomic map of the human hip and its cellular inhabitants. Using spatial information acquired by localizing relevant map landmarks (e.g. cells, blood vessels), network modeling will enable disease epidemiology studies in populations of cells inhabiting tissues and organs.

  19. A poisson regression approach for modelling spatial autocorrelation between geographically referenced observations

    Directory of Open Access Journals (Sweden)

    Jolley Damien

    2011-10-01

    Full Text Available Abstract Background Analytic methods commonly used in epidemiology do not account for spatial correlation between observations. In regression analyses, omission of that autocorrelation can bias parameter estimates and yield incorrect standard error estimates. Methods We used age standardised incidence ratios (SIRs of esophageal cancer (EC from the Babol cancer registry from 2001 to 2005, and extracted socioeconomic indices from the Statistical Centre of Iran. The following models for SIR were used: (1 Poisson regression with agglomeration-specific nonspatial random effects; (2 Poisson regression with agglomeration-specific spatial random effects. Distance-based and neighbourhood-based autocorrelation structures were used for defining the spatial random effects and a pseudolikelihood approach was applied to estimate model parameters. The Bayesian information criterion (BIC, Akaike's information criterion (AIC and adjusted pseudo R2, were used for model comparison. Results A Gaussian semivariogram with an effective range of 225 km best fit spatial autocorrelation in agglomeration-level EC incidence. The Moran's I index was greater than its expected value indicating systematic geographical clustering of EC. The distance-based and neighbourhood-based Poisson regression estimates were generally similar. When residual spatial dependence was modelled, point and interval estimates of covariate effects were different to those obtained from the nonspatial Poisson model. Conclusions The spatial pattern evident in the EC SIR and the observation that point estimates and standard errors differed depending on the modelling approach indicate the importance of accounting for residual spatial correlation in analyses of EC incidence in the Caspian region of Iran. Our results also illustrate that spatial smoothing must be applied with care.

  20. Developing a simplified geographical information system approach to dilute lahar modelling for rapid hazard assessment

    Science.gov (United States)

    Darnell, A. R.; Phillips, J. C.; Barclay, J.; Herd, R. A.; Lovett, A. A.; Cole, P. D.

    2013-04-01

    In this study, we present a geographical information system (GIS)-based approach to enable the estimation of lahar features important to rapid hazard assessment (including flow routes, velocities and travel times). Our method represents a simplified first stage in extending the utility of widely used existing GIS-based inundation models, such as LAHARZ, to provide estimates of flow speeds. LAHARZ is used to determine the spatial distribution of a lahar of constant volume, and for a given cell in a GIS grid, a single-direction flow routing technique incorporating the effect of surface roughness directs the flow according to steepest descent. The speed of flow passing through a cell is determined from coupling the flow depth, change in elevation and roughness using Manning's formula, and in areas where there is little elevation difference, flow is routed to locally maximum increase in velocity. Application of this methodology to lahars on Montserrat, West Indies, yielded support for this GIS-based approach as a hazard assessment tool through tests on small volume (5,000-125,000 m3) dilute lahars (consistent with application of Manning's law). Dominant flow paths were mapped, and for the first time in this study area, velocities (magnitudes and spatial distribution) and average travel times were estimated for a range of lahar volumes. Flow depth approximations were also made using (modified) LAHARZ, and these refined the input to Manning's formula. Flow depths were verified within an order of magnitude by field observations, and velocity predictions were broadly consistent with proxy measurements and published data. Forecasts from this coupled method can operate on short to mid-term timescales for hazard management. The methodology has potential to provide a rapid preliminary hazard assessment in similar systems where data acquisition may be difficult.

  1. Comparing performance of modern genotype imputation methods in different ethnicities

    Science.gov (United States)

    Roshyara, Nab Raj; Horn, Katrin; Kirsten, Holger; Ahnert, Peter; Scholz, Markus

    2016-10-01

    A variety of modern software packages are available for genotype imputation relying on advanced concepts such as pre-phasing of the target dataset or utilization of admixed reference panels. In this study, we performed a comprehensive evaluation of the accuracy of modern imputation methods on the basis of the publicly available POPRES samples. Good quality genotypes were masked and re-imputed by different imputation frameworks: namely MaCH, IMPUTE2, MaCH-Minimac, SHAPEIT-IMPUTE2 and MaCH-Admix. Results were compared to evaluate the relative merit of pre-phasing and the usage of admixed references. We showed that the pre-phasing framework SHAPEIT-IMPUTE2 can overestimate the certainty of genotype distributions resulting in the lowest percentage of correctly imputed genotypes in our case. MaCH-Minimac performed better than SHAPEIT-IMPUTE2. Pre-phasing always reduced imputation accuracy. IMPUTE2 and MaCH-Admix, both relying on admixed-reference panels, showed comparable results. MaCH showed superior results if well-matched references were available (Nei’s GST ≤ 0.010). For small to medium datasets, frameworks using genetically closest reference panel are recommended if the genetic distance between target and reference data set is small. Our results are valid for small to medium data sets. As shown on a larger data set of population based German samples, the disadvantage of pre-phasing decreases for larger sample sizes.

  2. Handling Out-of-Sequence Data: Kalman Filter Methods or Statistical Imputation?

    Directory of Open Access Journals (Sweden)

    Bhekisipho Twala

    2010-01-01

    Full Text Available The issue of handling sensor measurements data over single and multiple lag delays also known as outof-sequence measurement (OOSM has been considered. It is argued that this problem can also be addressed using model-based imputation strategies and their application in comparison to Kalman filter (KF-based approaches for a multi-sensor tracking prediction problem has also been demonstrated. The effectiveness of two model-based imputation procedures against five OOSM methods was investigated in Monte Carlo simulation experiments. The delayed measurements were either incorporated (or fused at the time these were finally available (using OOSM methods or imputed in a random way with higher probability of delays for multiple lags and lower probability of delays for a single lag (using single or multiple imputation. For single lag, estimates of target tracking computed from the observed data and those based on a data set in which the delayed measurements were imputed were equally unbiased; however, the KF estimates obtained using the Bayesian framework (BF-KF were more precise. When the measurements were delayed in a multiple lag fashion, there were significant differences in bias or precision between multiple imputation (MI and OOSM methods, with the former exhibiting a superior performance at nearly all levels of probability of measurement delay and range of manoeuvring indices. Researchers working on sensor data are encouraged to take advantage of software to implement delayed measurements using MI, as estimates of tracking are more precise and less biased in the presence of delayed multi-sensor data than those derived from an observed data analysis approach.Defence Science Journal, 2010, 60(1, pp.87-99, DOI:http://dx.doi.org/10.14429/dsj.60.115

  3. Geographic Clustering and Productivity: An Instrumental Variable Approach for Classical Composers

    DEFF Research Database (Denmark)

    Borowiecki, Karol

    2013-01-01

    It is difficult to estimate the impact of geographic clustering on productivity because of endogeneity issues. I use birthplace-cluster distance as an instrumental variable for the incidence of clustering of prominent classical composers born between 1750 and 1899. I find that geographic clustering...... strongly impacts the productivity of the clustering individuals: composers were approx. 33 percentage points more productive while they remained in a geographic cluster. Top composers and composers who migrated to the cluster are the greatest beneficiaries of clustering. The benefit depends...

  4. Geographic Clustering and Productivity: An Instrumental Variable Approach for Classical Composers

    DEFF Research Database (Denmark)

    Borowiecki, Karol

    2013-01-01

    It is difficult to estimate the impact of geographic clustering on productivity because of endogeneity issues. I use birthplace-cluster distance as an instrumental variable for the incidence of clustering of prominent classical composers born between 1750 and 1899. I find that geographic clustering...... strongly impacts the productivity of the clustering individuals: composers were approx. 33 percentage points more productive while they remained in a geographic cluster. Top composers and composers who migrated to the cluster are the greatest beneficiaries of clustering. The benefit depends...

  5. Accessibility patterns and community integration among previously homeless adults: a Geographic Information Systems (GIS) approach.

    Science.gov (United States)

    Chan, Dara V; Gopal, Sucharita; Helfrich, Christine A

    2014-11-01

    Although a desired rehabilitation goal, research continues to document that community integration significantly lags behind housing stability success rates for people of a variety of ages who used to be homeless. While accessibility to resources is an environmental factor that may promote or impede integration activity, there has been little empirical investigation into the impact of proximity of community features on resource use and integration. Using a Geographic Information Systems (GIS) approach, the current study examines how accessibility or proximity to community features in Boston, United States related to the types of locations used and the size of an individual's "activity space," or spatial presence in the community. Significant findings include an inverse relationship between activity space size and proximity to the number and type of community features in one's immediate area. Specifically, larger activity spaces were associated with neighborhoods with less community features, and smaller activity spaces corresponded with greater availability of resources within one's immediate area. Activity space size also varied, however, based on proximity to different types of resources, namely transportation and health care. Greater community function, or the ability to navigate and use community resources, was associated with better accessibility and feeling part of the community. Finally, proximity to a greater number of individual identified preferred community features was associated with better social integration. The current study suggests the ongoing challenges of successful integration may vary not just based on accessibility to, but relative importance of, specific community features and affinity with one's surroundings. Community integration researchers and housing providers may need to attend to the meaning attached to resources, not just presence or use in the community.

  6. IMPROVEMENT EVALUATION ON CERAMIC ROOF EXTRACTION USING WORLDVIEW-2 IMAGERY AND GEOGRAPHIC DATA MINING APPROACH

    Directory of Open Access Journals (Sweden)

    V. S. Brum-Bastos

    2016-06-01

    Full Text Available Advances in geotechnologies and in remote sensing have improved analysis of urban environments. The new sensors are increasingly suited to urban studies, due to the enhancement in spatial, spectral and radiometric resolutions. Urban environments present high heterogeneity, which cannot be tackled using pixel–based approaches on high resolution images. Geographic Object–Based Image Analysis (GEOBIA has been consolidated as a methodology for urban land use and cover monitoring; however, classification of high resolution images is still troublesome. This study aims to assess the improvement on ceramic roof classification using WorldView-2 images due to the increase of 4 new bands besides the standard “Blue-Green-Red-Near Infrared” bands. Our methodology combines GEOBIA, C4.5 classification tree algorithm, Monte Carlo simulation and statistical tests for classification accuracy. Two samples groups were considered: 1 eight multispectral and panchromatic bands, and 2 four multispectral and panchromatic bands, representing previous high-resolution sensors. The C4.5 algorithm generates a decision tree that can be used for classification; smaller decision trees are closer to the semantic networks produced by experts on GEOBIA, while bigger trees, are not straightforward to implement manually, but are more accurate. The choice for a big or small tree relies on the user’s skills to implement it. This study aims to determine for what kind of user the addition of the 4 new bands might be beneficial: 1 the common user (smaller trees or 2 a more skilled user with coding and/or data mining abilities (bigger trees. In overall the classification was improved by the addition of the four new bands for both types of users.

  7. Improvement Evaluation on Ceramic Roof Extraction Using WORLDVIEW-2 Imagery and Geographic Data Mining Approach

    Science.gov (United States)

    Brum-Bastos, V. S.; Ribeiro, B. M. G.; Pinho, C. M. D.; Korting, T. S.; Fonseca, L. M. G.

    2016-06-01

    Advances in geotechnologies and in remote sensing have improved analysis of urban environments. The new sensors are increasingly suited to urban studies, due to the enhancement in spatial, spectral and radiometric resolutions. Urban environments present high heterogeneity, which cannot be tackled using pixel-based approaches on high resolution images. Geographic Object-Based Image Analysis (GEOBIA) has been consolidated as a methodology for urban land use and cover monitoring; however, classification of high resolution images is still troublesome. This study aims to assess the improvement on ceramic roof classification using WorldView-2 images due to the increase of 4 new bands besides the standard "Blue-Green-Red-Near Infrared" bands. Our methodology combines GEOBIA, C4.5 classification tree algorithm, Monte Carlo simulation and statistical tests for classification accuracy. Two samples groups were considered: 1) eight multispectral and panchromatic bands, and 2) four multispectral and panchromatic bands, representing previous high-resolution sensors. The C4.5 algorithm generates a decision tree that can be used for classification; smaller decision trees are closer to the semantic networks produced by experts on GEOBIA, while bigger trees, are not straightforward to implement manually, but are more accurate. The choice for a big or small tree relies on the user's skills to implement it. This study aims to determine for what kind of user the addition of the 4 new bands might be beneficial: 1) the common user (smaller trees) or 2) a more skilled user with coding and/or data mining abilities (bigger trees). In overall the classification was improved by the addition of the four new bands for both types of users.

  8. Investigating human geographic origins using dual-isotope (87Sr/86Sr, δ18O) assignment approaches

    Science.gov (United States)

    Sonnemann, Till F.; Shafie, Termeh; Hofman, Corinne L.; Brandes, Ulrik; Davies, Gareth R.

    2017-01-01

    Substantial progress in the application of multiple isotope analyses has greatly improved the ability to identify nonlocal individuals amongst archaeological populations over the past decades. More recently the development of large scale models of spatial isotopic variation (isoscapes) has contributed to improved geographic assignments of human and animal origins. Persistent challenges remain, however, in the accurate identification of individual geographic origins from skeletal isotope data in studies of human (and animal) migration and provenance. In an attempt to develop and test more standardized and quantitative approaches to geographic assignment of individual origins using isotopic data two methods, combining 87Sr/86Sr and δ18O isoscapes, are examined for the Circum-Caribbean region: 1) an Interval approach using a defined range of fixed isotopic variation per location; and 2) a Likelihood assignment approach using univariate and bivariate probability density functions. These two methods are tested with enamel isotope data from a modern sample of known origin from Caracas, Venezuela and further explored with two archaeological samples of unknown origin recovered from Cuba and Trinidad. The results emphasize both the potential and limitation of the different approaches. Validation tests on the known origin sample exclude most areas of the Circum-Caribbean region and correctly highlight Caracas as a possible place of origin with both approaches. The positive validation results clearly demonstrate the overall efficacy of a dual-isotope approach to geoprovenance. The accuracy and precision of geographic assignments may be further improved by better understanding of the relationships between environmental and biological isotope variation; continued development and refinement of relevant isoscapes; and the eventual incorporation of a broader array of isotope proxy data. PMID:28222163

  9. GEOGRAPHIC INFORMATION SYSTEM APPROACH FOR PLAY PORTFOLIOS TO IMPROVE OIL PRODUCTION IN THE ILLINOIS BASIN

    Energy Technology Data Exchange (ETDEWEB)

    Beverly Seyler; John Grube

    2004-12-10

    Oil and gas have been commercially produced in Illinois for over 100 years. Existing commercial production is from more than fifty-two named pay horizons in Paleozoic rocks ranging in age from Middle Ordovician to Pennsylvanian. Over 3.2 billion barrels of oil have been produced. Recent calculations indicate that remaining mobile resources in the Illinois Basin may be on the order of several billion barrels. Thus, large quantities of oil, potentially recoverable using current technology, remain in Illinois oil fields despite a century of development. Many opportunities for increased production may have been missed due to complex development histories, multiple stacked pays, and commingled production which makes thorough exploitation of pays and the application of secondary or improved/enhanced recovery strategies difficult. Access to data, and the techniques required to evaluate and manage large amounts of diverse data are major barriers to increased production of critical reserves in the Illinois Basin. These constraints are being alleviated by the development of a database access system using a Geographic Information System (GIS) approach for evaluation and identification of underdeveloped pays. The Illinois State Geological Survey has developed a methodology that is being used by industry to identify underdeveloped areas (UDAs) in and around petroleum reservoirs in Illinois using a GIS approach. This project utilizes a statewide oil and gas Oracle{reg_sign} database to develop a series of Oil and Gas Base Maps with well location symbols that are color-coded by producing horizon. Producing horizons are displayed as layers and can be selected as separate or combined layers that can be turned on and off. Map views can be customized to serve individual needs and page size maps can be printed. A core analysis database with over 168,000 entries has been compiled and assimilated into the ISGS Enterprise Oracle database. Maps of wells with core data have been generated

  10. Clustering with Missing Values: No Imputation Required

    Science.gov (United States)

    Wagstaff, Kiri

    2004-01-01

    Clustering algorithms can identify groups in large data sets, such as star catalogs and hyperspectral images. In general, clustering methods cannot analyze items that have missing data values. Common solutions either fill in the missing values (imputation) or ignore the missing data (marginalization). Imputed values are treated as just as reliable as the truly observed data, but they are only as good as the assumptions used to create them. In contrast, we present a method for encoding partially observed features as a set of supplemental soft constraints and introduce the KSC algorithm, which incorporates constraints into the clustering process. In experiments on artificial data and data from the Sloan Digital Sky Survey, we show that soft constraints are an effective way to enable clustering with missing values.

  11. A Comparison of Imputation Strategies for Ordinal Missing Data on Likert Scale Variables.

    Science.gov (United States)

    Wu, Wei; Jia, Fan; Enders, Craig

    2015-01-01

    This article compares a variety of imputation strategies for ordinal missing data on Likert scale variables (number of categories = 2, 3, 5, or 7) in recovering reliability coefficients, mean scale scores, and regression coefficients of predicting one scale score from another. The examined strategies include imputing using normal data models with naïve rounding/without rounding, using latent variable models, and using categorical data models such as discriminant analysis and binary logistic regression (for dichotomous data only), multinomial and proportional odds logistic regression (for polytomous data only). The result suggests that both the normal model approach without rounding and the latent variable model approach perform well for either dichotomous or polytomous data regardless of sample size, missing data proportion, and asymmetry of item distributions. The discriminant analysis approach also performs well for dichotomous data. Naïvely rounding normal imputations or using logistic regression models to impute ordinal data are not recommended as they can potentially lead to substantial bias in all or some of the parameters.

  12. A probabilistic approach to evaluate the exploitation of the geographic situation of hydroelectric plants

    Energy Technology Data Exchange (ETDEWEB)

    Sant' Anna, Leonardo A.F.P. [EPE (Brazil)], E-mail: leonardo.santanna@epe.gov.br; Sant' Anna, Annibal Parracho [Escola de Engenharia - UFF, Rua Passo da Patria, 156 bl. D sl. 309 Niteroi, RJ 24210-240 (Brazil)], E-mail: tppaps@vm.uff.br

    2008-07-15

    A procedure to evaluate efficiency in the exploitation of the geographic situation of hydroelectric plants is developed here. It is based on the probabilistic composition of criteria. A comparison of 80 plants is carried out, with volume of water flow at the location and transmission rates paid measuring the potential of the geographic situation and installed power and assured energy measuring the employment of such potential. An analysis based on a new index of quality of approximation and a new measure of importance derived from Shapley value is used to select the criteria that enter a second stage of the efficiency evaluation.

  13. A probabilistic approach to evaluate the exploitation of the geographic situation of hydroelectric plants

    Energy Technology Data Exchange (ETDEWEB)

    Sant' Anna, Leonardo A.F.P.; Sant' Anna, Annibal Parracho [Escola de Engenharia - UFF, Rua Passo da Patria, 156 bl. D sl. 309 Niteroi, RJ 24210-240 (Brazil)

    2008-07-15

    A procedure to evaluate efficiency in the exploitation of the geographic situation of hydroelectric plants is developed here. It is based on the probabilistic composition of criteria. A comparison of 80 plants is carried out, with volume of water flow at the location and transmission rates paid measuring the potential of the geographic situation and installed power and assured energy measuring the employment of such potential. An analysis based on a new index of quality of approximation and a new measure of importance derived from Shapley value is used to select the criteria that enter a second stage of the efficiency evaluation. (author)

  14. An Activity-Based Learning Approach for Key Geographical Information Systems (GIS) Concepts

    Science.gov (United States)

    Srivastava, Sanjeev Kumar; Tait, Cynthia

    2012-01-01

    This study presents the effect of active learning methods of concepts in geographical information systems where students participated in a series of interlocked learning experiences. These activities spanned several teaching weeks and involved the creation of a hand drawn map that was scanned and geo-referenced with locations' coordinates derived…

  15. Comprehensive evaluation of imputation performance in African Americans.

    Science.gov (United States)

    Chanda, Pritam; Yuhki, Naoya; Li, Man; Bader, Joel S; Hartz, Alex; Boerwinkle, Eric; Kao, W H Linda; Arking, Dan E

    2012-07-01

    Imputation of genome-wide single-nucleotide polymorphism (SNP) arrays to a larger known reference panel of SNPs has become a standard and an essential part of genome-wide association studies. However, little is known about the behavior of imputation in African Americans with respect to the different imputation algorithms, the reference population(s) and the reference SNP panels used. Genome-wide SNP data (Affymetrix 6.0) from 3207 African American samples in the Atherosclerosis Risk in Communities Study (ARIC) was used to systematically evaluate imputation quality and yield. Imputation was performed with the imputation algorithms MACH, IMPUTE and BEAGLE using several combinations of three reference panels of HapMap III (ASW, YRI and CEU) and 1000 Genomes Project (pilot 1 YRI June 2010 release, EUR and AFR August 2010 and June 2011 releases) panels with SNP data on chromosomes 18, 20 and 22. About 10% of the directly genotyped SNPs from each chromosome were masked, and SNPs common between the reference panels were used for evaluating the imputation quality using two statistical metrics-concordance accuracy and Cohen's kappa (κ) coefficient. The dependencies of these metrics on the minor allele frequencies (MAF) and specific genotype categories (minor allele homozygotes, heterozygotes and major allele homozygotes) were thoroughly investigated to determine the best panel and method for imputation in African Americans. In addition, the power to detect imputed SNPs associated with simulated phenotypes was studied using the mean genotype of each masked SNP in the imputed data. Our results indicate that the genotype concordances after stratification into each genotype category and Cohen's κ coefficient are considerably better equipped to differentiate imputation performance compared with the traditionally used total concordance statistic, and both statistics improved with increasing MAF irrespective of the imputation method. We also find that both MACH and IMPUTE

  16. Development and implementation of an HIV/AIDS trials management system: a geographical information systems approach

    CSIR Research Space (South Africa)

    Busgeeth, K

    2008-01-01

    Full Text Available transmission (e.g. condom use) are included, but those trials assessing interventions specific to other sexually transmitted infections are excluded, as are trials which only assess safety (so-called phase I trials). All identified trials are imported... of study Closed Open/Ongoing Planned Stopped early Design RCT CCT Systematic review Register status Pending Accepted Rejected Intervention Treatment Prevention Outcomes Morbidity Mortality/Survival Transmission (MTCT) Geographical co...

  17. Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data.

    Science.gov (United States)

    Rahman, Shah Atiqur; Huang, Yuxiao; Claassen, Jan; Heintzman, Nathaniel; Kleinberg, Samantha

    2015-12-01

    Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length. Copyright © 2015 Elsevier Inc. All rights reserved.

  18. A hot-deck multiple imputation procedure for gaps in longitudinal recurrent event histories.

    Science.gov (United States)

    Wang, Chia-Ning; Little, Roderick; Nan, Bin; Harlow, Siobán D

    2011-12-01

    We propose a regression-based hot-deck multiple imputation method for gaps of missing data in longitudinal studies, where subjects experience a recurrent event process and a terminal event. Examples are repeated asthma episodes and death, or menstrual periods and menopause, as in our motivating application. Research interest concerns the onset time of a marker event, defined by the recurrent event process, or the duration from this marker event to the final event. Gaps in the recorded event history make it difficult to determine the onset time of the marker event, and hence, the duration from onset to the final event. Simple approaches such as jumping gap times or dropping cases with gaps have obvious limitations. We propose a procedure for imputing information in the gaps by substituting information in the gap from a matched individual with a completely recorded history in the corresponding interval. Predictive mean matching is used to incorporate information on longitudinal characteristics of the repeated process and the final event time. Multiple imputation is used to propagate imputation uncertainty. The procedure is applied to an important data set for assessing the timing and duration of the menopausal transition. The performance of the proposed method is assessed by a simulation study. © 2011, The International Biometric Society.

  19. Analysis of an incomplete longitudinal composite variable using a marginalized random effects model and multiple imputation.

    Science.gov (United States)

    Gosho, Masahiko; Maruo, Kazushi; Ishii, Ryota; Hirakawa, Akihiro

    2016-11-16

    The total score, which is calculated as the sum of scores in multiple items or questions, is repeatedly measured in longitudinal clinical studies. A mixed effects model for repeated measures method is often used to analyze these data; however, if one or more individual items are not measured, the method cannot be directly applied to the total score. We develop two simple and interpretable procedures that infer fixed effects for a longitudinal continuous composite variable. These procedures consider that the items that compose the total score are multivariate longitudinal continuous data and, simultaneously, handle subject-level and item-level missing data. One procedure is based on a multivariate marginalized random effects model with a multiple of Kronecker product covariance matrices for serial time dependence and correlation among items. The other procedure is based on a multiple imputation approach with a multivariate normal model. In terms of the type-1 error rate and the bias of treatment effect in total score, the marginalized random effects model and multiple imputation procedures performed better than the standard mixed effects model for repeated measures analysis with listwise deletion and single imputations for handling item-level missing data. In particular, the mixed effects model for repeated measures with listwise deletion resulted in substantial inflation of the type-1 error rate. The marginalized random effects model and multiple imputation methods provide for a more efficient analysis by fully utilizing the partially available data, compared to the mixed effects model for repeated measures method with listwise deletion.

  20. Limitations in Using Multiple Imputation to Harmonize Individual Participant Data for Meta-Analysis.

    Science.gov (United States)

    Siddique, Juned; de Chavez, Peter J; Howe, George; Cruden, Gracelyn; Brown, C Hendricks

    2017-02-27

    Individual participant data (IPD) meta-analysis is a meta-analysis in which the individual-level data for each study are obtained and used for synthesis. A common challenge in IPD meta-analysis is when variables of interest are measured differently in different studies. The term harmonization has been coined to describe the procedure of placing variables on the same scale in order to permit pooling of data from a large number of studies. Using data from an IPD meta-analysis of 19 adolescent depression trials, we describe a multiple imputation approach for harmonizing 10 depression measures across the 19 trials by treating those depression measures that were not used in a study as missing data. We then apply diagnostics to address the fit of our imputation model. Even after reducing the scale of our application, we were still unable to produce accurate imputations of the missing values. We describe those features of the data that made it difficult to harmonize the depression measures and provide some guidelines for using multiple imputation for harmonization in IPD meta-analysis.

  1. El tratamiento penal del delincuente imputable peligroso

    OpenAIRE

    Armaza Armaza, Emilio José

    2011-01-01

    XI, 529 p. No cabe duda que una de las cuestiones que en el ámbito de la política-criminal ha venido adquiriendo mayor importancia durante los últimos lustros, es la constituida por el tratamiento penal que el Estado debe dispensar al delincuente imputable peligroso de criminalidad grave. En este sentido, a pesar de que a lo largo de la historia las diversas sociedades humanas han tenido que bregar con la actuación de estas personas, no ha sido sino hasta bien entrada la segunda mitad del ...

  2. A Geographic Information System approach to modeling nutrient and sediment transport

    Energy Technology Data Exchange (ETDEWEB)

    Levine, D.A. [Automated Sciences Group, Inc., Oak Ridge, TN (United States); Hunsaker, C.T.; Beauchamp, J.J. [Oak Ridge National Lab., TN (United States); Timmins, S.P. [Analysas Corp., Oak Ridge, TN (United States)

    1993-02-01

    The objective of this study was to develop a water quality model to quantify nonpoint-source (NPS) pollution that uses a geographic information system (GIS) to link statistical modeling of nutrient and sediment delivery with the spatial arrangement of the parameters that drive the model. The model predicts annual nutrient and sediment loading and was developed, calibrated, and tested on 12 watersheds within the Lake Ray Roberts drainage basin in north Texas. Three physiographic regions are represented by these watersheds, and model success, as measured by the accuracy of load estimates, was compared within and across these regions.

  3. An innovative geographical approach: health promotion and empowerment in a context of extreme urban poverty.

    Science.gov (United States)

    Becker, Daniel; Edmundo, Kátia; Nunes, Nilza Rogéria; Bonatto, Daniella; de Souza, Rosane

    2005-01-01

    This article describes and analyses a territorial intervention, the Vila Paciencia Initiative--a local development/health promotion programme implemented in a context of extreme poverty in the western district of Rio de Janeiro. The main goal of the programme was to empower individuals and communities. We emphasise the lessons learned and the potential for integrating them into local and regional health services, which could strengthen community participation and capacity-building and improve the effectiveness and community orientation of primary health care and other public policies directed to geographical development.

  4. A Geographic Approach to Modelling Human Exposure to Traffic Air Pollution using GIS

    DEFF Research Database (Denmark)

    Jensen, S. S.

    at the address all the time, and an exposure estimate is also defined that takes into account the time the person spends at the address assuming standardised time-profiles depending on age groups. The exposure model takes advantage of a standard Geographic Information System (GIS) (ArcView and Avenue...... the exposure model. Input requirements are: digital maps including buildings, geocoded addresses, geocoded roads, geocoded cadastres; data from the Building and Dwelling Register (BBR); traffic data (ADT of passenger cars, van, lorries and busses) for linking to a segmented road network; population data...

  5. A Geographic Information System approach to modeling nutrient and sediment transport

    Energy Technology Data Exchange (ETDEWEB)

    Levine, D.A. [Automated Sciences Group, Inc., Oak Ridge, TN (United States); Hunsaker, C.T.; Beauchamp, J.J. [Oak Ridge National Lab., TN (United States); Timmins, S.P. [Analysas Corp., Oak Ridge, TN (United States)

    1993-02-01

    The objective of this study was to develop a water quality model to quantify nonpoint-source (NPS) pollution that uses a geographic information system (GIS) to link statistical modeling of nutrient and sediment delivery with the spatial arrangement of the parameters that drive the model. The model predicts annual nutrient and sediment loading and was developed, calibrated, and tested on 12 watersheds within the Lake Ray Roberts drainage basin in north Texas. Three physiographic regions are represented by these watersheds, and model success, as measured by the accuracy of load estimates, was compared within and across these regions.

  6. Cost reduction for web-based data imputation

    KAUST Repository

    Li, Zhixu

    2014-01-01

    Web-based Data Imputation enables the completion of incomplete data sets by retrieving absent field values from the Web. In particular, complete fields can be used as keywords in imputation queries for absent fields. However, due to the ambiguity of these keywords and the data complexity on the Web, different queries may retrieve different answers to the same absent field value. To decide the most probable right answer to each absent filed value, existing method issues quite a few available imputation queries for each absent value, and then vote on deciding the most probable right answer. As a result, we have to issue a large number of imputation queries for filling all absent values in an incomplete data set, which brings a large overhead. In this paper, we work on reducing the cost of Web-based Data Imputation in two aspects: First, we propose a query execution scheme which can secure the most probable right answer to an absent field value by issuing as few imputation queries as possible. Second, we recognize and prune queries that probably will fail to return any answers a priori. Our extensive experimental evaluation shows that our proposed techniques substantially reduce the cost of Web-based Imputation without hurting its high imputation accuracy. © 2014 Springer International Publishing Switzerland.

  7. 12 CFR 367.9 - Imputation of causes.

    Science.gov (United States)

    2010-01-01

    ... 12 Banks and Banking 4 2010-01-01 2010-01-01 false Imputation of causes. 367.9 Section 367.9 Banks... SUSPENSION AND EXCLUSION OF CONTRACTOR AND TERMINATION OF CONTRACTS § 367.9 Imputation of causes. (a) Where there is cause to suspend and/or exclude any affiliated business entity of the contractor, that...

  8. mice: Multivariate Imputation by Chained Equations in R

    NARCIS (Netherlands)

    van Buuren, Stef; Groothuis-Oudshoorn, Catharina Gerarda Maria

    2011-01-01

    The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which

  9. MULTIPLE IMPUTATION OF MISSING DATA IN SUSTAINABLE DEVELOPMENT MODELLING

    OpenAIRE

    Roberto Benedetti; Rita Lima; Alessandro Pandimiglio

    2006-01-01

    A multiple imputation technique is proposed to measure sustainable development using models of structural equations (LISREL) for the treatment of missing data. The reliability of such technique is verified comparing the estimation model with missing data to the estimation model with imputed data. The results show that the missing data problem significantly affect the estimation.

  10. A Comparison of Imputation Methods for Bayesian Factor Analysis Models

    Science.gov (United States)

    Merkle, Edgar C.

    2011-01-01

    Imputation methods are popular for the handling of missing data in psychology. The methods generally consist of predicting missing data based on observed data, yielding a complete data set that is amiable to standard statistical analyses. In the context of Bayesian factor analysis, this article compares imputation under an unrestricted…

  11. New approach for the study of paleofloras using geographical information systems applied to Glossopteris Flora

    Directory of Open Access Journals (Sweden)

    Isabel Cortez Christiano-de-Souza

    Full Text Available This paper introduces a methodology which makes possible the visualization of the spatial distribution of plant fossils and applies it to the occurrences of the Gondwana Floristic Province present on the eastern border of the Brazilian portion of the Paraná Basin during the Neopaleozoic. This province was chosen due to the existence of a large number of publications referring to their occurrence, so that a meta-analysis of their distribution could be based on ample information. The first step was the construction of a composite database including geographical location, geology, and the botanical systematics of each relevant fossil. The geographical locations were then georeferenced for translation into various maps showing various aspects of the distribution of the fossils. The spatial distribution of the fossil-housing outcrops shows that these are distributed along the area of deposition studied. Although some genera persisted for long periods of time, others lasted for only short intervals. As time passed, the fossil composition underwent a gradual change from the Late Carboniferous (Itararé Group to the Late Permian (Rio do Rasto Formation, with the number of genera represented decreasing from 45 in the Itararé Group to 11 in the Rio do Rasto Formation.

  12. A geographic information systems (GIS) and spatial modeling approach to assessing indoor radon potential at local level

    Energy Technology Data Exchange (ETDEWEB)

    Lacan, Igor [California Department of Health Services, Environmental Health Laboratory Branch, 850 Marina Bay Pkwy, Mailstop G365/EHLB, Richmond, CA 94804 (United States)]. E-mail: ilacan@nature.Berkeley.edu; Zhou, Joey Y. [California Department of Health Services, Environmental Health Laboratory Branch, 850 Marina Bay Pkwy, Mailstop G365/EHLB, Richmond, CA 94804 (United States); Liu, Kai-Shen [California Department of Health Services, Environmental Health Laboratory Branch, 850 Marina Bay Pkwy, Mailstop G365/EHLB, Richmond, CA 94804 (United States); Waldman, Jed [California Department of Health Services, Environmental Health Laboratory Branch, 850 Marina Bay Pkwy, Mailstop G365/EHLB, Richmond, CA 94804 (United States)

    2006-04-15

    This study integrates residential radon data from previous studies in Southern California (USA), into a geographic information system (GIS) linked with statistical techniques. A difference (p<0.05) is found in the indoor radon in residences grouped by radon-potential zones. Using a novel Monte Carlo approach, we found that the mean distance from elevated-radon residences (concentration>74Bqm{sup -3}) to epicenters of large (> 4 Richter) earthquakes was smaller (p<0.0001) than the average residence-to-epicenter distance, suggesting an association between the elevated indoor-radon and seismic activities.

  13. A Geographic Information Science (GISc) Approach to Characterizing Spatiotemporal Patterns of Terrorist Incidents in Iraq, 2004-2009

    Energy Technology Data Exchange (ETDEWEB)

    Medina, Richard M [ORNL; Siebeneck, Laura K. [University of Utah; Hepner, George F. [University of Utah

    2011-01-01

    As terrorism on all scales continues, it is necessary to improve understanding of terrorist and insurgent activities. This article takes a Geographic Information Systems (GIS) approach to advance the understanding of spatial, social, political, and cultural triggers that influence terrorism incidents. Spatial, temporal, and spatiotemporal patterns of terrorist attacks are examined to improve knowledge about terrorist systems of training, planning, and actions. The results of this study aim to provide a foundation for understanding attack patterns and tactics in emerging havens as well as inform the creation and implementation of various counterterrorism measures.

  14. Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

    Science.gov (United States)

    Shah, Anoop D; Bartlett, Jonathan W; Carpenter, James; Nicholas, Owen; Hemingway, Harry

    2014-03-15

    Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

  15. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation.

    Science.gov (United States)

    Wood, Andrew R; Perry, John R B; Tanaka, Toshiko; Hernandez, Dena G; Zheng, Hou-Feng; Melzer, David; Gibbs, J Raphael; Nalls, Michael A; Weedon, Michael N; Spector, Tim D; Richards, J Brent; Bandinelli, Stefania; Ferrucci, Luigi; Singleton, Andrew B; Frayling, Timothy M

    2013-01-01

    Genome-wide association (GWA) studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF HapMap and 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of PHapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007) and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12)). Our data provide important proof of principle that 1000 Genomes imputation will detect novel, low frequency-large effect associations.

  16. Geographic information analysis: An ecological approach for the management of wildlife on the forest landscape

    Science.gov (United States)

    Ripple, William J.

    1995-01-01

    This document is a summary of the project funded by NAGw-1460 as part of the Earth Observation Commericalization/Applications Program (EOCAP) directed by NASA's Earth Science and Applications Division. The goal was to work with several agencies to focus on forest structure and landscape characterizations for wildlife habitat applications. New analysis techniques were used in remote sensing and landscape ecology with geographic information systems (GIS). The development of GIS and the emergence of the discipline of landscape ecology provided us with an opportunity to study forest and wildlife habitat resources from a new perspective. New techniques were developed to measure forest structure across scales from the canopy to the regional level. This paper describes the project team, technical advances, and technology adoption process that was used. Reprints of related refereed journal articles are in the Appendix.

  17. A reappraisal of the geographic distribution of Bokkermannohyla sazimai (Anura: Hylidae through morphological and bioacoustic approaches.

    Directory of Open Access Journals (Sweden)

    Thiago Ribeiro de Carvalho

    2013-06-01

    Full Text Available The type locality of Bokermannohyla sazimai is in the municipality of São Roque de Minas, state of Minas Gerais, Brazil. In this paper, we reassess the geographic distribution of B. sazimai and provide additional information on variation of several other non-topotypic populations in comparisons of topotypic populations (São Roque de Minas and Vargem Bonita, on the basis of three lines of evidence (color-pattern, morphometry and vocalizations. Differences obtained among all populations with respect to color pattern, morphometry and advertisement calls were attributed to interpopulational variation, so that this variation was not enough to recognize any population as a distinctive lineage in comparison with the topotypic information available on B. sazimai.

  18. Biodiversity, land use and ecosystem services—An organismic and comparative approach to different geographical regions

    Directory of Open Access Journals (Sweden)

    Ulrich Zeller

    2017-04-01

    The approach further focuses on human–wildlife interactions. The emergence of top predators in Europe reveals the value of the experience from Africa, where pastoralists manage to coexist with large predators since millennia. The investigation of African grasslands enables a critical reflection and a thorough understanding of processes, which have occurred a long time ago in Europe. Our approach leads to a revaluation of the significance of Africa in terms of a conservative, relict case scenario that can provide essential insights into the original situation of ecosystems, especially in view of “rewilding” approaches in Europe. Thereby, the approach leads to a careful consideration of the term “wilderness”.

  19. Imputing forest carbon stock estimates from inventory plots to a nationally continuous coverage

    Directory of Open Access Journals (Sweden)

    Wilson Barry Tyler

    2013-01-01

    Full Text Available Abstract The U.S. has been providing national-scale estimates of forest carbon (C stocks and stock change to meet United Nations Framework Convention on Climate Change (UNFCCC reporting requirements for years. Although these currently are provided as national estimates by pool and year to meet greenhouse gas monitoring requirements, there is growing need to disaggregate these estimates to finer scales to enable strategic forest management and monitoring activities focused on various ecosystem services such as C storage enhancement. Through application of a nearest-neighbor imputation approach, spatially extant estimates of forest C density were developed for the conterminous U.S. using the U.S.’s annual forest inventory. Results suggest that an existing forest inventory plot imputation approach can be readily modified to provide raster maps of C density across a range of pools (e.g., live tree to soil organic carbon and spatial scales (e.g., sub-county to biome. Comparisons among imputed maps indicate strong regional differences across C pools. The C density of pools closely related to detrital input (e.g., dead wood is often highest in forests suffering from recent mortality events such as those in the northern Rocky Mountains (e.g., beetle infestations. In contrast, live tree carbon density is often highest on the highest quality forest sites such as those found in the Pacific Northwest. Validation results suggest strong agreement between the estimates produced from the forest inventory plots and those from the imputed maps, particularly when the C pool is closely associated with the imputation model (e.g., aboveground live biomass and live tree basal area, with weaker agreement for detrital pools (e.g., standing dead trees. Forest inventory imputed plot maps provide an efficient and flexible approach to monitoring diverse C pools at national (e.g., UNFCCC and regional scales (e.g., Reducing Emissions from Deforestation and Forest

  20. Imputing Gene Expression in Uncollected Tissues Within and Beyond GTEx

    Science.gov (United States)

    Wang, Jiebiao; Gamazon, Eric R.; Pierce, Brandon L.; Stranger, Barbara E.; Im, Hae Kyung; Gibbons, Robert D.; Cox, Nancy J.; Nicolae, Dan L.; Chen, Lin S.

    2016-01-01

    Gene expression and its regulation can vary substantially across tissue types. In order to generate knowledge about gene expression in human tissues, the Genotype-Tissue Expression (GTEx) program has collected transcriptome data in a wide variety of tissue types from post-mortem donors. However, many tissue types are difficult to access and are not collected in every GTEx individual. Furthermore, in non-GTEx studies, the accessibility of certain tissue types greatly limits the feasibility and scale of studies of multi-tissue expression. In this work, we developed multi-tissue imputation methods to impute gene expression in uncollected or inaccessible tissues. Via simulation studies, we showed that the proposed methods outperform existing imputation methods in multi-tissue expression imputation and that incorporating imputed expression data can improve power to detect phenotype-expression correlations. By analyzing data from nine selected tissue types in the GTEx pilot project, we demonstrated that harnessing expression quantitative trait loci (eQTLs) and tissue-tissue expression-level correlations can aid imputation of transcriptome data from uncollected GTEx tissues. More importantly, we showed that by using GTEx data as a reference, one can impute expression levels in inaccessible tissues in non-GTEx expression studies. PMID:27040689

  1. Phenetic and geographic pattern of Aconitum sect. Napellus (Ranunculaceae in the Eastern Carpathians - a numerical approach

    Directory of Open Access Journals (Sweden)

    Józef Mitka

    2014-01-01

    Full Text Available Aconitum sect. Napellus in the Eastern Carpathians was explored with the use of methods of numerical taxonomy*. The taxon consists of A. bucovinense Zapał. pro hybr., A.firmum Rchb. subsp. firmum, A. firmum subsp. fissurae Nyarady, A. fimum nsubsp.fussianum Starmuhl (A.firmum subsp. firmum x subsp. f'issurae, A. x nanum (Baumg. Simonk. (A. bucovinense x A. firmum and a hybrid A. firmum x A. x nanum. The taxi form phenetic continuum in a character hyperspace and their delimitation bases on a few traits, hitherto neglected, e.g. type of hairiness and flower morphology. A key is provided to identify taxa at all ranks within the supplemented of sect. Napellus. There is a regional pattern of particular OTUs distribution, which show local morphological uniqueness within a taxon. The phenomenon was inquired using the concept of "centers of phenetic coherence" (CPC based on overall morphological similarity. The CPC may be interpreteted as regions of neoendemism and/or may reflect a post-glacial migratory route. High-mountain flora of the Western Bieszczady Mts. (sect. Napelus as its example has features of neoendemism (schizoendemism, being most probably a result of geographical vicarism.

  2. Data Editing and Imputation in Business Surveys Using “R”

    Directory of Open Access Journals (Sweden)

    Elena Romascanu

    2014-06-01

    Full Text Available Purpose – Missing data are a recurring problem that can cause bias or lead to inefficient analyses. The objective of this paper is a direct comparison between the two statistical software features R and SPSS, in order to take full advantage of the existing automated methods for data editing process and imputation in business surveys (with a proper design of consistency rules as a partial alternative to the manual editing of data. Approach – The comparison of different methods on editing surveys data, in R with the ‘editrules’ and ‘survey’ packages because inside those, exist commonly used transformations in official statistics, as visualization of missing values pattern using ‘Amelia’ and ‘VIM’ packages, imputation approaches for longitudinal data using ‘VIMGUI’ and a comparison of another statistical software performance on the same features, such as SPSS. Findings – Data on business statistics received by NIS’s (National Institute of Statistics are not ready to be used for direct analysis due to in-record inconsistencies, errors and missing values from the collected data sets. The appropriate automatic methods from R packages, offers the ability to set the erroneous fields in edit-violating records, to verify the results after the imputation of missing values providing for users a flexible, less time consuming approach and easy to perform automation in R than in SPSS Macros syntax situations, when macros are very handy.

  3. A Comparison of Joint Model and Fully Conditional Specification Imputation for Multilevel Missing Data

    Science.gov (United States)

    Mistler, Stephen A.; Enders, Craig K.

    2017-01-01

    Multiple imputation methods can generally be divided into two broad frameworks: joint model (JM) imputation and fully conditional specification (FCS) imputation. JM draws missing values simultaneously for all incomplete variables using a multivariate distribution, whereas FCS imputes variables one at a time from a series of univariate conditional…

  4. Geographic Information System (GIS) capabilities in traffic accident information management: a qualitative approach.

    Science.gov (United States)

    Ahmadi, Maryam; Valinejadi, Ali; Goodarzi, Afshin; Safari, Ameneh; Hemmat, Morteza; Majdabadi, Hesamedin Askari; Mohammadi, Ali

    2017-06-01

    Traffic accidents are one of the more important national and international issues, and their consequences are important for the political, economical, and social level in a country. Management of traffic accident information requires information systems with analytical and accessibility capabilities to spatial and descriptive data. The aim of this study was to determine the capabilities of a Geographic Information System (GIS) in management of traffic accident information. This qualitative cross-sectional study was performed in 2016. In the first step, GIS capabilities were identified via literature retrieved from the Internet and based on the included criteria. Review of the literature was performed until data saturation was reached; a form was used to extract the capabilities. In the second step, study population were hospital managers, police, emergency, statisticians, and IT experts in trauma, emergency and police centers. Sampling was purposive. Data was collected using a questionnaire based on the first step data; validity and reliability were determined by content validity and Cronbach's alpha of 75%. Data was analyzed using the decision Delphi technique. GIS capabilities were identified in ten categories and 64 sub-categories. Import and process of spatial and descriptive data and so, analysis of this data were the most important capabilities of GIS in traffic accident information management. Storing and retrieving of descriptive and spatial data, providing statistical analysis in table, chart and zoning format, management of bad structure issues, determining the cost effectiveness of the decisions and prioritizing their implementation were the most important capabilities of GIS which can be efficient in the management of traffic accident information.

  5. Multiple imputation of completely missing repeated measures data within person from a complex sample: application to accelerometer data in the National Health and Nutrition Examination Survey.

    Science.gov (United States)

    Liu, Benmei; Yu, Mandi; Graubard, Barry I; Troiano, Richard P; Schenker, Nathaniel

    2016-12-10

    The Physical Activity Monitor component was introduced into the 2003-2004 National Health and Nutrition Examination Survey (NHANES) to collect objective information on physical activity including both movement intensity counts and ambulatory steps. Because of an error in the accelerometer device initialization process, the steps data were missing for all participants in several primary sampling units, typically a single county or group of contiguous counties, who had intensity count data from their accelerometers. To avoid potential bias and loss in efficiency in estimation and inference involving the steps data, we considered methods to accurately impute the missing values for steps collected in the 2003-2004 NHANES. The objective was to come up with an efficient imputation method that minimized model-based assumptions. We adopted a multiple imputation approach based on additive regression, bootstrapping and predictive mean matching methods. This method fits alternative conditional expectation (ace) models, which use an automated procedure to estimate optimal transformations for both the predictor and response variables. This paper describes the approaches used in this imputation and evaluates the methods by comparing the distributions of the original and the imputed data. A simulation study using the observed data is also conducted as part of the model diagnostics. Finally, some real data analyses are performed to compare the before and after imputation results. Published 2016. This article is a U.S. Government work and is in the public domain in the USA.

  6. A New Approach of Fuzzy Theory with Uncertainties in Geographic Information Systems

    Directory of Open Access Journals (Sweden)

    Mohammad Bazmara

    2013-01-01

    Full Text Available Until now, fuzzy logic has been extensively used to better analyze and design controllers for chemical processes. It has also been used for other applications like parameter estimation of nonlinear continuous-time systems but in general fuzzy logic has been intensively used for heuristics based system. Recently, fuzzy logic has been applied successfully in many areas where conventional model based approaches are difficult or not cost effective to implement. Mechanistic modeling of physical systems is often complicated by the presence of uncertainties. When models are used as purely predictive tools, uncertainty and variability lead to the need for assessment of the plausible range of model outcomes. A systematic uncertainty analysis provides insight into the level of confidence in model estimates, and can aid in assessing how various possible model estimates should be weighed. In this paper, generalized fuzzy α-cut is used to show the utility of fuzzy approach in uncertainty analysis of pollutant transport in ground water. Based on the concept of transformation method which is an extension of α-cuts, the approach shows superiority over conventional methods of uncertainty modeling. A 2-D groundwater transport model has been used to show the utility of this approach. Results are compared with commonly used probabilistic method and normal Fuzzy alpha-cut technique. In order to provide a basis for comparison between the two approaches, the shape of the membership functions used in the fuzzy methods are the same as the shape of the probability density function used in the Monte-Carlo method. The extended fuzzy α-cut technique presents a strong alternative to the conventional approach.

  7. InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction

    Science.gov (United States)

    2003-01-01

    corpus to perform WSD. Recent work emphasizes a corpus-based unsupervised approach [Dagon and Itai 1994; Yarowsky 1992; Yarowsky 1995] that avoids...1990. Introduction to Algorithm. The MIT Press, 504-505. Dagon, Ido and Alon Itai . 1994. Word Sense Disambiguation Using a Second Language

  8. An integrated geographic information system approach for modeling the suitability of conifer habitat in an alpine environment

    Science.gov (United States)

    McGregor, Stephen J.

    1998-01-01

    Alpine periglacial environments within the forest-alpine tundra ecotone (FATE) may be among the first to reflect changes in habitat characteristics as a consequence of climatic change. Previous FATE studies used Integrated Geographic Information System (IGIS) techniques to collect and model biophysical data but lacked the necessary detail to model the micro-scale patterns and compositions of habitat within alpine periglacial environments. This paper describes several promising data collection, integration, and cartographic modeling techniques used in an IGIS approach to model alpine periglacial environments in Glacier National Park (GNP), Montana, USA. High-resolution (I X I m) multi-spectral remote sensing data and differentially corrected Global Positioning System (DGPS) data were integrated with other biophysical data using a raster-based IGIS approach. Biophysical factors, hypothesized to influence the pattern and composition of the FATE and the alpine tundra ecosystem, were derived from the high-resolution remote sensing data, in-situ GPS data, high-resolution models of digital elevation, and other thematic data using image processing techniques and cartographic modeling. Suitability models of conifer habitat were created using indices generated from the IGIS database. This IGIS approach identified suitable conifer habitat within the FATE and permitted the modeling of micro-scale periglacial features and alpine tundra communities that are absent from traditional approaches of landscape-scale (30 X 30 m) modeling.

  9. missForest: Nonparametric missing value imputation using random forest

    Science.gov (United States)

    Stekhoven, Daniel J.

    2015-05-01

    missForest imputes missing values particularly in the case of mixed-type data. It uses a random forest trained on the observed values of a data matrix to predict the missing values. It can be used to impute continuous and/or categorical data including complex interactions and non-linear relations. It yields an out-of-bag (OOB) imputation error estimate without the need of a test set or elaborate cross-validation and can be run in parallel to save computation time. missForest has been used to, among other things, impute variable star colors in an All-Sky Automated Survey (ASAS) dataset of variable stars with no NOMAD match.

  10. A New Approach of Fuzzy Theory with Uncertainties in Geographic Information Systems

    OpenAIRE

    Mohammad Bazmara; Fereshteh Mohammadi

    2013-01-01

    Until now, fuzzy logic has been extensively used to better analyze and design controllers for chemical processes. It has also been used for other applications like parameter estimation of nonlinear continuous-time systems but in general fuzzy logic has been intensively used for heuristics based system. Recently, fuzzy logic has been applied successfully in many areas where conventional model based approaches are difficult or not cost effective to implement. Mechanistic modeling of physical sy...

  11. Ecological study and risk mapping of leishmaniasis in an endemic area of Brazil based on a geographical information systems approach

    Directory of Open Access Journals (Sweden)

    Alba Valéria Machado da Silva

    2011-11-01

    Full Text Available Visceral leishmaniasis is a vector-borne disease highly influenced by eco-epidemiological factors. Geographical information systems (GIS have proved to be a suitable approach for the analysis of environmental components that affect the spatial distribution of diseases. Exploiting this methodology, a model was developed for the mapping of the distribution and incidence of canine leishmaniasis in an endemic area of Brazil. Local variations were observed with respect to infection incidence and distribution of serological titers, i.e. high titers were noted close to areas with preserved vegetation, while low titers were more frequent in areas where people kept chickens. Based on these results, we conclude that the environment plays an important role in generating relatively protected areas within larger endemic regions, but that it can also contribute to the creation of hotspots with clusters of comparatively high serological titers indicating a high level of transmission compared with neighbouring areas.

  12. Ecological study and risk mapping of leishmaniasis in an endemic area of Brazil based on a geographical information systems approach.

    Science.gov (United States)

    Machado da Silva, Alba Valéria; Magalhães, Monica de Avelar Figueiredo Mafra; Peçanha Brazil, Reginaldo; Carreira, João Carlos Araujo

    2011-11-01

    Visceral leishmaniasis is a vector-borne disease highly influenced by eco-epidemiological factors. Geographical information systems (GIS) have proved to be a suitable approach for the analysis of environmental components that affect the spatial distribution of diseases. Exploiting this methodology, a model was developed for the mapping of the distribution and incidence of canine leishmaniasis in an endemic area of Brazil. Local variations were observed with respect to infection incidence and distribution of serological titers, i.e. high titers were noted close to areas with preserved vegetation, while low titers were more frequent in areas where people kept chickens. Based on these results, we conclude that the environment plays an important role in generating relatively protected areas within larger endemic regions, but that it can also contribute to the creation of hotspots with clusters of comparatively high serological titers indicating a high level of transmission compared with neighbouring areas.

  13. WIMP: web server tool for missing data imputation.

    Science.gov (United States)

    Urda, D; Subirats, J L; García-Laencina, P J; Franco, L; Sancho-Gómez, J L; Jerez, J M

    2012-12-01

    The imputation of unknown or missing data is a crucial task on the analysis of biomedical datasets. There are several situations where it is necessary to classify or identify instances given incomplete vectors, and the existence of missing values can much degrade the performance of the algorithms used for the classification/recognition. The task of learning accurately from incomplete data raises a number of issues some of which have not been completely solved in machine learning applications. In this sense, effective missing value estimation methods are required. Different methods for missing data imputations exist but most of the times the selection of the appropriate technique involves testing several methods, comparing them and choosing the right one. Furthermore, applying these methods, in most cases, is not straightforward, as they involve several technical details, and in particular in cases such as when dealing with microarray datasets, the application of the methods requires huge computational resources. As far as we know, there is not a public software application that can provide the computing capabilities required for carrying the task of data imputation. This paper presents a new public tool for missing data imputation that is attached to a computer cluster in order to execute high computational tasks. The software WIMP (Web IMPutation) is a public available web site where registered users can create, execute, analyze and store their simulations related to missing data imputation.

  14. MartiTracks: a geometrical approach for identifying geographical patterns of distribution.

    Directory of Open Access Journals (Sweden)

    Susy Echeverría-Londoño

    Full Text Available Panbiogeography represents an evolutionary approach to biogeography, using rational cost-efficient methods to reduce initial complexity to locality data, and depict general distribution patterns. However, few quantitative, and automated panbiogeographic methods exist. In this study, we propose a new algorithm, within a quantitative, geometrical framework, to perform panbiogeographical analyses as an alternative to more traditional methods. The algorithm first calculates a minimum spanning tree, an individual track for each species in a panbiogeographic context. Then the spatial congruence among segments of the minimum spanning trees is calculated using five congruence parameters, producing a general distribution pattern. In addition, the algorithm removes the ambiguity, and subjectivity often present in a manual panbiogeographic analysis. Results from two empirical examples using 61 species of the genus Bomarea (2340 records, and 1031 genera of both plants and animals (100118 records distributed across the Northern Andes, demonstrated that a geometrical approach to panbiogeography is a feasible quantitative method to determine general distribution patterns for taxa, reducing complexity, and the time needed for managing large data sets.

  15. The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data.

    Science.gov (United States)

    Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe

    2017-05-04

    To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.

  16. Geographical approaches the bluefin Tuna (thunnus thynnus farmers and farm of Musellim passage

    Directory of Open Access Journals (Sweden)

    Rüştü Ilgar

    2006-08-01

    Full Text Available The study area is important strait of our country and Lesvos Island between on the where the Biga Peninsula and Lesvos Island approach. Greenpeace, Southeast ANSE (Asociacion de Naturalistas del Sureste, WWF (World Wildlife Fund. As a result of these attempts, on April 15-19, 2002 ICCAT (International Commission for the Conservation of Atlantic Tuna and GFCM (Genaral Fisheries Commission for the Mediterranean gathered in Malta and studied on fish stocks and statistic values. At last, on April 29, 2002 they published a moratorium aimed at European Community Ministries in Cartagena. The objective of this work is to tourism process affecting life and safety aquatic biosphere, which the intensive interactions take places and discusses optimization proposals for Thunnus thynnus farmers.

  17. Statistical approaches for farm and parasitic risk profiling in geographical veterinary epidemiology.

    Science.gov (United States)

    Catelan, Dolores; Rinaldi, Laura; Musella, Vincenzo; Cringoli, Giuseppe; Biggeri, Annibale

    2012-10-01

    We address the problem of farm and parasitic risk profiling in the context of Veterinary Epidemiology. We take advantage of a cross-sectional study carried out in the Campania Region in order to study the spatial distribution of 16 parasites in 121 ovine farms. We propose a tri-level hierarchical Bayesian model, which account for multivariate spatially structured overdispersion, to obtain estimate of posterior classification probabilities, that is for each parasite and farm the probability to belong to the set of the null hypothesis. We explore four decision rules based on either posterior probabilities or posterior means and compare the results in terms of the number of false discoveries/non-discoveries or the rate of false discovery/non-discovery. Our approach proved useful for parasitological risk profiling and we show that decision rules can be easily handled.

  18. Investigating the Pathogenesis of Severe Malaria: A Multidisciplinary and Cross-Geographical Approach.

    Science.gov (United States)

    Wassmer, Samuel C; Taylor, Terrie E; Rathod, Pradipsinh K; Mishra, Saroj K; Mohanty, Sanjib; Arevalo-Herrera, Myriam; Duraisingh, Manoj T; Smith, Joseph D

    2015-09-01

    More than a century after the discovery of Plasmodium spp. parasites, the pathogenesis of severe malaria is still not well understood. The majority of malaria cases are caused by Plasmodium falciparum and Plasmodium vivax, which differ in virulence, red blood cell tropism, cytoadhesion of infected erythrocytes, and dormant liver hypnozoite stages. Cerebral malaria coma is one of the most severe manifestations of P. falciparum infection. Insights into its complex pathophysiology are emerging through a combination of autopsy, neuroimaging, parasite binding, and endothelial characterizations. Nevertheless, important questions remain regarding why some patients develop life-threatening conditions while the majority of P. falciparum-infected individuals do not, and why clinical presentations differ between children and adults. For P. vivax, there is renewed recognition of severe malaria, but an understanding of the factors influencing disease severity is limited and remains an important research topic. Shedding light on the underlying disease mechanisms will be necessary to implement effective diagnostic tools for identifying and classifying severe malaria syndromes and developing new therapeutic approaches for severe disease. This review highlights progress and outstanding questions in severe malaria pathophysiology and summarizes key areas of pathogenesis research within the International Centers of Excellence for Malaria Research program.

  19. An Intelligent Approach For Mining Frequent Spatial Objects In Geographic Information System

    Directory of Open Access Journals (Sweden)

    Animesh Tripathy

    2010-11-01

    Full Text Available Spatial Data Mining is based on correlation of spatial objects in space. Mining frequent pattern fromspatial databases systems has always remained a challenge for researchers. In the light of the first law ofgeography “everything is related to everything else but nearby things is more related than distant things”suggests that values taken from samples of spatial data near to each other tend to be more similar thanthose taken farther apart. This tendency is termed spatial autocorrelation or spatial dependence. It’snatural that most spatial data are not independent, they have high autocorrelation. In this paper, wepropose an enhancement of existing mining algorithm for efficiently mining frequent patterns for spatialobjects occurring in space such as a city is located near a river. The frequency of each spatial object inrelation to other object tends to determine multiple occurrence of the same object. We further enhancethe proposed approach by using a numerical method. This method uses a tree structure basedmethodology for mining frequent patterns considering the frequency of each object stored at each node ofthe tree. Experimental results suggest significant improvement in finding valid frequent patterns overexisting methods.

  20. Geomorphology and Ecology of Mountain Landscapes: an interdisciplinary approach to problem-based learning in a particular geographical setting

    Science.gov (United States)

    Wemple, B.; Thomas, E. P.; Shanley, J.

    2006-12-01

    Mountain settings provide some unique conditions for the instruction of earth surface processes and ecology. Recent attention has also highlighted certain risks to mountain environments posed by development pressures and climate change scenarios. We describe a course developed for senior undergraduate students that focuses on an integrated, interdisciplinary view of ecological, geophysical, and socio-political processes in mountain settings. We use a problem-based learning approach where students first learn to collect and analyze data around a set of field problems tackled during a one-week field intensive. Next, students explore a range of research problems from mountain settings through a semester-long seminar focusing on current scholarly readings and visits with resource managers, policy makers and stakeholders. Finally, students craft and execute a research project and present results in a symposium setting. Our course builds on the traditional model of the Geoscience field camp, employs a geographical perspective to think synthetically about the nature of mountain landscapes, uses an interdisciplinary approach to study processes and process- interactions of the mountain setting, and explores some of the unique challenges facing mountain regions.

  1. Geographic Names

    Data.gov (United States)

    Minnesota Department of Natural Resources — The Geographic Names Information System (GNIS), developed by the United States Geological Survey in cooperation with the U.S. Board of Geographic Names, provides...

  2. Quick, “Imputation-free” meta-analysis with proxy-SNPs

    Directory of Open Access Journals (Sweden)

    Meesters Christian

    2012-09-01

    Full Text Available Abstract Background Meta-analysis (MA is widely used to pool genome-wide association studies (GWASes in order to a increase the power to detect strong or weak genotype effects or b as a result verification method. As a consequence of differing SNP panels among genotyping chips, imputation is the method of choice within GWAS consortia to avoid losing too many SNPs in a MA. YAMAS (Yet Another Meta Analysis Software, however, enables cross-GWAS conclusions prior to finished and polished imputation runs, which eventually are time-consuming. Results Here we present a fast method to avoid forfeiting SNPs present in only a subset of studies, without relying on imputation. This is accomplished by using reference linkage disequilibrium data from 1,000 Genomes/HapMap projects to find proxy-SNPs together with in-phase alleles for SNPs missing in at least one study. MA is conducted by combining association effect estimates of a SNP and those of its proxy-SNPs. Our algorithm is implemented in the MA software YAMAS. Association results from GWAS analysis applications can be used as input files for MA, tremendously speeding up MA compared to the conventional imputation approach. We show that our proxy algorithm is well-powered and yields valuable ad hoc results, possibly providing an incentive for follow-up studies. We propose our method as a quick screening step prior to imputation-based MA, as well as an additional main approach for studies without available reference data matching the ethnicities of study participants. As a proof of principle, we analyzed six dbGaP Type II Diabetes GWAS and found that the proxy algorithm clearly outperforms naïve MA on the p-value level: for 17 out of 23 we observe an improvement on the p-value level by a factor of more than two, and a maximum improvement by a factor of 2127. Conclusions YAMAS is an efficient and fast meta-analysis program which offers various methods, including conventional MA as well as inserting proxy

  3. The Utility of Nonparametric Transformations for Imputation of Survey Data

    Directory of Open Access Journals (Sweden)

    Robbins Michael W.

    2014-12-01

    Full Text Available Missing values present a prevalent problem in the analysis of establishment survey data. Multivariate imputation algorithms (which are used to fill in missing observations tend to have the common limitation that imputations for continuous variables are sampled from Gaussian distributions. This limitation is addressed here through the use of robust marginal transformations. Specifically, kernel-density and empirical distribution-type transformations are discussed and are shown to have favorable properties when used for imputation of complex survey data. Although such techniques have wide applicability (i.e., they may be easily applied in conjunction with a wide array of imputation techniques, the proposed methodology is applied here with an algorithm for imputation in the USDA’s Agricultural Resource Management Survey. Data analysis and simulation results are used to illustrate the specific advantages of the robust methods when compared to the fully parametric techniques and to other relevant techniques such as predictive mean matching. To summarize, transformations based upon parametric densities are shown to distort several data characteristics in circumstances where the parametric model is ill fit; however, no circumstances are found in which the transformations based upon parametric models outperform the nonparametric transformations. As a result, the transformation based upon the empirical distribution (which is the most computationally efficient is recommended over the other transformation procedures in practice.

  4. Doubly robust and multiple-imputation-based generalized estimating equations.

    Science.gov (United States)

    Birhanu, Teshome; Molenberghs, Geert; Sotto, Cristina; Kenward, Michael G

    2011-03-01

    Generalized estimating equations (GEE), proposed by Liang and Zeger (1986), provide a popular method to analyze correlated non-Gaussian data. When data are incomplete, the GEE method suffers from its frequentist nature and inferences under this method are valid only under the strong assumption that the missing data are missing completely at random. When response data are missing at random, two modifications of GEE can be considered, based on inverse-probability weighting or on multiple imputation. The weighted GEE (WGEE) method involves weighting observations by the inverse of their probability of being observed. Imputation methods involve filling in missing observations with values predicted by an assumed imputation model, multiple times. The so-called doubly robust (DR) methods involve both a model for the weights and a predictive model for the missing observations given the observed ones. To yield consistent estimates, WGEE needs correct specification of the dropout model while imputation-based methodology needs a correctly specified imputation model. DR methods need correct specification of either the weight or the predictive model, but not necessarily both. Focusing on incomplete binary repeated measures, we study the relative performance of the singly robust and doubly robust versions of GEE in a variety of correctly and incorrectly specified models using simulation studies. Data from a clinical trial in onychomycosis further illustrate the method.

  5. Multiple imputation for threshold-crossing data with interval censoring.

    Science.gov (United States)

    Dorey, F J; Little, R J; Schenker, N

    1993-09-15

    Medical statistics often involve measurements of the time when a variable crosses a threshold value. The time to threshold crossing may be the outcome variable in a survival analysis, or a time-dependent covariate in the analysis of a subsequent event. This paper presents new methods for analysing threshold-crossing data that are interval censored in that the time of threshold crossing is known only within a specified interval. Such data typically arise in event-history studies when the threshold is crossed at some time between data-collection points, such as visits to a clinic. We propose methods based on multiple imputation of the threshold-crossing time with use of models that take into account values recorded at the times of visits. We apply the methods to two real data sets, one involving hip replacements and the other on the prostate specific antigen (PSA) assay for prostate cancer. In addition, we compare our methods with the common practice of imputing the threshold-crossing time as the right endpoint of the interval. The two examples require different imputation models, but both lead to simple analyses of the multiply imputed data that automatically take into account variability due to imputation.

  6. The multiple imputation method: a case study involving secondary data analysis.

    Science.gov (United States)

    Walani, Salimah R; Cleland, Charles M

    2015-05-01

    To illustrate with the example of a secondary data analysis study the use of the multiple imputation method to replace missing data. Most large public datasets have missing data, which need to be handled by researchers conducting secondary data analysis studies. Multiple imputation is a technique widely used to replace missing values while preserving the sample size and sampling variability of the data. The 2004 National Sample Survey of Registered Nurses. The authors created a model to impute missing values using the chained equation method. They used imputation diagnostics procedures and conducted regression analysis of imputed data to determine the differences between the log hourly wages of internationally educated and US-educated registered nurses. The authors used multiple imputation procedures to replace missing values in a large dataset with 29,059 observations. Five multiple imputed datasets were created. Imputation diagnostics using time series and density plots showed that imputation was successful. The authors also present an example of the use of multiple imputed datasets to conduct regression analysis to answer a substantive research question. Multiple imputation is a powerful technique for imputing missing values in large datasets while preserving the sample size and variance of the data. Even though the chained equation method involves complex statistical computations, recent innovations in software and computation have made it possible for researchers to conduct this technique on large datasets. The authors recommend nurse researchers use multiple imputation methods for handling missing data to improve the statistical power and external validity of their studies.

  7. Disease maps as context for community mapping: a methodological approach for linking confidential health information with local geographical knowledge for community health research.

    Science.gov (United States)

    Beyer, Kirsten M M; Comstock, Sara; Seagren, Renea

    2010-12-01

    Health is increasingly understood as a product of multiple levels of influence, from individual biological and behavioral influences to community and societal level contextual influences. In understanding these contextual influences, community health researchers have increasingly employed both geographic methodologies, including Geographic Information Systems (GIS), and community participatory approaches. However, despite growing interest in the role for community participation and local knowledge in community health investigations, and the use of geographical methods and datasets in characterizing community environments, there exist few examples of research projects that incorporate both geographical and participatory approaches in addressing health questions. This is likely due in part to concerns and restrictions regarding community access to confidential health data. In order to overcome this barrier, we present a method for linking confidential, geocoded health information with community-generated experiential geographical information in a GIS environment. We use sophisticated disease mapping methodologies to create continuously defined maps of colorectal cancer in Iowa, then incorporate these layers in an open source GIS application as the context for a participatory community mapping exercise with participants from a rural Iowa town. Our method allows participants to interact directly with health information at a fine geographical scale, facilitating hypothesis generation regarding contextual influences on health, while simultaneously protecting data confidentiality. Participants are able to use their local, geographical knowledge to generate hypotheses about factors influencing colorectal cancer risk in the community and opportunities for risk reduction. This work opens the door for future efforts to integrate empirical epidemiological data with community generated experiential information to inform community health research and practice.

  8. Association between Floods and Acute Cardiovascular Diseases: A Population-Based Cohort Study Using a Geographic Information System Approach

    Directory of Open Access Journals (Sweden)

    Alain Vanasse

    2016-01-01

    Full Text Available Background: Floods represent a serious threat to human health beyond the immediate risk of drowning. There is few data on the potential link between floods and direct consequences on health such as on cardiovascular health. This study aimed to explore the impact of one of the worst floods in the history of Quebec, Canada on acute cardiovascular diseases (CVD. Methods: A cohort study with a time series design with multiple control groups was built with the adult population identified in the Quebec Integrated Chronic Disease Surveillance System. A geographic information system approach was used to define the study areas. Logistic regressions were performed to compare the occurrence of CVD between groups. Results: The results showed a 25%–27% increase in the odds in the flooded population in spring 2011 when compared with the population in the same area in springs 2010 and 2012. Besides, an increase up to 69% was observed in individuals with a medical history of CVD. Conclusion: Despite interesting results, the association was not statistically significant. A possible explanation to this result can be that the population affected by the flood was probably too small to provide the statistical power to answer the question, and leaves open a substantial possibility for a real and large effect.

  9. Association between Floods and Acute Cardiovascular Diseases: A Population-Based Cohort Study Using a Geographic Information System Approach.

    Science.gov (United States)

    Vanasse, Alain; Cohen, Alan; Courteau, Josiane; Bergeron, Patrick; Dault, Roxanne; Gosselin, Pierre; Blais, Claudia; Bélanger, Diane; Rochette, Louis; Chebana, Fateh

    2016-01-28

    Floods represent a serious threat to human health beyond the immediate risk of drowning. There is few data on the potential link between floods and direct consequences on health such as on cardiovascular health. This study aimed to explore the impact of one of the worst floods in the history of Quebec, Canada on acute cardiovascular diseases (CVD). A cohort study with a time series design with multiple control groups was built with the adult population identified in the Quebec Integrated Chronic Disease Surveillance System. A geographic information system approach was used to define the study areas. Logistic regressions were performed to compare the occurrence of CVD between groups. The results showed a 25%-27% increase in the odds in the flooded population in spring 2011 when compared with the population in the same area in springs 2010 and 2012. Besides, an increase up to 69% was observed in individuals with a medical history of CVD. Despite interesting results, the association was not statistically significant. A possible explanation to this result can be that the population affected by the flood was probably too small to provide the statistical power to answer the question, and leaves open a substantial possibility for a real and large effect.

  10. Geographical orientation. An integral geoperspective

    Directory of Open Access Journals (Sweden)

    Cristóbal Cobo Arízaga

    2013-12-01

    This approach seeks to create a new line of discussion, to launch a proposal that is scientifically challenging to the hegemony of geographical thought and that provides new geographical rationality structures.

  11. An Efficient Approach for Historical Storage and Retrieval of Segmented Road Data in Geographic Information System for Transportation

    Institute of Scientific and Technical Information of China (English)

    Mohammad Reza Jelokhani-Niaraki; Ali Asghar Alesheikh; Abolghasem Sadeghi-Niaraki

    2010-01-01

    One of the most powerful functions of Geographic Information System for Transportation(GIS-T)is Dynamic Segmentation(DS),which is used to increase the efficiency and precision of road management by generating segments based on attributes.The road segments describing transportation data are both spatially and temporally referenced.For a variety of transportation applications,historical road segments must be preserved.This study presents an appropriate approach to preserve and retrieve the historical road segments efficiently.In the proposed method,only the portions of segments of a time stamp that have been changed into new segments rather than storing the entire segments for every old time stamp are recorded.The storage of these portions is based on the type of changes.A recursive algorithm is developed to retrieve all segments for every old time stamp.Experimental results using real data of Tehran City,Iran justify the strength of the proposed approach in many aspects.An important achievement of the results is that database volume for 2006,2007 and 2008 within the Historical Line Event Table(HLET)is reduced by 70%,80%and 78%,respectively.The proposed method has the potential to prevent from vast data redundancy and the unnecessary storage of entire segments for each time stamp.Since the present technique is performed on ordinary plain tables that are readable by all GIS software,special software platforms to manage the storage and retrieval of historical segments are not needed.In addition,this method simplifies spatio-temporal queries.

  12. Land Suitability Modeling using a Geographic Socio-Environmental Niche-Based Approach: A Case Study from Northeastern Thailand.

    Science.gov (United States)

    Heumann, Benjamin W; Walsh, Stephen J; Verdery, Ashton M; McDaniel, Phillip M; Rindfuss, Ronald R

    2013-01-01

    Understanding the pattern-process relations of land use/land cover change is an important area of research that provides key insights into human-environment interactions. The suitability or likelihood of occurrence of land use such as agricultural crop types across a human-managed landscape is a central consideration. Recent advances in niche-based, geographic species distribution modeling (SDM) offer a novel approach to understanding land suitability and land use decisions. SDM links species presence-location data with geospatial information and uses machine learning algorithms to develop non-linear and discontinuous species-environment relationships. Here, we apply the MaxEnt (Maximum Entropy) model for land suitability modeling by adapting niche theory to a human-managed landscape. In this article, we use data from an agricultural district in Northeastern Thailand as a case study for examining the relationships between the natural, built, and social environments and the likelihood of crop choice for the commonly grown crops that occur in the Nang Rong District - cassava, heavy rice, and jasmine rice, as well as an emerging crop, fruit trees. Our results indicate that while the natural environment (e.g., elevation and soils) is often the dominant factor in crop likelihood, the likelihood is also influenced by household characteristics, such as household assets and conditions of the neighborhood or built environment. Furthermore, the shape of the land use-environment curves illustrates the non-continuous and non-linear nature of these relationships. This approach demonstrates a novel method of understanding non-linear relationships between land and people. The article concludes with a proposed method for integrating the niche-based rules of land use allocation into a dynamic land use model that can address both allocation and quantity of agricultural crops.

  13. Imputation of variants from the 1000 Genomes Project modestly improves known associations and can identify low-frequency variant-phenotype associations undetected by HapMap based imputation.

    Directory of Open Access Journals (Sweden)

    Andrew R Wood

    Full Text Available Genome-wide association (GWA studies have been limited by the reliance on common variants present on microarrays or imputable from the HapMap Project data. More recently, the completion of the 1000 Genomes Project has provided variant and haplotype information for several million variants derived from sequencing over 1,000 individuals. To help understand the extent to which more variants (including low frequency (1% ≤ MAF <5% and rare variants (<1% can enhance previously identified associations and identify novel loci, we selected 93 quantitative circulating factors where data was available from the InCHIANTI population study. These phenotypes included cytokines, binding proteins, hormones, vitamins and ions. We selected these phenotypes because many have known strong genetic associations and are potentially important to help understand disease processes. We performed a genome-wide scan for these 93 phenotypes in InCHIANTI. We identified 21 signals and 33 signals that reached P<5×10(-8 based on HapMap and 1000 Genomes imputation, respectively, and 9 and 11 that reached a stricter, likely conservative, threshold of P<5×10(-11 respectively. Imputation of 1000 Genomes genotype data modestly improved the strength of known associations. Of 20 associations detected at P<5×10(-8 in both analyses (17 of which represent well replicated signals in the NHGRI catalogue, six were captured by the same index SNP, five were nominally more strongly associated in 1000 Genomes imputed data and one was nominally more strongly associated in HapMap imputed data. We also detected an association between a low frequency variant and phenotype that was previously missed by HapMap based imputation approaches. An association between rs112635299 and alpha-1 globulin near the SERPINA gene represented the known association between rs28929474 (MAF = 0.007 and alpha1-antitrypsin that predisposes to emphysema (P = 2.5×10(-12. Our data provide important proof of

  14. Variable selection under multiple imputation using the bootstrap in a prognostic study

    NARCIS (Netherlands)

    Heymans, M.W.; Buuren, S. van; Knol, D.L.; Mechelen, W. van; Vet, H.C.W. de

    2007-01-01

    Background. Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable s

  15. Variable selection under multiple imputation using the bootstrap in a prognostic study

    NARCIS (Netherlands)

    Heymans, M.W.; Buuren, S. van; Knol, D.L.; Mechelen, W. van; Vet, H.C.W. de

    2007-01-01

    Background. Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable s

  16. Handling missing data in cluster randomized trials: A demonstration of multiple imputation with PAN through SAS

    Directory of Open Access Journals (Sweden)

    Jiangxiu Zhou

    2014-09-01

    Full Text Available The purpose of this study is to demonstrate a way of dealing with missing data in clustered randomized trials by doing multiple imputation (MI with the PAN package in R through SAS. The procedure for doing MI with PAN through SAS is demonstrated in detail in order for researchers to be able to use this procedure with their own data. An illustration of the technique with empirical data was also included. In this illustration thePAN results were compared with pairwise deletion and three types of MI: (1 Normal Model (NM-MI ignoring the cluster structure; (2 NM-MI with dummy-coded cluster variables (fixed cluster structure; and (3 a hybrid NM-MI which imputes half the time ignoring the cluster structure, and the other half including the dummy-coded cluster variables. The empirical analysis showed that using PAN and the other strategies produced comparable parameter estimates. However, the dummy-coded MI overestimated the intraclass correlation, whereas MI ignoring the cluster structure and the hybrid MI underestimated the intraclass correlation. When compared with PAN, the p-value and standard error for the treatment effect were higher with dummy-coded MI, and lower with MI ignoring the clusterstructure, the hybrid MI approach, and pairwise deletion. Previous studies have shown that NM-MI is not appropriate for handling missing data in clustered randomized trials. This approach, in addition to the pairwise deletion approach, leads to a biased intraclass correlation and faultystatistical conclusions. Imputation in clustered randomized trials should be performed with PAN. We have demonstrated an easy way for using PAN through SAS.

  17. Accounting for uncertainty due to 'last observation carried forward' outcome imputation in a meta-analysis model.

    Science.gov (United States)

    Dimitrakopoulou, Vasiliki; Efthimiou, Orestis; Leucht, Stefan; Salanti, Georgia

    2015-02-28

    Missing outcome data are a problem commonly observed in randomized control trials that occurs as a result of participants leaving the study before its end. Missing such important information can bias the study estimates of the relative treatment effect and consequently affect the meta-analytic results. Therefore, methods on manipulating data sets with missing participants, with regard to incorporating the missing information in the analysis so as to avoid the loss of power and minimize the bias, are of interest. We propose a meta-analytic model that accounts for possible error in the effect sizes estimated in studies with last observation carried forward (LOCF) imputed patients. Assuming a dichotomous outcome, we decompose the probability of a successful unobserved outcome taking into account the sensitivity and specificity of the LOCF imputation process for the missing participants. We fit the proposed model within a Bayesian framework, exploring different prior formulations for sensitivity and specificity. We illustrate our methods by performing a meta-analysis of five studies comparing the efficacy of amisulpride versus conventional drugs (flupenthixol and haloperidol) on patients diagnosed with schizophrenia. Our meta-analytic models yield estimates similar to meta-analysis with LOCF-imputed patients. Allowing for uncertainty in the imputation process, precision is decreased depending on the priors used for sensitivity and specificity. Results on the significance of amisulpride versus conventional drugs differ between the standard LOCF approach and our model depending on prior beliefs on the imputation process. Our method can be regarded as a useful sensitivity analysis that can be used in the presence of concerns about the LOCF process.

  18. GACT: a Genome build and Allele definition Conversion Tool for SNP imputation and meta-analysis in genetic association studies.

    Science.gov (United States)

    Sulovari, Arvis; Li, Dawei

    2014-07-19

    Genome-wide association studies (GWAS) have successfully identified genes associated with complex human diseases. Although much of the heritability remains unexplained, combining single nucleotide polymorphism (SNP) genotypes from multiple studies for meta-analysis will increase the statistical power to identify new disease-associated variants. Meta-analysis requires same allele definition (nomenclature) and genome build among individual studies. Similarly, imputation, commonly-used prior to meta-analysis, requires the same consistency. However, the genotypes from various GWAS are generated using different genotyping platforms, arrays or SNP-calling approaches, resulting in use of different genome builds and allele definitions. Incorrect assumptions of identical allele definition among combined GWAS lead to a large portion of discarded genotypes or incorrect association findings. There is no published tool that predicts and converts among all major allele definitions. In this study, we have developed a tool, GACT, which stands for Genome build and Allele definition Conversion Tool, that predicts and inter-converts between any of the common SNP allele definitions and between the major genome builds. In addition, we assessed several factors that may affect imputation quality, and our results indicated that inclusion of singletons in the reference had detrimental effects while ambiguous SNPs had no measurable effect. Unexpectedly, exclusion of genotypes with missing rate > 0.001 (40% of study SNPs) showed no significant decrease of imputation quality (even significantly higher when compared to the imputation with singletons in the reference), especially for rare SNPs. GACT is a new, powerful, and user-friendly tool with both command-line and interactive online versions that can accurately predict, and convert between any of the common allele definitions and between genome builds for genome-wide meta-analysis and imputation of genotypes from SNP-arrays or deep

  19. A Comparison of Item-Level and Scale-Level Multiple Imputation for Questionnaire Batteries

    Science.gov (United States)

    Gottschall, Amanda C.; West, Stephen G.; Enders, Craig K.

    2012-01-01

    Behavioral science researchers routinely use scale scores that sum or average a set of questionnaire items to address their substantive questions. A researcher applying multiple imputation to incomplete questionnaire data can either impute the incomplete items prior to computing scale scores or impute the scale scores directly from other scale…

  20. Water quality and health in a Sahelian semi-arid urban context: an integrated geographical approach in Nouakchott, Mauritania

    Directory of Open Access Journals (Sweden)

    Doulo Traoré

    2013-11-01

    Full Text Available Access to sufficient quantities of safe drinking water is a human right. Moreover, access to clean water is of public health relevance, particularly in semi-arid and Sahelian cities due to the risks of water contamination and transmission of water-borne diseases. We conducted a study in Nouakchott, the capital of Mauritania, to deepen the understanding of diarrhoeal incidence in space and time. We used an integrated geographical approach, combining socio-environmental, microbiological and epidemiological data from various sources, including spatially explicit surveys, laboratory analysis of water samples and reported diarrhoeal episodes. A geospatial technique was applied to determine the environmental and microbiological risk factors that govern diarrhoeal transmission. Statistical and cartographic analyses revealed concentration of unimproved sources of drinking water in the most densely populated areas of the city, coupled with a daily water allocation below the recommended standard of 20 l per person. Bacteriological analysis indicated that 93% of the non-piped water sources supplied at water points were contaminated with 10-80 coliform bacteria per 100 ml. Diarrhoea was the second most important disease reported at health centres, accounting for 12.8% of health care service consultations on average. Diarrhoeal episodes were concentrated in municipalities with the largest number of contaminated water sources. Environmental factors (e.g. lack of improved water sources and bacteriological aspects (e.g. water contamination with coliform bacteria are the main drivers explaining the spatio-temporal distribution of diarrhoea. We conclude that integrating environmental, microbiological and epidemiological variables with statistical regression models facilitates risk profiling of diarrhoeal diseases. Modes of water supply and water contamination were the main drivers of diarrhoea in this semi-arid urban context of Nouakchott, and hence require a

  1. Water quality and health in a Sahelian semi-arid urban context: an integrated geographical approach in Nouakchott, Mauritania.

    Science.gov (United States)

    Traoré, Doulo; Sy, Ibrahima; Utzinger, Jürg; Epprecht, Michael; Kengne, Ives M; Lô, Baidy; Odermatt, Peter; Faye, Ousmane; Cissé, Guéladio; Tanner, Marcel

    2013-11-01

    Access to sufficient quantities of safe drinking water is a human right. Moreover, access to clean water is of public health relevance, particularly in semi-arid and Sahelian cities due to the risks of water contamination and transmission of water-borne diseases. We conducted a study in Nouakchott, the capital of Mauritania, to deepen the understanding of diarrhoeal incidence in space and time. We used an integrated geographical approach, combining socio-environmental, microbiological and epidemiological data from various sources, including spatially explicit surveys, laboratory analysis of water samples and reported diarrhoeal episodes. A geospatial technique was applied to determine the environmental and microbiological risk factors that govern diarrhoeal transmission. Statistical and cartographic analyses revealed concentration of unimproved sources of drinking water in the most densely populated areas of the city, coupled with a daily water allocation below the recommended standard of 20 l per person. Bacteriological analysis indicated that 93% of the non-piped water sources supplied at water points were contaminated with 10-80 coliform bacteria per 100 ml. Diarrhoea was the second most important disease reported at health centres, accounting for 12.8% of health care service consultations on average. Diarrhoeal episodes were concentrated in municipalities with the largest number of contaminated water sources. Environmental factors (e.g. lack of improved water sources) and bacteriological aspects (e.g. water contamination with coliform bacteria) are the main drivers explaining the spatio-temporal distribution of diarrhoea. We conclude that integrating environmental, microbiological and epidemiological variables with statistical regression models facilitates risk profiling of diarrhoeal diseases. Modes of water supply and water contamination were the main drivers of diarrhoea in this semi-arid urban context of Nouakchott, and hence require a strategy to improve

  2. Multiple Imputation Strategies for Multiple Group Structural Equation Models

    Science.gov (United States)

    Enders, Craig K.; Gottschall, Amanda C.

    2011-01-01

    Although structural equation modeling software packages use maximum likelihood estimation by default, there are situations where one might prefer to use multiple imputation to handle missing data rather than maximum likelihood estimation (e.g., when incorporating auxiliary variables). The selection of variables is one of the nuances associated…

  3. Synthetic Multiple-Imputation Procedure for Multistage Complex Samples

    Directory of Open Access Journals (Sweden)

    Zhou Hanzhi

    2016-03-01

    Full Text Available Multiple imputation (MI is commonly used when item-level missing data are present. However, MI requires that survey design information be built into the imputation models. For multistage stratified clustered designs, this requires dummy variables to represent strata as well as primary sampling units (PSUs nested within each stratum in the imputation model. Such a modeling strategy is not only operationally burdensome but also inferentially inefficient when there are many strata in the sample design. Complexity only increases when sampling weights need to be modeled. This article develops a generalpurpose analytic strategy for population inference from complex sample designs with item-level missingness. In a simulation study, the proposed procedures demonstrate efficient estimation and good coverage properties. We also consider an application to accommodate missing body mass index (BMI data in the analysis of BMI percentiles using National Health and Nutrition Examination Survey (NHANES III data. We argue that the proposed methods offer an easy-to-implement solution to problems that are not well-handled by current MI techniques. Note that, while the proposed method borrows from the MI framework to develop its inferential methods, it is not designed as an alternative strategy to release multiply imputed datasets for complex sample design data, but rather as an analytic strategy in and of itself.

  4. Multiple Imputation of Predictor Variables Using Generalized Additive Models

    NARCIS (Netherlands)

    de Jong, Roel; van Buuren, Stef; Spiess, Martin

    2016-01-01

    The sensitivity of multiple imputation methods to deviations from their distributional assumptions is investigated using simulations, where the parameters of scientific interest are the coefficients of a linear regression model, and values in predictor variables are missing at random. The performanc

  5. Imputation of the rare HOXB13 G84E mutation and cancer risk in a large population-based cohort.

    Directory of Open Access Journals (Sweden)

    Thomas J Hoffmann

    2015-01-01

    Full Text Available An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project. We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37–0.77. We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4x10-12. The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8x10-4 and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting

  6. Imputation of the Rare HOXB13 G84E Mutation and Cancer Risk in a Large Population-Based Cohort

    Science.gov (United States)

    Hoffmann, Thomas J.; Sakoda, Lori C.; Shen, Ling; Jorgenson, Eric; Habel, Laurel A.; Liu, Jinghua; Kvale, Mark N.; Asgari, Maryam M.; Banda, Yambazi; Corley, Douglas; Kushi, Lawrence H.; Quesenberry, Charles P.; Schaefer, Catherine; Van Den Eeden, Stephen K.; Risch, Neil; Witte, John S.

    2015-01-01

    An efficient approach to characterizing the disease burden of rare genetic variants is to impute them into large well-phenotyped cohorts with existing genome-wide genotype data using large sequenced referenced panels. The success of this approach hinges on the accuracy of rare variant imputation, which remains controversial. For example, a recent study suggested that one cannot adequately impute the HOXB13 G84E mutation associated with prostate cancer risk (carrier frequency of 0.0034 in European ancestry participants in the 1000 Genomes Project). We show that by utilizing the 1000 Genomes Project data plus an enriched reference panel of mutation carriers we were able to accurately impute the G84E mutation into a large cohort of 83,285 non-Hispanic White participants from the Kaiser Permanente Research Program on Genes, Environment and Health Genetic Epidemiology Research on Adult Health and Aging cohort. Imputation authenticity was confirmed via a novel classification and regression tree method, and then empirically validated analyzing a subset of these subjects plus an additional 1,789 men from Kaiser specifically genotyped for the G84E mutation (r2 = 0.57, 95% CI = 0.37−0.77). We then show the value of this approach by using the imputed data to investigate the impact of the G84E mutation on age-specific prostate cancer risk and on risk of fourteen other cancers in the cohort. The age-specific risk of prostate cancer among G84E mutation carriers was higher than among non-carriers. Risk estimates from Kaplan-Meier curves were 36.7% versus 13.6% by age 72, and 64.2% versus 24.2% by age 80, for G84E mutation carriers and non-carriers, respectively (p = 3.4×10−12). The G84E mutation was also associated with an increase in risk for the fourteen other most common cancers considered collectively (p = 5.8×10−4) and more so in cases diagnosed with multiple cancer types, both those including and not including prostate cancer, strongly suggesting pleiotropic effects

  7. Sequence imputation of HPV16 genomes for genetic association studies.

    Directory of Open Access Journals (Sweden)

    Benjamin Smith

    Full Text Available BACKGROUND: Human Papillomavirus type 16 (HPV16 causes over half of all cervical cancer and some HPV16 variants are more oncogenic than others. The genetic basis for the extraordinary oncogenic properties of HPV16 compared to other HPVs is unknown. In addition, we neither know which nucleotides vary across and within HPV types and lineages, nor which of the single nucleotide polymorphisms (SNPs determine oncogenicity. METHODS: A reference set of 62 HPV16 complete genome sequences was established and used to examine patterns of evolutionary relatedness amongst variants using a pairwise identity heatmap and HPV16 phylogeny. A BLAST-based algorithm was developed to impute complete genome data from partial sequence information using the reference database. To interrogate the oncogenic risk of determined and imputed HPV16 SNPs, odds-ratios for each SNP were calculated in a case-control viral genome-wide association study (VWAS using biopsy confirmed high-grade cervix neoplasia and self-limited HPV16 infections from Guanacaste, Costa Rica. RESULTS: HPV16 variants display evolutionarily stable lineages that contain conserved diagnostic SNPs. The imputation algorithm indicated that an average of 97.5±1.03% of SNPs could be accurately imputed. The VWAS revealed specific HPV16 viral SNPs associated with variant lineages and elevated odds ratios; however, individual causal SNPs could not be distinguished with certainty due to the nature of HPV evolution. CONCLUSIONS: Conserved and lineage-specific SNPs can be imputed with a high degree of accuracy from limited viral polymorphic data due to the lack of recombination and the stochastic mechanism of variation accumulation in the HPV genome. However, to determine the role of novel variants or non-lineage-specific SNPs by VWAS will require direct sequence analysis. The investigation of patterns of genetic variation and the identification of diagnostic SNPs for lineages of HPV16 variants provides a valuable

  8. Application of imputation methods to genomic selection in Chinese Holstein cattle

    Directory of Open Access Journals (Sweden)

    Weng Ziqing

    2012-02-01

    Full Text Available Abstract Missing genotypes are a common feature of high density SNP datasets obtained using SNP chip technology and this is likely to decrease the accuracy of genomic selection. This problem can be circumvented by imputing the missing genotypes with estimated genotypes. When implementing imputation, the criteria used for SNP data quality control and whether to perform imputation before or after data quality control need to consider. In this paper, we compared six strategies of imputation and quality control using different imputation methods, different quality control criteria and by changing the order of imputation and quality control, against a real dataset of milk production traits in Chinese Holstein cattle. The results demonstrated that, no matter what imputation method and quality control criteria were used, strategies with imputation before quality control performed better than strategies with imputation after quality control in terms of accuracy of genomic selection. The different imputation methods and quality control criteria did not significantly influence the accuracy of genomic selection. We concluded that performing imputation before quality control could increase the accuracy of genomic selection, especially when the rate of missing genotypes is high and the reference population is small.

  9. Multiple Imputation with Diagnostics ( mi in R : Opening Windows into the Black Box

    Directory of Open Access Journals (Sweden)

    Yu-Sung Su

    2011-12-01

    Full Text Available Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: choice of predictors, models, and transformations for chained imputation models; standard and binned residual plots for checking the t of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is tohave a demonstration package that (a avoids many of the practical problems that arise with existing multivariate imputation programs, and (b demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.

  10. Geographic Tongue

    Science.gov (United States)

    ... cases, most often related to eating hot, spicy, salty or acidic foods Many people with geographic tongue ... sensitive oral tissues, including: Hot, spicy, acidic or salty foods Tobacco products Toothpaste that contains tartar-control ...

  11. Bridging a Survey Redesign Using Multiple Imputation: An Application to the 2014 CPS ASEC

    Directory of Open Access Journals (Sweden)

    Rothbaum Jonathan

    2017-03-01

    Full Text Available The Current Population Survey Annual Social and Economic Supplement (CPS ASEC serves as the data source for official income, poverty, and inequality statistics in the United States. In 2014, the CPS ASEC questionnaire was redesigned to improve data quality and to reduce misreporting, item nonresponse, and errors resulting from respondent fatigue. The sample was split into two groups, with nearly 70% receiving the traditional instrument and 30% receiving the redesigned instrument. Due to the relatively small redesign sample, analyses of changes in income and poverty between this and future years may lack sufficient power, especially for subgroups. The traditional sample is treated as if the responses were missing for income sources targeted by the redesign, and multiple imputation is used to generate plausible responses. A flexible imputation technique is used to place individuals into strata along two dimensions: 1 their probability of income recipiency and 2 their expected income conditional on recipiency for each income source. By matching on these two dimensions, this approach combines the ideas of propensity score matching and predictive means matching. In this article, this approach is implemented, the matching models are evaluated using diagnostics, and the results are analyzed.

  12. A new strategy for enhancing imputation quality of rare variants from next-generation sequencing data via combining SNP and exome chip data

    NARCIS (Netherlands)

    Y.J. Kim (Young Jin); J. Lee (Juyoung); B.-J. Kim (Bong-Jo); T. Park (Taesung); G.R. Abecasis (Gonçalo); M. Almeida (Marcio); D. Altshuler (David); J.L. Asimit (Jennifer L.); G. Atzmon (Gil); M. Barber (Mathew); A. Barzilai (Ari); N.L. Beer (Nicola L.); G.I. Bell (Graeme I.); J. Below (Jennifer); T. Blackwell (Tom); J. Blangero (John); M. Boehnke (Michael); D.W. Bowden (Donald W.); N.P. Burtt (Noël); J.C. Chambers (John); H. Chen (Han); P. Chen (Ping); P.S. Chines (Peter); S. Choi (Sungkyoung); C. Churchhouse (Claire); P. Cingolani (Pablo); B.K. Cornes (Belinda); N.J. Cox (Nancy); A.G. Day-Williams (Aaron); A. Duggirala (Aparna); J. Dupuis (Josée); T. Dyer (Thomas); S. Feng (Shuang); J. Fernandez-Tajes (Juan); T. Ferreira (Teresa); T.E. Fingerlin (Tasha E.); J. Flannick (Jason); J.C. Florez (Jose); P. Fontanillas (Pierre); T.M. Frayling (Timothy); C. Fuchsberger (Christian); E. Gamazon (Eric); K. Gaulton (Kyle); S. Ghosh (Saurabh); B. Glaser (Benjamin); A.L. Gloyn (Anna); R.L. Grossman (Robert L.); J. Grundstad (Jason); C. Hanis (Craig); A. Heath (Allison); H. Highland (Heather); M. Horikoshi (Momoko); I.-S. Huh (Ik-Soo); J.R. Huyghe (Jeroen R.); M.K. Ikram (Kamran); K.A. Jablonski (Kathleen); Y. Jun (Yang); N. Kato (Norihiro); J. Kim (Jayoun); Y.J. Kim (Young Jin); B.-J. Kim (Bong-Jo); J. Lee (Juyoung); C.R. King (C. Ryan); J.S. Kooner (Jaspal S.); M.-S. Kwon (Min-Seok); H.K. Im (Hae Kyung); M. Laakso (Markku); K.K.-Y. Lam (Kevin Koi-Yau); J. Lee (Jaehoon); S. Lee (Selyeong); S. Lee (Sungyoung); D.M. Lehman (Donna M.); H. Li (Heng); C.M. Lindgren (Cecilia); X. Liu (Xuanyao); O.E. Livne (Oren E.); A.E. Locke (Adam E.); A. Mahajan (Anubha); J.B. Maller (Julian B.); A.K. Manning (Alisa K.); T.J. Maxwell (Taylor J.); A. Mazoure (Alexander); M.I. McCarthy (Mark); J.B. Meigs (James B.); B. Min (Byungju); K.L. Mohlke (Karen); A.P. Morris (Andrew); S. Musani (Solomon); Y. Nagai (Yoshihiko); M.C.Y. Ng (Maggie C.Y.); D. Nicolae (Dan); S. Oh (Sohee); N.D. Palmer (Nicholette); T. Park (Taesung); T.I. Pollin (Toni I.); I. Prokopenko (Inga); D. Reich (David); M.A. Rivas (Manuel); L.J. Scott (Laura); M. Seielstad (Mark); Y.S. Cho (Yoon Shin); X. Sim (Xueling); R. Sladek (Rob); P. Smith (Philip); I. Tachmazidou (Ioanna); E.S. Tai (Shyong); Y.Y. Teo (Yik Ying); T.M. Teslovich (Tanya M.); J. Torres (Jason); V. Trubetskoy (Vasily); S.M. Willems (Sara); A.L. Williams (Amy L.); J.G. Wilson (James); S. Wiltshire (Steven); S. Won (Sungho); A.R. Wood (Andrew); W. Xu (Wang); J. Yoon (Joon); M. Zawistowski (Matthew); E. Zeggini (Eleftheria); W. Zhang (Weihua); S. Zöllner (Sebastian)

    2015-01-01

    textabstractBackground: Rare variants have gathered increasing attention as a possible alternative source of missing heritability. Since next generation sequencing technology is not yet cost-effective for large-scale genomic studies, a widely used alternative approach is imputation. However, the imp

  13. Bayesian Approaches to Imputation, Hypothesis Testing, and Parameter Estimation

    Science.gov (United States)

    Ross, Steven J.; Mackey, Beth

    2015-01-01

    This chapter introduces three applications of Bayesian inference to common and novel issues in second language research. After a review of the critiques of conventional hypothesis testing, our focus centers on ways Bayesian inference can be used for dealing with missing data, for testing theory-driven substantive hypotheses without a default null…

  14. Pitfalls of the Geographic Population Structure (GPS) Approach Applied to Human Genetic History: A Case Study of Ashkenazi Jews.

    Science.gov (United States)

    Flegontov, Pavel; Kassian, Alexei; Thomas, Mark G; Fedchenko, Valentina; Changmai, Piya; Starostin, George

    2016-08-16

    In a recent interdisciplinary study, Das et al. have attempted to trace the homeland of Ashkenazi Jews and of their historical language, Yiddish (Das et al. 2016 Localizing Ashkenazic Jews to Primeval Villages in the Ancient Iranian Lands of Ashkenaz. Genome Biol Evol. 8:1132-1149). Das et al. applied the geographic population structure (GPS) method to autosomal genotyping data and inferred geographic coordinates of populations supposedly ancestral to Ashkenazi Jews, placing them in Eastern Turkey. They argued that this unexpected genetic result goes against the widely accepted notion of Ashkenazi origin in the Levant, and speculated that Yiddish was originally a Slavic language strongly influenced by Iranian and Turkic languages, and later remodeled completely under Germanic influence. In our view, there are major conceptual problems with both the genetic and linguistic parts of the work. We argue that GPS is a provenancing tool suited to inferring the geographic region where a modern and recently unadmixed genome is most likely to arise, but is hardly suitable for admixed populations and for tracing ancestry up to 1,000 years before present, as its authors have previously claimed. Moreover, all methods of historical linguistics concur that Yiddish is a Germanic language, with no reliable evidence for Slavic, Iranian, or Turkic substrata.

  15. Accuracy of Range Restriction Correction with Multiple Imputation in Small and Moderate Samples: A Simulation Study

    Directory of Open Access Journals (Sweden)

    Andreas Pfaffel

    2016-09-01

    Full Text Available Approaches to correcting correlation coefficients for range restriction have been developed under the framework of large sample theory. The accuracy of missing data techniques for correcting correlation coefficients for range restriction has thus far only been investigated with relatively large samples. However, researchers and evaluators are often faced with a small or moderate number of applicants but must still attempt to estimate the population correlation between predictor and criterion. Therefore, in the present study we investigated the accuracy of population correlation estimates and their associated standard error in terms of small and moderate sample sizes. We applied multiple imputation by chained equations for continuous and naturally dichotomous criterion variables. The results show that multiple imputation by chained equations is accurate for a continuous criterion variable, even for a small number of applicants when the selection ratio is not too small. In the case of a naturally dichotomous criterion variable, a small or moderate number of applicants leads to biased estimates when the selection ratio is small. In contrast, the standard error of the population correlation estimate is accurate over a wide range of conditions of sample size, selection ratio, true population correlation, for continuous and naturally dichotomous criterion variables, and for direct and indirect range restriction scenarios. The findings of this study provide empirical evidence about the accuracy of the correction, and support researchers and evaluators in their assessment of conditions under which correlation coefficients corrected for range restriction can be trusted.

  16. FCMPSO: An Imputation for Missing Data Features in Heart Disease Classification

    Science.gov (United States)

    Salleh, Mohd Najib Mohd; Ashikin Samat, Nurul

    2017-08-01

    The application of data mining and machine learning in directing clinical research into possible hidden knowledge is becoming greatly influential in medical areas. Heart Disease is a killer disease around the world, and early prevention through efficient methods can help to reduce the mortality number. Medical data may contain many uncertainties, as they are fuzzy and vague in nature. Nonetheless, imprecise features data such as no values and missing values can affect quality of classification results. Nevertheless, the other complete features are still capable to give information in certain features. Therefore, an imputation approach based on Fuzzy C-Means and Particle Swarm Optimization (FCMPSO) is developed in preprocessing stage to help fill in the missing values. Then, the complete dataset is trained in classification algorithm, Decision Tree. The experiment is trained with Heart Disease dataset and the performance is analysed using accuracy, precision, and ROC values. Results show that the performance of Decision Tree is increased after the application of FCMSPO for imputation.

  17. Whole-Genome Sequencing Coupled to Imputation Discovers Genetic Signals for Anthropometric Traits.

    Science.gov (United States)

    Tachmazidou, Ioanna; Süveges, Dániel; Min, Josine L; Ritchie, Graham R S; Steinberg, Julia; Walter, Klaudia; Iotchkova, Valentina; Schwartzentruber, Jeremy; Huang, Jie; Memari, Yasin; McCarthy, Shane; Crawford, Andrew A; Bombieri, Cristina; Cocca, Massimiliano; Farmaki, Aliki-Eleni; Gaunt, Tom R; Jousilahti, Pekka; Kooijman, Marjolein N; Lehne, Benjamin; Malerba, Giovanni; Männistö, Satu; Matchan, Angela; Medina-Gomez, Carolina; Metrustry, Sarah J; Nag, Abhishek; Ntalla, Ioanna; Paternoster, Lavinia; Rayner, Nigel W; Sala, Cinzia; Scott, William R; Shihab, Hashem A; Southam, Lorraine; St Pourcain, Beate; Traglia, Michela; Trajanoska, Katerina; Zaza, Gialuigi; Zhang, Weihua; Artigas, María S; Bansal, Narinder; Benn, Marianne; Chen, Zhongsheng; Danecek, Petr; Lin, Wei-Yu; Locke, Adam; Luan, Jian'an; Manning, Alisa K; Mulas, Antonella; Sidore, Carlo; Tybjaerg-Hansen, Anne; Varbo, Anette; Zoledziewska, Magdalena; Finan, Chris; Hatzikotoulas, Konstantinos; Hendricks, Audrey E; Kemp, John P; Moayyeri, Alireza; Panoutsopoulou, Kalliope; Szpak, Michal; Wilson, Scott G; Boehnke, Michael; Cucca, Francesco; Di Angelantonio, Emanuele; Langenberg, Claudia; Lindgren, Cecilia; McCarthy, Mark I; Morris, Andrew P; Nordestgaard, Børge G; Scott, Robert A; Tobin, Martin D; Wareham, Nicholas J; Burton, Paul; Chambers, John C; Smith, George Davey; Dedoussis, George; Felix, Janine F; Franco, Oscar H; Gambaro, Giovanni; Gasparini, Paolo; Hammond, Christopher J; Hofman, Albert; Jaddoe, Vincent W V; Kleber, Marcus; Kooner, Jaspal S; Perola, Markus; Relton, Caroline; Ring, Susan M; Rivadeneira, Fernando; Salomaa, Veikko; Spector, Timothy D; Stegle, Oliver; Toniolo, Daniela; Uitterlinden, André G; Barroso, Inês; Greenwood, Celia M T; Perry, John R B; Walker, Brian R; Butterworth, Adam S; Xue, Yali; Durbin, Richard; Small, Kerrin S; Soranzo, Nicole; Timpson, Nicholas J; Zeggini, Eleftheria

    2017-06-01

    Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra. We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader allelic architecture of 12 anthropometric traits associated with height, body mass, and fat distribution in up to 267,616 individuals. We report 106 genome-wide significant signals that have not been previously identified, including 9 low-frequency variants pointing to functional candidates. Of the 106 signals, 6 are in genomic regions that have not been implicated with related traits before, 28 are independent signals at previously reported regions, and 72 represent previously reported signals for a different anthropometric trait. 71% of signals reside within genes and fine mapping resolves 23 signals to one or two likely causal variants. We confirm genetic overlap between human monogenic and polygenic anthropometric traits and find signal enrichment in cis expression QTLs in relevant tissues. Our results highlight the potential of WGS strategies to enhance biologically relevant discoveries across the frequency spectrum. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.

  18. Genome-wide association analysis of imputed rare variants: application to seven common complex diseases.

    Science.gov (United States)

    Mägi, Reedik; Asimit, Jennifer L; Day-Williams, Aaron G; Zeggini, Eleftheria; Morris, Andrew P

    2012-12-01

    Genome-wide association studies have been successful in identifying loci contributing effects to a range of complex human traits. The majority of reproducible associations within these loci are with common variants, each of modest effect, which together explain only a small proportion of heritability. It has been suggested that much of the unexplained genetic component of complex traits can thus be attributed to rare variation. However, genome-wide association study genotyping chips have been designed primarily to capture common variation, and thus are underpowered to detect the effects of rare variants. Nevertheless, we demonstrate here, by simulation, that imputation from an existing scaffold of genome-wide genotype data up to high-density reference panels has the potential to identify rare variant associations with complex traits, without the need for costly re-sequencing experiments. By application of this approach to genome-wide association studies of seven common complex diseases, imputed up to publicly available reference panels, we identify genome-wide significant evidence of rare variant association in PRDM10 with coronary artery disease and multiple genes in the major histocompatibility complex (MHC) with type 1 diabetes. The results of our analyses highlight that genome-wide association studies have the potential to offer an exciting opportunity for gene discovery through association with rare variants, conceivably leading to substantial advancements in our understanding of the genetic architecture underlying complex human traits.

  19. Weigh-In-Motion Data Checking and Imputation

    OpenAIRE

    Wei, Ting; Fricker, Jon D.

    2003-01-01

    There are about 46 weigh-in-motion (WIM) stations in Indiana. When operating properly, they provide valuable information on traffic volumes, vehicle classifications, and axle weights. Because there are great amounts of WIM data collected everyday, the quality of these data should be monitor without further delay. The first objective of this study is to develop effective and efficient methods to identify missing or erroneous WIM data. The second objective is to develop a data imputation method...

  20. Is missing geographic positioning system data in accelerometry studies a problem, and is imputation the solution?

    DEFF Research Database (Denmark)

    Meseck, Kristin; Jankowska, Marta M; Schipperijn, Jasper

    2016-01-01

    and viable method for correcting GPS data loss. Accelerometer and GPS data of 782 participants from 8 studies were pooled to represent a range of lifestyles and interactions with the built environment. Periods of GPS signal lapse were identified and extracted. Generalised linear mixed models were run...

  1. Evaluation of Multiple Imputation in Missing Data Analysis: An Application on Repeated Measurement Data in Animal Science

    Directory of Open Access Journals (Sweden)

    Gazel Ser

    2015-12-01

    Full Text Available The purpose of this study was to evaluate the performance of multiple imputation method in case that missing observation structure is at random and completely at random from the approach of general linear mixed model. The application data of study was consisted of a total 77 heads of Norduz ram lambs at 7 months of age. After slaughtering, pH values measured at five different time points were determined as dependent variable. In addition, hot carcass weight, muscle glycogen level and fasting durations were included as independent variables in the model. In the dependent variable without missing observation, two missing observation structures including Missing Completely at Random (MCAR and Missing at Random (MAR were created by deleting the observations at certain rations (10% and 25%. After that, in data sets that have missing observation structure, complete data sets were obtained using MI (multiple imputation. The results obtained by applying general linear mixed model to the data sets that were completed using MI method were compared to the results regarding complete data. In the mixed model which was applied to the complete data and MI data sets, results whose covariance structures were the same and parameter estimations and standard estimations were rather close to the complete data are obtained. As a result, in this study, it was ensured that reliable information was obtained in mixed model in case of choosing MI as imputation method in missing observation structure and rates of both cases.

  2. Explicating the Conditions Under Which Multilevel Multiple Imputation Mitigates Bias Resulting from Random Coefficient-Dependent Missing Longitudinal Data.

    Science.gov (United States)

    Gottfredson, Nisha C; Sterba, Sonya K; Jackson, Kristina M

    2017-01-01

    Random coefficient-dependent (RCD) missingness is a non-ignorable mechanism through which missing data can arise in longitudinal designs. RCD, for which we cannot test, is a problematic form of missingness that occurs if subject-specific random effects correlate with propensity for missingness or dropout. Particularly when covariate missingness is a problem, investigators typically handle missing longitudinal data by using single-level multiple imputation procedures implemented with long-format data, which ignores within-person dependency entirely, or implemented with wide-format (i.e., multivariate) data, which ignores some aspects of within-person dependency. When either of these standard approaches to handling missing longitudinal data is used, RCD missingness leads to parameter bias and incorrect inference. We explain why multilevel multiple imputation (MMI) should alleviate bias induced by a RCD missing data mechanism under conditions that contribute to stronger determinacy of random coefficients. We evaluate our hypothesis with a simulation study. Three design factors are considered: intraclass correlation (ICC; ranging from .25 to .75), number of waves (ranging from 4 to 8), and percent of missing data (ranging from 20 to 50%). We find that MMI greatly outperforms the single-level wide-format (multivariate) method for imputation under a RCD mechanism. For the MMI analyses, bias was most alleviated when the ICC is high, there were more waves of data, and when there was less missing data. Practical recommendations for handling longitudinal missing data are suggested.

  3. A fuzzy approach to a multiple criteria and geographical information system for decision support on suitable locations for biogas plants

    DEFF Research Database (Denmark)

    Franco de los Rios, Camilo Andres; Bojesen, Mikkel; Hougaard, Jens Leth

    The purpose of this paper is to model the multi-criteria decision problem of identifying the most suitable facility locations for biogas plants under an integrated decision support methodology. Here the Geographical Information System (GIS) is used for measuring the attributes of the alternatives...... can also be successfully applied over the outcomes of different decision makers, in case a unique social solution is required to exist. The proposed methodology can be used under an integrated decision support frame for identifying the most suitable locations for biogas facilities, taking into account...

  4. Missing value imputation in multi-environment trials: Reconsidering the Krzanowski method

    Directory of Open Access Journals (Sweden)

    Sergio Arciniegas-Alarcón

    2016-07-01

    Full Text Available We propose a new methodology for multiple imputation when faced with missing data in multi-environmental trials with genotype-by-environment interaction, based on the imputation system developed by Krzanowski that uses the singular value decomposition (SVD of a matrix. Several different iterative variants are described; differential weights can also be included in each variant to represent the influence of different components of SVD in the imputation process. The methods are compared through a simulation study based on three real data matrices that have values deleted randomly at different percentages, using as measure of overall accuracy a combination of the variance between imputations and their mean square deviations relative to the deleted values. The best results are shown by two of the iterative schemes that use weights belonging to the interval [0.75, 1]. These schemes provide imputations that have higher quality when compared with other multiple imputation methods based on the Krzanowski method.

  5. Mapping and modelling the geographical distribution of soil-transmitted helminthiases in Peninsular Malaysia: implications for control approaches

    Directory of Open Access Journals (Sweden)

    Romano Ngui

    2014-05-01

    Full Text Available Soil-transmitted helminth (STH infections in Malaysia are still highly prevalent, especially in rural and remote communities. Complete estimations of the total disease burden in the country has not been performed, since available data are not easily accessible in the public domain. The current study utilised geographical information system (GIS to collate and map the distribution of STH infections from available empirical survey data in Peninsular Malaysia, highlighting areas where information is lacking. The assembled database, comprising surveys conducted between 1970 and 2012 in 99 different locations, represents one of the most comprehensive compilations of STH infections in the country. It was found that the geographical distribution of STH varies considerably with no clear pattern across the surveyed locations. Our attempt to generate predictive risk maps of STH infections on the basis of ecological limits such as climate and other environmental factors shows that the prevalence of Ascaris lumbricoides is low along the western coast and the southern part of the country, whilst the prevalence is high in the central plains and in the North. In the present study, we demonstrate that GIS can play an important role in providing data for the implementation of sustainable and effective STH control programmes to policy-makers and authorities in charge.

  6. Socio-environmental determinants of the leptospirosis outbreak of 1996 in western Rio de Janeiro: a geographical approach.

    Science.gov (United States)

    Barcellos, C; Sabroza, P C

    2000-12-01

    The environmental and social context in which a leptospirosis outbreak took place during the summer of 1996 in the Rio de Janeiro Western Region was examined by using spatial analysis of leptospirosis cases merged with population and environmental data in a Geographical Information System (GIS). Important differences were observed between places where residences of leptospirosis cases are concentrated and other places in the region. Water supply coverage, solid waste collection, sewerage system coverage and flood risk area were the main determining variables from an initial list of ten. The influence of these unfavorable social and environmental factors is verified hundreds of meters distant from the leptospirosis case residences, demonstrating a necessity to broaden the area of health surveillance practices. The geocoding indicated that some cases did not report contact with flood water, even though they were geographically adjacent to cases who did report this contact. Cases may only report exposures they believe are related to the disease. Geocoding is a useful tool for evaluating such bias in the exposure recall.

  7. Methods and Strategies to Impute Missing Genotypes for Improving Genomic Prediction

    DEFF Research Database (Denmark)

    Ma, Peipei

    Genomic prediction has been widely used in dairy cattle breeding. Genotype imputation is a key procedure to efficently utilize marker data from different chips and obtain high density marker data with minimizing cost. This thesis investigated methods and strategies to genotype imputation for impr......Genomic prediction has been widely used in dairy cattle breeding. Genotype imputation is a key procedure to efficently utilize marker data from different chips and obtain high density marker data with minimizing cost. This thesis investigated methods and strategies to genotype imputation...

  8. Assessing and comparison of different machine learning methods in parent-offspring trios for genotype imputation.

    Science.gov (United States)

    Mikhchi, Abbas; Honarvar, Mahmood; Kashan, Nasser Emam Jomeh; Aminafshar, Mehdi

    2016-06-21

    Genotype imputation is an important tool for prediction of unknown genotypes for both unrelated individuals and parent-offspring trios. Several imputation methods are available and can either employ universal machine learning methods, or deploy algorithms dedicated to infer missing genotypes. In this research the performance of eight machine learning methods: Support Vector Machine, K-Nearest Neighbors, Extreme Learning Machine, Radial Basis Function, Random Forest, AdaBoost, LogitBoost, and TotalBoost compared in terms of the imputation accuracy, computation time and the factors affecting imputation accuracy. The methods employed using real and simulated datasets to impute the un-typed SNPs in parent-offspring trios. The tested methods show that imputation of parent-offspring trios can be accurate. The Random Forest and Support Vector Machine were more accurate than the other machine learning methods. The TotalBoost performed slightly worse than the other methods.The running times were different between methods. The ELM was always most fast algorithm. In case of increasing the sample size, the RBF requires long imputation time.The tested methods in this research can be an alternative for imputation of un-typed SNPs in low missing rate of data. However, it is recommended that other machine learning methods to be used for imputation.

  9. Imputation and quality control steps for combining multiple genome-wide datasets

    Directory of Open Access Journals (Sweden)

    Shefali S Verma

    2014-12-01

    Full Text Available The electronic MEdical Records and GEnomics (eMERGE network brings together DNA biobanks linked to electronic health records (EHRs from multiple institutions. Approximately 52,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes, and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2 were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.

  10. A fuzzy approach to a multiple criteria and geographical information system for decision support on suitable locations for biogas plants

    DEFF Research Database (Denmark)

    Franco de los Rios, Camilo Andres; Bojesen, Mikkel; Hougaard, Jens Leth

    The purpose of this paper is to model the multi-criteria decision problem of identifying the most suitable facility locations for biogas plants under an integrated decision support methodology. Here the Geographical Information System (GIS) is used for measuring the attributes of the alternatives....... The estimation of criteria weights, which is necessary for applying the FWOD procedure, is done by means of the Analytical Hierarchy Process (AHP), such that a combined AHP-FWOD methodology allows identifying the more suitable sites for building biogas plants. We show that the FWOD relevance-ranking procedure...... can also be successfully applied over the outcomes of different decision makers, in case a unique social solution is required to exist. The proposed methodology can be used under an integrated decision support frame for identifying the most suitable locations for biogas facilities, taking into account...

  11. A fuzzy approach to a multiple criteria and Geographical Information System for decision support on suitable locations for biogas plants

    DEFF Research Database (Denmark)

    Franco, Camilo; Bojesen, Mikkel; Hougaard, Jens Leth

    2015-01-01

    The purpose of this paper is to model the multi-criteria decision problem of identifying the most suitable facility locations for biogas plants under an integrated decision support methodology. Here the Geographical Information System (GIS) is used for measuring the attributes of the alternatives...... suitable sites for building biogas plants. We show that the FWOD relevance-ranking procedure can also be successfully applied over the outcomes of different decision makers, in case a unique social solution is required to exist. The proposed methodology can be used under an integrated decision support...... frame for identifying the most suitable locations for biogas facilities, taking into account the most relevant criteria for the social, economic and political dimensions....

  12. Geographic variation in species richness, rarity, and the selection of areas for conservation: An integrative approach with Brazilian estuarine fishes

    Science.gov (United States)

    Vilar, Ciro C.; Joyeux, Jean-Christophe; Spach, Henry L.

    2017-09-01

    While the number of species is a key indicator of ecological assemblages, spatial conservation priorities solely identified from species richness are not necessarily efficient to protect other important biological assets. Hence, the results of spatial prioritization analysis would be greatly enhanced if richness were used in association to complementary biodiversity measures. In this study, geographic patterns in estuarine fish species rarity (i.e. the average range size in the study area), endemism and richness, were mapped and integrated to identify regions important for biodiversity conservation along the Brazilian coast. Furthermore, we analyzed the effectiveness of the national system of protected areas to represent these regions. Analyses were performed on presence/absence data of 412 fish species in 0.25° latitudinal bands covering the entire Brazilian biogeographical province. Species richness, rarity and endemism patterns differed and strongly reflected biogeographical limits and regions. However, among the existing 154 latitudinal bands, 48 were recognized as conservation priorities by concomitantly harboring high estuarine fish species richness and assemblages of geographically rare species. Priority areas identified for all estuarine fish species largely differed from those identified for Brazilian endemics. Moreover, there was no significant correlation between the different aspects of the fish assemblages considered (i.e. species richness, endemism or rarity), suggesting that designating reserves based on a single variable may lead to large gaps in the overall protection of biodiversity. Our results further revealed that the existing system of protected areas is insufficient for representing the priority bands we identified. This highlights the urgent need for expanding the national network of protected areas to maintain estuarine ecosystems with high conservation value.

  13. GST M1-T1 null allele frequency patterns in geographically assorted human populations: a phylogenetic approach.

    Science.gov (United States)

    Kasthurinaidu, Senthilkumar Pitchalu; Ramasamy, Thirumurugan; Ayyavoo, Jayachitra; Dave, Dhvani Kirtikumar; Adroja, Divya Anantray

    2015-01-01

    Genetic diversity in drug metabolism and disposition is mainly considered as the outcome of the inter-individual genetic variation in polymorphism of drug-xenobiotic metabolizing enzyme (XME). Among the XMEs, glutathione-S-transferases (GST) gene loci are an important candidate for the investigation of diversity in allele frequency, as the deletion mutations in GST M1 and T1 genotypes are associated with various cancers and genetic disorders of all major Population Affiliations (PAs). Therefore, the present population based phylogenetic study was focused to uncover the frequency distribution pattern in GST M1 and T1 null genotypes among 45 Geographically Assorted Human Populations (GAHPs). The frequency distribution pattern for GST M1 and T1 null alleles have been detected in this study using the data derived from literatures representing 44 populations affiliated to Africa, Asia, Europe, South America and the genome of PA from Gujarat, a region in western India. Allele frequency counting for Gujarat PA and scattered plot analysis for geographical distribution among the PAs were performed in SPSS-21. The GST M1 and GST T1 null allele frequencies patterns of the PAs were computed in Seqboot, Gendist program of Phylip software package (3.69 versions) and Unweighted Pair Group method with Arithmetic Mean in Mega-6 software. Allele frequencies from South African Xhosa tribe, East African Zimbabwe, East African Ethiopia, North African Egypt, Caucasian, South Asian Afghanistan and South Indian Andhra Pradesh have been identified as the probable seven patterns among the 45 GAHPs investigated in this study for GST M1-T1 null genotypes. The patternized null allele frequencies demonstrated in this study for the first time addresses the missing link in GST M1-T1 null allele frequencies among GAHPs.

  14. GST M1-T1 null allele frequency patterns in geographically assorted human populations: a phylogenetic approach.

    Directory of Open Access Journals (Sweden)

    Senthilkumar Pitchalu Kasthurinaidu

    Full Text Available Genetic diversity in drug metabolism and disposition is mainly considered as the outcome of the inter-individual genetic variation in polymorphism of drug-xenobiotic metabolizing enzyme (XME. Among the XMEs, glutathione-S-transferases (GST gene loci are an important candidate for the investigation of diversity in allele frequency, as the deletion mutations in GST M1 and T1 genotypes are associated with various cancers and genetic disorders of all major Population Affiliations (PAs. Therefore, the present population based phylogenetic study was focused to uncover the frequency distribution pattern in GST M1 and T1 null genotypes among 45 Geographically Assorted Human Populations (GAHPs. The frequency distribution pattern for GST M1 and T1 null alleles have been detected in this study using the data derived from literatures representing 44 populations affiliated to Africa, Asia, Europe, South America and the genome of PA from Gujarat, a region in western India. Allele frequency counting for Gujarat PA and scattered plot analysis for geographical distribution among the PAs were performed in SPSS-21. The GST M1 and GST T1 null allele frequencies patterns of the PAs were computed in Seqboot, Gendist program of Phylip software package (3.69 versions and Unweighted Pair Group method with Arithmetic Mean in Mega-6 software. Allele frequencies from South African Xhosa tribe, East African Zimbabwe, East African Ethiopia, North African Egypt, Caucasian, South Asian Afghanistan and South Indian Andhra Pradesh have been identified as the probable seven patterns among the 45 GAHPs investigated in this study for GST M1-T1 null genotypes. The patternized null allele frequencies demonstrated in this study for the first time addresses the missing link in GST M1-T1 null allele frequencies among GAHPs.

  15. Assessment of genotype imputation performance using 1000 Genomes in African American studies.

    Directory of Open Access Journals (Sweden)

    Dana B Hancock

    Full Text Available Genotype imputation, used in genome-wide association studies to expand coverage of single nucleotide polymorphisms (SNPs, has performed poorly in African Americans compared to less admixed populations. Overall, imputation has typically relied on HapMap reference haplotype panels from Africans (YRI, European Americans (CEU, and Asians (CHB/JPT. The 1000 Genomes project offers a wider range of reference populations, such as African Americans (ASW, but their imputation performance has had limited evaluation. Using 595 African Americans genotyped on Illumina's HumanHap550v3 BeadChip, we compared imputation results from four software programs (IMPUTE2, BEAGLE, MaCH, and MaCH-Admix and three reference panels consisting of different combinations of 1000 Genomes populations (February 2012 release: (1 3 specifically selected populations (YRI, CEU, and ASW; (2 8 populations of diverse African (AFR or European (AFR descent; and (3 all 14 available populations (ALL. Based on chromosome 22, we calculated three performance metrics: (1 concordance (percentage of masked genotyped SNPs with imputed and true genotype agreement; (2 imputation quality score (IQS; concordance adjusted for chance agreement, which is particularly informative for low minor allele frequency [MAF] SNPs; and (3 average r2hat (estimated correlation between the imputed and true genotypes, for all imputed SNPs. Across the reference panels, IMPUTE2 and MaCH had the highest concordance (91%-93%, but IMPUTE2 had the highest IQS (81%-83% and average r2hat (0.68 using YRI+ASW+CEU, 0.62 using AFR+EUR, and 0.55 using ALL. Imputation quality for most programs was reduced by the addition of more distantly related reference populations, due entirely to the introduction of low frequency SNPs (MAF≤2% that are monomorphic in the more closely related panels. While imputation was optimized by using IMPUTE2 with reference to the ALL panel (average r2hat = 0.86 for SNPs with MAF>2%, use of the ALL

  16. Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies.

    Science.gov (United States)

    Shah, Jasmit S; Rai, Shesh N; DeFilippis, Andrew P; Hill, Bradford G; Bhatnagar, Aruni; Brock, Guy N

    2017-02-20

    High throughput metabolomics makes it possible to measure the relative abundances of numerous metabolites in biological samples, which is useful to many areas of biomedical research. However, missing values (MVs) in metabolomics datasets are common and can arise due to both technical and biological reasons. Typically, such MVs are substituted by a minimum value, which may lead to different results in downstream analyses. Here we present a modified version of the K-nearest neighbor (KNN) approach which accounts for truncation at the minimum value, i.e., KNN truncation (KNN-TN). We compare imputation results based on KNN-TN with results from other KNN approaches such as KNN based on correlation (KNN-CR) and KNN based on Euclidean distance (KNN-EU). Our approach assumes that the data follow a truncated normal distribution with the truncation point at the detection limit (LOD). The effectiveness of each approach was analyzed by the root mean square error (RMSE) measure as well as the metabolite list concordance index (MLCI) for influence on downstream statistical testing. Through extensive simulation studies and application to three real data sets, we show that KNN-TN has lower RMSE values compared to the other two KNN procedures as well as simpler imputation methods based on substituting missing values with the metabolite mean, zero values, or the LOD. MLCI values between KNN-TN and KNN-EU were roughly equivalent, and superior to the other four methods in most cases. Our findings demonstrate that KNN-TN generally has improved performance in imputing the missing values of the different datasets compared to KNN-CR and KNN-EU when there is missingness due to missing at random combined with an LOD. The results shown in this study are in the field of metabolomics but this method could be applicable with any high throughput technology which has missing due to LOD.

  17. Geographical Tatoos

    Directory of Open Access Journals (Sweden)

    Valéria Cazetta

    2014-08-01

    Full Text Available The article deals with maps tattooed on bodies. My interest in studying the corporeality is inserted in a broader project entitled Geographies and (in Bodies. There is several published research on tattoos, but none in particular about tattooed maps. However some of these works interested me because they present important discussions in contemporary about body modification that helped me locate the body modifications most within the culture than on the nature. At this time, I looked at pictures of geographical tattoos available in several sites of the internet.

  18. Water quality analysis of the commercial boreholes in Mubi Metropolis, Adamawa State, Nigeria: geographic information system approach.

    Science.gov (United States)

    Mayomi, Ikusemoran; Elisha, Ibrahim

    2011-12-01

    It is observed that most of the commercial boreholes in Mubi Metropolis are located along River Yedzeram which is the main river that runs across the town. Unfortunately, due to the geographical location of the town in savanna region with minimal water supply, water related small scale industries such as sachet water, block making, irrigation agriculture, cloth dying, car wash and other pollution activities such as mechanical workshops and public toilets are also located along the same River Yedzeram. Moreover, the inhabitants of the town either dump their refuse in the River or spread it on their farmlands as there is no provision of refuse dump site by the government. Therefore, five parameters (Nitrate, Magnesium, Copper, Calcium and Iron) were used to test thewater quality of water samples that were collected from twenty two commercial boreholes along the river, using the standard examination of water and waste water of the World Health Organization to determine the water quality of the boreholes. The study revealed that only eight out of the twenty two boreholes are of good quality, while the others are either of bad quality or not portable. ArcGIS 9.2 and ILWIS 3.3 software were used to analyze the laboratory results through the use of SQL queries. It was recommended that the government should provide portable water, establish water quality control board and make use of GIS for creation of database and analysis.

  19. Geographic information systems: a useful tool to approach African swine fever surveillance management of wild pig populations.

    Science.gov (United States)

    Rolesu, Sandro; Aloi, Daniela; Ghironi, Annalisa; Oggiano, Nicolino; Oggiano, Annalisa; Puggioni, Giantonella; Patta, Cristiana; Farina, Salvatore; Montinaro, Salvatore

    2007-01-01

    The epidemiological surveillance of African swine fever in wild pig populations requires the previous collection of numerous samples of biological materials for virological and serological testing from each animal that has been killed during the hunting season. The number of samples needs to demonstrate the absence of the disease at a prevalence level of 5% (and confidence level of 95%) in the area subject observed. Since the typology of the territory suitable for maintaining wild pig populations and the precise location can be identified, it is possible to pinpoint specific areas within Sardinia where organised sampling is undertaken. The results from tests are used to estimate the prevalence of the disease in the wild pig population in the place of origin. Areas were identified using the geographic information system technology with support from maps in the field. The correct localisation of seropositivity has led to the redefinition of high-risk areas for African swine fever. Results from the outbreaks and the surveillance of the wild pig population has confirmed the decreasing role of the wild boar in maintaining the disease.

  20. Estimating Lifetime Costs of Social Care: A Bayesian Approach Using Linked Administrative Datasets from Three Geographical Areas.

    Science.gov (United States)

    Steventon, Adam; Roberts, Adam

    2015-12-01

    We estimated lifetime costs of publicly funded social care, covering services such as residential and nursing care homes, domiciliary care and meals. Like previous studies, we constructed microsimulation models. However, our transition probabilities were estimated from longitudinal, linked administrative health and social care datasets, rather than from survey data. Administrative data were obtained from three geographical areas of England, and we estimated transition probabilities in each of these sites flexibly using Bayesian methods. This allowed us to quantify regional variation as well as the impact of structural and parameter uncertainty regarding the transition probabilities. Expected lifetime costs at age 65 were £20,200-27,000 for men and £38,700-49,000 for women, depending on which of the three areas was used to calibrate the model. Thus, patterns of social care spending differed markedly between areas, with mean costs varying by almost £10,000 (25%) across the lifetime for people of the same age and gender. Allowing for structural and parameter uncertainty had little impact on expected lifetime costs, but slightly increased the risk of very high costs, which will have implications for insurance products for social care through increasing requirements for capital reserves.

  1. Development of an Antarctic digital elevation model by integrating cartographic and remotely sensed data: A geographic information system based approach

    Science.gov (United States)

    Liu, Hongxing; Jezek, Kenneth C.; Li, Biyan

    1999-10-01

    We present a high-resolution digital elevation model (DEM) of the Antarctic. It was created in a geographic information system (GIS) environment by integrating the best available topographic data from a variety of sources. Extensive GIS-based error detection and correction operations ensured that our DEM is free of gross errors. The carefully designed interpolation algorithms for different types of source data and incorporation of surface morphologic information preserved and enhanced the fine surface structures present in the source data. The effective control of adverse edge effects and the use of the Hermite blending weight function in data merging minimized the discontinuities between different types of data, leading to a seamless and topographically consistent DEM throughout the Antarctic. This new DEM provides exceptional topographical details and represents a substantial improvement in horizontal resolution and vertical accuracy over the earlier, continental-scale renditions, particularly in mountainous and coastal regions. It has a horizontal resolution of 200 m over the rugged mountains, 400 m in the coastal regions, and approximately 5 km in the interior. The vertical accuracy of the DEM is estimated at about 100-130 m over the rugged mountainous area, better than 2 m for the ice shelves, better than 15 m for the interior ice sheet, and about 35 m for the steeper ice sheet perimeter. The Antarctic DEM can be obtained from the authors.

  2. The Local Food Environment and Fruit and Vegetable Intake: A Geographically Weighted Regression Approach in the ORiEL Study.

    Science.gov (United States)

    Clary, Christelle; Lewis, Daniel J; Flint, Ellen; Smith, Neil R; Kestens, Yan; Cummins, Steven

    2016-12-01

    Studies that explore associations between the local food environment and diet routinely use global regression models, which assume that relationships are invariant across space, yet such stationarity assumptions have been little tested. We used global and geographically weighted regression models to explore associations between the residential food environment and fruit and vegetable intake. Analyses were performed in 4 boroughs of London, United Kingdom, using data collected between April 2012 and July 2012 from 969 adults in the Olympic Regeneration in East London Study. Exposures were assessed both as absolute densities of healthy and unhealthy outlets, taken separately, and as a relative measure (proportion of total outlets classified as healthy). Overall, local models performed better than global models (lower Akaike information criterion). Locally estimated coefficients varied across space, regardless of the type of exposure measure, although changes of sign were observed only when absolute measures were used. Despite findings from global models showing significant associations between the relative measure and fruit and vegetable intake (β = 0.022; P environment and diet. It further challenges the idea that a single measure of exposure, whether relative or absolute, can reflect the many ways the food environment may shape health behaviors. © The Author 2016. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  3. Global and Mexican analytical review of the state of the art on Ecosystem and Environmental services: A geographical approach

    Directory of Open Access Journals (Sweden)

    Maria Perevochtchikova

    2013-12-01

    Full Text Available The term Ecosystem Services (ES was introduced in the Rio Declaration in 1992, within a strong international movement for sustainable natural resource management. Back then, the innovative principle concerned the environmental functions that maintain life support systems. To illustrate this further, pollination, oxygen production, temperature regulation, water storage, filtering and distribution, among others, were listed and previously taken for granted until human action contested them. The first compensation schemes for Environmental Services were proposed in 1997 as one of the tools of the new environmental policy directed towards the principles of sustainable development. Since then, the topic of ES has received remarkable global response, which is reflected by the implementation of payment programs and by the development of research in many countries worldwide. This paper analyses the state of the art of the research carried out so far on ES and Environmental Services from the global and the Mexican perspectives. It is based upon the review of 1,781 scientific papers published in international peer reviewed journals between 1992 and 2012. Furthermore, the present study provides a sound geographical overview of the main ES topics studied and of the relative emission of papers per region, country or state. Results are finally presented and discussed in the light of their deficits and of the challenges ahead.

  4. Adaptive Cartography and Geographical Education

    Science.gov (United States)

    Konecny, Milan; Stanek, Karel

    2010-01-01

    The article focuses on adaptive cartography and its potential for geographical education. After briefly describing the wider context of adaptive cartography, it is suggested that this new cartographic approach establishes new demands and benefits for geographical education, especially in offering the possibility for broader individual…

  5. Comparison of different methods for imputing genome-wide marker genotypes in Swedish and Finnish Red Cattle

    DEFF Research Database (Denmark)

    Ma, Peipei; Brøndum, Rasmus Froberg; Qin, Zahng

    2013-01-01

    This study investigated the imputation accuracy of different methods, considering both the minor allele frequency and relatedness between individuals in the reference and test data sets. Two data sets from the combined population of Swedish and Finnish Red Cattle were used to test the influence...... of these factors on the accuracy of imputation. Data set 1 consisted of 2,931 reference bulls and 971 test bulls, and was used for validation of imputation from 3,000 markers (3K) to 54,000 markers (54K). Data set 2 contained 341 bulls in the reference set and 117 in the test set, and was used for validation...... of imputation from 54K to high density [777,000 markers (777K)]. Both test sets were divided into 4 groups according to their relationship to the reference population. Five imputation methods (Beagle, IMPUTE2, findhap, AlphaImpute, and FImpute) were used in this study. Imputation accuracy was measured...

  6. Consequences of Splitting Sequencing Effort over Multiple Breeds on Imputation Accuracy

    NARCIS (Netherlands)

    Bouwman, A.C.; Veerkamp, R.F.

    2014-01-01

    Imputation from a high-density SNP panel (777k) to whole-genome sequence with a reference population of 20 Holstein resulted in an average imputation accuracy of 0.70, and increased to 0.83 when the reference population was increased by including 3 other dairy breeds with 20 animals each. When the

  7. Variable selection for multiply-imputed data with application to dioxin exposure study.

    Science.gov (United States)

    Chen, Qixuan; Wang, Sijian

    2013-09-20

    Multiple imputation (MI) is a commonly used technique for handling missing data in large-scale medical and public health studies. However, variable selection on multiply-imputed data remains an important and longstanding statistical problem. If a variable selection method is applied to each imputed dataset separately, it may select different variables for different imputed datasets, which makes it difficult to interpret the final model or draw scientific conclusions. In this paper, we propose a novel multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) variable selection method as an extension of the least absolute shrinkage and selection operator (LASSO) method to multiply-imputed data. The MI-LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple-imputed datasets. We use a simulation study to demonstrate the advantage of the MI-LASSO method compared with the alternatives. We also apply the MI-LASSO method to the University of Michigan Dioxin Exposure Study to identify important circumstances and exposure factors that are associated with human serum dioxin concentration in Midland, Michigan.

  8. Taking don't knows as valid responses: a multiple complete random imputation of missing data

    NARCIS (Netherlands)

    Kroh, Martin

    2006-01-01

    Incomplete data is a common problem of survey research. Recent work on multiple imputation techniques has increased analysts awareness of the biasing effects of missing data and has also provided a convenient solution. Imputation methods replace non-response with estimates of the unobserved scores.

  9. A Method for Imputing Response Options for Missing Data on Multiple-Choice Assessments

    Science.gov (United States)

    Wolkowitz, Amanda A.; Skorupski, William P.

    2013-01-01

    When missing values are present in item response data, there are a number of ways one might impute a correct or incorrect response to a multiple-choice item. There are significantly fewer methods for imputing the actual response option an examinee may have provided if he or she had not omitted the item either purposely or accidentally. This…

  10. Estimation of missing rainfall data using spatial interpolation and imputation methods

    Science.gov (United States)

    Radi, Noor Fadhilah Ahmad; Zakaria, Roslinazairimah; Azman, Muhammad Az-zuhri

    2015-02-01

    This study is aimed to estimate missing rainfall data by dividing the analysis into three different percentages namely 5%, 10% and 20% in order to represent various cases of missing data. In practice, spatial interpolation methods are chosen at the first place to estimate missing data. These methods include normal ratio (NR), arithmetic average (AA), coefficient of correlation (CC) and inverse distance (ID) weighting methods. The methods consider the distance between the target and the neighbouring stations as well as the correlations between them. Alternative method for solving missing data is an imputation method. Imputation is a process of replacing missing data with substituted values. A once-common method of imputation is single-imputation method, which allows parameter estimation. However, the single imputation method ignored the estimation of variability which leads to the underestimation of standard errors and confidence intervals. To overcome underestimation problem, multiple imputations method is used, where each missing value is estimated with a distribution of imputations that reflect the uncertainty about the missing data. In this study, comparison of spatial interpolation methods and multiple imputations method are presented to estimate missing rainfall data. The performance of the estimation methods used are assessed using the similarity index (S-index), mean absolute error (MAE) and coefficient of correlation (R).

  11. Performance of genotype imputation for rare variants identified in exons and flanking regions of genes.

    Directory of Open Access Journals (Sweden)

    Li Li

    Full Text Available Genotype imputation has the potential to assess human genetic variation at a lower cost than assaying the variants using laboratory techniques. The performance of imputation for rare variants has not been comprehensively studied. We utilized 8865 human samples with high depth resequencing data for the exons and flanking regions of 202 genes and Genome-Wide Association Study (GWAS data to characterize the performance of genotype imputation for rare variants. We evaluated reference sets ranging from 100 to 3713 subjects for imputing into samples typed for the Affymetrix (500K and 6.0 and Illumina 550K GWAS panels. The proportion of variants that could be well imputed (true r(2>0.7 with a reference panel of 3713 individuals was: 31% (Illumina 550K or 25% (Affymetrix 500K with MAF (Minor Allele Frequency less than or equal 0.001, 48% or 35% with 0.0010.05. The performance for common SNPs (MAF>0.05 within exons and flanking regions is comparable to imputation of more uniformly distributed SNPs. The performance for rare SNPs (0.01imputation for extending the assessment of common variants identified in humans via targeted exon resequencing into additional samples with GWAS data, but imputation of very rare variants (MAF< = 0.005 will require reference panels with thousands of subjects.

  12. 48 CFR 1830.7002-4 - Determining imputed cost of money.

    Science.gov (United States)

    2010-10-01

    ... money. 1830.7002-4 Section 1830.7002-4 Federal Acquisition Regulations System NATIONAL AERONAUTICS AND... Determining imputed cost of money. (a) Determine the imputed cost of money for an asset under construction, fabrication, or development by applying a cost of money rate (see 1830.7002-2) to the...

  13. 5 CFR 919.630 - May the OPM impute conduct of one person to another?

    Science.gov (United States)

    2010-01-01

    ... 5 Administrative Personnel 2 2010-01-01 2010-01-01 false May the OPM impute conduct of one person to another? 919.630 Section 919.630 Administrative Personnel OFFICE OF PERSONNEL MANAGEMENT...) General Principles Relating to Suspension and Debarment Actions § 919.630 May the OPM impute conduct...

  14. A novel approach to find and optimize bin locations and collection routes using a geographic information system.

    Science.gov (United States)

    Erfani, Seyed Mohammad Hassan; Danesh, Shahnaz; Karrabi, Seyed Mohsen; Shad, Rouzbeh

    2017-07-01

    One of the major challenges in big cities is planning and implementation of an optimized, integrated solid waste management system. This optimization is crucial if environmental problems are to be prevented and the expenses to be reduced. A solid waste management system consists of many stages including collection, transfer and disposal. In this research, an integrated model was proposed and used to optimize two functional elements of municipal solid waste management (storage and collection systems) in the Ahmadabad neighbourhood located in the City of Mashhad - Iran. The integrated model was performed by modelling and solving the location allocation problem and capacitated vehicle routing problem (CVRP) through Geographic Information Systems (GIS). The results showed that the current collection system is not efficient owing to its incompatibility with the existing urban structure and population distribution. Application of the proposed model could significantly improve the storage and collection system. Based on the results of minimizing facilities analyses, scenarios with 100, 150 and 180 m walking distance were considered to find optimal bin locations for Alamdasht, C-metri and Koohsangi. The total number of daily collection tours was reduced to seven as compared to the eight tours carried out in the current system (12.50% reduction). In addition, the total number of required crews was minimized and reduced by 41.70% (24 crews in the current collection system vs 14 in the system provided by the model). The total collection vehicle routing was also optimized such that the total travelled distances during night and day working shifts was cut back by 53%.

  15. An approach for land suitability evaluation using geostatistics, remote sensing, and geographic information system in arid and semiarid ecosystems.

    Science.gov (United States)

    Emadi, Mostafa; Baghernejad, Majid; Pakparvar, Mojtaba; Kowsar, Sayyed Ahang

    2010-05-01

    This study was undertaken to incorporate geostatistics, remote sensing, and geographic information system (GIS) technologies to improve the qualitative land suitability assessment in arid and semiarid ecosystems of Arsanjan plain, southern Iran. The primary data were obtained from 85 soil samples collected from tree depths (0-30, 30-60, and 60-90 cm); the secondary information was acquired from the remotely sensed data from the linear imaging self-scanner (LISS-III) receiver of the IRS-P6 satellite. Ordinary kriging and simple kriging with varying local means (SKVLM) methods were used to identify the spatial dependency of soil important parameters. It was observed that using the data collected from the spectral values of band 1 of the LISS-III receiver as the secondary variable applying the SKVLM method resulted in the lowest mean square error for mapping the pH and electrical conductivity (ECe) in the 0-30-cm depth. On the other hand, the ordinary kriging method resulted in a reliable accuracy for the other soil properties with moderate to strong spatial dependency in the study area for interpolation in the unstamped points. The parametric land suitability evaluation method was applied on the density points (150 x 150 m(2)) instead of applying on the limited representative profiles conventionally, which were obtained by the kriging or SKVLM methods. Overlaying the information layers of the data was used with the GIS for preparing the final land suitability evaluation. Therefore, changes in land characteristics could be identified in the same soil uniform mapping units over a very short distance. In general, this new method can easily present the squares and limitation factors of the different land suitability classes with considerable accuracy in arbitrary land indices.

  16. Evaluation of Multi-parameter Test Statistics for Multiple Imputation.

    Science.gov (United States)

    Liu, Yu; Enders, Craig K

    2017-01-01

    In Ordinary Least Square regression, researchers often are interested in knowing whether a set of parameters is different from zero. With complete data, this could be achieved using the gain in prediction test, hierarchical multiple regression, or an omnibus F test. However, in substantive research scenarios, missing data often exist. In the context of multiple imputation, one of the current state-of-art missing data strategies, there are several different analogous multi-parameter tests of the joint significance of a set of parameters, and these multi-parameter test statistics can be referenced to various distributions to make statistical inferences. However, little is known about the performance of these tests, and virtually no research study has compared the Type 1 error rates and statistical power of these tests in scenarios that are typical of behavioral science data (e.g., small to moderate samples, etc.). This paper uses Monte Carlo simulation techniques to examine the performance of these multi-parameter test statistics for multiple imputation under a variety of realistic conditions. We provide a number of practical recommendations for substantive researchers based on the simulation results, and illustrate the calculation of these test statistics with an empirical example.

  17. Imputation of KIR Types from SNP Variation Data.

    Science.gov (United States)

    Vukcevic, Damjan; Traherne, James A; Næss, Sigrid; Ellinghaus, Eva; Kamatani, Yoichiro; Dilthey, Alexander; Lathrop, Mark; Karlsen, Tom H; Franke, Andre; Moffatt, Miriam; Cookson, William; Trowsdale, John; McVean, Gil; Sawcer, Stephen; Leslie, Stephen

    2015-10-01

    Large population studies of immune system genes are essential for characterizing their role in diseases, including autoimmune conditions. Of key interest are a group of genes encoding the killer cell immunoglobulin-like receptors (KIRs), which have known and hypothesized roles in autoimmune diseases, resistance to viruses, reproductive conditions, and cancer. These genes are highly polymorphic, which makes typing expensive and time consuming. Consequently, despite their importance, KIRs have been little studied in large cohorts. Statistical imputation methods developed for other complex loci (e.g., human leukocyte antigen [HLA]) on the basis of SNP data provide an inexpensive high-throughput alternative to direct laboratory typing of these loci and have enabled important findings and insights for many diseases. We present KIR∗IMP, a method for imputation of KIR copy number. We show that KIR∗IMP is highly accurate and thus allows the study of KIRs in large cohorts and enables detailed investigation of the role of KIRs in human disease.

  18. [Imputing missing data in public health: general concepts and application to dichotomous variables].

    Science.gov (United States)

    Hernández, Gilma; Moriña, David; Navarro, Albert

    The presence of missing data in collected variables is common in health surveys, but the subsequent imputation thereof at the time of analysis is not. Working with imputed data may have certain benefits regarding the precision of the estimators and the unbiased identification of associations between variables. The imputation process is probably still little understood by many non-statisticians, who view this process as highly complex and with an uncertain goal. To clarify these questions, this note aims to provide a straightforward, non-exhaustive overview of the imputation process to enable public health researchers ascertain its strengths. All this in the context of dichotomous variables which are commonplace in public health. To illustrate these concepts, an example in which missing data is handled by means of simple and multiple imputation is introduced. Copyright © 2017 SESPAS. Publicado por Elsevier España, S.L.U. All rights reserved.

  19. Data supporting the high-accuracy haplotype imputation using unphased genotype data as the references

    Directory of Open Access Journals (Sweden)

    Wenzhi Li

    2016-09-01

    Full Text Available The data presented in this article is related to the research article entitled “High-accuracy haplotype imputation using unphased genotype data as the references” which reports the unphased genotype data can be used as reference for haplotyping imputation [1]. This article reports different implementation generation pipeline, the results of performance comparison between different implementations (A, B, and C and between HiFi and three major imputation software tools. Our data showed that the performances of these three implementations are similar on accuracy, in which the accuracy of implementation-B is slightly but consistently higher than A and C. HiFi performed better on haplotype imputation accuracy and three other software performed slightly better on genotype imputation accuracy. These data may provide a strategy for choosing optimal phasing pipeline and software for different studies.

  20. Sustainable Growth in Urbanised Delta Areas: the Opportunities of a Geographical Approach to the Pearl River Delta

    NARCIS (Netherlands)

    Van Rens, G.; Nillisen, A.L.; Schamhart, C.; Lugt, N.

    2006-01-01

    The attractions of delta areas have boomed economies and founded major cities, but the threats of the adjacent water have persisted and natural resources have declined. The objective to facilitate sustainable urban growth in delta areas can only be met by a simultaneous approach of all the stakehold

  1. Combining multiple imputation and meta-analysis with individual participant data.

    Science.gov (United States)

    Burgess, Stephen; White, Ian R; Resche-Rigon, Matthieu; Wood, Angela M

    2013-11-20

    Multiple imputation is a strategy for the analysis of incomplete data such that the impact of the missingness on the power and bias of estimates is mitigated. When data from multiple studies are collated, we can propose both within-study and multilevel imputation models to impute missing data on covariates. It is not clear how to choose between imputation models or how to combine imputation and inverse-variance weighted meta-analysis methods. This is especially important as often different studies measure data on different variables, meaning that we may need to impute data on a variable which is systematically missing in a particular study. In this paper, we consider a simulation analysis of sporadically missing data in a single covariate with a linear analysis model and discuss how the results would be applicable to the case of systematically missing data. We find in this context that ensuring the congeniality of the imputation and analysis models is important to give correct standard errors and confidence intervals. For example, if the analysis model allows between-study heterogeneity of a parameter, then we should incorporate this heterogeneity into the imputation model to maintain the congeniality of the two models. In an inverse-variance weighted meta-analysis, we should impute missing data and apply Rubin's rules at the study level prior to meta-analysis, rather than meta-analyzing each of the multiple imputations and then combining the meta-analysis estimates using Rubin's rules. We illustrate the results using data from the Emerging Risk Factors Collaboration.

  2. Effect of Genome-Wide Genotyping and Reference Panels on Rare Variants Imputation

    Institute of Scientific and Technical Information of China (English)

    Hou-Feng Zheng; Martin Ladouceur; Celia M.T. Greenwood; J.Brent Richards

    2012-01-01

    Common variants explain little of the variance of most common disease,prompting large-scale sequencing studies to understand the contribution of rare variants to these diseases.Imputation of rare variants from genome-wide genotypic arrays offers a cost-efficient strategy to achieve necessary sample sizes required for adequate statistical power.To estimate the performance of imputation of rare variants,we imputed 153 individuals,each of whom was genotyped on 3 different genotype arrays including 317k,610k and 1 million single nucleotide polymorphisms (SNPs),to two different reference panels:HapMap2 and 1000 Genomes pilot March 2010 release (1KGpilot) by using IMPUTE version 2.We found that more than 94% and 84% of all SNPs yield acceptable accuracy (info > 0.4) in HapMap2 and 1KGpilot-based imputation,respectively.For rare variants (minor allele frequency (MAF) ≤5%),the proportion of well-imputed SNPs increased as the MAF increased from 0.3% to 5% across all 3 genome-wide association study (GWAS) datasets.The proportion of well-imputed SNPs was 69%,60% and 49% for SNPs with a MAF from 0.3% to 5% for 1M,610k and 317k,respectively.None of the very rare variants (MAF ≤ 0.3%) were well imputed.We conclude that the imputation accuracy of rare variants increases with higher density of genome-wide genotyping arrays when the size of the reference panel is small.Variants with lower MAF are more difficult to impute.These findings have important implications in the design and replication of large-scale sequencing studies.

  3. Insights into Diversity and Imputed Metabolic Potential of Bacterial Communities in the Continental Shelf of Agatti Island.

    Science.gov (United States)

    Kumbhare, Shreyas V; Dhotre, Dhiraj P; Dhar, Sunil Kumar; Jani, Kunal; Apte, Deepak A; Shouche, Yogesh S; Sharma, Avinash

    2015-01-01

    Marine microbes play a key role and contribute largely to the global biogeochemical cycles. This study aims to explore microbial diversity from one such ecological hotspot, the continental shelf of Agatti Island. Sediment samples from various depths of the continental shelf were analyzed for bacterial diversity using deep sequencing technology along with the culturable approach. Additionally, imputed metagenomic approach was carried out to understand the functional aspects of microbial community especially for microbial genes important in nutrient uptake, survival and biogeochemical cycling in the marine environment. Using culturable approach, 28 bacterial strains representing 9 genera were isolated from various depths of continental shelf. The microbial community structure throughout the samples was dominated by phylum Proteobacteria and harbored various bacterioplanktons as well. Significant differences were observed in bacterial diversity within a short region of the continental shelf (1-40 meters) i.e. between upper continental shelf samples (UCS) with lesser depths (i.e. 1-20 meters) and lower continental shelf samples (LCS) with greater depths (i.e. 25-40 meters). By using imputed metagenomic approach, this study also discusses several adaptive mechanisms which enable microbes to survive in nutritionally deprived conditions, and also help to understand the influence of nutrition availability on bacterial diversity.

  4. Imputation-based meta-analysis of severe malaria in three African populations.

    Directory of Open Access Journals (Sweden)

    Gavin Band

    2013-05-01

    Full Text Available Combining data from genome-wide association studies (GWAS conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP-based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles.

  5. Imputation-Based Meta-Analysis of Severe Malaria in Three African Populations

    Science.gov (United States)

    Band, Gavin; Le, Quang Si; Jostins, Luke; Pirinen, Matti; Kivinen, Katja; Jallow, Muminatou; Sisay-Joof, Fatoumatta; Bojang, Kalifa; Pinder, Margaret; Sirugo, Giorgio; Conway, David J.; Nyirongo, Vysaul; Kachala, David; Molyneux, Malcolm; Taylor, Terrie; Ndila, Carolyne; Peshu, Norbert; Marsh, Kevin; Williams, Thomas N.; Alcock, Daniel; Andrews, Robert; Edkins, Sarah; Gray, Emma; Hubbart, Christina; Jeffreys, Anna; Rowlands, Kate; Schuldt, Kathrin; Clark, Taane G.; Small, Kerrin S.; Teo, Yik Ying; Kwiatkowski, Dominic P.; Rockett, Kirk A.; Barrett, Jeffrey C.; Spencer, Chris C. A.

    2013-01-01

    Combining data from genome-wide association studies (GWAS) conducted at different locations, using genotype imputation and fixed-effects meta-analysis, has been a powerful approach for dissecting complex disease genetics in populations of European ancestry. Here we investigate the feasibility of applying the same approach in Africa, where genetic diversity, both within and between populations, is far more extensive. We analyse genome-wide data from approximately 5,000 individuals with severe malaria and 7,000 population controls from three different locations in Africa. Our results show that the standard approach is well powered to detect known malaria susceptibility loci when sample sizes are large, and that modern methods for association analysis can control the potential confounding effects of population structure. We show that pattern of association around the haemoglobin S allele differs substantially across populations due to differences in haplotype structure. Motivated by these observations we consider new approaches to association analysis that might prove valuable for multicentre GWAS in Africa: we relax the assumptions of SNP–based fixed effect analysis; we apply Bayesian approaches to allow for heterogeneity in the effect of an allele on risk across studies; and we introduce a region-based test to allow for heterogeneity in the location of causal alleles. PMID:23717212

  6. An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data.

    Science.gov (United States)

    Liu, Yuzhe; Gopalakrishnan, Vanathi

    2017-03-01

    Many clinical research datasets have a large percentage of missing values that directly impacts their usefulness in yielding high accuracy classifiers when used for training in supervised machine learning. While missing value imputation methods have been shown to work well with smaller percentages of missing values, their ability to impute sparse clinical research data can be problem specific. We previously attempted to learn quantitative guidelines for ordering cardiac magnetic resonance imaging during the evaluation for pediatric cardiomyopathy, but missing data significantly reduced our usable sample size. In this work, we sought to determine if increasing the usable sample size through imputation would allow us to learn better guidelines. We first review several machine learning methods for estimating missing data. Then, we apply four popular methods (mean imputation, decision tree, k-nearest neighbors, and self-organizing maps) to a clinical research dataset of pediatric patients undergoing evaluation for cardiomyopathy. Using Bayesian Rule Learning (BRL) to learn ruleset models, we compared the performance of imputation-augmented models versus unaugmented models. We found that all four imputation-augmented models performed similarly to unaugmented models. While imputation did not improve performance, it did provide evidence for the robustness of our learned models.

  7. Multiple imputation to evaluate the impact of an assay change in national surveys.

    Science.gov (United States)

    Sternberg, Maya

    2017-07-30

    National health surveys, such as the National Health and Nutrition Examination Survey, are used to monitor trends of nutritional biomarkers. These surveys try to maintain the same biomarker assay over time, but there are a variety of reasons why the assay may change. In these cases, it is important to evaluate the potential impact of a change so that any observed fluctuations in concentrations over time are not confounded by changes in the assay. To this end, a subset of stored specimens previously analyzed with the old assay is retested using the new assay. These paired data are used to estimate an adjustment equation, which is then used to 'adjust' all the old assay results and convert them into 'equivalent' units of the new assay. In this paper, we present a new way of approaching this problem using modern statistical methods designed for missing data. Using simulations, we compare the proposed multiple imputation approach with the adjustment equation approach currently in use. We also compare these approaches using real National Health and Nutrition Examination Survey data for 25-hydroxyvitamin D. Published 2017. This article is a U.S. Government work and is in the public domain in the USA. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.

  8. Differential network analysis with multiply imputed lipidomic data.

    Directory of Open Access Journals (Sweden)

    Maiju Kujala

    Full Text Available The importance of lipids for cell function and health has been widely recognized, e.g., a disorder in the lipid composition of cells has been related to atherosclerosis caused cardiovascular disease (CVD. Lipidomics analyses are characterized by large yet not a huge number of mutually correlated variables measured and their associations to outcomes are potentially of a complex nature. Differential network analysis provides a formal statistical method capable of inferential analysis to examine differences in network structures of the lipids under two biological conditions. It also guides us to identify potential relationships requiring further biological investigation. We provide a recipe to conduct permutation test on association scores resulted from partial least square regression with multiple imputed lipidomic data from the LUdwigshafen RIsk and Cardiovascular Health (LURIC study, particularly paying attention to the left-censored missing values typical for a wide range of data sets in life sciences. Left-censored missing values are low-level concentrations that are known to exist somewhere between zero and a lower limit of quantification. To make full use of the LURIC data with the missing values, we utilize state of the art multiple imputation techniques and propose solutions to the challenges that incomplete data sets bring to differential network analysis. The customized network analysis helps us to understand the complexities of the underlying biological processes by identifying lipids and lipid classes that interact with each other, and by recognizing the most important differentially expressed lipids between two subgroups of coronary artery disease (CAD patients, the patients that had a fatal CVD event and the ones who remained stable during two year follow-up.

  9. Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels

    Directory of Open Access Journals (Sweden)

    Xiaoyi eGao

    2012-06-01

    Full Text Available Genotype imputation is a vital tool in genome-wide association studies (GWAS and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR+CEU+YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation-based analysis in Latinos.

  10. Genotype Imputation for Latinos Using the HapMap and 1000 Genomes Project Reference Panels.

    Science.gov (United States)

    Gao, Xiaoyi; Haritunians, Talin; Marjoram, Paul; McKean-Cowdin, Roberta; Torres, Mina; Taylor, Kent D; Rotter, Jerome I; Gauderman, William J; Varma, Rohit

    2012-01-01

    Genotype imputation is a vital tool in genome-wide association studies (GWAS) and meta-analyses of multiple GWAS results. Imputation enables researchers to increase genomic coverage and to pool data generated using different genotyping platforms. HapMap samples are often employed as the reference panel. More recently, the 1000 Genomes Project resource is becoming the primary source for reference panels. Multiple GWAS and meta-analyses are targeting Latinos, the most populous, and fastest growing minority group in the US. However, genotype imputation resources for Latinos are rather limited compared to individuals of European ancestry at present, largely because of the lack of good reference data. One choice of reference panel for Latinos is one derived from the population of Mexican individuals in Los Angeles contained in the HapMap Phase 3 project and the 1000 Genomes Project. However, a detailed evaluation of the quality of the imputed genotypes derived from the public reference panels has not yet been reported. Using simulation studies, the Illumina OmniExpress GWAS data from the Los Angles Latino Eye Study and the MACH software package, we evaluated the accuracy of genotype imputation in Latinos. Our results show that the 1000 Genomes Project AMR + CEU + YRI reference panel provides the highest imputation accuracy for Latinos, and that also including Asian samples in the panel can reduce imputation accuracy. We also provide the imputation accuracy for each autosomal chromosome using the 1000 Genomes Project panel for Latinos. Our results serve as a guide to future imputation based analysis in Latinos.

  11. Association studies with imputed variants using expectation-maximization likelihood-ratio tests.

    Directory of Open Access Journals (Sweden)

    Kuan-Chieh Huang

    Full Text Available Genotype imputation has become standard practice in modern genetic studies. As sequencing-based reference panels continue to grow, increasingly more markers are being well or better imputed but at the same time, even more markers with relatively low minor allele frequency are being imputed with low imputation quality. Here, we propose new methods that incorporate imputation uncertainty for downstream association analysis, with improved power and/or computational efficiency. We consider two scenarios: I when posterior probabilities of all potential genotypes are estimated; and II when only the one-dimensional summary statistic, imputed dosage, is available. For scenario I, we have developed an expectation-maximization likelihood-ratio test for association based on posterior probabilities. When only imputed dosages are available (scenario II, we first sample the genotype probabilities from its posterior distribution given the dosages, and then apply the EM-LRT on the sampled probabilities. Our simulations show that type I error of the proposed EM-LRT methods under both scenarios are protected. Compared with existing methods, EM-LRT-Prob (for scenario I offers optimal statistical power across a wide spectrum of MAF and imputation quality. EM-LRT-Dose (for scenario II achieves a similar level of statistical power as EM-LRT-Prob and, outperforms the standard Dosage method, especially for markers with relatively low MAF or imputation quality. Applications to two real data sets, the Cebu Longitudinal Health and Nutrition Survey study and the Women's Health Initiative Study, provide further support to the validity and efficiency of our proposed methods.

  12. A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data.

    Science.gov (United States)

    Miecznikowski, Jeffrey C; Damodaran, Senthilkumar; Sellers, Kimberly F; Rabin, Richard A

    2010-12-15

    Numerous gel-based softwares exist to detect protein changes potentially associated with disease. The data, however, are abundant with technical and structural complexities, making statistical analysis a difficult task. A particularly important topic is how the various softwares handle missing data. To date, no one has extensively studied the impact that interpolating missing data has on subsequent analysis of protein spots. This work highlights the existing algorithms for handling missing data in two-dimensional gel analysis and performs a thorough comparison of the various algorithms and statistical tests on simulated and real datasets. For imputation methods, the best results in terms of root mean squared error are obtained using the least squares method of imputation along with the expectation maximization (EM) algorithm approach to estimate missing values with an array covariance structure. The bootstrapped versions of the statistical tests offer the most liberal option for determining protein spot significance while the generalized family wise error rate (gFWER) should be considered for controlling the multiple testing error. In summary, we advocate for a three-step statistical analysis of two-dimensional gel electrophoresis (2-DE) data with a data imputation step, choice of statistical test, and lastly an error control method in light of multiple testing. When determining the choice of statistical test, it is worth considering whether the protein spots will be subjected to mass spectrometry. If this is the case a more liberal test such as the percentile-based bootstrap t can be employed. For error control in electrophoresis experiments, we advocate that gFWER be controlled for multiple testing rather than the false discovery rate.

  13. A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data

    Directory of Open Access Journals (Sweden)

    Sellers Kimberly F

    2010-12-01

    Full Text Available Abstract Background Numerous gel-based softwares exist to detect protein changes potentially associated with disease. The data, however, are abundant with technical and structural complexities, making statistical analysis a difficult task. A particularly important topic is how the various softwares handle missing data. To date, no one has extensively studied the impact that interpolating missing data has on subsequent analysis of protein spots. Results This work highlights the existing algorithms for handling missing data in two-dimensional gel analysis and performs a thorough comparison of the various algorithms and statistical tests on simulated and real datasets. For imputation methods, the best results in terms of root mean squared error are obtained using the least squares method of imputation along with the expectation maximization (EM algorithm approach to estimate missing values with an array covariance structure. The bootstrapped versions of the statistical tests offer the most liberal option for determining protein spot significance while the generalized family wise error rate (gFWER should be considered for controlling the multiple testing error. Conclusions In summary, we advocate for a three-step statistical analysis of two-dimensional gel electrophoresis (2-DE data with a data imputation step, choice of statistical test, and lastly an error control method in light of multiple testing. When determining the choice of statistical test, it is worth considering whether the protein spots will be subjected to mass spectrometry. If this is the case a more liberal test such as the percentile-based bootstrap t can be employed. For error control in electrophoresis experiments, we advocate that gFWER be controlled for multiple testing rather than the false discovery rate.

  14. Geographic variations in cervical cancer risk in San Luis Potosí state, Mexico: A spatial statistical approach.

    Science.gov (United States)

    Terán-Hernández, Mónica; Ramis-Prieto, Rebeca; Calderón-Hernández, Jaqueline; Garrocho-Rangel, Carlos Félix; Campos-Alanís, Juan; Ávalos-Lozano, José Antonio; Aguilar-Robledo, Miguel

    2016-09-29

    Worldwide, Cervical Cancer (CC) is the fourth most common type of cancer and cause of death in women. It is a significant public health problem, especially in low and middle-income/Gross Domestic Product (GDP) countries. In the past decade, several studies of CC have been published, that identify the main modifiable and non-modifiable CC risk factors for Mexican women. However, there are no studies that attempt to explain the residual spatial variation in CC incidence In Mexico, i.e. spatial variation that cannot be ascribed to known, spatially varying risk factors. This paper uses a spatial statistical methodology that takes into account spatial variation in socio-economic factors and accessibility to health services, whilst allowing for residual, unexplained spatial variation in risk. To describe residual spatial variations in CC risk, we used generalised linear mixed models (GLMM) with both spatially structured and unstructured random effects, using a Bayesian approach to inference. The highest risk is concentrated in the southeast, where the Matlapa and Aquismón municipalities register excessive risk, with posterior probabilities greater than 0.8. The lack of coverage of Cervical Cancer-Screening Programme (CCSP) (RR 1.17, 95 % CI 1.12-1.22), Marginalisation Index (RR 1.05, 95 % CI 1.03-1.08), and lack of accessibility to health services (RR 1.01, 95 % CI 1.00-1.03) were significant covariates. There are substantial differences between municipalities, with high-risk areas mainly in low-resource areas lacking accessibility to health services for CC. Our results clearly indicate the presence of spatial patterns, and the relevance of the spatial analysis for public health intervention. Ignoring the spatial variability means to continue a public policy that does not tackle deficiencies in its national CCSP and to keep disadvantaging and disempowering Mexican women in regard to their health care.

  15. Volunteered Geographic Information in Natural Hazard Analysis: A Systematic Literature Review of Current Approaches with a Focus on Preparedness and Mitigation

    Directory of Open Access Journals (Sweden)

    Carolin Klonner

    2016-06-01

    Full Text Available With the rise of new technologies, citizens can contribute to scientific research via Web 2.0 applications for collecting and distributing geospatial data. Integrating local knowledge, personal experience and up-to-date geoinformation indicates a promising approach for the theoretical framework and the methods of natural hazard analysis. Our systematic literature review aims at identifying current research and directions for future research in terms of Volunteered Geographic Information (VGI within natural hazard analysis. Focusing on both the preparedness and mitigation phase results in eleven articles from two literature databases. A qualitative analysis for in-depth information extraction reveals auspicious approaches regarding community engagement and data fusion, but also important research gaps. Mainly based in Europe and North America, the analysed studies deal primarily with floods and forest fires, applying geodata collected by trained citizens who are improving their knowledge and making their own interpretations. Yet, there is still a lack of common scientific terms and concepts. Future research can use these findings for the adaptation of scientific models of natural hazard analysis in order to enable the fusion of data from technical sensors and VGI. The development of such general methods shall contribute to establishing the user integration into various contexts, such as natural hazard analysis.

  16. Imputation methods for filling missing data in urban air pollution data for Malaysia

    Directory of Open Access Journals (Sweden)

    Nur Afiqah Zakaria

    2018-06-01

    Full Text Available The air quality measurement data obtained from the continuous ambient air quality monitoring (CAAQM station usually contained missing data. The missing observations of the data usually occurred due to machine failure, routine maintenance and human error. In this study, the hourly monitoring data of CO, O3, PM10, SO2, NOx, NO2, ambient temperature and humidity were used to evaluate four imputation methods (Mean Top Bottom, Linear Regression, Multiple Imputation and Nearest Neighbour. The air pollutants observations were simulated into four percentages of simulated missing data i.e. 5%, 10%, 15% and 20%. Performance measures namely the Mean Absolute Error, Root Mean Squared Error, Coefficient of Determination and Index of Agreement were used to describe the goodness of fit of the imputation methods. From the results of the performance measures, Mean Top Bottom method was selected as the most appropriate imputation method for filling in the missing values in air pollutants data.

  17. Understanding Africa: A Geographic Approach

    Science.gov (United States)

    2009-01-01

    Nairobi, work in the informal sector of the economy doing jobs such as brewing beer in their houses from maize, at times supplementing their incomes from...plastics. A few larger factories manufacture cement, rolled steel, corrugated iron, aluminum sheets, cigarettes, beer and other beverages, raw... beer sales with sexual liaisons with the men they meet as customers (Stock 2004). Medical Geography Malaria is endemic in the Shebeelle

  18. Identification of environmental parameters and risk mapping of visceral leishmaniasis in Ethiopia by using geographical information systems and a statistical approach

    Directory of Open Access Journals (Sweden)

    Teshome Tsegaw

    2013-05-01

    Full Text Available Visceral leishmaniasis (VL, a vector-borne disease strongly influenced by environmental factors, has (re-emerged in Ethiopia during the last two decades and is currently of increasing public health concern. Based on VL incidence in each locality (kebele documented from federal or regional health bureaus and/or hospital records in the country, geographical information systems (GIS, coupled with binary and multivariate logistic regression methods, were employed to develop a risk map for Ethiopia with respect to VL based on soil type, altitude, rainfall, slope and temperature. The risk model was subsequently validated in selected sites. This environmental VL risk model provided an overall prediction accuracy of 86% with mean land surface temperature and soil type found to be the best predictors of VL. The total population at risk was estimated at 3.2 million according to the national population census in 2007. The approach presented here should facilitate the identification of priority areas for intervention and the monitoring of trends as well as providing input for further epidemiological and applied research with regard to this disease in Ethiopia.

  19. Identification of environmental parameters and risk mapping of visceral leishmaniasis in Ethiopia by using geographical information systems and a statistical approach.

    Science.gov (United States)

    Tsegaw, Teshome; Gadisa, Endalamaw; Seid, Ahmed; Abera, Adugna; Teshome, Aklilu; Mulugeta, Abate; Herrero, Merce; Argaw, Daniel; Jorge, Alvar; Aseffa, Abraham

    2013-05-01

    Visceral leishmaniasis (VL), a vector-borne disease strongly influenced by environmental factors, has (re)-emerged in Ethiopia during the last two decades and is currently of increasing public health concern. Based on VL incidence in each locality (kebele) documented from federal or regional health bureaus and/or hospital records in the country, geographical information systems (GIS), coupled with binary and multivariate logistic regression methods, were employed to develop a risk map for Ethiopia with respect to VL based on soil type, altitude, rainfall, slope and temperature. The risk model was subsequently validated in selected sites. This environmental VL risk model provided an overall prediction accuracy of 86% with mean land surface temperature and soil type found to be the best predictors of VL. The total population at risk was estimated at 3.2 million according to the national population census in 2007. The approach presented here should facilitate the identification of priority areas for intervention and the monitoring of trends as well as providing input for further epidemiological and applied research with regard to this disease in Ethiopia.

  20. A hybrid segmentation approach for geographic atrophy in fundus auto-fluorescence images for diagnosis of age-related macular degeneration.

    Science.gov (United States)

    Lee, Noah; Laine, Andrew F; Smith, R Theodore

    2007-01-01

    Fundus auto-fluorescence (FAF) images with hypo-fluorescence indicate geographic atrophy (GA) of the retinal pigment epithelium (RPE) in age-related macular degeneration (AMD). Manual quantification of GA is time consuming and prone to inter- and intra-observer variability. Automatic quantification is important for determining disease progression and facilitating clinical diagnosis of AMD. In this paper we describe a hybrid segmentation method for GA quantification by identifying hypo-fluorescent GA regions from other interfering retinal vessel structures. First, we employ background illumination correction exploiting a non-linear adaptive smoothing operator. Then, we use the level set framework to perform segmentation of hypo-fluorescent areas. Finally, we present an energy function combining morphological scale-space analysis with a geometric model-based approach to perform segmentation refinement of false positive hypo- fluorescent areas due to interfering retinal structures. The clinically apparent areas of hypo-fluorescence were drawn by an expert grader and compared on a pixel by pixel basis to our segmentation results. The mean sensitivity and specificity of the ROC analysis were 0.89 and 0.98%.

  1. genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools.

    Science.gov (United States)

    Lemieux Perreault, Louis-Philippe; Legault, Marc-André; Asselin, Géraldine; Dubé, Marie-Pierre

    2016-12-01

    Genotype imputation is now commonly performed following genome-wide genotyping experiments. Imputation increases the density of analyzed genotypes in the dataset, enabling fine-mapping across the genome. However, the process of imputation using the most recent publicly available reference datasets can require considerable computation power and the management of hundreds of large intermediate files. We have developed genipe, a complete genome-wide imputation pipeline which includes automatic reporting, imputed data indexing and management, and a suite of statistical tests for imputed data commonly used in genetic epidemiology (Sequence Kernel Association Test, Cox proportional hazards for survival analysis, and linear mixed models for repeated measurements in longitudinal studies). The genipe package is an open source Python software and is freely available for non-commercial use (CC BY-NC 4.0) at https://github.com/pgxcentre/genipe Documentation and tutorials are available at http://pgxcentre.github.io/genipe CONTACT: louis-philippe.lemieux.perreault@statgen.org or marie-pierre.dube@statgen.orgSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  2. First Use of Multiple Imputation with the National Tuberculosis Surveillance System

    Directory of Open Access Journals (Sweden)

    Christopher Vinnard

    2013-01-01

    Full Text Available Aims. The purpose of this study was to compare methods for handling missing data in analysis of the National Tuberculosis Surveillance System of the Centers for Disease Control and Prevention. Because of the high rate of missing human immunodeficiency virus (HIV infection status in this dataset, we used multiple imputation methods to minimize the bias that may result from less sophisticated methods. Methods. We compared analysis based on multiple imputation methods with analysis based on deleting subjects with missing covariate data from regression analysis (case exclusion, and determined whether the use of increasing numbers of imputed datasets would lead to changes in the estimated association between isoniazid resistance and death. Results. Following multiple imputation, the odds ratio for initial isoniazid resistance and death was 2.07 (95% CI 1.30, 3.29; with case exclusion, this odds ratio decreased to 1.53 (95% CI 0.83, 2.83. The use of more than 5 imputed datasets did not substantively change the results. Conclusions. Our experience with the National Tuberculosis Surveillance System dataset supports the use of multiple imputation methods in epidemiologic analysis, but also demonstrates that close attention should be paid to the potential impact of missing covariates at each step of the analysis.

  3. Measuring segregation: an activity space approach.

    Science.gov (United States)

    Wong, David W S; Shaw, Shih-Lung

    2011-06-01

    While the literature clearly acknowledges that individuals may experience different levels of segregation across their various socio-geographical spaces, most measures of segregation are intended to be used in the residential space. Using spatially aggregated data to evaluate segregation in the residential space has been the norm and thus individual's segregation experiences in other socio-geographical spaces are often de-emphasized or ignored. This paper attempts to provide a more comprehensive approach in evaluating segregation beyond the residential space. The entire activity spaces of individuals are taken into account with individuals serving as the building blocks of the analysis. The measurement principle is based upon the exposure dimension of segregation. The proposed measure reflects the exposure of individuals of a referenced group in a neighborhood to the populations of other groups that are found within the activity spaces of individuals in the referenced group. Using the travel diary data collected from the tri-county area in southeast Florida and the imputed racial-ethnic data, this paper demonstrates how the proposed segregation measurement approach goes beyond just measuring population distribution patterns in the residential space and can provide a more comprehensive evaluation of segregation by considering various socio-geographical spaces.

  4. Nearest neighbor imputation using spatial-temporal correlations in wireless sensor networks.

    Science.gov (United States)

    Li, YuanYuan; Parker, Lynne E

    2014-01-01

    Missing data is common in Wireless Sensor Networks (WSNs), especially with multi-hop communications. There are many reasons for this phenomenon, such as unstable wireless communications, synchronization issues, and unreliable sensors. Unfortunately, missing data creates a number of problems for WSNs. First, since most sensor nodes in the network are battery-powered, it is too expensive to have the nodes retransmit missing data across the network. Data re-transmission may also cause time delays when detecting abnormal changes in an environment. Furthermore, localized reasoning techniques on sensor nodes (such as machine learning algorithms to classify states of the environment) are generally not robust enough to handle missing data. Since sensor data collected by a WSN is generally correlated in time and space, we illustrate how replacing missing sensor values with spatially and temporally correlated sensor values can significantly improve the network's performance. However, our studies show that it is important to determine which nodes are spatially and temporally correlated with each other. Simple techniques based on Euclidean distance are not sufficient for complex environmental deployments. Thus, we have developed a novel Nearest Neighbor (NN) imputation method that estimates missing data in WSNs by learning spatial and temporal correlations between sensor nodes. To improve the search time, we utilize a kd-tree data structure, which is a non-parametric, data-driven binary search tree. Instead of using traditional mean and variance of each dimension for kd-tree construction, and Euclidean distance for kd-tree search, we use weighted variances and weighted Euclidean distances based on measured percentages of missing data. We have evaluated this approach through experiments on sensor data from a volcano dataset collected by a network of Crossbow motes, as well as experiments using sensor data from a highway traffic monitoring application. Our experimental results

  5. Using family-based imputation in genome-wide association studies with large complex pedigrees: the Framingham Heart Study.

    Directory of Open Access Journals (Sweden)

    Ming-Huei Chen

    Full Text Available Imputation has been widely used in genome-wide association studies (GWAS to infer genotypes of un-genotyped variants based on the linkage disequilibrium in external reference panels such as the HapMap and 1000 Genomes. However, imputation has only rarely been performed based on family relationships to infer genotypes of un-genotyped individuals. Using 8998 Framingham Heart Study (FHS participants genotyped with Affymetrix 550K SNPs, we imputed genotypes of same set of SNPs for additional 3121 participants, most of whom were never genotyped due to lack of DNA sample. Prior to imputation, 122 pedigrees were too large to be handled by the imputation software Merlin. Therefore, we developed a novel pedigree splitting algorithm that can maximize the number of genotyped relatives for imputing each un-genotyped individual, while keeping new sub-pedigrees under a pre-specified size. In GWAS of four phenotypes available in FHS (Alzheimer disease, circulating levels of fibrinogen, high-density lipoprotein cholesterol, and uric acid, we compared results using genotyped individuals only with results using both genotyped and imputed individuals. We studied the impact of applying different imputation quality filtering thresholds on the association results and did not found a universal threshold that always resulted in a more significant p-value for previously identified loci. However most of these loci had a lower p-value when we only included imputed genotypes with with ≥60% SNP- and ≥50% person-specific imputation certainty. In summary, we developed a novel algorithm for splitting large pedigrees for imputation and found a plausible imputation quality filtering threshold based on FHS. Further examination may be required to generalize this threshold to other studies.

  6. Body size and geographic range do not explain long term variation in fish populations: a Bayesian phylogenetic approach to testing assembly processes in stream fish assemblages.

    Directory of Open Access Journals (Sweden)

    Stephen J Jacquemin

    Full Text Available We combine evolutionary biology and community ecology to test whether two species traits, body size and geographic range, explain long term variation in local scale freshwater stream fish assemblages. Body size and geographic range are expected to influence several aspects of fish ecology, via relationships with niche breadth, dispersal, and abundance. These traits are expected to scale inversely with niche breadth or current abundance, and to scale directly with dispersal potential. However, their utility to explain long term temporal patterns in local scale abundance is not known. Comparative methods employing an existing molecular phylogeny were used to incorporate evolutionary relatedness in a test for covariation of body size and geographic range with long term (1983 - 2010 local scale population variation of fishes in West Fork White River (Indiana, USA. The Bayesian model incorporating phylogenetic uncertainty and correlated predictors indicated that neither body size nor geographic range explained significant variation in population fluctuations over a 28 year period. Phylogenetic signal data indicated that body size and geographic range were less similar among taxa than expected if trait evolution followed a purely random walk. We interpret this as evidence that local scale population variation may be influenced less by species-level traits such as body size or geographic range, and instead may be influenced more strongly by a taxon's local scale habitat and biotic assemblages.

  7. Imputation-based analysis of association studies: candidate regions and quantitative traits.

    Directory of Open Access Journals (Sweden)

    Bertrand Servin

    2007-07-01

    Full Text Available We introduce a new framework for the analysis of association studies, designed to allow untyped variants to be more effectively and directly tested for association with a phenotype. The idea is to combine knowledge on patterns of correlation among SNPs (e.g., from the International HapMap project or resequencing data in a candidate region of interest with genotype data at tag SNPs collected on a phenotyped study sample, to estimate ("impute" unmeasured genotypes, and then assess association between the phenotype and these estimated genotypes. Compared with standard single-SNP tests, this approach results in increased power to detect association, even in cases in which the causal variant is typed, with the greatest gain occurring when multiple causal variants are present. It also provides more interpretable explanations for observed associations, including assessing, for each SNP, the strength of the evidence that it (rather than another correlated SNP is causal. Although we focus on association studies with quantitative phenotype and a relatively restricted region (e.g., a candidate gene, the framework is applicable and computationally practical for whole genome association studies. Methods described here are implemented in a software package, Bim-Bam, available from the Stephens Lab website http://stephenslab.uchicago.edu/software.html.

  8. Analysis of Case-Control Association Studies: SNPs, Imputation and Haplotypes

    KAUST Repository

    Chatterjee, Nilanjan

    2009-11-01

    Although prospective logistic regression is the standard method of analysis for case-control data, it has been recently noted that in genetic epidemiologic studies one can use the "retrospective" likelihood to gain major power by incorporating various population genetics model assumptions such as Hardy-Weinberg-Equilibrium (HWE), gene-gene and gene-environment independence. In this article we review these modern methods and contrast them with the more classical approaches through two types of applications (i) association tests for typed and untyped single nucleotide polymorphisms (SNPs) and (ii) estimation of haplotype effects and haplotype-environment interactions in the presence of haplotype-phase ambiguity. We provide novel insights to existing methods by construction of various score-tests and pseudo-likelihoods. In addition, we describe a novel two-stage method for analysis of untyped SNPs that can use any flexible external algorithm for genotype imputation followed by a powerful association test based on the retrospective likelihood. We illustrate applications of the methods using simulated and real data. © Institute of Mathematical Statistics, 2009.

  9. Multi-approaches analysis reveals local adaptation in the emmer wheat (Triticum dicoccoides) at macro- but not micro-geographical scale.

    Science.gov (United States)

    Volis, Sergei; Ormanbekova, Danara; Yermekbayev, Kanat; Song, Minshu; Shulgina, Irina

    2015-01-01

    Detecting local adaptation and its spatial scale is one of the most important questions of evolutionary biology. However, recognition of the effect of local selection can be challenging when there is considerable environmental variation across the distance at the whole species range. We analyzed patterns of local adaptation in emmer wheat, Triticum dicoccoides, at two spatial scales, small (inter-population distance less than one km) and large (inter-population distance more than 50 km) using several approaches. Plants originating from four distinct habitats at two geographic scales (cold edge, arid edge and two topographically dissimilar core locations) were reciprocally transplanted and their success over time was measured as 1) lifetime fitness in a year of planting, and 2) population growth four years after planting. In addition, we analyzed molecular (SSR) and quantitative trait variation and calculated the QST/FST ratio. No home advantage was detected at the small spatial scale. At the large spatial scale, home advantage was detected for the core population and the cold edge population in the year of introduction via measuring life-time plant performance. However, superior performance of the arid edge population in its own environment was evident only after several generations via measuring experimental population growth rate through genotyping with SSRs allowing counting the number of plants and seeds per introduced genotype per site. These results highlight the importance of multi-generation surveys of population growth rate in local adaptation testing. Despite predominant self-fertilization of T. dicoccoides and the associated high degree of structuring of genetic variation, the results of the QST - FST comparison were in general agreement with the pattern of local adaptation at the two spatial scales detected by reciprocal transplanting.

  10. Geographical networks: geographical effects on network properties

    Institute of Scientific and Technical Information of China (English)

    Kong-qing YANG; Lei YANG; Bai-hua GONG; Zhong-cai LIN; Hong-sheng HE; Liang HUANG

    2008-01-01

    Complex networks describe a wide range of sys-tems in nature and society. Since most real systems exist in certain physical space and the distance between the nodes has influence on the connections, it is helpful to study geographi-cal complex networks and to investigate how the geographical constrains on the connections affect the network properties. In this paper, we briefly review our recent progress on geo-graphical complex networks with respect of statistics, mod-elling, robustness, and synchronizability. It has been shown that the geographical constrains tend to make the network less robust and less synchronizable. Synchronization on random networks and clustered networks is also studied.

  11. Sensitivity analysis in multiple imputation in effectiveness studies of psychotherapy

    Science.gov (United States)

    Crameri, Aureliano; von Wyl, Agnes; Koemeda, Margit; Schulthess, Peter; Tschuschke, Volker

    2015-01-01

    The importance of preventing and treating incomplete data in effectiveness studies is nowadays emphasized. However, most of the publications focus on randomized clinical trials (RCT). One flexible technique for statistical inference with missing data is multiple imputation (MI). Since methods such as MI rely on the assumption of missing data being at random (MAR), a sensitivity analysis for testing the robustness against departures from this assumption is required. In this paper we present a sensitivity analysis technique based on posterior predictive checking, which takes into consideration the concept of clinical significance used in the evaluation of intra-individual changes. We demonstrate the possibilities this technique can offer with the example of irregular longitudinal data collected with the Outcome Questionnaire-45 (OQ-45) and the Helping Alliance Questionnaire (HAQ) in a sample of 260 outpatients. The sensitivity analysis can be used to (1) quantify the degree of bias introduced by missing not at random data (MNAR) in a worst reasonable case scenario, (2) compare the performance of different analysis methods for dealing with missing data, or (3) detect the influence of possible violations to the model assumptions (e.g., lack of normality). Moreover, our analysis showed that ratings from the patient's and therapist's version of the HAQ could significantly improve the predictive value of the routine outcome monitoring based on the OQ-45. Since analysis dropouts always occur, repeated measurements with the OQ-45 and the HAQ analyzed with MI are useful to improve the accuracy of outcome estimates in quality assurance assessments and non-randomized effectiveness studies in the field of outpatient psychotherapy. PMID:26283989

  12. Dealing with missing data in a multi-question depression scale: a comparison of imputation methods

    Directory of Open Access Journals (Sweden)

    Stuart Heather

    2006-12-01

    Full Text Available Abstract Background Missing data present a challenge to many research projects. The problem is often pronounced in studies utilizing self-report scales, and literature addressing different strategies for dealing with missing data in such circumstances is scarce. The objective of this study was to compare six different imputation techniques for dealing with missing data in the Zung Self-reported Depression scale (SDS. Methods 1580 participants from a surgical outcomes study completed the SDS. The SDS is a 20 question scale that respondents complete by circling a value of 1 to 4 for each question. The sum of the responses is calculated and respondents are classified as exhibiting depressive symptoms when their total score is over 40. Missing values were simulated by randomly selecting questions whose values were then deleted (a missing completely at random simulation. Additionally, a missing at random and missing not at random simulation were completed. Six imputation methods were then considered; 1 multiple imputation, 2 single regression, 3 individual mean, 4 overall mean, 5 participant's preceding response, and 6 random selection of a value from 1 to 4. For each method, the imputed mean SDS score and standard deviation were compared to the population statistics. The Spearman correlation coefficient, percent misclassified and the Kappa statistic were also calculated. Results When 10% of values are missing, all the imputation methods except random selection produce Kappa statistics greater than 0.80 indicating 'near perfect' agreement. MI produces the most valid imputed values with a high Kappa statistic (0.89, although both single regression and individual mean imputation also produced favorable results. As the percent of missing information increased to 30%, or when unbalanced missing data were introduced, MI maintained a high Kappa statistic. The individual mean and single regression method produced Kappas in the 'substantial agreement' range

  13. PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population.

    Directory of Open Access Journals (Sweden)

    Oren E Livne

    2015-03-01

    Full Text Available Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm, a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs, from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.

  14. SparRec: An effective matrix completion framework of missing data imputation for GWAS

    Science.gov (United States)

    Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen

    2016-10-01

    Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.

  15. Assessing the Fit of Structural Equation Models With Multiply Imputed Data.

    Science.gov (United States)

    Enders, Craig K; Mansolf, Maxwell

    2016-11-28

    Multiple imputation has enjoyed widespread use in social science applications, yet the application of imputation-based inference to structural equation modeling has received virtually no attention in the literature. Thus, this study has 2 overarching goals: evaluate the application of Meng and Rubin's (1992) pooling procedure for likelihood ratio statistic to the SEM test of model fit, and explore the possibility of using this test statistic to define imputation-based versions of common fit indices such as the TLI, CFI, and RMSEA. Computer simulation results suggested that, when applied to a correctly specified model, the pooled likelihood ratio statistic performed well as a global test of model fit and was closely calibrated to the corresponding full information maximum likelihood (FIML) test statistic. However, when applied to misspecified models with high rates of missingness (30%-40%), the imputation-based test statistic generally exhibited lower power than that of FIML. Using the pooled test statistic to construct imputation-based versions of the TLI, CFI, and RMSEA worked well and produced indices that were well-calibrated with those of full information maximum likelihood estimation. This article gives Mplus and R code to implement the pooled test statistic, and it offers a number of recommendations for future research. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  16. Gaussianization-based quasi-imputation and expansion strategies for incomplete correlated binary responses.

    Science.gov (United States)

    Demirtas, Hakan; Hedeker, Donald

    2007-02-20

    New quasi-imputation and expansion strategies for correlated binary responses are proposed by borrowing ideas from random number generation. The core idea is to convert correlated binary outcomes to multivariate normal outcomes in a sensible way so that re-conversion to the binary scale, after performing multiple imputation, yields the original specified marginal expectations and correlations. This conversion process ensures that the correlations are transformed reasonably which in turn allows us to take advantage of well-developed imputation techniques for Gaussian outcomes. We use the phrase 'quasi' because the original observations are not guaranteed to be preserved. We argue that if the inferential goals are well-defined, it is not necessary to strictly adhere to the established definition of multiple imputation. Our expansion scheme employs a similar strategy where imputation is used as an intermediate step. It leads to proportionally inflated observed patterns, forcing the data set to a complete rectangular format. The plausibility of the proposed methodology is examined by applying it to a wide range of simulated data sets that reflect alternative assumptions on complete data populations and missing-data mechanisms. We also present an application using a data set from obesity research. We conclude that the proposed method is a promising tool for handling incomplete longitudinal or clustered binary outcomes under ignorable non-response mechanisms. Copyright 2006 John Wiley & Sons, Ltd.

  17. Multiple imputation by chained equations for systematically and sporadically missing multilevel data.

    Science.gov (United States)

    Resche-Rigon, Matthieu; White, Ian R

    2016-09-19

    In multilevel settings such as individual participant data meta-analysis, a variable is 'systematically missing' if it is wholly missing in some clusters and 'sporadically missing' if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.

  18. PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population.

    Science.gov (United States)

    Livne, Oren E; Han, Lide; Alkorta-Aranburu, Gorka; Wentworth-Sheilds, William; Abney, Mark; Ober, Carole; Nicolae, Dan L

    2015-03-01

    Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.

  19. Genotype imputation for African Americans using data from HapMap phase II versus 1000 genomes projects.

    Science.gov (United States)

    Sung, Yun J; Gu, C Charles; Tiwari, Hemant K; Arnett, Donna K; Broeckel, Ulrich; Rao, Dabeeru C

    2012-07-01

    Genotype imputation provides imputation of untyped single nucleotide polymorphisms (SNPs) that are present on a reference panel such as those from the HapMap Project. It is popular for increasing statistical power and comparing results across studies using different platforms. Imputation for African American populations is challenging because their linkage disequilibrium blocks are shorter and also because no ideal reference panel is available due to admixture. In this paper, we evaluated three imputation strategies for African Americans. The intersection strategy used a combined panel consisting of SNPs polymorphic in both CEU and YRI. The union strategy used a panel consisting of SNPs polymorphic in either CEU or YRI. The merge strategy merged results from two separate imputations, one using CEU and the other using YRI. Because recent investigators are increasingly using the data from the 1000 Genomes (1KG) Project for genotype imputation, we evaluated both 1KG-based imputations and HapMap-based imputations. We used 23,707 SNPs from chromosomes 21 and 22 on Affymetrix SNP Array 6.0 genotyped for 1,075 HyperGEN African Americans. We found that 1KG-based imputations provided a substantially larger number of variants than HapMap-based imputations, about three times as many common variants and eight times as many rare and low-frequency variants. This higher yield is expected because the 1KG panel includes more SNPs. Accuracy rates using 1KG data were slightly lower than those using HapMap data before filtering, but slightly higher after filtering. The union strategy provided the highest imputation yield with next highest accuracy. The intersection strategy provided the lowest imputation yield but the highest accuracy. The merge strategy provided the lowest imputation accuracy. We observed that SNPs polymorphic only in CEU had much lower accuracy, reducing the accuracy of the union strategy. Our findings suggest that 1KG-based imputations can facilitate discovery of

  20. A New Missing Data Imputation Algorithm Applied to Electrical Data Loggers

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2015-12-01

    Full Text Available Nowadays, data collection is a key process in the study of electrical power networks when searching for harmonics and a lack of balance among phases. In this context, the lack of data of any of the main electrical variables (phase-to-neutral voltage, phase-to-phase voltage, and current in each phase and power factor adversely affects any time series study performed. When this occurs, a data imputation process must be accomplished in order to substitute the data that is missing for estimated values. This paper presents a novel missing data imputation method based on multivariate adaptive regression splines (MARS and compares it with the well-known technique called multivariate imputation by chained equations (MICE. The results obtained demonstrate how the proposed method outperforms the MICE algorithm.

  1. Exact Inference for Hardy-Weinberg Proportions with Missing Genotypes: Single and Multiple Imputation

    Science.gov (United States)

    Graffelman, Jan; Nelson, S.; Gogarten, S. M.; Weir, B. S.

    2015-01-01

    This paper addresses the issue of exact-test based statistical inference for Hardy−Weinberg equilibrium in the presence of missing genotype data. Missing genotypes often are discarded when markers are tested for Hardy−Weinberg equilibrium, which can lead to bias in the statistical inference about equilibrium. Single and multiple imputation can improve inference on equilibrium. We develop tests for equilibrium in the presence of missingness by using both inbreeding coefficients (or, equivalently, χ2 statistics) and exact p-values. The analysis of a set of markers with a high missing rate from the GENEVA project on prematurity shows that exact inference on equilibrium can be altered considerably when missingness is taken into account. For markers with a high missing rate (>5%), we found that both single and multiple imputation tend to diminish evidence for Hardy−Weinberg disequilibrium. Depending on the imputation method used, 6−13% of the test results changed qualitatively at the 5% level. PMID:26377959

  2. Saturated linkage map construction in Rubus idaeus using genotyping by sequencing and genome-independent imputation

    Directory of Open Access Journals (Sweden)

    Ward Judson A

    2013-01-01

    Full Text Available Abstract Background Rapid development of highly saturated genetic maps aids molecular breeding, which can accelerate gain per breeding cycle in woody perennial plants such as Rubus idaeus (red raspberry. Recently, robust genotyping methods based on high-throughput sequencing were developed, which provide high marker density, but result in some genotype errors and a large number of missing genotype values. Imputation can reduce the number of missing values and can correct genotyping errors, but current methods of imputation require a reference genome and thus are not an option for most species. Results Genotyping by Sequencing (GBS was used to produce highly saturated maps for a R. idaeus pseudo-testcross progeny. While low coverage and high variance in sequencing resulted in a large number of missing values for some individuals, a novel method of imputation based on maximum likelihood marker ordering from initial marker segregation overcame the challenge of missing values, and made map construction computationally tractable. The two resulting parental maps contained 4521 and 2391 molecular markers spanning 462.7 and 376.6 cM respectively over seven linkage groups. Detection of precise genomic regions with segregation distortion was possible because of map saturation. Microsatellites (SSRs linked these results to published maps for cross-validation and map comparison. Conclusions GBS together with genome-independent imputation provides a rapid method for genetic map construction in any pseudo-testcross progeny. Our method of imputation estimates the correct genotype call of missing values and corrects genotyping errors that lead to inflated map size and reduced precision in marker placement. Comparison of SSRs to published R. idaeus maps showed that the linkage maps constructed with GBS and our method of imputation were robust, and marker positioning reliable. The high marker density allowed identification of genomic regions with segregation

  3. Imputation-based population genetics analysis of Plasmodium falciparum malaria parasites.

    Science.gov (United States)

    Samad, Hanif; Coll, Francesc; Preston, Mark D; Ocholla, Harold; Fairhurst, Rick M; Clark, Taane G

    2015-04-01

    Whole-genome sequencing technologies are being increasingly applied to Plasmodium falciparum clinical isolates to identify genetic determinants of malaria pathogenesis. However, genome-wide discovery methods, such as haplotype scans for signatures of natural selection, are hindered by missing genotypes in sequence data. Poor correlation between single nucleotide polymorphisms (SNPs) in the P. falciparum genome complicates efforts to apply established missing-genotype imputation methods that leverage off patterns of linkage disequilibrium (LD). The accuracy of state-of-the-art, LD-based imputation methods (IMPUTE, Beagle) was assessed by measuring allelic r2 for 459 P. falciparum samples from malaria patients in 4 countries: Thailand, Cambodia, Gambia, and Malawi. In restricting our analysis to 86 k high-quality SNPs across the populations, we found that the complete-case analysis was restricted to 21k SNPs (24.5%), despite no single SNP having more than 10% missing genotypes. The accuracy of Beagle in filling in missing genotypes was consistently high across all populations (allelic r2, 0.87-0.96), but the performance of IMPUTE was mixed (allelic r2, 0.34-0.99) depending on reference haplotypes and population. Positive selection analysis using Beagle-imputed haplotypes identified loci involved in resistance to chloroquine (crt) in Thailand, Cambodia, and Gambia, sulfadoxine-pyrimethamine (dhfr, dhps) in Cambodia, and artemisinin (kelch13) in Cambodia. Tajima's D-based analysis identified genes under balancing selection that encode well-characterized vaccine candidates: apical merozoite antigen 1 (ama1) and merozoite surface protein 1 (msp1). In contrast, the complete-case analysis failed to identify any well-validated drug resistance or candidate vaccine loci, except kelch13. In a setting of low LD and modest levels of missing genotypes, using Beagle to impute P. falciparum genotypes is a viable strategy for conducting accurate large-scale population genetics and

  4. Geographical information systems

    DEFF Research Database (Denmark)

    Möller, Bernd

    2004-01-01

    The chapter gives an introduction to Geographical Information Systems (GIS) with particular focus on their application within environmental management.......The chapter gives an introduction to Geographical Information Systems (GIS) with particular focus on their application within environmental management....

  5. Geographical information systems

    DEFF Research Database (Denmark)

    Möller, Bernd

    2004-01-01

    The chapter gives an introduction to Geographical Information Systems (GIS) with particular focus on their application within environmental management.......The chapter gives an introduction to Geographical Information Systems (GIS) with particular focus on their application within environmental management....

  6. Genotyping-by-sequencing approach indicates geographic distance as the main factor affecting genetic structure and gene flow in Brazilian populations of Grapholita molesta (Lepidoptera, Tortricidae).

    Science.gov (United States)

    Silva-Brandão, Karina Lucas; Silva, Oscar Arnaldo Batista Neto E; Brandão, Marcelo Mendes; Omoto, Celso; Sperling, Felix A H

    2015-06-01

    The oriental fruit moth Grapholita molesta is one of the major pests of stone and pome fruit species in Brazil. Here, we applied 1226 SNPs obtained by genotyping-by-sequencing to test whether host species associations or other factors such as geographic distance structured populations of this pest. Populations from the main areas of occurrence of G. molesta were sampled principally from peach and apple orchards. Three main clusters were recovered by neighbor-joining analysis, all defined by geographic proximity between sampling localities. Overall genetic structure inferred by a nonhierarchical amova resulted in a significant ΦST value = 0.19109. Here, we demonstrate for the first time that SNPs gathered by genotyping-by-sequencing can be used to infer genetic structure of a pest insect in Brazil; moreover, our results indicate that those markers are very informative even over a restricted geographic scale. We also demonstrate that host plant association has little effect on genetic structure among Brazilian populations of G. molesta; on the other hand, reduced gene flow promoted by geographic isolation has a stronger impact on population differentiation.

  7. Imputation of genotypes in Danish purebred and two-way crossbred pigs using low-density panels

    DEFF Research Database (Denmark)

    Xiang, Tao; Ma, Peipei; Ostersen, Tage;

    2015-01-01

    in crossbred animals and, in particular, in pigs. The extent and pattern of linkage disequilibrium differ in crossbred versus purebred animals, which may impact the performance of imputation. In this study, first we compared different scenarios of imputation from 5 K to 8 K single nucleotide polymorphisms...

  8. Effect of imputing markers from a low-density chip on the reliability of genomic breeding values in Holstein populations

    DEFF Research Database (Denmark)

    Dassonneville, R; Brøndum, Rasmus Froberg; Druet, T

    2011-01-01

    The purpose of this study was to investigate the imputation error and loss of reliability of direct genomic values (DGV) or genomically enhanced breeding values (GEBV) when using genotypes imputed from a 3,000-marker single nucleotide polymorphism (SNP) panel to a 50,000-marker SNP panel. Data co...

  9. 21 CFR 1404.630 - May the Office of National Drug Control Policy impute conduct of one person to another?

    Science.gov (United States)

    2010-04-01

    ...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from...: (a) Conduct imputed from an individual to an organization. We may impute the fraudulent, criminal, or... associated with an organization, to that organization when the improper conduct occurred in connection...

  10. 31 CFR 19.630 - May the Department of the Treasury impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ...'s knowledge, approval or acquiescence. The organization's acceptance of the benefits derived from...: (a) Conduct imputed from an individual to an organization. We may impute the fraudulent, criminal, or... associated with an organization, to that organization when the improper conduct occurred in connection...

  11. Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle

    DEFF Research Database (Denmark)

    Brøndum, Rasmus Froberg; Guldbrandtsen, Bernt; Sahana, Goutam

    2014-01-01

    Background The advent of low cost next generation sequencing has made it possible to sequence a large number of dairy and beef bulls which can be used as a reference for imputation of whole genome sequence data. The aim of this study was to investigate the accuracy and speed of imputation from...

  12. A reference panel of 64,976 haplotypes for genotype imputation.

    Science.gov (United States)

    McCarthy, Shane; Das, Sayantan; Kretzschmar, Warren; Delaneau, Olivier; Wood, Andrew R; Teumer, Alexander; Kang, Hyun Min; Fuchsberger, Christian; Danecek, Petr; Sharp, Kevin; Luo, Yang; Sidore, Carlo; Kwong, Alan; Timpson, Nicholas; Koskinen, Seppo; Vrieze, Scott; Scott, Laura J; Zhang, He; Mahajan, Anubha; Veldink, Jan; Peters, Ulrike; Pato, Carlos; van Duijn, Cornelia M; Gillies, Christopher E; Gandin, Ilaria; Mezzavilla, Massimo; Gilly, Arthur; Cocca, Massimiliano; Traglia, Michela; Angius, Andrea; Barrett, Jeffrey C; Boomsma, Dorrett; Branham, Kari; Breen, Gerome; Brummett, Chad M; Busonero, Fabio; Campbell, Harry; Chan, Andrew; Chen, Sai; Chew, Emily; Collins, Francis S; Corbin, Laura J; Smith, George Davey; Dedoussis, George; Dorr, Marcus; Farmaki, Aliki-Eleni; Ferrucci, Luigi; Forer, Lukas; Fraser, Ross M; Gabriel, Stacey; Levy, Shawn; Groop, Leif; Harrison, Tabitha; Hattersley, Andrew; Holmen, Oddgeir L; Hveem, Kristian; Kretzler, Matthias; Lee, James C; McGue, Matt; Meitinger, Thomas; Melzer, David; Min, Josine L; Mohlke, Karen L; Vincent, John B; Nauck, Matthias; Nickerson, Deborah; Palotie, Aarno; Pato, Michele; Pirastu, Nicola; McInnis, Melvin; Richards, J Brent; Sala, Cinzia; Salomaa, Veikko; Schlessinger, David; Schoenherr, Sebastian; Slagboom, P Eline; Small, Kerrin; Spector, Timothy; Stambolian, Dwight; Tuke, Marcus; Tuomilehto, Jaakko; Van den Berg, Leonard H; Van Rheenen, Wouter; Volker, Uwe; Wijmenga, Cisca; Toniolo, Daniela; Zeggini, Eleftheria; Gasparini, Paolo; Sampson, Matthew G; Wilson, James F; Frayling, Timothy; de Bakker, Paul I W; Swertz, Morris A; McCarroll, Steven; Kooperberg, Charles; Dekker, Annelot; Altshuler, David; Willer, Cristen; Iacono, William; Ripatti, Samuli; Soranzo, Nicole; Walter, Klaudia; Swaroop, Anand; Cucca, Francesco; Anderson, Carl A; Myers, Richard M; Boehnke, Michael; McCarthy, Mark I; Durbin, Richard

    2016-10-01

    We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

  13. Analyzing geographic clustered response

    Energy Technology Data Exchange (ETDEWEB)

    Merrill, D.W.; Selvin, S.; Mohr, M.S.

    1991-08-01

    In the study of geographic disease clusters, an alternative to traditional methods based on rates is to analyze case locations on a transformed map in which population density is everywhere equal. Although the analyst's task is thereby simplified, the specification of the density equalizing map projection (DEMP) itself is not simple and continues to be the subject of considerable research. Here a new DEMP algorithm is described, which avoids some of the difficulties of earlier approaches. The new algorithm (a) avoids illegal overlapping of transformed polygons; (b) finds the unique solution that minimizes map distortion; (c) provides constant magnification over each map polygon; (d) defines a continuous transformation over the entire map domain; (e) defines an inverse transformation; (f) can accept optional constraints such as fixed boundaries; and (g) can use commercially supported minimization software. Work is continuing to improve computing efficiency and improve the algorithm. 21 refs., 15 figs., 2 tabs.

  14. Geographic Media Literacy

    Science.gov (United States)

    Lukinbeal, Chris

    2014-01-01

    While the use of media permeates geographic research and pedagogic practice, the underlying literacies that link geography and media remain uncharted. This article argues that geographic media literacy incorporates visual literacy, information technology literacy, information literacy, and media literacy. Geographic media literacy is the ability…

  15. Geographic information system for Long Island: An epidemiologic systems approach to identify environmental breast cancer risks on Long Island. Phase 1

    Energy Technology Data Exchange (ETDEWEB)

    Barancik, J.I.; Kramer, C.F.; Thode, H.C. Jr.

    1995-12-01

    BNL is developing and implementing the project ``Geographic Information System (GIS) for Long Island`` to address the potential relationship of environmental and occupational exposures to breast cancer etiology on Long Island. The project is divided into two major phases: The four month-feasibility project (Phase 1), and the major development and implementation project (Phase 2). This report summarizes the work completed in the four month Phase 1 Project, ``Feasibility of a Geographic Information System for Long Island.`` It provides the baseline information needed to further define and prioritize the scope of work for subsequent tasks. Phase 2 will build upon this foundation to develop an operational GIS for the Long Island Breast Cancer Study Project (LIBCSP).

  16. Genotyping-by-sequencing approach indicates geographic distance as the main factor affecting genetic structure and gene flow in Brazilian populations of Grapholita molesta (Lepidoptera, Tortricidae)

    OpenAIRE

    Silva-Brandão, Karina Lucas; Oscar Arnaldo Batista Neto E Silva; Brandão, Marcelo Mendes; Omoto, Celso; Sperling, Felix A. H.

    2015-01-01

    The oriental fruit moth Grapholita molesta is one of the major pests of stone and pome fruit species in Brazil. Here, we applied 1226 SNPs obtained by genotyping-by-sequencing to test whether host species associations or other factors such as geographic distance structured populations of this pest. Populations from the main areas of occurrence of G. molesta were sampled principally from peach and apple orchards. Three main clusters were recovered by neighbor-joining analysis, all defined by g...

  17. Candidate gene analysis using imputed genotypes: cell cycle single-nucleotide polymorphisms and ovarian cancer risk

    DEFF Research Database (Denmark)

    Goode, Ellen L; Fridley, Brooke L; Vierkant, Robert A

    2009-01-01

    , CDK4, RB1, CDKN2D, and CCNE1) and one gene region (CDKN2A-CDKN2B). Because of the semi-overlapping nature of the 123 assayed tagging SNPs, we performed multiple imputation based on fastPHASE using data from White non-Hispanic study participants and participants in the international HapMap Consortium...... and National Institute of Environmental Health Sciences SNPs Program. Logistic regression assuming a log-additive model was done on combined and imputed data. We observed strengthened signals in imputation-based analyses at several SNPs, particularly CDKN2A-CDKN2B rs3731239; CCND1 rs602652, rs3212879, rs649392......, and rs3212891; CDK2 rs2069391, rs2069414, and rs17528736; and CCNE1 rs3218036. These results exemplify the utility of imputation in candidate gene studies and lend evidence to a role of cell cycle genes in ovarian cancer etiology, suggest a reduced set of SNPs to target in additional cases and controls....

  18. Consequences of splitting whole-genome sequencing effort over multiple breeds on imputation accuracy

    NARCIS (Netherlands)

    Bouwman, A.C.; Veerkamp, R.F.

    2014-01-01

    The aim of this study was to determine the consequences of splitting sequencing effort over multiple breeds for imputation accuracy from a high-density SNP chip towards whole-genome sequence. Such information would assist for instance numerical smaller cattle breeds, but also pig and chicken

  19. Mapping change of older forest with nearest-neighbor imputation and Landsat time-series

    Science.gov (United States)

    Janet L. Ohmann; Matthew J. Gregory; Heather M. Roberts; Warren B. Cohen; Robert E. Kennedy; Zhiqiang. Yang

    2012-01-01

    The Northwest Forest Plan (NWFP), which aims to conserve late-successional and old-growth forests (older forests) and associated species, established new policies on federal lands in the Pacific Northwest USA. As part of monitoring for the NWFP, we tested nearest-neighbor imputation for mapping change in older forest, defined by threshold values for forest attributes...

  20. Multiple imputation strategies for zero-inflated cost data in economic evaluations : which method works best?

    NARCIS (Netherlands)

    MacNeil Vroomen, Janet; Eekhout, Iris; Dijkgraaf, Marcel G; van Hout, Hein; de Rooij, Sophia E; Heymans, Martijn W; Bosmans, Judith E

    2015-01-01

    Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing cost-effecti

  1. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective.

    Science.gov (United States)

    Schafer, Joseph L.; Olsen, Maren K.

    1998-01-01

    The key ideas of multiple imputation for multivariate missing data problems are reviewed. Software programs available for this analysis are described, and their use is illustrated with data from the Adolescent Alcohol Prevention Trial (W. Hansen and J. Graham, 1991). (SLD)

  2. Reporting the Use of Multiple Imputation for Missing Data in Higher Education Research

    Science.gov (United States)

    Manly, Catherine A.; Wells, Ryan S.

    2015-01-01

    Higher education researchers using survey data often face decisions about handling missing data. Multiple imputation (MI) is considered by many statisticians to be the most appropriate technique for addressing missing data in many circumstances. In particular, it has been shown to be preferable to listwise deletion, which has historically been a…

  3. Handling Missing Data: Analysis of a Challenging Data Set Using Multiple Imputation

    Science.gov (United States)

    Pampaka, Maria; Hutcheson, Graeme; Williams, Julian

    2016-01-01

    Missing data is endemic in much educational research. However, practices such as step-wise regression common in the educational research literature have been shown to be dangerous when significant data are missing, and multiple imputation (MI) is generally recommended by statisticians. In this paper, we provide a review of these advances and their…

  4. Missing Data and Multiple Imputation in the Context of Multivariate Analysis of Variance

    Science.gov (United States)

    Finch, W. Holmes

    2016-01-01

    Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…

  5. Multiple imputation strategies for zero-inflated cost data in economic evaluations : which method works best?

    NARCIS (Netherlands)

    Vroomen, Janet MacNeil; Eekhout, Iris; Dijkgraaf, Marcel G.; van Hout, Hein; de Rooij, Sophia E.; Heymans, Martijn W.; Bosmans, Judith E.

    2016-01-01

    Cost and effect data often have missing data because economic evaluations are frequently added onto clinical studies where cost data are rarely the primary outcome. The objective of this article was to investigate which multiple imputation strategy is most appropriate to use for missing cost-effecti

  6. Effect of reference population size and available ancestor genotypes on imputation of Mexican Holstein genotypes

    Science.gov (United States)

    The effects of reference population size and the availability of information from genotyped ancestors on the accuracy of imputation of single nucleotide polymorphisms (SNPs) were investigated for Mexican Holstein cattle. Three scenarios for reference population size were examined: (1) a local popula...

  7. The roles of nearest neighbor methods in imputing missing data in forest inventory and monitoring databases

    Science.gov (United States)

    Bianca N. I. Eskelson; Hailemariam Temesgen; Valerie Lemay; Tara M. Barrett; Nicholas L. Crookston; Andrew T. Hudak

    2009-01-01

    Almost universally, forest inventory and monitoring databases are incomplete, ranging from missing data for only a few records and a few variables, common for small land areas, to missing data for many observations and many variables, common for large land areas. For a wide variety of applications, nearest neighbor (NN) imputation methods have been developed to fill in...

  8. Discovery and refinement of genetic loci associated with cardiometabolic risk using dense imputation maps

    NARCIS (Netherlands)

    Iotchkova, Valentina; Huang, Jie; Morris, John A.; Jain, Deepti; Barbieri, Caterina; Walter, Klaudia; Min, Josine L.; Chen, Lu; Astle, William; Cocca, Massimilian; Deelen, Patrick; Elding, Heather; Farmaki, Aliki-Eleni; Franklin, Christopher S.; Franberg, Mattias; Gaunt, Tom R.; Hofman, Albert; Jiang, Tao; Kleber, Marcus E.; Lachance, Genevieve; Luan, Jianan; Malerba, Giovanni; Matchan, Angela; Mead, Daniel; Memari, Yasin; Ntalla, Ioanna; Panoutsopoulou, Kalliope; Pazoki, Raha; Perry, John R. B.; Rivadeneira, Fernando; Sabater-Lleal, Maria; Sennblad, Bengt; Shin, So-Youn; Southam, Lorraine; Traglia, Michela; van Dijk, Freerk; van Leeuwen, Elisabeth M.; Zaza, Gianluigi; Zhang, Weihua; Amin, Najaf; Butterworth, Adam; Chambers, John C.; Dedoussis, George; Dehghan, Abbas; Franco, Oscar H.; Franke, Lude; Frontini, Mattia; Gambaro, Giovanni; Gasparini, Paolo; Hamsten, Anders; Issacs, Aaron; Kooner, Jaspal S.; Kooperberg, Charles; Langenberg, Claudia; Marz, Winfried; Scott, Robert A.; Swertz, Morris A.; Toniolo, Daniela; Uitterlinden, Andre G.; van Duijn, Cornelia M.; Watkins, Hugh; Zeggini, Eleftheria; Maurano, Mathew T.; Timpson, Nicholas J.; Reiner, Alexander P.; Auer, Paul L.; Soranzo, Nicole

    2016-01-01

    Large-scale whole-genome sequence data sets offer novel opportunities to identify genetic variation underlying human traits. Here we apply genotype imputation based on whole-genome sequence data from the UK1OK and 1000 Genomes Project into 35,981 study participants of European ancestry, followed by

  9. Discovery and refinement of genetic loci associated with cardiometabolic risk using dense imputation maps

    NARCIS (Netherlands)

    V. Iotchkova (Valentina); J. Huang (Jian); Morris, J.A. (John A); Jain, D. (Deepti); C. Barbieri (Caterina); K. Walter (Klaudia); J. Min (Josine); L. Chen (Lu); Astle, W. (William); M. Cocca (Massimiliano); P. Deelen (Patrick); Elding, H. (Heather); A.-E. Farmaki (Aliki-Eleni); C.S. Franklin (Christopher); M. Frånberg (Mattias); T.R. Gaunt (Tom); Hofman, A. (Albert); Jiang, T. (Tao); M.E. Kleber (Marcus); G. Lachance (Genevieve); J. Luan (Jian'An); G. Malerba (Giovanni); A. Matchan (Angela); Mead, D. (Daniel); Y. Memari (Yasin); I. Ntalla (Ioanna); Panoutsopoulou, K. (Kalliope); R. Pazoki (Raha); J.R.B. Perry (John); F. Rivadeneira Ramirez (Fernando); M. Sabater-Lleal (Maria); B. Sennblad (Bengt); S.-Y. Shin; L. Southam (Lorraine); M. Traglia (Michela); F. van Dijk (Freerk); E.M. van Leeuwen (Elisa); G. Zaza (Gianluigi); W. Zhang (Weihua); N. Amin (Najaf); A.S. Butterworth (Adam); J.C. Chambers (John); G.V. Dedoussis (George); A. Dehghan (Abbas); O.H. Franco (Oscar); L. Franke (Lude); Frontini, M. (Mattia); Gambaro, G. (Giovanni); P. Gasparini (Paolo); A. Hamsten (Anders); Issacs, A. (Aaron); J.S. Kooner (Jaspal S.); C. Kooperberg (Charles); C. Langenberg (Claudia); W. März (Winfried); R.A. Scott (Robert); Swertz, M.A. (Morris A); D. Toniolo (Daniela); A.G. Uitterlinden (André); C.M. van Duijn (Cock); H. Watkins (Hugh); E. Zeggini (Eleftheria); M.T. Maurano (Matthew T.); N. Timpson (Nicholas); A. Reiner (Alexander); P. Auer (Paul); N. Soranzo (Nicole)

    2016-01-01

    textabstractLarge-scale whole-genome sequence data sets offer novel opportunities to identify genetic variation underlying human traits. Here we apply genotype imputation based on whole-genome sequence data from the UK10K and 1000 Genomes Project into 35,981 study participants of European ancestry,

  10. Sixteen new lung function signals identified through 1000 Genomes Project reference panel imputation

    NARCIS (Netherlands)

    Artigas, Maria Soler; Wain, Louise V.; Miller, Suzanne; Kheirallah, Abdul Kader; Huffman, Jennifer E.; Ntalla, Ioanna; Shrine, Nick; Obeidat, Ma'en; Trochet, Holly; McArdle, Wendy L.; Alves, Alexessander Couto; Hui, Jennie; Zhao, Jing Hua; Joshi, Peter K.; Teumer, Alexander; Albrecht, Eva; Imboden, Medea; Rawal, Rajesh; Lopez, Lorna M.; Marten, Jonathan; Enroth, Stefan; Surakka, Ida; Polasek, Ozren; Lyytikainen, Leo-Pekka; Granell, Raquel; Hysi, Pirro G.; Flexeder, Claudia; Mahajan, Anubha; Beilby, John; Bosse, Yohan; Brandsma, Corry-Anke; Campbell, Harry; Gieger, Christian; Glaeser, Sven; Gonzalez, Juan R.; Grallert, Harald; Hammond, Chris J.; Harris, Sarah E.; Hartikainen, Anna-Liisa; Heliovaara, Markku; Henderson, John; Hocking, Lynne; Horikoshi, Momoko; Hutri-Kahonen, Nina; Ingelsson, Erik; Johansson, Asa; Kemp, John P.; Kolcic, Ivana; Kumar, Ashish; Lind, Lars; Melen, Erik; Musk, Arthur W.; Navarro, Pau; Nickle, David C.; Padmanabhan, Sandosh; Raitakari, Olli T.; Ried, Janina S.; Ripatti, Samuli; Schulz, Holger; Scott, Robert A.; Sin, Don D.; Starr, John M.; Vinuela, Ana; Voelzke, Henry; Wild, Sarah H.; Wright, Alan F.; Zemunik, Tatijana; Jarvis, Deborah L.; Spector, Tim D.; Evans, David M.; Lehtimaki, Terho; Vitart, Veronique; Kahonen, Mika; Gyllensten, Ulf; Rudan, Igor; Deary, Ian J.; Karrasch, Stefan; Probst-Hensch, Nicole M.; Heinrich, Joachim; Stubbe, Beate; Wilson, James F.; Wareham, Nicholas J.; James, Alan L.; Morris, Andrew P.; Jarvelin, Marjo-Riitta; Hayward, Caroline; Sayers, Ian; Strachan, David P.; Hall, Ian P.; Tobin, Martin D.; Deloukas, Panos; Hansell, Anna L.; Hubbard, Richard; Jackson, Victoria E.; Marchini, Jonathan; Pavord, Ian; Thomson, Neil C.; Zeggini, Eleftheria

    2015-01-01

    Lung function measures are used in the diagnosis of chronic obstructive pulmonary disease. In 38,199 European ancestry individuals, we studied genome-wide association of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC) and FEV1/FVC with 1000 Genomes Project (phase 1)-imputed genot

  11. Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis

    NARCIS (Netherlands)

    Twisk, J.; de Boer, M.; de Vente, W.; Heymans, M.

    2013-01-01

    Background and Objectives: As a result of the development of sophisticated techniques, such as multiple imputation, the interest in handling missing data in longitudinal studies has increased enormously in past years. Within the field of longitudinal data analysis, there is a current debate on wheth

  12. Missing Data and Multiple Imputation in the Context of Multivariate Analysis of Variance

    Science.gov (United States)

    Finch, W. Holmes

    2016-01-01

    Multivariate analysis of variance (MANOVA) is widely used in educational research to compare means on multiple dependent variables across groups. Researchers faced with the problem of missing data often use multiple imputation of values in place of the missing observations. This study compares the performance of 2 methods for combining p values in…

  13. Dwelling Price Ranking versus Socioeconomic Clustering: Possibility of Imputation

    Directory of Open Access Journals (Sweden)

    Fleishman Larisa

    2015-06-01

    Full Text Available In order to characterize the socioeconomic profile of various geographic units, it is common practice to use aggregated indices. However, the process of calculating such indices requires a wide variety of variables from various data sources available concurrently. Using a number of administrative databases for 2001 and 2003, this study examines the question of whether dwelling prices in a given locality can serve as a proxy for its socioeconomic level. Based on statistical and geographic criteria, we developed a Dwelling Price Ranking (DPR methodology. Our findings show that the DPR can serve as a good approximation for the socioeconomic cluster (SEC calculated by the Israel Central Bureau of Statistics for years when the required data was available. As opposed to the SEC, the suggested DPR indicator can easily be calculated, thus ensuring a continuum of socioeconomic index series. Both parametric and nonparametric statistical analyses have been carried out in order to examine the additional social, demographic, location, crime and security effects that are exogenous to SEC. Complementary analysis on recently published SEC series for 2006 and 2008 show that our conclusions remain valid. The proposed methodology and the obtained findings may be applicable for different statistical purposes in other countries which possess dwelling transactions data.

  14. Multiple imputation for assessment of exposures to drinking water contaminants: evaluation with the Atrazine Monitoring Program.

    Science.gov (United States)

    Jones, Rachael M; Stayner, Leslie T; Demirtas, Hakan

    2014-10-01

    Drinking water may contain pollutants that harm human health. The frequency of pollutant monitoring may occur quarterly, annually, or less frequently, depending upon the pollutant, the pollutant concentration, and community water system. However, birth and other health outcomes are associated with narrow time-windows of exposure. Infrequent monitoring impedes linkage between water quality and health outcomes for epidemiological analyses. To evaluate the performance of multiple imputation to fill in water quality values between measurements in community water systems (CWSs). The multiple imputation method was implemented in a simulated setting using data from the Atrazine Monitoring Program (AMP, 2006-2009 in five Midwestern states). Values were deleted from the AMP data to leave one measurement per month. Four patterns reflecting drinking water monitoring regulations were used to delete months of data in each CWS: three patterns were missing at random and one pattern was missing not at random. Synthetic health outcome data were created using a linear and a Poisson exposure-response relationship with five levels of hypothesized association, respectively. The multiple imputation method was evaluated by comparing the exposure-response relationships estimated based on multiply imputed data with the hypothesized association. The four patterns deleted 65-92% months of atrazine observations in AMP data. Even with these high rates of missing information, our procedure was able to recover most of the missing information when the synthetic health outcome was included for missing at random patterns and for missing not at random patterns with low-to-moderate exposure-response relationships. Multiple imputation appears to be an effective method for filling in water quality values between measurements. Copyright © 2014 Elsevier Inc. All rights reserved.

  15. Impact of Missing Value Imputation on Classification for DNA Microarray Gene Expression Data—A Model-Based Study

    Directory of Open Access Journals (Sweden)

    Sun Youting

    2009-01-01

    Full Text Available Many missing-value (MV imputation methods have been developed for microarray data, but only a few studies have investigated the relationship between MV imputation and classification accuracy. Furthermore, these studies are problematic in fundamental steps such as MV generation and classifier error estimation. In this work, we carry out a model-based study that addresses some of the issues in previous studies. Six popular imputation algorithms, two feature selection methods, and three classification rules are considered. The results suggest that it is beneficial to apply MV imputation when the noise level is high, variance is small, or gene-cluster correlation is strong, under small to moderate MV rates. In these cases, if data quality metrics are available, then it may be helpful to consider the data point with poor quality as missing and apply one of the most robust imputation algorithms to estimate the true signal based on the available high-quality data points. However, at large MV rates, we conclude that imputation methods are not recommended. Regarding the MV rate, our results indicate the presence of a peaking phenomenon: performance of imputation methods actually improves initially as the MV rate increases, but after an optimum point, performance quickly deteriorates with increasing MV rates.

  16. Accuracy of hemoglobin A1c imputation using fasting plasma glucose in diabetes research using electronic health records data

    Directory of Open Access Journals (Sweden)

    Stanley Xu

    2014-05-01

    Full Text Available In studies that use electronic health record data, imputation of important data elements such as Glycated hemoglobin (A1c has become common. However, few studies have systematically examined the validity of various imputation strategies for missing A1c values. We derived a complete dataset using an incident diabetes population that has no missing values in A1c, fasting and random plasma glucose (FPG and RPG, age, and gender. We then created missing A1c values under two assumptions: missing completely at random (MCAR and missing at random (MAR. We then imputed A1c values, compared the imputed values to the true A1c values, and used these data to assess the impact of A1c on initiation of antihyperglycemic therapy. Under MCAR, imputation of A1c based on FPG 1 estimated a continuous A1c within ± 1.88% of the true A1c 68.3% of the time; 2 estimated a categorical A1c within ± one category from the true A1c about 50% of the time. Including RPG in imputation slightly improved the precision but did not improve the accuracy. Under MAR, including gender and age in addition to FPG improved the accuracy of imputed continuous A1c but not categorical A1c. Moreover, imputation of up to 33% of missing A1c values did not change the accuracy and precision and did not alter the impact of A1c on initiation of antihyperglycemic therapy. When using A1c values as a predictor variable, a simple imputation algorithm based only on age, sex, and fasting plasma glucose gave acceptable results.

  17. Airports Geographic Information System -

    Data.gov (United States)

    Department of Transportation — The Airports Geographic Information System maintains the airport and aeronautical data required to meet the demands of the Next Generation National Airspace System....

  18. An R function for imputation of missing cells in two-way data sets by EM-AMMI algorithm

    Directory of Open Access Journals (Sweden)

    Jakub Paderewski

    2014-06-01

    Full Text Available Various statistical methods for two-way classification data sets (including AMMI or GGE analyses, used in crop science for interpreting genotype-by-environment interaction require the data to be complete, that is, not to have missing cells. If there are such, however, one might impute the missing cells. The paper offers R code for imputing missing values by the EM-AMMI algorithm. In addition, a function to check the repeatability of this algorithm is proposed. This function could be used to evaluate if the missing data were imputed reliably (unambiguously, which is important especially for small data sets

  19. Assessing Geographic Information Enhancement

    NARCIS (Netherlands)

    Van Loenen, B.; Zevenbergen, J.

    2010-01-01

    Assessment of geographic information infrastructures (or spatial data infrastructures) is increasingly attracting the attention of researchers in the Geographic information (GI) domain. Especially the assessment of value added GI appears to be complex. By applying the concept of value chain analysis

  20. Environmental geographic information system.

    Energy Technology Data Exchange (ETDEWEB)

    Peek, Dennis W; Helfrich, Donald Alan; Gorman, Susan

    2010-08-01

    This document describes how the Environmental Geographic Information System (EGIS) was used, along with externally received data, to create maps for the Site-Wide Environmental Impact Statement (SWEIS) Source Document project. Data quality among the various classes of geographic information system (GIS) data is addressed. A complete listing of map layers used is provided.

  1. Application of a geographical information system approach for risk analysis of fascioliasis in southern Espírito Santo state, Brazil.

    Science.gov (United States)

    Martins, Isabella Vilhena Freire; de Avelar, Barbara Rauta; Pereira, Maria Julia Salim; da Fonseca, Adevair Henrique

    2012-09-01

    A model based on geographical information systems for mapping the risk of fascioliasis was developed for the southern part of Espírito Santo state, Brazil. The determinants investigated were precipitation, temperature, elevation, slope, soil type and land use. Weightings and grades were assigned to determinants and their categories according to their relevance with respect to fascioliasis. Theme maps depicting the spatial distribution of risk areas indicate that over 50% of southern Espírito Santo is either at high or at very high risk for fascioliasis. These areas were found to be characterized by comparatively high temperature but relatively low slope, low precipitation and low elevation corresponding to periodically flooded grasslands or soils that promote water retention.

  2. Analyses of Sensitivity to the Missing-at-Random Assumption Using Multiple Imputation With Delta Adjustment: Application to a Tuberculosis/HIV Prevalence Survey With Incomplete HIV-Status Data.

    Science.gov (United States)

    Leacy, Finbarr P; Floyd, Sian; Yates, Tom A; White, Ian R

    2017-01-10

    Multiple imputation with delta adjustment provides a flexible and transparent means to impute univariate missing data under general missing-not-at-random mechanisms. This facilitates the conduct of analyses assessing sensitivity to the missing-at-random (MAR) assumption. We review the delta-adjustment procedure and demonstrate how it can be used to assess sensitivity to departures from MAR, both when estimating the prevalence of a partially observed outcome and when performing parametric causal mediation analyses with a partially observed mediator. We illustrate the approach using data from 34,446 respondents to a tuberculosis and human immunodeficiency virus (HIV) prevalence survey that was conducted as part of the Zambia-South Africa TB and AIDS Reduction Study (2006-2010). In this study, information on partially observed HIV serological values was supplemented by additional information on self-reported HIV status. We present results from 2 types of sensitivity analysis: The first assumed that the degree of departure from MAR was the same for all individuals with missing HIV serological values; the second assumed that the degree of departure from MAR varied according to an individual's self-reported HIV status. Our analyses demonstrate that multiple imputation offers a principled approach by which to incorporate auxiliary information on self-reported HIV status into analyses based on partially observed HIV serological values.

  3. Using mi impute chained to fit ANCOVA models in randomized trials with censored dependent and independent variables

    DEFF Research Database (Denmark)

    Andersen, Andreas; Rieckmann, Andreas

    2016-01-01

    In this article, we illustrate how to use mi impute chained with intreg to fit an analysis of covariance analysis of censored and nondetectable immunological concentrations measured in a randomized pretest–posttest design....

  4. Geographical National Condition and Complex System

    Directory of Open Access Journals (Sweden)

    WANG Jiayao

    2016-01-01

    Full Text Available The significance of studying the complex system of geographical national conditions lies in rationally expressing the complex relationships of the “resources-environment-ecology-economy-society” system. Aiming to the problems faced by the statistical analysis of geographical national conditions, including the disunity of research contents, the inconsistency of range, the uncertainty of goals, etc.the present paper conducted a range of discussions from the perspectives of concept, theory and method, and designed some solutions based on the complex system theory and coordination degree analysis methods.By analyzing the concepts of geographical national conditions, geographical national conditions survey and geographical national conditions statistical analysis, as well as investigating the relationships between theirs, the statistical contents and the analytical range of geographical national conditions are clarified and defined. This investigation also clarifies the goals of the statistical analysis by analyzing the basic characteristics of the geographical national conditions and the complex system, and the consistency between the analysis of the degree of coordination and statistical analyses. It outlines their goals, proposes a concept for the complex system of geographical national conditions, and it describes the concept. The complex system theory provides new theoretical guidance for the statistical analysis of geographical national conditions. The degree of coordination offers new approaches on how to undertake the analysis based on the measurement method and decision-making analysis scheme upon which the complex system of geographical national conditions is based. It analyzes the overall trend via the degree of coordination of the complex system on a macro level, and it determines the direction of remediation on a micro level based on the degree of coordination among various subsystems and of single systems. These results establish

  5. Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data.

    Science.gov (United States)

    Karakaya, Jale; Karabulut, Erdem; Yucel, Recai M

    Modern statistical methods using incomplete data have been increasingly applied in a wide variety of substantive problems. Similarly, receiver operating characteristic (ROC) analysis, a method used in evaluating diagnostic tests or biomarkers in medical research, has also been increasingly popular problem in both its development and application. While missing-data methods have been applied in ROC analysis, the impact of model mis-specification and/or assumptions (e.g. missing at random) underlying the missing data has not been thoroughly studied. In this work, we study the performance of multiple imputation (MI) inference in ROC analysis. Particularly, we investigate parametric and non-parametric techniques for MI inference under common missingness mechanisms. Depending on the coherency of the imputation model with the underlying data generation mechanism, our results show that MI generally leads to well-calibrated inferences under ignorable missingness mechanisms.

  6. MULTIMEDIA ON GEOGRAPHIC NETWORK

    OpenAIRE

    Merlanti, Danilo

    2012-01-01

    In this thesis we investigate the topic of the multimedia contents distribution on a geo- graphic network which is a rarefied and huge field. First of all we have to classify the main parts necessary in the multimedia distribution on a geographic network. The main aspects of a geographic network that will be highlighted in this thesis are: the mechanism used to retrieve the sources of the multimedia content; in the case of the peer-to-peer network on geographic network one of t...

  7. [Multiple imputation and complete case analysis in logistic regression models: a practical assessment of the impact of incomplete covariate data].

    Science.gov (United States)

    Camargos, Vitor Passos; César, Cibele Comini; Caiaffa, Waleska Teixeira; Xavier, Cesar Coelho; Proietti, Fernando Augusto

    2011-12-01

    Researchers in the health field often deal with the problem of incomplete databases. Complete Case Analysis (CCA), which restricts the analysis to subjects with complete data, reduces the sample size and may result in biased estimates. Based on statistical grounds, Multiple Imputation (MI) uses all collected data and is recommended as an alternative to CCA. Data from the study Saúde em Beagá, attended by 4,048 adults from two of nine health districts in the city of Belo Horizonte, Minas Gerais State, Brazil, in 2008-2009, were used to evaluate CCA and different MI approaches in the context of logistic models with incomplete covariate data. Peculiarities in some variables in this study allowed analyzing a situation in which the missing covariate data are recovered and thus the results before and after recovery are compared. Based on the analysis, even the more simplistic MI approach performed better than CCA, since it was closer to the post-recovery results.

  8. Imputating missing values in diary records of sun-exposure study

    DEFF Research Database (Denmark)

    Have, Anna Szynkowiak; Philipsen, Peter Alshede; Larsen, Jan

    2001-01-01

    In a sun-exposure study, questionnaires concerning sun-habits were collected from 195 subjects. This paper focuses on the general problem of missing data values, which occurs when some, or even all of the questions have not been answered in a questionnaire. Here, only missing values of low concen...... concentration are investigated. We consider and compare two different models for imputating missing values: the Gaussian model and the non-parametric K-nearest neighbor model....

  9. Semi-empiricial Likelihood Confidence Intervals for the Differences of Two Populations Based on Fractional Imputation

    Institute of Scientific and Technical Information of China (English)

    BAI YUN-XIA; QIN YONG-SONG; WANG LI-RONG; LI LING

    2009-01-01

    Suppose that there axe two populations x and y with missing data on both of them, where x has a distribution function F(.) which is unknown and y has form depending on some unknown parameter θ. Fractional imputation is used to fill in missing data. The asymptotic distributions of the semi-empirical likelihood ration statistic are obtained under some mild conditions. Then, empirical likelihood confidence intervals on the differences of x and y are constructed.

  10. Effects of height and live crown ratio imputation strategies on stand biomass estimation

    Science.gov (United States)

    Elijah J. Allensworth; Temesgen. Hailemariam

    2015-01-01

    The effects of subsample design and imputation of total height (ht) and live crown ratio (cr) on the accuracy of stand-level estimates of component and total aboveground biomass are not well investigated in the current body of literature. To assess this gap in research, this study uses a data set of 3,454 Douglas-fir trees obtained from 102 stands in southwestern...

  11. Bootstrap imputation with a disease probability model minimized bias from misclassification due to administrative database codes.

    Science.gov (United States)

    van Walraven, Carl

    2017-04-01

    Diagnostic codes used in administrative databases cause bias due to misclassification of patient disease status. It is unclear which methods minimize this bias. Serum creatinine measures were used to determine severe renal failure status in 50,074 hospitalized patients. The true prevalence of severe renal failure and its association with covariates were measured. These were compared to results for which renal failure status was determined using surrogate measures including the following: (1) diagnostic codes; (2) categorization of probability estimates of renal failure determined from a previously validated model; or (3) bootstrap methods imputation of disease status using model-derived probability estimates. Bias in estimates of severe renal failure prevalence and its association with covariates were minimal when bootstrap methods were used to impute renal failure status from model-based probability estimates. In contrast, biases were extensive when renal failure status was determined using codes or methods in which model-based condition probability was categorized. Bias due to misclassification from inaccurate diagnostic codes can be minimized using bootstrap methods to impute condition status using multivariable model-derived probability estimates. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION.

    Science.gov (United States)

    Allen, Genevera I; Tibshirani, Robert

    2010-06-01

    Missing data estimation is an important challenge with high-dimensional data arranged in the form of a matrix. Typically this data matrix is transposable, meaning that either the rows, columns or both can be treated as features. To model transposable data, we present a modification of the matrix-variate normal, the mean-restricted matrix-variate normal, in which the rows and columns each have a separate mean vector and covariance matrix. By placing additive penalties on the inverse covariance matrices of the rows and columns, these so called transposable regularized covariance models allow for maximum likelihood estimation of the mean and non-singular covariance matrices. Using these models, we formulate EM-type algorithms for missing data imputation in both the multivariate and transposable frameworks. We present theoretical results exploiting the structure of our transposable models that allow these models and imputation methods to be applied to high-dimensional data. Simulations and results on microarray data and the Netflix data show that these imputation techniques often outperform existing methods and offer a greater degree of flexibility.

  13. Missing Data Imputation of Solar Radiation Data under Different Atmospheric Conditions

    Directory of Open Access Journals (Sweden)

    Concepción Crespo Turrado

    2014-10-01

    Full Text Available Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE. This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW and Multiple Linear Regression (MLR. The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  14. Missing data imputation of solar radiation data under different atmospheric conditions.

    Science.gov (United States)

    Turrado, Concepción Crespo; López, María Del Carmen Meizoso; Lasheras, Fernando Sánchez; Gómez, Benigno Antonio Rodríguez; Rollé, José Luis Calvo; Juez, Francisco Javier de Cos

    2014-10-29

    Global solar broadband irradiance on a planar surface is measured at weather stations by pyranometers. In the case of the present research, solar radiation values from nine meteorological stations of the MeteoGalicia real-time observational network, captured and stored every ten minutes, are considered. In this kind of record, the lack of data and/or the presence of wrong values adversely affects any time series study. Consequently, when this occurs, a data imputation process must be performed in order to replace missing data with estimated values. This paper aims to evaluate the multivariate imputation of ten-minute scale data by means of the chained equations method (MICE). This method allows the network itself to impute the missing or wrong data of a solar radiation sensor, by using either all or just a group of the measurements of the remaining sensors. Very good results have been obtained with the MICE method in comparison with other methods employed in this field such as Inverse Distance Weighting (IDW) and Multiple Linear Regression (MLR). The average RMSE value of the predictions for the MICE algorithm was 13.37% while that for the MLR it was 28.19%, and 31.68% for the IDW.

  15. Impacts of the Nakhodka heavy-oil spill on an intertidal ecosystem: an approach to impact evaluation using geographical information system.

    Science.gov (United States)

    Teruhisa, Komatsu; Masahiro, Nakaoka; Hiroshi, Kawai; Tomoko, Yamamoto; Kouichi, Ohwada

    2003-01-01

    A major heavy-oil spill from the Russian tanker Nakhodka occurred in the Sea of Japan on 2 January 1997. We investigated the impacts of this spill on a rocky intertidal ecosystem along the southern coast of the Sea of Japan. We selected Imago-Ura Cove as our study site to observe temporal changes along the oiled shore, because minimal cleaning effort was made in this area. Field surveys were conducted every autumn and spring from 1997 to 2000. We measured coverage by macroalgae in 1 x 1-m(2) quadrats and counted the animals in 5 x 5-m(2) quadrats along the intertidal zone. Changes in the ecosystem caused by the oil spill were analyzed by applying a geographical information system (GIS) to the Sea of Japan for the first time. The GIS showed that following the accident there were heavily oiled areas in sheltered regions, but these decreased over the three years. It also showed that coverage by macroalgae and the number of animals increased, although some species of algae with microscopic sporophyte generations, and some populations of perennial shellfish, remained stable or decreased during the study period. GIS was able to trace temporal changes in intertidal communities resulting from the impacts of heavy oil on flora and fauna at a spatial scale of 10-100 m. GIS is thus a practical tool for visualizing, analyzing, and monitoring changes in an ecosystem polluted by oil, taking into account topographic differences along the coastline.

  16. Potential of LC-MS phenolic profiling combined with multivariate analysis as an approach for the determination of the geographical origin of north Moroccan virgin olive oils.

    Science.gov (United States)

    Bajoub, Aadil; Carrasco-Pancorbo, Alegría; Ajal, El Amine; Ouazzani, Noureddine; Fernández-Gutiérrez, Alberto

    2015-01-01

    The applicability of two different platforms (LC-ESI-TOF MS and LC-ESI-IT MS) as powerful tools for the characterisation and subsequent quantification of the phenolic compounds present in north Moroccan virgin olive oils was assessed in this study. 156 olives samples of "Picholine Marocaine" cultivar grown in 7 Moroccan regions were collected and olive oils extracted. The phenolic profiles of these olive oils were studied using a resolutive chromatographic method coupled to ESI-TOF MS (for initial characterisation purposes) and coupled to ESI-IT MS (for further identification and quantification). 25 phenolic compounds belonging to different chemical families were identified and quantified. Secoiridoids were the most abundant phenols in all the samples, followed by phenolic alcohols, lignans and flavonoids, respectively. For testing the ability of phenolic profiles for tracing the geographical origin of the investigated oils, multivariate analysis tools were used, getting a good rate of correct classification and prediction by using a cross validation procedure. Copyright © 2014 Elsevier Ltd. All rights reserved.

  17. A novel approach to parasite population genetics: experimental infection reveals geographic differentiation, recombination and host-mediated population structure in Pasteuria ramosa, a bacterial parasite of Daphnia.

    Science.gov (United States)

    Andras, J P; Ebert, D

    2013-02-01

    The population structure of parasites is central to the ecology and evolution of host-parasite systems. Here, we investigate the population genetics of Pasteuria ramosa, a bacterial parasite of Daphnia. We used natural P. ramosa spore banks from the sediments of two geographically well-separated ponds to experimentally infect a panel of Daphnia magna host clones whose resistance phenotypes were previously known. In this way, we were able to assess the population structure of P. ramosa based on geography, host resistance phenotype and host genotype. Overall, genetic diversity of P. ramosa was high, and nearly all infected D. magna hosted more than one parasite haplotype. On the basis of the observation of recombinant haplotypes and relatively low levels of linkage disequilibrium, we conclude that P. ramosa engages in substantial recombination. Isolates were strongly differentiated by pond, indicating that gene flow is spatially restricted. Pasteuria ramosa isolates within one pond were segregated completely based on the resistance phenotype of the host-a result that, to our knowledge, has not been previously reported for a nonhuman parasite. To assess the comparability of experimental infections with natural P. ramosa isolates, we examined the population structure of naturally infected D. magna native to one of the two source ponds. We found that experimental and natural infections of the same host resistance phenotype from the same source pond were indistinguishable, indicating that experimental infections provide a means to representatively sample the diversity of P. ramosa while reducing the sampling bias often associated with studies of parasite epidemics. These results expand our knowledge of this model parasite, provide important context for the large existing body of research on this system and will guide the design of future studies of this host-parasite system.

  18. Effects of Different Missing Data Imputation Techniques on the Performance of Undiagnosed Diabetes Risk Prediction Models in a Mixed-Ancestry Population of South Africa.

    Directory of Open Access Journals (Sweden)

    Katya L Masconi

    Full Text Available Imputation techniques used to handle missing data are based on the principle of replacement. It is widely advocated that multiple imputation is superior to other imputation methods, however studies have suggested that simple methods for filling missing data can be just as accurate as complex methods. The objective of this study was to implement a number of simple and more complex imputation methods, and assess the effect of these techniques on the performance of undiagnosed diabetes risk prediction models during external validation.Data from the Cape Town Bellville-South cohort served as the basis for this study. Imputation methods and models were identified via recent systematic reviews. Models' discrimination was assessed and compared using C-statistic and non-parametric methods, before and after recalibration through simple intercept adjustment.The study sample consisted of 1256 individuals, of whom 173 were excluded due to previously diagnosed diabetes. Of the final 1083 individuals, 329 (30.4% had missing data. Family history had the highest proportion of missing data (25%. Imputation of the outcome, undiagnosed diabetes, was highest in stochastic regression imputation (163 individuals. Overall, deletion resulted in the lowest model performances while simple imputation yielded the highest C-statistic for the Cambridge Diabetes Risk model, Kuwaiti Risk model, Omani Diabetes Risk model and Rotterdam Predictive model. Multiple imputation only yielded the highest C-statistic for the Rotterdam Predictive model, which were matched by simpler imputation methods.Deletion was confirmed as a poor technique for handling missing data. However, despite the emphasized disadvantages of simpler imputation methods, this study showed that implementing these methods results in similar predictive utility for undiagnosed diabetes when compared to multiple imputation.

  19. The relationship between geographical and social space and approaches to care among rural and urban caregivers caring for a family member with Dementia: a qualitative study.

    Science.gov (United States)

    Ehrlich, Kethy; Emami, Azita; Heikkilä, Kristiina

    2017-01-17

    Knowledge about family caregivers in rural areas remains sparse. No studies to date have addressed the sociocultural aspects in caregiving, thus neglecting potentially significant data. This study aimed to explore and better understand family caregivers' experiences in rural and urban areas and the sociocultural spheres that these two areas represent. How do family caregivers approach their caregiving situation? A hermeneutical approach was chosen to uncover the underlying meanings of experiences. Open-ended in-depth interviews were conducted. The ontological and epistemological roots are based on hermeneutic philosophy, where a human being's existence is viewed as socially constructed. The study followed a purposeful sampling. Semi-structured in-depth interviews were conducted with 12 rural and 11 urban family caregivers to persons with dementia. These were then analyzed in accordance with the hermeneutical process. The findings provide insight into the variations of family caregiver approaches to caregiving in rural and urban areas of Sweden. There seemed to be a prevalence of a more accepting and maintaining approach in the rural areas as compared to the urban areas, where caregiving was more often viewed as an obligation and something that limited one's space. Differences in the construction of family identity seemed to influence the participants approach to family caregiving. Therefore, community-based caregiving for the elderly needs to become aware of how living within a family differs and how this affects their views on being a caregiver. Thus, support systems must be individually adjusted to each family's lifestyles so that this is more in tune with their everyday lives.

  20. Geographic Information Systems.

    Science.gov (United States)

    Wieczorek, William F; Delmerico, Alan M

    2009-01-01

    This chapter presents an overview of the development, capabilities, and utilization of geographic information systems (GIS). There are nearly an unlimited number of applications that are relevant to GIS because virtually all human interactions, natural and man-made features, resources, and populations have a geographic component. Everything happens somewhere and the location often has a role that affects what occurs. This role is often called spatial dependence or spatial autocorrelation, which exists when a phenomenon is not randomly geographically distributed. GIS has a number of key capabilities that are needed to conduct a spatial analysis to assess this spatial dependence. This chapter presents these capabilities (e.g., georeferencing, adjacency/distance measures, overlays) and provides a case study to illustrate how GIS can be used for both research and planning. Although GIS has developed into a relatively mature application for basic functions, development is needed to more seamlessly integrate spatial statistics and models.The issue of location, especially the geography of human activities, interactions between humanity and nature, and the distribution and location of natural resources and features, is one of the most basic elements of scientific inquiry. Conceptualizations and physical maps of geographic space have existed since the beginning of time because all human activity takes place in a geographic context. Representing objects in space, basically where things are located, is a critical aspect of the natural, social, and applied sciences. Throughout history there have been many methods of characterizing geographic space, especially maps created by artists, mariners, and others eventually leading to the development of the field of cartography. It is no surprise that the digital age has launched a major effort to utilize geographic data, but not just as maps. A geographic information system (GIS) facilitates the collection, analysis, and reporting of

  1. Symposium on Geographic Information Systems.

    Science.gov (United States)

    Felleman, John, Ed.

    1990-01-01

    Six papers on geographic information systems cover the future of geographic information systems, land information systems modernization in Wisconsin, the Topologically Integrated Geographic Encoding and Referencing (TIGER) System of the U.S. Bureau of the Census, satellite remote sensing, geographic information systems and sustainable development,…

  2. Imputation for transcription factor binding predictions based on deep learning

    Science.gov (United States)

    Qin, Qian

    2017-01-01

    Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding. PMID:28234893

  3. A Framework for Sustainable Tourism Planning in Johor Ramsar Sites, Malaysia: A Geographic Information System (GIS Based Analytic Network Process (ANP Approach

    Directory of Open Access Journals (Sweden)

    Mansir Aminu

    2013-06-01

    Full Text Available This study presents an approach based on an integrated use of GIS, ANP and Water Quality Index (WQI for sustainable tourism planning in a wetland environment (Ramsar site. ANP will be utilized to evaluate the relative priorities for the conservation, tourism and economic development of the Ramsar sites based on chosen criteria and indicators (elements. Pair wise comparison technique will be used in order to evaluate possible alternatives from different perspectives. To reflect the interdependencies in the network, pair wise comparisons will be conducted among all the elements. As different elements are usually characterized by different importance levels, the subsequent step will be the prioritization of the elements, which allows for a comparison among the elements using expert opinion as input and the results transferred into GIS environment. Elements to be evaluated and ranked will be represented by criterion maps. The criterion maps will be evaluated by reclassifying the data layers, to represent different needs for conservation and development of the Ramsar sites. To determine the water quality of the river, parameters of the sampling stations will be used to calculate the sub-indices. Consequently surface data of water quality will be generated from the points of the sampling stations and decisions taken appropriately. Map layers reflecting the opinion of different experts involved will be compared using the Boolean overlay approach of GIS. Subsequently conservation, tourism and economic development models will be generated, which will ensure that tourism maintain the viability of the study area for an indefinite period of time.

  4. Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.

    Science.gov (United States)

    Lazar, Cosmin; Gatto, Laurent; Ferro, Myriam; Bruley, Christophe; Burger, Thomas

    2016-04-01

    Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

  5. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods.

    Science.gov (United States)

    Shara, Nawar; Yassin, Sayf A; Valaitis, Eduardas; Wang, Hong; Howard, Barbara V; Wang, Wenyu; Lee, Elisa T; Umans, Jason G

    2015-01-01

    Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS). Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991), 2 (1993-1995), and 3 (1998-1999) was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.

  6. Randomly and Non-Randomly Missing Renal Function Data in the Strong Heart Study: A Comparison of Imputation Methods.

    Directory of Open Access Journals (Sweden)

    Nawar Shara

    Full Text Available Kidney and cardiovascular disease are widespread among populations with high prevalence of diabetes, such as American Indians participating in the Strong Heart Study (SHS. Studying these conditions simultaneously in longitudinal studies is challenging, because the morbidity and mortality associated with these diseases result in missing data, and these data are likely not missing at random. When such data are merely excluded, study findings may be compromised. In this article, a subset of 2264 participants with complete renal function data from Strong Heart Exams 1 (1989-1991, 2 (1993-1995, and 3 (1998-1999 was used to examine the performance of five methods used to impute missing data: listwise deletion, mean of serial measures, adjacent value, multiple imputation, and pattern-mixture. Three missing at random models and one non-missing at random model were used to compare the performance of the imputation techniques on randomly and non-randomly missing data. The pattern-mixture method was found to perform best for imputing renal function data that were not missing at random. Determining whether data are missing at random or not can help in choosing the imputation method that will provide the most accurate results.

  7. Geographical Income Polarization

    DEFF Research Database (Denmark)

    Azhar, Hussain; Jonassen, Anders Bruun

    In this paper we estimate the degree, composition and development of geographical income polarization based on data at the individual and municipal level in Denmark from 1984 to 2002. Rising income polarization is reconfirmed when applying new polarization measures, the driving force being greater...

  8. Making Geographical Futures

    Science.gov (United States)

    Morgan, John

    2015-01-01

    Although there are surprisingly few academic books about geography with the term "future" or "futures" in their titles, this paper indicates that for much of the twentieth century geographers contributed to important discussions about the shape of worlds to come. The paper offers a review of these debates within Anglo-American…

  9. Geographic profiling survey

    NARCIS (Netherlands)

    Emeno, Karla; Bennell, Craig; Snook, Brent; Taylor, Paul Jonathon

    Geographic profiling (GP) is an investigative technique that involves predicting a serial offender?s home location (or some other anchor point) based on where he or she committed a crime. Although the use of GP in police investigations appears to be on the rise, little is known about the procedure

  10. Using the Superpopulation Model for Imputations and Variance Computation in Survey Sampling

    Directory of Open Access Journals (Sweden)

    Petr Novák

    2012-03-01

    Full Text Available This study is aimed at variance computation techniques for estimates of population characteristics based on survey sampling and imputation. We use the superpopulation regression model, which means that the target variable values for each statistical unit are treated as random realizations of a linear regression model with weighted variance. We focus on regression models with one auxiliary variable and no intercept, which have many applications and straightforward interpretation in business statistics. Furthermore, we deal with caseswhere the estimates are not independent and thus the covariance must be computed. We also consider chained regression models with auxiliary variables as random variables instead of constants.

  11. A Framework for Geographic Object-Based Image Analysis (GEOBIA) based on geographic ontology

    Science.gov (United States)

    Gu, H. Y.; Li, H. T.; Yan, L.; Lu, X. J.

    2015-06-01

    GEOBIA (Geographic Object-Based Image Analysis) is not only a hot topic of current remote sensing and geographical research. It is believed to be a paradigm in remote sensing and GIScience. The lack of a systematic approach designed to conceptualize and formalize the class definitions makes GEOBIA a highly subjective and difficult method to reproduce. This paper aims to put forward a framework for GEOBIA based on geographic ontology theory, which could implement "Geographic entities - Image objects - Geographic objects" true reappearance. It consists of three steps, first, geographical entities are described by geographic ontology, second, semantic network model is built based on OWL(ontology web language), at last, geographical objects are classified with decision rule or other classifiers. A case study of farmland ontology was conducted for describing the framework. The strength of this framework is that it provides interpretation strategies and global framework for GEOBIA with the property of objective, overall, universal, universality, etc., which avoids inconsistencies caused by different experts' experience and provides an objective model for mage analysis.

  12. Characterising an intense PM pollution episode in March 2015 in France from multi-site approach and near real time data: Climatology, variabilities, geographical origins and model evaluation

    Science.gov (United States)

    Petit, J.-E.; Amodeo, T.; Meleux, F.; Bessagnet, B.; Menut, L.; Grenier, D.; Pellan, Y.; Ockler, A.; Rocq, B.; Gros, V.; Sciare, J.; Favez, O.

    2017-04-01

    During March 2015, a severe and large-scale particulate matter (PM) pollution episode occurred in France. Measurements in near real-time of the major chemical composition at four different urban background sites across the country (Paris, Creil, Metz and Lyon) allowed the investigation of spatiotemporal variabilities during this episode. A climatology approach showed that all sites experienced clear unusual rain shortage, a pattern that is also found on a longer timescale, highlighting the role of synoptic conditions over Wester-Europe. This episode is characterized by a strong predominance of secondary pollution, and more particularly of ammonium nitrate, which accounted for more than 50% of submicron aerosols at all sites during the most intense period of the episode. Pollution advection is illustrated by similar variabilities in Paris and Creil (distant of around 100 km), as well as trajectory analyses applied on nitrate and sulphate. Local sources, especially wood burning, are however found to contribute to local/regional sub-episodes, notably in Metz. Finally, simulated concentrations from Chemistry-Transport model CHIMERE were compared to observed ones. Results highlighted different patterns depending on the chemical components and the measuring site, reinforcing the need of such exercises over other pollution episodes and sites.

  13. Acesso aos serviços de saúde: uma abordagem de geografia em saúde pública Access to health services: a geographical approach to public health

    Directory of Open Access Journals (Sweden)

    Carmen Vieira de Sousa Unglert

    1987-10-01

    Full Text Available O acesso da população aos serviços de saúde é um pré-requisito de fundamental importância para uma eficiente assistência à saúde. A localização geográfica dos serviços é um dos fatores que interferem nessa acessibilidade. Pretendeu-se estudar a localização dos serviços de saúde. A proposta básica foi a de apresentação de uma metodologia considerando-se as relações de variáveis geográficas, demográficas e sociais. Enfatizou-se, no processo, a participação da comunidade. Efetuou-se o estudo da adequação dessa metodologia às características da região de Santo Amaro, Município de São Paulo, Brasil. A contribuição dada pela abordagem geográfica abre ampla perspectiva quanto ao estabelecimento de novas linhas de estudo, planejamento e gestão, advindas do intercâmbio entre a Geografia Humana e a Saúde Pública, numa área que se sugere denominar Geografia em Saúde Pública.The access of the population to the health services is a requirement of basic importance for the efficiency of health assistance. The geographical localization of the services is one of the factors that interfere with this accessibility. It is intended to make a contribution to the study of the localization of health services. The basic proposal introduces a method which takes into account the relationships between geographical, demographical and social variables. Emphasis is placed on community participation in the process. The study of the adequacy of this method was undertaken under the regional characteristics of Santo Amaro, a suburb of the city of S. Paulo, Brazil. The contribution furnished by the geographical approach in this work opens up a broad perspective for the setting up of new lines of research, planning and administration resulting from the interation between human geography and public health within the common field for which it is suggested Geography of Public Health.

  14. Remote sensing research in geographic education: An alternative view

    Science.gov (United States)

    Wilson, H.; Cary, T. K.; Goward, S. N.

    1981-01-01

    It is noted that within many geography departments remote sensing is viewed as a mere technique a student should learn in order to carry out true geographic research. This view inhibits both students and faculty from investigation of remotely sensed data as a new source of geographic knowledge that may alter our understanding of the Earth. The tendency is for geographers to accept these new data and analysis techniques from engineers and mathematicians without questioning the accompanying premises. This black-box approach hinders geographic applications of the new remotely sensed data and limits the geographer's contribution to further development of remote sensing observation systems. It is suggested that geographers contribute to the development of remote sensing through pursuit of basic research. This research can be encouraged, particularly among students, by demonstrating the links between geographic theory and remotely sensed observations, encouraging a healthy skepticism concerning the current understanding of these data.

  15. Imputation of microsatellite alleles from dense SNP genotypes for parentage verification across multiple Bos taurus and Bos indicus breeds

    Science.gov (United States)

    McClure, Matthew C.; Sonstegard, Tad S.; Wiggans, George R.; Van Eenennaam, Alison L.; Weber, Kristina L.; Penedo, Cecilia T.; Berry, Donagh P.; Flynn, John; Garcia, Jose F.; Carmo, Adriana S.; Regitano, Luciana C. A.; Albuquerque, Milla; Silva, Marcos V. G. B.; Machado, Marco A.; Coffey, Mike; Moore, Kirsty; Boscher, Marie-Yvonne; Genestout, Lucie; Mazza, Raffaele; Taylor, Jeremy F.; Schnabel, Robert D.; Simpson, Barry; Marques, Elisa; McEwan, John C.; Cromie, Andrew; Coutinho, Luiz L.; Kuehn, Larry A.; Keele, John W.; Piper, Emily K.; Cook, Jim; Williams, Robert; Van Tassell, Curtis P.

    2013-01-01

    To assist cattle producers transition from microsatellite (MS) to single nucleotide polymorphism (SNP) genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N = 479) from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8000 animals representing 39 breeds (Bos taurus and B. indicus) were used to predict 9410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles from 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N = 2 to 36 breeds). These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had ≤1 Mendelian inheritance conflicts with their parents' reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset. PMID:24065982

  16. Imputation of Microsatellite Alleles from Dense SNP Genotypes for Parentage Verification Across Multiple Bos taurus and Bos indicus breeds

    Directory of Open Access Journals (Sweden)

    Matthew Charles Mcclure

    2013-09-01

    Full Text Available To assist cattle producers transition from microsatellite (MS to single nucleotide polymorphism (SNP genotyping for parental verification we previously devised an effective and inexpensive method to impute MS alleles from SNP haplotypes. While the reported method was verified with only a limited data set (N=479 from Brown Swiss, Guernsey, Holstein, and Jersey cattle, some of the MS-SNP haplotype associations were concordant across these phylogenetically diverse breeds. This implied that some haplotypes predate modern breed formation and remain in strong linkage disequilibrium. To expand the utility of MS allele imputation across breeds, MS and SNP data from more than 8,000 animals representing 39 breeds (Bos taurus and B. indicus were used to predict 9,410 SNP haplotypes, incorporating an average of 73 SNPs per haplotype, for which alleles for 12 MS markers could be accurately be imputed. Approximately 25% of the MS-SNP haplotypes were present in multiple breeds (N=2 to 36 breeds. These shared haplotypes allowed for MS imputation in breeds that were not represented in the reference population with only a small increase in Mendelian inheritance inconsistancies. Our reported reference haplotypes can be used for any cattle breed and the reported methods can be applied to any species to aid the transition from MS to SNP genetic markers. While ~91% of the animals with imputed alleles for 12 MS markers had <1 Mendelian inheritance conflicts with their parents’ reported MS genotypes, this figure was 96% for our reference animals, indicating potential errors in the reported MS genotypes. The workflow we suggest autocorrects for genotyping errors and rare haplotypes, by MS genotyping animals whose imputed MS alleles fail parentage verification, and then incorporating those animals into the reference dataset.

  17. Using full-cohort data in nested case-control and case-cohort studies by multiple imputation.

    Science.gov (United States)

    Keogh, Ruth H; White, Ian R

    2013-10-15

    In many large prospective cohorts, expensive exposure measurements cannot be obtained for all individuals. Exposure-disease association studies are therefore often based on nested case-control or case-cohort studies in which complete information is obtained only for sampled individuals. However, in the full cohort, there may be a large amount of information on cheaply available covariates and possibly a surrogate of the main exposure(s), which typically goes unused. We view the nested case-control or case-cohort study plus the remainder of the cohort as a full-cohort study with missing data. Hence, we propose using multiple imputation (MI) to utilise information in the full cohort when data from the sub-studies are analysed. We use the fully observed data to fit the imputation models. We consider using approximate imputation models and also using rejection sampling to draw imputed values from the true distribution of the missing values given the observed data. Simulation studies show that using MI to utilise full-cohort information in the analysis of nested case-control and case-cohort studies can result in important gains in efficiency, particularly when a surrogate of the main exposure is available in the full cohort. In simulations, this method outperforms counter-matching in nested case-control studies and a weighted analysis for case-cohort studies, both of which use some full-cohort information. Approximate imputation models perform well except when there are interactions or non-linear terms in the outcome model, where imputation using rejection sampling works well. Copyright © 2013 John Wiley & Sons, Ltd.

  18. A review of RCTs in four medical journals to assess the use of imputation to overcome missing data in quality of life outcomes

    Directory of Open Access Journals (Sweden)

    Cook Jonathan A

    2008-08-01

    Full Text Available Abstract Background Randomised controlled trials (RCTs are perceived as the gold-standard method for evaluating healthcare interventions, and increasingly include quality of life (QoL measures. The observed results are susceptible to bias if a substantial proportion of outcome data are missing. The review aimed to determine whether imputation was used to deal with missing QoL outcomes. Methods A random selection of 285 RCTs published during 2005/6 in the British Medical Journal, Lancet, New England Journal of Medicine and Journal of American Medical Association were identified. Results QoL outcomes were reported in 61 (21% trials. Six (10% reported having no missing data, 20 (33% reported ≤ 10% missing, eleven (18% 11%–20% missing, and eleven (18% reported >20% missing. Missingness was unclear in 13 (21%. Missing data were imputed in 19 (31% of the 61 trials. Imputation was part of the primary analysis in 13 trials, but a sensitivity analysis in six. Last value carried forward was used in 12 trials and multiple imputation in two. Following imputation, the most common analysis method was analysis of covariance (10 trials. Conclusion The majority of studies did not impute missing data and carried out a complete-case analysis. For those studies that did impute missing data, researchers tended to prefer simpler methods of imputation, despite more sophisticated methods being available.

  19. Random Forest as an Imputation Method for Education and Psychology Research: Its Impact on Item Fit and Difficulty of the Rasch Model

    Science.gov (United States)

    Golino, Hudson F.; Gomes, Cristiano M. A.

    2016-01-01

    This paper presents a non-parametric imputation technique, named random forest, from the machine learning field. The random forest procedure has two main tuning parameters: the number of trees grown in the prediction and the number of predictors used. Fifty experimental conditions were created in the imputation procedure, with different…

  20. 22 CFR 208.630 - May the U.S. Agency for International Development impute conduct of one person to another?

    Science.gov (United States)

    2010-04-01

    ..., or with the organization's knowledge, approval or acquiescence. The organization's acceptance of the... conduct as follows: (a) Conduct imputed from an individual to an organization. We may impute the... other individual associated with an organization, to that organization when the improper...

  1. Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

    Directory of Open Access Journals (Sweden)

    Lotz Meredith J

    2008-01-01

    Full Text Available Abstract Background Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures × time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set. Results We found that the optimal imputation algorithms (LSA, LLS, and BPCA are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost. Conclusion Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA

  2. A multi breed reference improves genotype imputation accuracy in Nordic Red cattle

    DEFF Research Database (Denmark)

    Brøndum, Rasmus Froberg; Ma, Peipei; Lund, Mogens Sandø;

    2012-01-01

    612,615 SNPs on chromosome 1-29 remained for analysis. Validation was done by masking markers in true HD data and imputing them using Beagle v. 3.3 and a reference group of either national Red, combined Red or combined Red and Holstein bulls. Results show a decrease in allele error rate from 2.64, 1......The objective of this study was to investigate if a multi breed reference would improve genotype imputation accuracy from 50K to high density (HD) single nucleotide polymorphism (SNP) marker data in Nordic Red Dairy Cattle, compared to using only a single breed reference, and to check.......39 and 0.87 percent to 1.75, 0.59 and 0.54 percent for respectively Danish, Swedish and Fi nnish Red when going from single national reference to a combined Red reference. The larger error rate in the Danish population was caused by a subgroup of 10 animals showing a large proportion of Holstein genetics...

  3. A multi breed reference improves genotype imputation accuracy in Nordic Red cattle

    DEFF Research Database (Denmark)

    Brøndum, Rasmus Froberg; Ma, Peipei; Lund, Mogens Sandø;

    612,615 SNPs on chromosome 1-29 remained for analysis. Validation was done by masking markers in true HD data and imputing them using Beagle v. 3.3 and a reference group of either national Red, combined Red or combined Red and Holstein bulls. Results show a decrease in allele error rate from 2.64, 1......The objective of this study was to investigate if a multi breed reference would improve genotype imputation accuracy from 50K to high density (HD) single nucleotide polymorphism (SNP) marker data in Nordic Red Dairy Cattle, compared to using only a single breed reference, and to check.......39 and 0.87 percent to 1.75, 0.59 and 0.54 percent for respectively Danish, Swedish and Fi nnish Red when going from single national reference to a combined Red reference. The larger error rate in the Danish population was caused by a subgroup of 10 animals showing a large proportion of Holstein genetics...

  4. Application of the Single Imputation Method to Estimate Missing Wind Speed Data in Malaysia

    Directory of Open Access Journals (Sweden)

    Nurulkamal Masseran

    2013-07-01

    Full Text Available In almost all research fields, the procedure for handling missing values must be addressed before a detailed analysis can be made. Thus, a suitable method of imputation should be chosen to address the missing value problem. Wind speed has been found in engineering practice to be the most significant parameter in wind power. However, researchers are sometimes faced with the problem of missing wind speed data caused by equipment failure. In this study, we attempt to implement four types of single imputation methods to estimate the wind speed data from three adjacent stations in Malaysia. The methods, known as the site-dependent effect method, the hour mean method, the last and next method, and the row mean method, are compared based on the index of agreement to identify the best method for estimating the missing values. The results indicate that the last and next is the best of the three methods for estimating the missing data for the wind stations considered.

  5. Spatial Copula Model for Imputing Traffic Flow Data from Remote Microwave Sensors.

    Science.gov (United States)

    Ma, Xiaolei; Luan, Sen; Du, Bowen; Yu, Bin

    2017-09-21

    Issues of missing data have become increasingly serious with the rapid increase in usage of traffic sensors. Analyses of the Beijing ring expressway have showed that up to 50% of microwave sensors pose missing values. The imputation of missing traffic data must be urgently solved although a precise solution that cannot be easily achieved due to the significant number of missing portions. In this study, copula-based models are proposed for the spatial interpolation of traffic flow from remote traffic microwave sensors. Most existing interpolation methods only rely on covariance functions to depict spatial correlation and are unsuitable for coping with anomalies due to Gaussian consumption. Copula theory overcomes this issue and provides a connection between the correlation function and the marginal distribution function of traffic flow. To validate copula-based models, a comparison with three kriging methods is conducted. Results indicate that copula-based models outperform kriging methods, especially on roads with irregular traffic patterns. Copula-based models demonstrate significant potential to impute missing data in large-scale transportation networks.

  6. An intelligent method for geographic Web search

    Science.gov (United States)

    Mei, Kun; Yuan, Ying

    2008-10-01

    While the electronically available information in the World-Wide Web is explosively growing and thus increasing, the difficulty to find relevant information is also increasing for search engine user. In this paper we discuss how to constrain web queries geographically. A number of search queries are associated with geographical locations, either explicitly or implicitly. Accurately and effectively detecting the locations where search queries are truly about has huge potential impact on increasing search relevance, bringing better targeted search results, and improving search user satisfaction. Our approach focus on both in the way geographic information is extracted from the web and, as far as we can tell, in the way it is integrated into query processing. This paper gives an overview of a spatially aware search engine for semantic querying of web document. It also illustrates algorithms for extracting location from web documents and query requests using the location ontologies to encode and reason about formal semantics of geographic web search. Based on a real-world scenario of tourism guide search, the application of our approach shows that the geographic information retrieval can be efficiently supported.

  7. Teaching Geographic Field Methods Using Paleoecology

    Science.gov (United States)

    Walsh, Megan K.

    2014-01-01

    Field-based undergraduate geography courses provide numerous pedagogical benefits including an opportunity for students to acquire employable skills in an applied context. This article presents one unique approach to teaching geographic field methods using paleoecological research. The goals of this course are to teach students key geographic…

  8. Optimizing Synchronizability of Scale-Free Networks in Geographical Space

    Institute of Scientific and Technical Information of China (English)

    WANG Bing; TANG Huan-Wen; XIU Zhi-Long; GUO Chong-Hui

    2006-01-01

    @@ We investigate the relationship between the structure and the synchronizability of scale-free networks in geographical space. With an optimization approach, the numerical results indicate that when the network synchronizability is improved, the geographical distance becomes larger while the maximal load decreases. Thus the maximal betweenness can be a candidate factor that affects the network synchronizability both in topological space and in geographical space.

  9. Imputation by the mean score should be avoided when validating a Patient Reported Outcomes questionnaire by a Rasch model in presence of informative missing data

    LENUS (Irish Health Repository)

    Hardouin, Jean-Benoit

    2011-07-14

    Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO), are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation) in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale). In particular, this is the case with the method based on the Personal Mean Score (PMS) which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his\\/her score), and the difficulty of the item (measured for example by its rate of favourable responses). Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses) is an interesting alternative to the simple imputation in this context.

  10. Imputation by the mean score should be avoided when validating a Patient Reported Outcomes questionnaire by a Rasch model in presence of informative missing data

    Directory of Open Access Journals (Sweden)

    Sébille Véronique

    2011-07-01

    Full Text Available Abstract Background Nowadays, more and more clinical scales consisting in responses given by the patients to some items (Patient Reported Outcomes - PRO, are validated with models based on Item Response Theory, and more specifically, with a Rasch model. In the validation sample, presence of missing data is frequent. The aim of this paper is to compare sixteen methods for handling the missing data (mainly based on simple imputation in the context of psychometric validation of PRO by a Rasch model. The main indexes used for validation by a Rasch model are compared. Methods A simulation study was performed allowing to consider several cases, notably the possibility for the missing values to be informative or not and the rate of missing data. Results Several imputations methods produce bias on psychometrical indexes (generally, the imputation methods artificially improve the psychometric qualities of the scale. In particular, this is the case with the method based on the Personal Mean Score (PMS which is the most commonly used imputation method in practice. Conclusions Several imputation methods should be avoided, in particular PMS imputation. From a general point of view, it is important to use an imputation method that considers both the ability of the patient (measured for example by his/her score, and the difficulty of the item (measured for example by its rate of favourable responses. Another recommendation is to always consider the addition of a random process in the imputation method, because such a process allows reducing the bias. Last, the analysis realized without imputation of the missing data (available case analyses is an interesting alternative to the simple imputation in this context.

  11. The Middle East, and Her Geographic Approaches.

    Science.gov (United States)

    2014-09-26

    SUMMER, 1974. THURBURN, R.G. "THE OPERATIONS IN SOUTHERN KURDISTAN , MARCH-MAY, 1923," A9, 31:264-77, FEB,1936. KUWAIT ARURI, NASEER H. "KUWAIT...JORDAN AND THE TRANS-JORDAN FRONTIER FORCE," JORA, 60:471-85, 1933-34. KURDISTAN REGION CARR, RALPH E. "THE KURDISTAN MOUNTAIN RANGES, CONSIDERED IN...34 SEA, 22:33-4, APR.1979. "THE OTHER MIDEAST WAR," AFJI, 111:14-5, JUN.1974. MORTON, RICHARD L."INSURGENCY IN IRAQI KURDISTAN ," MILRVW, 48:64-8, JUN.1968

  12. Understanding Amphibian Declines Through Geographic Approaches

    Science.gov (United States)

    Gallant, Alisa

    2006-01-01

    Growing concern over worldwide amphibian declines warrants serious examination. Amphibians are important to the proper functioning of ecosystems and provide many direct benefits to humans in the form of pest and disease control, pharmaceutical compounds, and even food. Amphibians have permeable skin and rely on both aquatic and terrestrial ecosystems during different seasons and stages of their lives. Their association with these ecosystems renders them likely to serve as sensitive indicators of environmental change. While much research on amphibian declines has centered on mysterious causes, or on causes that directly affect humans (global warming, chemical pollution, ultraviolet-B radiation), most declines are the result of habitat loss and habitat alteration. Improving our ability to characterize, model, and monitor the interactions between environmental variables and amphibian habitats is key to addressing amphibian conservation. In 2000, the U.S. Geological Survey (USGS) initiated the Amphibian Research and Monitoring Initiative (ARMI) to address issues surrounding amphibian declines.

  13. Geographical Income Polarization

    DEFF Research Database (Denmark)

    Azhar, Hussain; Jonassen, Anders Bruun

    In this paper we estimate the degree, composition and development of geographical income polarization based on data at the individual and municipal level in Denmark from 1984 to 2002. Rising income polarization is reconfirmed when applying new polarization measures, the driving force being greater...... inter municipal income inequality. Counter factual simulations show that rising property prices to a large part explain the rise in polarization. One side-effect of polarization is tendencies towards a parallel polarization of residence location patterns, where low skilled individuals tend to live...

  14. 29 CFR 1471.630 - May the Federal Mediation and Conciliation Service impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 29 Labor 4 2010-07-01 2010-07-01 false May the Federal Mediation and Conciliation Service impute...) FEDERAL MEDIATION AND CONCILIATION SERVICE GOVERNMENTWIDE DEBARMENT AND SUSPENSION (NONPROCUREMENT) General Principles Relating to Suspension and Debarment Actions § 1471.630 May the Federal Mediation...

  15. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    van Leeuwen, Elisabeth M.; Karssen, Lennart C.; Deelen, Joris; Isaacs, Aaron; Medina-Gomez, Carolina; Mbarek, Hamdi; Kanterakis, Alexandros; Trompet, Stella; Postmus, Iris; Verweij, Niek; van Enckevort, David J.; Huffman, Jennifer E.; White, Charles C.; Feitosa, Mary F.; Bartz, Traci M.; Manichaikul, Ani; Joshi, Peter K.; Peloso, Gina M.; Deelen, Patrick; van Dijk, Freerk; Willemsen, Gonneke; de Geus, Eco J.; Milaneschi, Yuri; Penninx, Brenda W. J. H.; Francioli, Laurent C.; Menelaou, Androniki; Pulit, Sara L.; Rivadeneira, Fernando; Hofman, Albert; Oostra, Ben A.; Franco, Oscar H.; Leach, Irene Mateo; Beekman, Marian; de Craen, Anton J. M.; Uh, Hae-Won; Trochet, Holly; Hocking, Lynne J.; Porteous, David J.; Sattar, Naveed; Packard, Chris J.; Buckley, Brendan M.; Brody, Jennifer A.; Bis, Joshua C.; Rotter, Jerome I.; Mychaleckyj, Josyf C.; Campbell, Harry; Duan, Qing; Lange, Leslie A.; Wilson, James F.; Hayward, Caroline; Polasek, Ozren; Vitart, Veronique; Rudan, Igor; Wright, Alan F.; Rich, Stephen S.; Psaty, Bruce M.; Borecki, Ingrid B.; Kearney, Patricia M.; Stott, David J.; Cupples, L. Adrienne; Jukema, J. Wouter; van der Harst, Pim; Sijbrands, Eric J.; Hottenga, Jouke-Jan; Uitterlinden, Andre G.; Swertz, Morris A.; van Ommen, Gert-Jan B.; de Bakker, Paul I. W.; Slagboom, P. Eline; Boomsma, Dorret I.; Wijmenga, Cisca; van Duijn, Cornelia M.

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (similar to 35,000 samples) with the population-specific reference pan

  16. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    Van Leeuwen, Elisabeth M.; Karssen, Lennart C.; Deelen, Joris; Isaacs, Aaron; Medina-Gomez, Carolina; Mbarek, Hamdi; Kanterakis, Alexandros; Trompet, Stella; Postmus, Iris; Verweij, Niek; Van Enckevort, David J.; Huffman, Jennifer E.; White, Charles C.; Feitosa, Mary F.; Bartz, Traci M.; Manichaikul, Ani; Joshi, Peter K.; Peloso, Gina M.; Deelen, Patrick; Van Dijk, Freerk; Willemsen, Gonneke; De Geus, Eco J.; Milaneschi, Yuri; Penninx, Brenda W J H; Francioli, Laurent C.; Menelaou, Androniki; Pulit, Sara L.; Rivadeneira, Fernando; Hofman, Albert; Oostra, Ben A.; Franco, Oscar H.; Leach, Irene Mateo; Beekman, Marian; De Craen, Anton J M; Uh, Hae Won; Trochet, Holly; Hocking, Lynne J.; Porteous, David J.; Sattar, Naveed; Packard, Chris J.; Buckley, Brendan M.; Brody, Jennifer A.; Bis, Joshua C.; Rotter, Jerome I.; Mychaleckyj, Josyf C.; Campbell, Harry; Duan, Qing; Lange, Leslie A.; Wilson, James F.; Hayward, Caroline; Polasek, Ozren; Vitart, Veronique; Rudan, Igor; Wright, Alan F.; Rich, Stephen S.; Psaty, Bruce M.; Borecki, Ingrid B.; Kearney, Patricia M.; Stott, David J.; Cupples, L. Adrienne; Jukema, J. Wouter; Van Der Harst, Pim; Sijbrands, Eric J.; Hottenga, Jouke Jan; Uitterlinden, Andre G.; Swertz, Morris A.; Van Ommen, Gert Jan B; De Bakker, Paul I W; Eline Slagboom, P.; Boomsma, Dorret I.; Wijmenga, Cisca; Van Duijn, Cornelia M.; Neerincx, Pieter B T; Elbers, Clara C.; Palamara, Pier Francesco; Peer, Itsik; Abdellaoui, Abdel; Kloosterman, Wigard P.; Van Oven, Mannis; Vermaat, Martijn; Li, Mingkun; Laros, Jeroen F J; Stoneking, Mark; De Knijff, Peter; Kayser, Manfred; Veldink, Jan H.; Van Den Berg, Leonard H.; Byelas, Heorhiy; Den Dunnen, Johan T.; Dijkstra, Martijn; Amin, Najaf; Van Der Velde, K. Joeri; Van Setten, Jessica; Kattenberg, Mathijs; Van Schaik, Barbera D C; Bot, Jan; Nijman, Isaäc J.; Mei, Hailiang; Koval, Vyacheslav; Ye, Kai; Lameijer, Eric Wubbo; Moed, Matthijs H.; Hehir-Kwa, Jayne Y.; Handsaker, Robert E.; Sunyaev, Shamil R.; Sohail, Mashaal; Hormozdiari, Fereydoun; Marschall, Tobias; Schönhuth, Alexander; Guryev, Victor; Suchiman, H. Eka D; Wolffenbuttel, Bruce H.; Platteel, Mathieu; Pitts, Steven J.; Potluri, Shobha; Cox, David R.; Li, Qibin; Li, Yingrui; Du, Yuanping; Chen, Ruoyan; Cao, Hongzhi; Li, Ning; Cao, Sujie; Wang, Jun; Bovenberg, Jasper A.; de Bakker, Paul I W

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference panel created

  17. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    E.M. van Leeuwen (Elisa); L.C. Karssen (Lennart); J. Deelen (Joris); A. Isaacs (Aaron); M.C. Medina-Gomez (Carolina); H. Mbarek; A. Kanterakis (Alexandros); S. Trompet (Stella); D. Postmus (Douwe); N. Verweij (Niek); D. van Enckevort (David); J.E. Huffman (Jennifer); C.C. White (Charles); M.F. Feitosa (Mary Furlan); T.M. Bartz (Traci M.); A. Manichaikul (Ani); P.K. Joshi (Peter); G.M. Peloso (Gina); P. Deelen (Patrick); F. van Dijk (F.); G.A.H.M. Willemsen (Gonneke); E.J.C. de Geus (Eco); Y. Milaneschi (Yuri); B.W.J.H. Penninx (Brenda); L.C. Francioli (Laurent); A. Menelaou (Androniki); S.L. Pulit (Sara); F. Rivadeneira Ramirez (Fernando); A. Hofman (Albert); B.A. Oostra (Ben); O.H. Franco (Oscar); I.M. Leach (Irene Mateo); M. Beekman (Marian); A.J. de Craen (Anton); H.-W. Uh (Hae-Won); H. Trochet (Holly); L.J. Hocking (Lynne); D.J. Porteous (David J.); N. Sattar (Naveed); C.J. Packard (Chris J.); B.M. Buckley (Brendan M.); J. Brody (Jennifer); J.C. Bis (Joshua); J.I. Rotter (Jerome I.); J.C. Mychaleckyj (Josyf); H. Campbell (Harry); Q. Duan (Qing); L.A. Lange (Leslie); J.F. Wilson (James F); C. Hayward (Caroline); O. Polasek (Ozren); V. Vitart (Veronique); I. Rudan (Igor); A. Wright (Alan); S.S. Rich (Stephen S.); B.M. Psaty (Bruce); I.B. Borecki (Ingrid); P.M. Kearney (Patricia M.); D.J. Stott (David. J.); L.A. Cupples (Adrienne); J.W. Jukema (Jan Wouter); P. van der Harst (Pim); E.J.G. Sijbrands (Eric); J.J. Hottenga (Jouke Jan); A.G. Uitterlinden (André); M. Swertz (Morris); G.-J.B. Van Ommen (Gert-Jan B.); P.I.W. de Bakker (Paul); P. Eline Slagboom; D.I. Boomsma (Dorret); C. Wijmenga (Cisca); C.M. van Duijn (Cock); P.B.T. Neerincx (Pieter B T); C.C. Elbers (Clara); P.F. Palamara (Pier Francesco); I. Peer (Itsik); M. Abdellaoui (Mohammed); W.P. Kloosterman (Wigard); M. van Oven (Mannis); M. Vermaat (Martijn); M. Li (Mingkun); J.F.J. Laros (Jeroen F.); M. Stoneking (Mark); P. de Knijff (Peter); M.H. Kayser (Manfred); J.H. Veldink (Jan); L.H. van den Berg (Leonard); H. Byelas (Heorhiy); J.T. den Dunnen (Johan); M.K. Dijkstra; N. Amin (Najaf); K.J. Van Der Velde (K. Joeri); J. van Setten (Jessica); V.M. Kattenberg (Mathijs); F.D.M. Van Schaik (Fiona D.M.); J.J. Bot (Jan); I.J. Nijman (Isaac ); H. Mei (Hailiang); V. Koval (Vyacheslav); K. Ye (Kai); E.-W. Lameijer (Eric-Wubbo); H. Moed (Heleen); J. Hehir-Kwa (Jayne); R.E. Handsaker (Robert); S.R. Sunyaev (Shamil); M. Sohail (Mashaal); F. Hormozdiari (Fereydoun); T. Marschall (Tanja); A. Schönhuth (Alexander); V. Guryev (Victor); H.E.D. Suchiman (Eka); B.H.R. Wolffenbuttel (Bruce); I. Platteel (Inge); S.J. Pitts (Steven); S. Potluri (Shobha); D.R. Cox (David R.); Q. Li (Qibin); Y. Li (Yingrui); Y. Du (Yuanping); R. Chen (Ruoyan); H. Cao (Hongzhi); N. Li (Ning); S. Cao (Sujie); J. Wang (Jun); J.A. Bovenberg (Jasper)

    2015-01-01

    textabstractVariants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference p

  18. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    NARCIS (Netherlands)

    J. Huang (Jie); B. Howie (Bryan); S. McCarthy (Shane); Y. Memari (Yasin); K. Walter (Klaudia); J.L. Min (Josine L.); P. Danecek (Petr); G. Malerba (Giovanni); E. Trabetti (Elisabetta); H.-F. Zheng (Hou-Feng); G. Gambaro (Giovanni); J.B. Richards (J. Brent); R. Durbin (Richard); N. Timpson (Nicholas); J. Marchini (Jonathan); N. Soranzo (Nicole); S. Al Turki (Saeed); A. Amuzu (Antoinette); C. Anderson (Carl); R. Anney (Richard); D. Antony (Dinu); M.S. Artigas; M. Ayub (Muhammad); S. Bala (Senduran); J. Barrett (Jeffrey); I. Barroso (Inês); P.L. Beales (Philip); M. Benn (Marianne); J. Bentham (Jamie); S. Bhattacharya (Shoumo); E. Birney (Ewan); D.H.R. Blackwood (Douglas); M. Bobrow (Martin); E. Bochukova (Elena); P.F. Bolton (Patrick F.); R. Bounds (Rebecca); C. Boustred (Chris); G. Breen (Gerome); M. Calissano (Mattia); K. Carss (Keren); J.P. Casas (Juan Pablo); J.C. Chambers (John C.); R. Charlton (Ruth); K. Chatterjee (Krishna); L. Chen (Lu); A. Ciampi (Antonio); S. Cirak (Sebahattin); P. Clapham (Peter); G. Clement (Gail); G. Coates (Guy); M. Cocca (Massimiliano); D.A. Collier (David); C. Cosgrove (Catherine); T. Cox (Tony); N.J. Craddock (Nick); L. Crooks (Lucy); S. Curran (Sarah); D. Curtis (David); A. Daly (Allan); I.N.M. Day (Ian N.M.); A.G. Day-Williams (Aaron); G.V. Dedoussis (George); T. Down (Thomas); Y. Du (Yuanping); C.M. van Duijn (Cock); I. Dunham (Ian); T. Edkins (Ted); R. Ekong (Rosemary); P. Ellis (Peter); D.M. Evans (David); I.S. Farooqi (I. Sadaf); D.R. Fitzpatrick (David R.); P. Flicek (Paul); J. Floyd (James); A.R. Foley (A. Reghan); C.S. Franklin (Christopher S.); M. Futema (Marta); L. Gallagher (Louise); P. Gasparini (Paolo); T.R. Gaunt (Tom); M. Geihs (Matthias); D. Geschwind (Daniel); C.M.T. Greenwood (Celia); H. Griffin (Heather); D. Grozeva (Detelina); X. Guo (Xiaosen); X. Guo (Xueqin); H. Gurling (Hugh); D. Hart (Deborah); A.E. Hendricks (Audrey E.); P.A. Holmans (Peter A.); L. Huang (Liren); T. Hubbard (Tim); S.E. Humphries (Steve E.); M.E. Hurles (Matthew); P.G. Hysi (Pirro); V. Iotchkova (Valentina); A. Isaacs (Aaron); D.K. Jackson (David K.); Y. Jamshidi (Yalda); J. Johnson (Jon); C. Joyce (Chris); K.J. Karczewski (Konrad); J. Kaye (Jane); T. Keane (Thomas); J.P. Kemp (John); K. Kennedy (Karen); A. Kent (Alastair); J. Keogh (Julia); F. Khawaja (Farrah); M.E. Kleber (Marcus E.); M. Van Kogelenberg (Margriet); A. Kolb-Kokocinski (Anja); J.S. Kooner (Jaspal S.); G. Lachance (Genevieve); C. Langenberg (Claudia); C. Langford (Cordelia); D. Lawson (Daniel); I. Lee (Irene); E.M. van Leeuwen (Elisa); M. Lek (Monkol); R. Li (Rui); Y. Li (Yingrui); J. Liang (Jieqin); H. Lin (Hong); R. Liu (Ryan); J. Lönnqvist (Jouko); L.R. Lopes (Luis R.); M.C. Lopes (Margarida); J. Luan; D.G. MacArthur (Daniel G.); M. Mangino (Massimo); G. Marenne (Gaëlle); W. März (Winfried); J. Maslen (John); A. Matchan (Angela); I. Mathieson (Iain); P. McGuffin (Peter); A.M. McIntosh (Andrew); A.G. McKechanie (Andrew G.); A. McQuillin (Andrew); S. Metrustry (Sarah); N. Migone (Nicola); H.M. Mitchison (Hannah M.); A. Moayyeri (Alireza); J. Morris (James); R. Morris (Richard); D. Muddyman (Dawn); F. Muntoni; B.G. Nordestgaard (Børge G.); K. Northstone (Kate); M.C. O'donovan (Michael); S. O'Rahilly (Stephen); A. Onoufriadis (Alexandros); K. Oualkacha (Karim); M.J. Owen (Michael J.); A. Palotie (Aarno); K. Panoutsopoulou (Kalliope); V. Parker (Victoria); J.R. Parr (Jeremy R.); L. Paternoster (Lavinia); T. Paunio (Tiina); F. Payne (Felicity); S.J. Payne (Stewart J.); J.R.B. Perry (John); O.P.H. Pietiläinen (Olli); V. Plagnol (Vincent); R.C. Pollitt (Rebecca C.); S. Povey (Sue); M.A. Quail (Michael A.); L. Quaye (Lydia); L. Raymond (Lucy); K. Rehnström (Karola); C.K. Ridout (Cheryl K.); S.M. Ring (Susan); G.R.S. Ritchie (Graham R.S.); N. Roberts (Nicola); R.L. Robinson (Rachel L.); D.B. Savage (David); P.J. Scambler (Peter); S. Schiffels (Stephan); M. Schmidts (Miriam); N. Schoenmakers (Nadia); R.H. Scott (Richard H.); R.A. Scott (Robert); R.K. Semple (Robert K.); E. Serra (Eva); S.I. Sharp (Sally I.); A.C. Shaw (Adam C.); H.A. Shihab (Hashem A.); S.-Y. Shin (So-Youn); D. Skuse (David); K.S. Small (Kerrin); C. Smee (Carol); G.D. Smith; L. Southam (Lorraine); O. Spasic-Boskovic (Olivera); T.D. Spector (Timothy); D. St. Clair (David); B. St Pourcain (Beate); J. Stalker (Jim); E. Stevens (Elizabeth); J. Sun (Jianping); G. Surdulescu (Gabriela); J. Suvisaari (Jaana); P. Syrris (Petros); I. Tachmazidou (Ioanna); R. Taylor (Rohan); J. Tian (Jing); M.D. Tobin (Martin); D. Toniolo (Daniela); M. Traglia (Michela); A. Tybjaerg-Hansen; A.M. Valdes; A.M. Vandersteen (Anthony M.); A. Varbo (Anette); P. Vijayarangakannan (Parthiban); P.M. Visscher (Peter); L.V. Wain (Louise); J.T. Walters (James); G. Wang (Guangbiao); J. Wang (Jun); Y. Wang (Yu); K. Ward (Kirsten); E. Wheeler (Eleanor); P.H. Whincup (Peter); T. Whyte (Tamieka); H.J. Williams (Hywel J.); K.A. Williamson (Kathleen); C. Wilson (Crispian); S.G. Wilson (Scott); K. Wong (Kim); C. Xu (Changjiang); J. Yang (Jian); G. Zaza (Gianluigi); E. Zeggini (Eleftheria); F. Zhang (Feng); P. Zhang (Pingbo); W. Zhang (Weihua)

    2015-01-01

    textabstractImputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced

  19. 41 CFR 105-68.630 - May the General Services Administration impute conduct of one person to another?

    Science.gov (United States)

    2010-07-01

    ... 41 Public Contracts and Property Management 3 2010-07-01 2010-07-01 false May the General Services Administration impute conduct of one person to another? 105-68.630 Section 105-68.630 Public Contracts and Property Management Federal Property Management Regulations System (Continued) GENERAL...

  20. THE PENAL IMPUTATION IN THE ENVIRONMENT OF THE COMPANY AND THE STRUCTURES OMISSION: BASE FOR THEIR ANALYSIS

    OpenAIRE

    Cesano, Jose Daniel

    2009-01-01

    The purpose of this paper is to describe the scope of the structure omission like instrument of personal imputation in the environment of the company.  El presente trabajo tendrá por objeto de análisis el alcance de la estructura omisiva como instrumento de imputación personal en el ámbito de la empresa. 

  1. The use of imputed sibling genotypes in sibship-based association analysis: On modeling alternatives, power and model misspecification

    NARCIS (Netherlands)

    Minica, C.C.; Dolan, C.V.; Hottenga, J.J.; Willemsen, G.; Vink, J.M.; Boomsma, D.I.

    2013-01-01

    When phenotypic, but no genotypic data are available for relatives of participants in genetic association studies, previous research has shown that family-based imputed genotypes can boost the statistical power when included in such studies. Here, using simulations, we compared the performance of tw

  2. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    NARCIS (Netherlands)

    Van Leeuwen, Elisabeth M.; Karssen, Lennart C.; Deelen, Joris; Isaacs, Aaron; Medina-Gomez, Carolina; Mbarek, Hamdi; Kanterakis, Alexandros; Trompet, Stella; Postmus, Iris; Verweij, Niek; Van Enckevort, David J.; Huffman, Jennifer E.; White, Charles C.; Feitosa, Mary F.; Bartz, Traci M.; Manichaikul, Ani; Joshi, Peter K.; Peloso, Gina M.; Deelen, Patrick; Van Dijk, Freerk; Willemsen, Gonneke; De Geus, Eco J.; Milaneschi, Yuri; Penninx, Brenda W J H; Francioli, Laurent C.; Menelaou, Androniki; Pulit, Sara L.; Rivadeneira, Fernando; Hofman, Albert; Oostra, Ben A.; Franco, Oscar H.; Leach, Irene Mateo; Beekman, Marian; De Craen, Anton J M; Uh, Hae Won; Trochet, Holly; Hocking, Lynne J.; Porteous, David J.; Sattar, Naveed; Packard, Chris J.; Buckley, Brendan M.; Brody, Jennifer A.; Bis, Joshua C.; Rotter, Jerome I.; Mychaleckyj, Josyf C.; Campbell, Harry; Duan, Qing; Lange, Leslie A.; Wilson, James F.; Hayward, Caroline; Polasek, Ozren; Vitart, Veronique; Rudan, Igor; Wright, Alan F.; Rich, Stephen S.; Psaty, Bruce M.; Borecki, Ingrid B.; Kearney, Patricia M.; Stott, David J.; Cupples, L. Adrienne; Jukema, J. Wouter; Van Der Harst, Pim; Sijbrands, Eric J.; Hottenga, Jouke Jan; Uitterlinden, Andre G.; Swertz, Morris A.; Van Ommen, Gert Jan B; De Bakker, Paul I W; Eline Slagboom, P.; Boomsma, Dorret I.; Wijmenga, Cisca; Van Duijn, Cornelia M.; Neerincx, Pieter B T; Elbers, Clara C.; Palamara, Pier Francesco; Peer, Itsik; Abdellaoui, Abdel; Kloosterman, Wigard P.|info:eu-repo/dai/nl/304076953; Van Oven, Mannis; Vermaat, Martijn; Li, Mingkun; Laros, Jeroen F J; Stoneking, Mark; De Knijff, Peter; Kayser, Manfred; Veldink, Jan H.|info:eu-repo/dai/nl/266575722; Van Den Berg, Leonard H.|info:eu-repo/dai/nl/288255216; Byelas, Heorhiy; Den Dunnen, Johan T.; Dijkstra, Martijn; Amin, Najaf; Van Der Velde, K. Joeri; Van Setten, Jessica|info:eu-repo/dai/nl/345493990; Kattenberg, Mathijs; Van Schaik, Barbera D C; Bot, Jan; Nijman, Isaäc J.|info:eu-repo/dai/nl/185967833; Mei, Hailiang; Koval, Vyacheslav; Ye, Kai; Lameijer, Eric Wubbo; Moed, Matthijs H.; Hehir-Kwa, Jayne Y.; Handsaker, Robert E.; Sunyaev, Shamil R.; Sohail, Mashaal; Hormozdiari, Fereydoun; Marschall, Tobias; Schönhuth, Alexander; Guryev, Victor|info:eu-repo/dai/nl/343083132; Suchiman, H. Eka D; Wolffenbuttel, Bruce H.; Platteel, Mathieu; Pitts, Steven J.; Potluri, Shobha; Cox, David R.; Li, Qibin; Li, Yingrui; Du, Yuanping; Chen, Ruoyan; Cao, Hongzhi; Li, Ning; Cao, Sujie; Wang, Jun; Bovenberg, Jasper A.; de Bakker, Paul I W|info:eu-repo/dai/nl/342957082

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference panel created

  3. Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research

    Science.gov (United States)

    2012-01-01

    Background Multiple imputation is becoming increasingly popular. Theoretical considerations as well as simulation studies have shown that the inclusion of auxiliary variables is generally of benefit. Methods A simulation study of a linear regression with a response Y and two predictors X1 and X2 was performed on data with n = 50, 100 and 200 using complete cases or multiple imputation with 0, 10, 20, 40 and 80 auxiliary variables. Mechanisms of missingness were either 100% MCAR or 50% MAR + 50% MCAR. Auxiliary variables had low (r=.10) vs. moderate correlations (r=.50) with X’s and Y. Results The inclusion of auxiliary variables can improve a multiple imputation model. However, inclusion of too many variables leads to downward bias of regression coefficients and decreases precision. When the correlations are low, inclusion of auxiliary variables is not useful. Conclusion More research on auxiliary variables in multiple imputation should be performed. A preliminary rule of thumb could be that the ratio of variables to cases with complete data should not go below 1 : 3. PMID:23216665

  4. Propensity Scoring after Multiple Imputation in a Retrospective Study on Adjuvant Radiation Therapy in Lymph-Node Positive Vulvar Cancer

    NARCIS (Netherlands)

    Eulenburg, Christine; Suling, Anna; Neuser, Petra; Reuss, Alexander; Canzler, Ulrich; Fehm, Tanja; Luyten, Alexander; Hellriegel, Martin; Woelber, Linn; Mahner, Sven

    2016-01-01

    Propensity scoring (PS) is an established tool to account for measured confounding in non-randomized studies. These methods are sensitive to missing values, which are a common problem in observational data. The combination of multiple imputation of missing values and different propensity scoring

  5. Missing data in a multi-item instrument were best handled by multiple imputation at the item score level

    NARCIS (Netherlands)

    Eekhout, Iris; de Vet, Henrica C. W.; Twisk, Jos W. R.; Brand, Jaap P. L.; de Boer, Michiel R.; Heymans, Martijn W.

    2014-01-01

    Objectives: Regardless of the proportion of missing values, complete-case analysis is most frequently applied, although advanced techniques such as multiple imputation (MI) are available. The objective of this study was to explore the performance of simple and more advanced methods for handling miss

  6. Application of a novel hybrid method for spatiotemporal data imputation: A case study of the Minqin County groundwater level

    Science.gov (United States)

    Zhang, Zhongrong; Yang, Xuan; Li, Hao; Li, Weide; Yan, Haowen; Shi, Fei

    2017-10-01

    The techniques for data analyses have been widely developed in past years, however, missing data still represent a ubiquitous problem in many scientific fields. In particular, dealing with missing spatiotemporal data presents an enormous challenge. Nonetheless, in recent years, a considerable amount of research has focused on spatiotemporal problems, making spatiotemporal missing data imputation methods increasingly indispensable. In this paper, a novel spatiotemporal hybrid method is proposed to verify and imputed spatiotemporal missing values. This new method, termed SOM-FLSSVM, flexibly combines three advanced techniques: self-organizing feature map (SOM) clustering, the fruit fly optimization algorithm (FOA) and the least squares support vector machine (LSSVM). We employ a cross-validation (CV) procedure and FOA swarm intelligence optimization strategy that can search available parameters and determine the optimal imputation model. The spatiotemporal underground water data for Minqin County, China, were selected to test the reliability and imputation ability of SOM-FLSSVM. We carried out a validation experiment and compared three well-studied models with SOM-FLSSVM using a different missing data ratio from 0.1 to 0.8 in the same data set. The results demonstrate that the new hybrid method performs well in terms of both robustness and accuracy for spatiotemporal missing data.

  7. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    NARCIS (Netherlands)

    J. Huang (Jie); B. Howie (Bryan); S. McCarthy (Shane); Y. Memari (Yasin); K. Walter (Klaudia); J.L. Min (Josine L.); P. Danecek (Petr); G. Malerba (Giovanni); E. Trabetti (Elisabetta); H.-F. Zheng (Hou-Feng); G. Gambaro (Giovanni); J.B. Richards (Brent); R. Durbin (Richard); N.J. Timpson (Nicholas); J. Marchini (Jonathan); N. Soranzo (Nicole); S. Al Turki (Saeed); A. Amuzu (Antoinette); C. Anderson (Carl); R. Anney (Richard); D. Antony (Dinu); M.S. Artigas; M. Ayub (Muhammad); S. Bala (Senduran); J. Barrett (Jeffrey); I. Barroso (Inês); P.L. Beales (Philip); M. Benn (Marianne); J. Bentham (Jamie); S. Bhattacharya (Shoumo); E. Birney (Ewan); D.H.R. Blackwood (Douglas); M. Bobrow (Martin); E. Bochukova (Elena); P.F. Bolton (Patrick F.); R. Bounds (Rebecca); C. Boustred (Chris); G. Breen (Gerome); M. Calissano (Mattia); K. Carss (Keren); J.P. Casas (Juan Pablo); J.C. Chambers (John C.); R. Charlton (Ruth); K. Chatterjee (Krishna); L. Chen (Lu); A. Ciampi (Antonio); S. Cirak (Sebahattin); P. Clapham (Peter); G. Clement (Gail); G. Coates (Guy); M. Cocca (Massimiliano); D.A. Collier (David); C. Cosgrove (Catherine); T. Cox (Tony); N.J. Craddock (Nick); L. Crooks (Lucy); S. Curran (Sarah); D. Curtis (David); A. Daly (Allan); I.N.M. Day (Ian N.M.); A.G. Day-Williams (Aaron); G.V. Dedoussis (George); T. Down (Thomas); Y. Du (Yuanping); C.M. van Duijn (Cock); I. Dunham (Ian); T. Edkins (Ted); R. Ekong (Rosemary); P. Ellis (Peter); D.M. Evans (David); I.S. Farooqi (I. Sadaf); D.R. Fitzpatrick (David R.); P. Flicek (Paul); J. Floyd (James); A.R. Foley (A. Reghan); C.S. Franklin (Christopher S.); M. Futema (Marta); L. Gallagher (Louise); P. Gasparini (Paolo); T.R. Gaunt (Tom); M. Geihs (Matthias); D. Geschwind (Daniel); C.M.T. Greenwood (Celia); H. Griffin (Heather); D. Grozeva (Detelina); X. Guo (Xiaosen); X. Guo (Xueqin); H. Gurling (Hugh); D. Hart (Deborah); A.E. Hendricks (Audrey E.); P.A. Holmans (Peter A.); L. Huang (Liren); T. Hubbard (Tim); S.E. Humphries (Steve E.); M.E. Hurles (Matthew); P.G. Hysi (Pirro); V. Iotchkova (Valentina); A. Isaacs (Aaron); D.K. Jackson (David K.); Y. Jamshidi (Yalda); J. Johnson (Jon); C. Joyce (Chris); K.J. Karczewski (Konrad); J. Kaye (Jane); T. Keane (Thomas); J.P. Kemp (John); K. Kennedy (Karen); A. Kent (Alastair); J. Keogh (Julia); F. Khawaja (Farrah); M.E. Kleber (Marcus E.); M. Van Kogelenberg (Margriet); A. Kolb-Kokocinski (Anja); J.S. Kooner (Jaspal S.); G. Lachance (Genevieve); C. Langenberg (Claudia); C. Langford (Cordelia); D. Lawson (Daniel); I. Lee (Irene); E.M. van Leeuwen (Elisa); M. Lek (Monkol); R. Li (Rui); Y. Li (Yingrui); J. Liang (Jieqin); H. Lin (Hong); R. Liu (Ryan); J. Lönnqvist (Jouko); L.R. Lopes (Luis R.); M.C. Lopes (Margarida); J. Luan; D.G. MacArthur (Daniel G.); M. Mangino (Massimo); G. Marenne (Gaëlle); W. März (Winfried); J. Maslen (John); A. Matchan (Angela); I. Mathieson (Iain); P. McGuffin (Peter); A.M. McIntosh (Andrew); A.G. McKechanie (Andrew G.); A. McQuillin (Andrew); S. Metrustry (Sarah); N. Migone (Nicola); H.M. Mitchison (Hannah M.); A. Moayyeri (Alireza); J. Morris (James); R. Morris (Richard); D. Muddyman (Dawn); F. Muntoni; B.G. Nordestgaard (Børge G.); K. Northstone (Kate); M.C. O'donovan (Michael); S. O'Rahilly (Stephen); A. Onoufriadis (Alexandros); K. Oualkacha (Karim); M.J. Owen (Michael J.); A. Palotie (Aarno); K. Panoutsopoulou (Kalliope); V. Parker (Victoria); J.R. Parr (Jeremy R.); L. Paternoster (Lavinia); T. Paunio (Tiina); F. Payne (Felicity); S.J. Payne (Stewart J.); J.R.B. Perry (John); O.P.H. Pietiläinen (Olli); V. Plagnol (Vincent); R.C. Pollitt (Rebecca C.); S. Povey (Sue); M.A. Quail (Michael A.); L. Quaye (Lydia); L. Raymond (Lucy); K. Rehnström (Karola); C.K. Ridout (Cheryl K.); S.M. Ring (Susan); G.R.S. Ritchie (Graham R.S.); N. Roberts (Nicola); R.L. Robinson (Rachel L.); D.B. Savage (David); P.J. Scambler (Peter); S. Schiffels (Stephan); M. Schmidts (Miriam); N. Schoenmakers (Nadia); R.H. Scott (Richard H.); R.A. Scott (Robert); R.K. Semple (Robert K.); E. Serra (Eva); S.I. Sharp (Sally I.); A.C. Shaw (Adam C.); H.A. Shihab (Hashem A.); S.-Y. Shin (So-Youn); D. Skuse (David); K.S. Small (Kerrin); C. Smee (Carol); G.D. Smith; L. Southam (Lorraine); O. Spasic-Boskovic (Olivera); T.D. Spector (Timothy); D. St. Clair (David); B. St Pourcain (Beate); J. Stalker (Jim); E. Stevens (Elizabeth); J. Sun (Jianping); G. Surdulescu (Gabriela); J. Suvisaari (Jaana); P. Syrris (Petros); I. Tachmazidou (Ioanna); R. Taylor (Rohan)

    2015-01-01

    textabstractImputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced

  8. Improved imputation of low-frequency and rare variants using the UK10K haplotype reference panel

    DEFF Research Database (Denmark)

    Huang, Jie; Howie, Bryan; Mccarthy, Shane

    2015-01-01

    Imputing genotypes from reference panels created by whole-genome sequencing (WGS) provides a cost-effective strategy for augmenting the single-nucleotide polymorphism (SNP) content of genome-wide arrays. The UK10K Cohorts project has generated a data set of 3,781 whole genomes sequenced at low de...

  9. Age at menopause: imputing age at menopause for women with a hysterectomy with application to risk of postmenopausal breast cancer

    Science.gov (United States)

    Rosner, Bernard; Colditz, Graham A.

    2011-01-01

    Purpose Age at menopause, a major marker in the reproductive life, may bias results for evaluation of breast cancer risk after menopause. Methods We follow 38,948 premenopausal women in 1980 and identify 2,586 who reported hysterectomy without bilateral oophorectomy, and 31,626 who reported natural menopause during 22 years of follow-up. We evaluate risk factors for natural menopause, impute age at natural menopause for women reporting hysterectomy without bilateral oophorectomy and estimate the hazard of reaching natural menopause in the next 2 years. We apply this imputed age at menopause to both increase sample size and to evaluate the relation between postmenopausal exposures and risk of breast cancer. Results Age, cigarette smoking, age at menarche, pregnancy history, body mass index, history of benign breast disease, and history of breast cancer were each significantly related to age at natural menopause; duration of oral contraceptive use and family history of breast cancer were not. The imputation increased sample size substantially and although some risk factors after menopause were weaker in the expanded model (height, and alcohol use), use of hormone therapy is less biased. Conclusions Imputing age at menopause increases sample size, broadens generalizability making it applicable to women with hysterectomy, and reduces bias. PMID:21441037

  10. A Multi-Faceted Approach to Analyse the Effects of Environmental Variables on Geographic Range and Genetic Structure of a Perennial Psammophilous Geophyte: The Case of the Sea Daffodil Pancratium maritimum L. in the Mediterranean Basin.

    Science.gov (United States)

    De Castro, Olga; Di Maio, Antonietta; Di Febbraro, Mirko; Imparato, Gennaro; Innangi, Michele; Véla, Errol; Menale, Bruno

    2016-01-01

    The Mediterranean coastline is a dynamic and complex system which owes its complexity to its past and present vicissitudes, e.g. complex tectonic history, climatic fluctuations, and prolonged coexistence with human activities. A plant species that is widespread in this habitat is the sea daffodil, Pancratium maritimum (Amaryllidaceae), which is a perennial clonal geophyte of the coastal sands of the Mediterranean and neighbouring areas, well adapted to the stressful conditions of sand dune environments. In this study, an integrated approach was used, combining genetic and environmental data with a niche modelling approach, aimed to investigate: (1) the effect of climate change on the geographic range of this species at different times {past (last inter-glacial, LIG; and last glacial maximum, LGM), present (CURR), near-future (FUT)} and (2) the possible influence of environmental variables on the genetic structure of this species in the current period. The genetic results show that 48 sea daffodil populations (867 specimens) display a good genetic diversity in which the marginal populations (i.e. Atlantic Sea populations) present lower values. Recent genetic signature of bottleneck was detected in few populations (8%). The molecular variation was higher within the populations (77%) and two genetic pools were well represented. Comparing the different climatic simulations in time, the global range of this plant increased, and a further extension is foreseen in the near future thanks to projections on the climate of areas currently-more temperate, where our model suggested a forecast for a climate more similar to the Mediterranean coast. A significant positive correlation was observed between the genetic distance and Precipitation of Coldest Quarter variable in current periods. Our analyses support the hypothesis that geomorphology of the Mediterranean coasts, sea currents, and climate have played significant roles in shaping the current genetic structure of the sea

  11. The Geographical Information System

    Directory of Open Access Journals (Sweden)

    Jürgen Schweikart

    2008-12-01

    Full Text Available The Geographical Information System, normally called GIS, is a tool for representing spatial relationships and real processes with the help of a model. A GIS is a system of hardware, software and staff for collecting, managing, analysing and representing geospatial information. For example, we can study the evolution of an infectious disease in a certain territory, perform market analysis, or locate the best ways to choose a new industrial site. In substance, it is data manipulation software for that allows us to have, both the graphic component, that is a territorial representation of the reality that you want to represent, and the data components in the form of a database or more commonly, calculation sheets. Geographical data are divided in spatial data and attribute data: Spatial data are recorded as points, lines and polygons (vectorial structure. In other words, the survey systems have been projected to acquire information in accordance to elementary cells corresponding to a territorial grid (raster structure. It also includes remote sensing data.

  12. ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing

    Directory of Open Access Journals (Sweden)

    Kamatani Naoyuki

    2011-05-01

    Full Text Available Abstract Background Use of missing genotype imputations and haplotype reconstructions are valuable in genome-wide association studies (GWASs. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and used for GWASs. Since millions of single nucleotide polymorphisms need to be imputed in a GWAS, faster methods for genotype imputation and haplotype reconstruction are required. Results We developed a program package for parallel computation of genotype imputation and haplotype reconstruction. Our program package, ParaHaplo 3.0, is intended for use in workstation clusters using the Intel Message Passing Interface. We compared the performance of ParaHaplo 3.0 on the Japanese in Tokyo, Japan and Han Chinese in Beijing, and Chinese in the HapMap dataset. A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo. Conclusions ParaHaplo 3.0 is an invaluable tool for conducting haplotype-based GWASs. The need for faster genotype imputation and haplotype reconstruction using parallel computing will become increasingly important as the data sizes of such projects continue to increase. ParaHaplo executable binaries and program sources are available at http://en.sourceforge.jp/projects/parallelgwas/releases/.

  13. Mendel-GPU: haplotyping and genotype imputation on graphics processing units.

    Science.gov (United States)

    Chen, Gary K; Wang, Kai; Stram, Alex H; Sobel, Eric M; Lange, Kenneth

    2012-11-15

    In modern sequencing studies, one can improve the confidence of genotype calls by phasing haplotypes using information from an external reference panel of fully typed unrelated individuals. However, the computational demands are so high that they prohibit researchers with limited computational resources from haplotyping large-scale sequence data. Our graphics processing unit based software delivers haplotyping and imputation accuracies comparable to competing programs at a fraction of the computational cost and peak memory demand. Mendel-GPU, our OpenCL software, runs on Linux platforms and is portable across AMD and nVidia GPUs. Users can download both code and documentation at http://code.google.com/p/mendel-gpu/. gary.k.chen@usc.edu. Supplementary data are available at Bioinformatics online.

  14. Impute DC link (IDCL) cell based power converters and control thereof

    Energy Technology Data Exchange (ETDEWEB)

    Divan, Deepakraj M.; Prasai, Anish; Hernendez, Jorge; Moghe, Rohit; Iyer, Amrit; Kandula, Rajendra Prasad

    2016-04-26

    Power flow controllers based on Imputed DC Link (IDCL) cells are provided. The IDCL cell is a self-contained power electronic building block (PEBB). The IDCL cell may be stacked in series and parallel to achieve power flow control at higher voltage and current levels. Each IDCL cell may comprise a gate drive, a voltage sharing module, and a thermal management component in order to facilitate easy integration of the cell into a variety of applications. By providing direct AC conversion, the IDCL cell based AC/AC converters reduce device count, eliminate the use of electrolytic capacitors that have life and reliability issues, and improve system efficiency compared with similarly rated back-to-back inverter system.

  15. Comparison of Imputation Methods for Handling Missing Categorical Data with Univariate Pattern|| Una comparación de métodos de imputación de variables categóricas con patrón univariado

    Directory of Open Access Journals (Sweden)

    Torres Munguía, Juan Armando

    2014-06-01

    Full Text Available This paper examines the sample proportions estimates in the presence of univariate missing categorical data. A database about smoking habits (2011 National Addiction Survey of Mexico was used to create simulated yet realistic datasets at rates 5% and 15% of missingness, each for MCAR, MAR and MNAR mechanisms. Then the performance of six methods for addressing missingness is evaluated: listwise, mode imputation, random imputation, hot-deck, imputation by polytomous regression and random forests. Results showed that the most effective methods for dealing with missing categorical data in most of the scenarios assessed in this paper were hot-deck and polytomous regression approaches. || El presente estudio examina la estimación de proporciones muestrales en la presencia de valores faltantes en una variable categórica. Se utiliza una encuesta de consumo de tabaco (Encuesta Nacional de Adicciones de México 2011 para crear bases de datos simuladas pero reales con 5% y 15% de valores perdidos para cada mecanismo de no respuesta MCAR, MAR y MNAR. Se evalúa el desempeño de seis métodos para tratar la falta de respuesta: listwise, imputación de moda, imputación aleatoria, hot-deck, imputación por regresión politómica y árboles de clasificación. Los resultados de las simulaciones indican que los métodos más efectivos para el tratamiento de la no respuesta en variables categóricas, bajo los escenarios simulados, son hot-deck y la regresión politómica.

  16. Application of an imputation method for geospatial inventory of forest structural attributes across multiple spatial scales in the Lake States, U.S.A

    Science.gov (United States)

    Deo, Ram K.

    Credible spatial information characterizing the structure and site quality of forests is critical to sustainable forest management and planning, especially given the increasing demands and threats to forest products and services. Forest managers and planners are required to evaluate forest conditions over a broad range of scales, contingent on operational or reporting requirements. Traditionally, forest inventory estimates are generated via a design-based approach that involves generalizing sample plot measurements to characterize an unknown population across a larger area of interest. However, field plot measurements are costly and as a consequence spatial coverage is limited. Remote sensing technologies have shown remarkable success in augmenting limited sample plot data to generate stand- and landscape-level spatial predictions of forest inventory attributes. Further enhancement of forest inventory approaches that couple field measurements with cutting edge remotely sensed and geospatial datasets are essential to sustainable forest management. We evaluated a novel Random Forest based k Nearest Neighbors (RF-kNN) imputation approach to couple remote sensing and geospatial data with field inventory collected by different sampling methods to generate forest inventory information across large spatial extents. The forest inventory data collected by the FIA program of US Forest Service was integrated with optical remote sensing and other geospatial datasets to produce biomass distribution maps for a part of the Lake States and species-specific site index maps for the entire Lake State. Targeting small-area application of the state-of-art remote sensing, LiDAR (light detection and ranging) data was integrated with the field data collected by an inexpensive method, called variable plot sampling, in the Ford Forest of Michigan Tech to derive standing volume map in a cost-effective way. The outputs of the RF-kNN imputation were compared with independent validation

  17. GEOGRAPHIC NAMES INFORMATION SYSTEM (GNIS) ...

    Science.gov (United States)

    The Geographic Names Information System (GNIS), developed by the U.S. Geological Survey in cooperation with the U.S. Board on Geographic Names (BGN), contains information about physical and cultural geographic features in the United States and associated areas, both current and historical, but not including roads and highways. The database also contains geographic names in Antarctica. The database holds the Federally recognized name of each feature and defines the location of the feature by state, county, USGS topographic map, and geographic coordinates. Other feature attributes include names or spellings other than the official name, feature designations, feature class, historical and descriptive information, and for some categories of features the geometric boundaries. The database assigns a unique feature identifier, a random number, that is a key for accessing, integrating, or reconciling GNIS data with other data sets. The GNIS is our Nation's official repository of domestic geographic feature names information.

  18. Geographic Versus Industry Diversification: Contraints Matter

    OpenAIRE

    Ehling, Paul; Sofia B. Ramos

    2004-01-01

    This research addresses whether geographic diversification provides benefits over industry diversification in a sample of European country and industry indexes. The methodology allows performance comparisons with short-selling constraints, upper and lower bounds, and many benchmarks. In the absence of constraints, no empirical evidence is found to support the argument that country diversification is a superior approach. In the case of realistic weights on portfolios such as short-selling, and...

  19. Missing Data in Substance Abuse Treatment Research: Current Methods and Modern Approaches

    Science.gov (United States)

    McPherson, Sterling; Barbosa-Leiker, Celestina; Burns, G. Leonard; Howell, Donelle; Roll, John

    2013-01-01

    Two common procedures for the treatment of missing information, listwise deletion and positive urine analysis (UA) imputation (e.g., if the participant fails to provide urine for analysis, then score the UA positive), may result in significant biases during the interpretation of treatment effects. To compare these approaches and to offer a possible alternative, these two procedures were compared to the multiple imputation (MI) procedure with publicly available data from a recent clinical trial. Listwise deletion, single imputation (i.e., positive UA imputation), and MI missing data procedures were used to comparatively examine the effect of two different buprenorphine/naloxone tapering schedules (7- or 28-days) for opioid addiction on the likelihood of a positive UA (Clinical Trial Network 0003; Ling et al., 2009). The listwise deletion of missing data resulted in a nonsignificant effect for the taper while the positive UA imputation procedure resulted in a significant effect, replicating the original findings by Ling et al. (2009). Although the MI procedure also resulted in a significant effect, the effect size was meaningfully smaller and the standard errors meaningfully larger when compared to the positive UA procedure. This study demonstrates that the researcher can obtain markedly different results depending on how the missing data are handled. Missing data theory suggests that listwise deletion and single imputation procedures should not be used to account for missing information, and that MI has advantages with respect to internal and external validity when the assumption of missing at random can be reasonably supported. PMID:22329556

  20. Partial imputation to improve predictive modelling in insurance risk classification using a hybrid positive selection algorithm and correlation-based feature selection

    CSIR Research Space (South Africa)

    Duma, M

    2013-09-01

    Full Text Available We propose a hybrid missing data imputation technique using positive selection and correlation-based feature selection for insurance data. The hybrid is used to help supervised learning methods improve their classification accuracy and resilience...

  1. Coloring geographical threshold graphs

    Energy Technology Data Exchange (ETDEWEB)

    Bradonjic, Milan [Los Alamos National Laboratory; Percus, Allon [Los Alamos National Laboratory; Muller, Tobias [EINDHOVEN UNIV. OF TECH

    2008-01-01

    We propose a coloring algorithm for sparse random graphs generated by the geographical threshold graph (GTG) model, a generalization of random geometric graphs (RGG). In a GTG, nodes are distributed in a Euclidean space, and edges are assigned according to a threshold function involving the distance between nodes as well as randomly chosen node weights. The motivation for analyzing this model is that many real networks (e.g., wireless networks, the Internet, etc.) need to be studied by using a 'richer' stochastic model (which in this case includes both a distance between nodes and weights on the nodes). Here, we analyze the GTG coloring algorithm together with the graph's clique number, showing formally that in spite of the differences in structure between GTG and RGG, the asymptotic behavior of the chromatic number is identical: {chi}1n 1n n / 1n n (1 + {omicron}(1)). Finally, we consider the leading corrections to this expression, again using the coloring algorithm and clique number to provide bounds on the chromatic number. We show that the gap between the lower and upper bound is within C 1n n / (1n 1n n){sup 2}, and specify the constant C.

  2. Propensity Scoring after Multiple Imputation in a Retrospective Study on Adjuvant Radiation Therapy in Lymph-Node Positive Vulvar Cancer.

    Science.gov (United States)

    Eulenburg, Christine; Suling, Anna; Neuser, Petra; Reuss, Alexander; Canzler, Ulrich; Fehm, Tanja; Luyten, Alexander; Hellriegel, Martin; Woelber, Linn; Mahner, Sven

    2016-01-01

    Propensity scoring (PS) is an established tool to account for measured confounding in non-randomized studies. These methods are sensitive to missing values, which are a common problem in observational data. The combination of multiple imputation of missing values and different propensity scoring techniques is addressed in this work. For a sample of lymph node-positive vulvar cancer patients, we re-analyze associations between the application of radiotherapy and disease-related and non-related survival. Inverse-probability-of-treatment-weighting (IPTW) and PS stratification are applied after multiple imputation by chained equation (MICE). Methodological issues are described in detail. Interpretation of the results and methodological limitations are discussed.

  3. International Refugees: A Geographical Perspective.

    Science.gov (United States)

    Demko, George J.; Wood, William B.

    1987-01-01

    Examines the problem of international refugees from a geographical perspective. Focuses on sub-saharan Africa, Afghanistan, Central America, and southeast Asia. Concludes that geographers can and should use their skills and intellectual tools to address and help resolve this global problem. (JDH)

  4. On the performance of multiple imputation based on chained equations in tackling missing data of the African α3.7 -globin deletion in a malaria association study.

    Science.gov (United States)

    Sepúlveda, Nuno; Manjurano, Alphaxard; Drakeley, Chris; Clark, Taane G

    2014-07-01

    Multiple imputation based on chained equations (MICE) is an alternative missing genotype method that can use genetic and nongenetic auxiliary data to inform the imputation process. Previously, MICE was successfully tested on strongly linked genetic data. We have now tested it on data of the HBA2 gene which, by the experimental design used in a malaria association study in Tanzania, shows a high missing data percentage and is weakly linked with the remaining genetic markers in the data set. We constructed different imputation models and studied their performance under different missing data conditions. Overall, MICE failed to accurately predict the true genotypes. However, using the best imputation model for the data, we obtained unbiased estimates for the genetic effects, and association signals of the HBA2 gene on malaria positivity. When the whole data set was analyzed with the same imputation model, the association signal increased from 0.80 to 2.70 before and after imputation, respectively. Conversely, postimputation estimates for the genetic effects remained the same in relation to the complete case analysis but showed increased precision. We argue that these postimputation estimates are reasonably unbiased, as a result of a good study design based on matching key socio-environmental factors.

  5. Assessment of Consequences of Replacement of System of the Uniform Tax on Imputed Income Patent System of the Taxation

    Directory of Open Access Journals (Sweden)

    Galina A. Manokhina

    2012-11-01

    Full Text Available The article highlights the main questions concerning possible consequences of replacement of nowadays operating system in the form of a single tax in reference to imputed income with patent system of the taxation. The main advantages and drawbacks of new system of the taxation are shown, including the opinion that not the replacement of one special mode of the taxation with another is more effective, but the introduction of patent a taxation system as an auxilary system.

  6. Theoretical-methodological aspects of systemic geographical investigation into landscape degradation

    Directory of Open Access Journals (Sweden)

    Dušan Plut

    1995-12-01

    Full Text Available The following methodological approaches have been developed in the systemically planned investigation into the degradation of geographical environment: the physico-geographical, the ecosystemic, the socio-ecological, the landscape-ecological, and the functional regional-geographical ones.

  7. Meta-analysis and imputation refines the association of 15q25 with smoking quantity

    Science.gov (United States)

    Liu, Jason Z.; Tozzi, Federica; Waterworth, Dawn M.; Pillai, Sreekumar G.; Muglia, Pierandrea; Middleton, Lefkos; Berrettini, Wade; Knouff, Christopher W.; Yuan, Xin; Waeber, Gérard; Vollenweider, Peter; Preisig, Martin; Wareham, Nicholas J; Zhao, Jing Hua; Loos, Ruth J.F.; Barroso, Inês; Khaw, Kay-Tee; Grundy, Scott; Barter, Philip; Mahley, Robert; Kesaniemi, Antero; McPherson, Ruth; Vincent, John B.; Strauss, John; Kennedy, James L.; Farmer, Anne; McGuffin, Peter; Day, Richard; Matthews, Keith; Bakke, Per; Gulsvik, Amund; Lucae, Susanne; Ising, Marcus; Brueckl, Tanja; Horstmann, Sonja; Wichmann, H.-Erich; Rawal, Rajesh; Dahmen, Norbert; Lamina, Claudia; Polasek, Ozren; Zgaga, Lina; Huffman, Jennifer; Campbell, Susan; Kooner, Jaspal; Chambers, John C; Burnett, Mary Susan; Devaney, Joseph M.; Pichard, Augusto D.; Kent, Kenneth M.; Satler, Lowell; Lindsay, Joseph M.; Waksman, Ron; Epstein, Stephen; Wilson, James F.; Wild, Sarah H.; Campbell, Harry; Vitart, Veronique; Reilly, Muredach P.; Li, Mingyao; Qu, Liming; Wilensky, Robert; Matthai, William; Hakonarson, Hakon H.; Rader, Daniel J.; Franke, Andre; Wittig, Michael; Schäfer, Arne; Uda, Manuela; Terracciano, Antonio; Xiao, Xiangjun; Busonero, Fabio; Scheet, Paul; Schlessinger, David; St Clair, David; Rujescu, Dan; Abecasis, Gonçalo R.; Grabe, Hans Jörgen; Teumer, Alexander; Völzke, Henry; Petersmann, Astrid; John, Ulrich; Rudan, Igor; Hayward, Caroline; Wright, Alan F.; Kolcic, Ivana; Wright, Benjamin J; Thompson, John R; Balmforth, Anthony J.; Hall, Alistair S.; Samani, Nilesh J.; Anderson, Carl A.; Ahmad, Tariq; Mathew, Christopher G.; Parkes, Miles; Satsangi, Jack; Caulfield, Mark; Munroe, Patricia B.; Farrall, Martin; Dominiczak, Anna; Worthington, Jane; Thomson, Wendy; Eyre, Steve; Barton, Anne; Mooser, Vincent; Francks, Clyde; Marchini, Jonathan

    2013-01-01

    Smoking is a leading global cause of disease and mortality1. We performed a genomewide meta-analytic association study of smoking-related behavioral traits in a total sample of 41,150 individuals drawn from 20 disease, population, and control cohorts. Our analysis confirmed an effect on smoking quantity (SQ) at a locus on 15q25 (P=9.45e-19) that includes three genes encoding neuronal nicotinic acetylcholine receptor subunits (CHRNA5, CHRNA3, CHRNB4). We used data from the 1000 Genomes project to investigate the region using imputation, which allowed analysis of virtually all common variants in the region and offered a five-fold increase in coverage over the HapMap. This increased the spectrum of potentially causal single nucleotide polymorphisms (SNPs), which included a novel SNP that showed the highest significance, rs55853698, located within the promoter region of CHRNA5. Conditional analysis also identified a secondary locus (rs6495308) in CHRNA3. PMID:20418889

  8. The search for stable prognostic models in multiple imputed data sets

    Directory of Open Access Journals (Sweden)

    de Vet Henrica CW

    2010-09-01

    Full Text Available Abstract Background In prognostic studies model instability and missing data can be troubling factors. Proposed methods for handling these situations are bootstrapping (B and Multiple imputation (MI. The authors examined the influence of these methods on model composition. Methods Models were constructed using a cohort of 587 patients consulting between January 2001 and January 2003 with a shoulder problem in general practice in the Netherlands (the Dutch Shoulder Study. Outcome measures were persistent shoulder disability and persistent shoulder pain. Potential predictors included socio-demographic variables, characteristics of the pain problem, physical activity and psychosocial factors. Model composition and performance (calibration and discrimination were assessed for models using a complete case analysis, MI, bootstrapping or both MI and bootstrapping. Results Results showed that model composition varied between models as a result of how missing data was handled and that bootstrapping provided additional information on the stability of the selected prognostic model. Conclusion In prognostic modeling missing data needs to be handled by MI and bootstrap model selection is advised in order to provide information on model stability.

  9. Recovering incomplete data using Statistical Multiple Imputations (SMI): a case study in environmental chemistry.

    Science.gov (United States)

    Mercer, Theresa G; Frostick, Lynne E; Walmsley, Anthony D

    2011-10-15

    This paper presents a statistical technique that can be applied to environmental chemistry data where missing values and limit of detection levels prevent the application of statistics. A working example is taken from an environmental leaching study that was set up to determine if there were significant differences in levels of leached arsenic (As), chromium (Cr) and copper (Cu) between lysimeters containing preservative treated wood waste and those containing untreated wood. Fourteen lysimeters were setup and left in natural conditions for 21 weeks. The resultant leachate was analysed by ICP-OES to determine the As, Cr and Cu concentrations. However, due to the variation inherent in each lysimeter combined with the limits of detection offered by ICP-OES, the collected quantitative data was somewhat incomplete. Initial data analysis was hampered by the number of 'missing values' in the data. To recover the dataset, the statistical tool of Statistical Multiple Imputation (SMI) was applied, and the data was re-analysed successfully. It was demonstrated that using SMI did not affect the variance in the data, but facilitated analysis of the complete dataset.

  10. Machine Learning Data Imputation and Classification in a Multicohort Hypertension Clinical Study.

    Science.gov (United States)

    Seffens, William; Evans, Chad; Taylor, Herman

    2015-01-01

    Health-care initiatives are pushing the development and utilization of clinical data for medical discovery and translational research studies. Machine learning tools implemented for Big Data have been applied to detect patterns in complex diseases. This study focuses on hypertension and examines phenotype data across a major clinical study called Minority Health Genomics and Translational Research Repository Database composed of self-reported African American (AA) participants combined with related cohorts. Prior genome-wide association studies for hypertension in AAs presumed that an increase of disease burden in susceptible populations is due to rare variants. But genomic analysis of hypertension, even those designed to focus on rare variants, has yielded marginal genome-wide results over many studies. Machine learning and other nonparametric statistical methods have recently been shown to uncover relationships in complex phenotypes, genotypes, and clinical data. We trained neural networks with phenotype data for missing-data imputation to increase the usable size of a clinical data set. Validity was established by showing performance effects using the expanded data set for the association of phenotype variables with case/control status of patients. Data mining classification tools were used to generate association rules.

  11. Discovery and refinement of genetic loci associated with cardiometabolic risk using dense imputation maps.

    Science.gov (United States)

    Iotchkova, Valentina; Huang, Jie; Morris, John A; Jain, Deepti; Barbieri, Caterina; Walter, Klaudia; Min, Josine L; Chen, Lu; Astle, William; Cocca, Massimilian; Deelen, Patrick; Elding, Heather; Farmaki, Aliki-Eleni; Franklin, Christopher S; Franberg, Mattias; Gaunt, Tom R; Hofman, Albert; Jiang, Tao; Kleber, Marcus E; Lachance, Genevieve; Luan, Jian'an; Malerba, Giovanni; Matchan, Angela; Mead, Daniel; Memari, Yasin; Ntalla, Ioanna; Panoutsopoulou, Kalliope; Pazoki, Raha; Perry, John R B; Rivadeneira, Fernando; Sabater-Lleal, Maria; Sennblad, Bengt; Shin, So-Youn; Southam, Lorraine; Traglia, Michela; van Dijk, Freerk; van Leeuwen, Elisabeth M; Zaza, Gianluigi; Zhang, Weihua; Amin, Najaf; Butterworth, Adam; Chambers, John C; Dedoussis, George; Dehghan, Abbas; Franco, Oscar H; Franke, Lude; Frontini, Mattia; Gambaro, Giovanni; Gasparini, Paolo; Hamsten, Anders; Issacs, Aaron; Kooner, Jaspal S; Kooperberg, Charles; Langenberg, Claudia; Marz, Winfried; Scott, Robert A; Swertz, Morris A; Toniolo, Daniela; Uitterlinden, Andre G; van Duijn, Cornelia M; Watkins, Hugh; Zeggini, Eleftheria; Maurano, Mathew T; Timpson, Nicholas J; Reiner, Alexander P; Auer, Paul L; Soranzo, Nicole

    2016-11-01

    Large-scale whole-genome sequence data sets offer novel opportunities to identify genetic variation underlying human traits. Here we apply genotype imputation based on whole-genome sequence data from the UK10K and 1000 Genomes Project into 35,981 study participants of European ancestry, followed by association analysis with 20 quantitative cardiometabolic and hematological traits. We describe 17 new associations, including 6 rare (minor allele frequency (MAF) < 1%) or low-frequency (1% < MAF < 5%) variants with platelet count (PLT), red blood cell indices (MCH and MCV) and HDL cholesterol. Applying fine-mapping analysis to 233 known and new loci associated with the 20 traits, we resolve the associations of 59 loci to credible sets of 20 or fewer variants and describe trait enrichments within regions of predicted regulatory function. These findings improve understanding of the allelic architecture of risk factors for cardiometabolic and hematological diseases and provide additional functional insights with the identification of potentially novel biological targets.

  12. Geographical Effects on Complex Networks

    Institute of Scientific and Technical Information of China (English)

    LIN Zhong-Cai; YANG Lei; YANG Kong-Qing

    2005-01-01

    @@ We investigate how the geographical structure of a complex network affects its network topology, synchronization and the average spatial length of edges. The geographical structure means that the connecting probability of two nodes is related to the spatial distance of the two nodes. Our simulation results show that the geographical structure changes the network topology. The synchronization tendency is enhanced and the average spatial length of edges is enlarged when the node can randomly connect to the further one. Analytic results support our understanding of the phenomena.

  13. 33 CFR 165.8 - Geographic coordinates.

    Science.gov (United States)

    2010-07-01

    ... 33 Navigation and Navigable Waters 2 2010-07-01 2010-07-01 false Geographic coordinates. 165.8... Geographic coordinates. Geographic coordinates expressed in terms of latitude or longitude, or both, are not... 1983 (NAD 83), unless such geographic coordinates are expressly labeled NAD 83. Geographic...

  14. Change of Geographic Information Service Model in Mobile Context

    Institute of Scientific and Technical Information of China (English)

    REN Fu; DU Qingyun

    2005-01-01

    A research on that how the topic of mobility, which is completely different but tightly relevant to space, provides new approaches and methods so as to promote the further development of geographic information services, will accumulate basic experience for different types of relative information systems in the wide fields of location based services. This paper analyzes the meaning of mobility and the change for geographic information service model, it describes the differences and correlation between M-GIS and traditional GIS. It sets a technical framework of geographic information services according to mobile context and provides a case study.

  15. NEPR Geographic Zone Map 2015

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — This geographic zone map was created by interpreting satellite and aerial imagery, seafloor topography (bathymetry model), and the new NEPR Benthic Habitat Map...

  16. Ecoscapes: Geographical Patternings of Relations

    Directory of Open Access Journals (Sweden)

    Aimar Ventsel

    2012-06-01

    Full Text Available Book review of the publication Ecoscapes: Geographical Patternings of Relations. Edited by Gary Backhaus and John Murungi. Lanham, Boulder, New York, Toronto, Oxford, Lexington Books, 2006, xxxiii+241 pp.

  17. Geographic Tongue in Monozygotic Twins

    OpenAIRE

    Shekhar M, Guna

    2014-01-01

    This article discusses a case of 5-year-old girl monozygotic twins who were suffering from geographic tongue (GT), a benign inflammatory disorder of the tongue which is characterized by circinate, irregular erythematous lesions on the dorsum and lateral borders of the tongue caused by loss of filiform papillae of the tongue epithelium. Whilst geographic tongue is a common entity, reports on this condition are uncommon in the literature. To best of our knowledge, this is the first report which...

  18. The National Map - geographic names

    Science.gov (United States)

    Yost, Lou; Carswell, William J.

    2009-01-01

    The Geographic Names Information System (GNIS), developed by the U.S. Geological Survey (USGS) in cooperation with the U.S. Board on Geographic Names (BGN), contains information about the official names for places, features, and areas in the 50 States, the District of Columbia, the territories and outlying areas of the United States, including Antarctica. It is the geographic names component of The National Map. The BGN maintains working relationships with State names authorities to cooperate in achieving the standardization of geographic names. The GNIS contains records on more than 2 million geographic names in the United States - from populated places, schools, reservoirs, and parks to streams, valleys, springs, ridges, and every feature type except roads and highways. Entries include information such as the federally-recognized name and variant names and spellings for the feature; former names; the status of the name as determined by the BGN; county or counties in which each named feature is located; geographic coordinates that locate the approximate center of an aerial feature or the mouth and source of a linear feature, such as a stream; name of the cell of the USGS topographic map or maps on which the feature may appear; elevation figures derived from the National Elevation Dataset; bibliographic code for the source of the name; BGN decision dates and historical information are available for some features. Data from the GNIS are used for emergency preparedness, mapmaking, local and regional planning, service delivery routing, marketing, site selection, environmental analysis, genealogical research, and other applications.

  19. Geographically isolated wetlands: Rethinking a misnomer

    Science.gov (United States)

    Mushet, David M.; Calhoun, Aram J. K.; Alexander, Laurie C.; Cohen, Matthew J.; DeKeyser, Edward S.; Fowler, Laurie G.; Lane, Charles R.; Lang, Megan W.; Rains, Mark C.; Walls, Susan

    2015-01-01

    We explore the category “geographically isolated wetlands” (GIWs; i.e., wetlands completely surrounded by uplands at the local scale) as used in the wetland sciences. As currently used, the GIW category (1) hampers scientific efforts by obscuring important hydrological and ecological differences among multiple wetland functional types, (2) aggregates wetlands in a manner not reflective of regulatory and management information needs, (3) implies wetlands so described are in some way “isolated,” an often incorrect implication, (4) is inconsistent with more broadly used and accepted concepts of “geographic isolation,” and (5) has injected unnecessary confusion into scientific investigations and discussions. Instead, we suggest other wetland classification systems offer more informative alternatives. For example, hydrogeomorphic (HGM) classes based on well-established scientific definitions account for wetland functional diversity thereby facilitating explorations into questions of connectivity without an a priori designation of “isolation.” Additionally, an HGM-type approach could be used in combination with terms reflective of current regulatory or policymaking needs. For those rare cases in which the condition of being surrounded by uplands is the relevant distinguishing characteristic, use of terminology that does not unnecessarily imply isolation (e.g., “upland embedded wetlands”) would help alleviate much confusion caused by the “geographically isolated wetlands” misnomer.

  20. Incomplete Big Data Imputation Algorithm Based on Deep Learning%基于深度学习的不完整大数据填充算法

    Institute of Scientific and Technical Information of China (English)

    卜范玉; 陈志奎; 张清辰

    2014-01-01

    提出一种基于深度学习的不完整大数据填充算法。算法首先以自动编码机为基础建立填充自动编码机。在此基础上,构建深度填充网络模型,分析不完整大数据的深度特征并根据逐层训练思想和反向传播算法计算网络参数。最后利用深度填充网络来还原不完整大数据,对缺失值进行填充。实验表明,提出的算法能够有效提高不完整大数据的填充精度。%This paper presents an impuation algorithm based on learning for incomplete big data .The proposed algorithm establishs a novel auto‐encoder , called imputation auto‐encoder , and then builds a deep imputation network model to analyze the deep features of incomplete big data and to calculate network parameters based on drill training ideas and back‐propagation algorithm .Finally ,the deep imputation network is used to impute the missing values .Experimental results show that the proposed algorithm can effectively improve the imputation accuracy for incomplete big data .

  1. High-accuracy imputation for HLA class I and II genes based on high-resolution SNP data of population-specific references.

    Science.gov (United States)

    Khor, S-S; Yang, W; Kawashima, M; Kamitsuji, S; Zheng, X; Nishida, N; Sawai, H; Toyoda, H; Miyagawa, T; Honda, M; Kamatani, N; Tokunaga, K

    2015-12-01

    Statistical imputation of classical human leukocyte antigen (HLA) alleles is becoming an indispensable tool for fine-mappings of disease association signals from case-control genome-wide association studies. However, most currently available HLA imputation tools are based on European reference populations and are not suitable for direct application to non-European populations. Among the HLA imputation tools, The HIBAG R package is a flexible HLA imputation tool that is equipped with a wide range of population-based classifiers; moreover, HIBAG R enables individual researchers to build custom classifiers. Here, two data sets, each comprising data from healthy Japanese individuals of difference sample sizes, were used to build custom classifiers. HLA imputation accuracy in five HLA classes (HLA-A, HLA-B, HLA-DRB1, HLA-DQB1 and HLA-DPB1) increased from the 82.5-98.8% obtained with the original HIBAG references to 95.2-99.5% with our custom classifiers. A call threshold (CT) of 0.4 is recommended for our Japanese classifiers; in contrast, HIBAG references recommend a CT of 0.5. Finally, our classifiers could be used to identify the risk haplotypes for Japanese narcolepsy with cataplexy, HLA-DRB1*15:01 and HLA-DQB1*06:02, with 100% and 99.7% accuracy, respectively; therefore, these classifiers can be used to supplement the current lack of HLA genotyping data in widely available genome-wide association study data sets.

  2. Harvesting geographic features from heterogeneous raster maps

    Science.gov (United States)

    Chiang, Yao-Yi

    2010-11-01

    Raster maps offer a great deal of geospatial information and are easily accessible compared to other geospatial data. However, harvesting geographic features locked in heterogeneous raster maps to obtain the geospatial information is challenging. This is because of the varying image quality of raster maps (e.g., scanned maps with poor image quality and computer-generated maps with good image quality), the overlapping geographic features in maps, and the typical lack of metadata (e.g., map geocoordinates, map source, and original vector data). Previous work on map processing is typically limited to a specific type of map and often relies on intensive manual work. In contrast, this thesis investigates a general approach that does not rely on any prior knowledge and requires minimal user effort to process heterogeneous raster maps. This approach includes automatic and supervised techniques to process raster maps for separating individual layers of geographic features from the maps and recognizing geographic features in the separated layers (i.e., detecting road intersections, generating and vectorizing road geometry, and recognizing text labels). The automatic technique eliminates user intervention by exploiting common map properties of how road lines and text labels are drawn in raster maps. For example, the road lines are elongated linear objects and the characters are small connected-objects. The supervised technique utilizes labels of road and text areas to handle complex raster maps, or maps with poor image quality, and can process a variety of raster maps with minimal user input. The results show that the general approach can handle raster maps with varying map complexity, color usage, and image quality. By matching extracted road intersections to another geospatial dataset, we can identify the geocoordinates of a raster map and further align the raster map, separated feature layers from the map, and recognized features from the layers with the geospatial

  3. A comparison of genomic selection models across time in interior spruce (Picea engelmannii × glauca) using unordered SNP imputation methods.

    Science.gov (United States)

    Ratcliffe, B; El-Dien, O G; Klápště, J; Porth, I; Chen, C; Jaquish, B; El-Kassaby, Y A

    2015-12-01

    Genomic selection (GS) potentially offers an unparalleled advantage over traditional pedigree-based selection (TS) methods by reducing the time commitment required to carry out a single cycle of tree improvement. This quality is particularly appealing to tree breeders, where lengthy improvement cycles are the norm. We explored the prospect of implementing GS for interior spruce (Picea engelmannii × glauca) utilizing a genotyped population of 769 trees belonging to 25 open-pollinated families. A series of repeated tree height measurements through ages 3-40 years permitted the testing of GS methods temporally. The genotyping-by-sequencing (GBS) platform was used for single nucleotide polymorphism (SNP) discovery in conjunction with three unordered imputation methods applied to a data set with 60% missing information. Further, three diverse GS models were evaluated based on predictive accuracy (PA), and their marker effects. Moderate levels of PA (0.31-0.55) were observed and were of sufficient capacity to deliver improved selection response over TS. Additionally, PA varied substantially through time accordingly with spatial competition among trees. As expected, temporal PA was well correlated with age-age genetic correlation (r=0.99), and decreased substantially with increasing difference in age between the training and validation populations (0.04-0.47). Moreover, our imputation comparisons indicate that k-nearest neighbor and singular value decomposition yielded a greater number of SNPs and gave higher predictive accuracies than imputing with the mean. Furthermore, the ridge regression (rrBLUP) and BayesCπ (BCπ) models both yielded equal, and better PA than the generalized ridge regression heteroscedastic effect model for the traits evaluated.

  4. hsphase: an R package for pedigree reconstruction, detection of recombination events, phasing and imputation of half-sib family groups

    Science.gov (United States)

    2014-01-01

    Background Identification of recombination events and which chromosomal segments contributed to an individual is useful for a number of applications in genomic analyses including haplotyping, imputation, signatures of selection, and improved estimates of relationship and probability of identity by descent. Genotypic data on half-sib family groups are widely available in livestock genomics. This structure makes it possible to identify recombination events accurately even with only a few individuals and it lends itself well to a range of applications such as parentage assignment and pedigree verification. Results Here we present hsphase, an R package that exploits the genetic structure found in half-sib livestock data to identify and count recombination events, impute and phase un-genotyped sires and phase its offspring. The package also allows reconstruction of family groups (pedigree inference), identification of pedigree errors and parentage assignment. Additional functions in the package allow identification of genomic mapping errors, imputation of paternal high density genotypes from low density genotypes, evaluation of phasing results either from hsphase or from other phasing programs. Various diagnostic plotting functions permit rapid visual inspection of results and evaluation of datasets. Conclusion The hsphase package provides a suite of functions for analysis and visualization of genomic structures in half-sib family groups implemented in the widely used R programming environment. Low level functions were implemented in C++ and parallelized to improve performance. hsphase was primarily designed for use with high density SNP array data but it is fast enough to run directly on sequence data once they become more widely available. The package is available (GPL 3) from the Comprehensive R Archive Network (CRAN) or from http://www-personal.une.edu.au/~cgondro2/hsphase.htm. PMID:24906803

  5. Changes at the National Geographic Society

    Science.gov (United States)

    Schwille, Kathleen

    2016-01-01

    For more than 125 years, National Geographic has explored the planet, unlocking its secrets and sharing them with the world. For almost thirty of those years, National Geographic has been committed to K-12 educators and geographic education through its Network of Alliances. As National Geographic begins a new chapter, they remain committed to the…

  6. Changes at the National Geographic Society

    Science.gov (United States)

    Schwille, Kathleen

    2016-01-01

    For more than 125 years, National Geographic has explored the planet, unlocking its secrets and sharing them with the world. For almost thirty of those years, National Geographic has been committed to K-12 educators and geographic education through its Network of Alliances. As National Geographic begins a new chapter, they remain committed to the…

  7. Autonomous gliding entry guidance with geographic constraints

    Institute of Scientific and Technical Information of China (English)

    Guo Jie; Wu Xuzhong; Tang Shengjing

    2015-01-01

    This paper presents a novel three-dimensional autonomous entry guidance for relatively high lift-to-drag ratio vehicles satisfying geographic constraints and other path constraints. The guidance is composed of onboard trajectory planning and robust trajectory tracking. For trajectory planning, a longitudinal sub-planner is introduced to generate a feasible drag-versus-energy profile by using the interpolation between upper boundary and lower boundary of entry corridor to get the desired trajectory length. The associated magnitude of the bank angle can be specified by drag profile, while the sign of bank angle is determined by lateral sub-planner. Two-reverse mode is utilized to satisfy waypoint constraints and dynamic heading error corridor is utilized to satisfy no-fly zone constraints. The longitudinal and lateral sub-planners are iteratively employed until all of the path constraints are satisfied. For trajectory tracking, a novel tracking law based on the active disturbance rejection control is introduced. Finally, adaptability tests and Monte Carlo simulations of the entry guidance approach are performed. Results show that the proposed entry guidance approach can adapt to different entry missions and is able to make the vehicle reach the prescribed target point precisely in spite of geographic constraints.

  8. The Andes: A Geographical Portrait

    Directory of Open Access Journals (Sweden)

    Anthony Bebbington

    2016-05-01

    Full Text Available Reviewed: The Andes: A Geographical Portrait. By Axel Borsdorf and Christoph Stadel. Translated by Brigitte Scott and Christoph Stadel. Cham, Switzerland: Springer International Publishing, 2015. xiv + 368 pp. US$ 139.00. Also available as an e-book. ISBN 978-3-319-03529-1.

  9. Geographic Projection of Cluster Composites

    NARCIS (Netherlands)

    Nerbonne, J.; Bosveld-de Smet, L.M.; Kleiweg, P.; Blackwell, A.; Marriott, K.; Shimojima, A.

    2004-01-01

    A composite cluster map displays a fuzzy categorisation of geographic areas. It combines information from several sources to provide a visualisation of the significance of cluster borders. The basic technique renders the chance that two neighbouring locations are members of different clusters as the

  10. Geographical Concepts in Turkish Lullabys

    Science.gov (United States)

    Çifçi, Taner

    2016-01-01

    In this study, a collection of lullabies which have an important place in Turkish culture and which form an important genre in folk literature are examined to find out distribution and presentation of geographical terms in the lullabies in this collection. In the study, 2480 lullabies in Turkish Lullabies which is one of the leading collections in…

  11. Territorial Decentration and Geographic Learning.

    Science.gov (United States)

    Stoltman, Joseph P.

    Territorial decentration is a question of major significance to geographic educators. This paper reports the findings of a research project designed to determine the territorial decentration of an American sample of children. The primary purpose of the research was to determine if Piaget's territorial decentration stages are appropriate for…

  12. Imputation of the Date of HIV Seroconversion in a Cohort of Seroprevalent Subjects: Implications for Analysis of Late HIV Diagnosis

    Directory of Open Access Journals (Sweden)

    Paz Sobrino-Vegas

    2012-01-01

    Full Text Available Objectives. Since subjects may have been diagnosed before cohort entry, analysis of late HIV diagnosis (LD is usually restricted to the newly diagnosed. We estimate the magnitude and risk factors of LD in a cohort of seroprevalent individuals by imputing seroconversion dates. Methods. Multicenter cohort of HIV-positive subjects who were treatment naive at entry, in Spain, 2004–2008. Multiple-imputation techniques were used. Subjects with times to HIV diagnosis longer than 4.19 years were considered LD. Results. Median time to HIV diagnosis was 2.8 years in the whole cohort of 3,667 subjects. Factors significantly associated with LD were: male sex; Sub-Saharan African, Latin-American origin compared to Spaniards; and older age. In 2,928 newly diagnosed subjects, median time to diagnosis was 3.3 years, and LD was more common in injecting drug users. Conclusions. Estimates of the magnitude and risk factors of LD for the whole cohort differ from those obtained for new HIV diagnoses.

  13. Multiple Imputation based Clustering Validation (MIV) for Big Longitudinal Trial Data with Missing Values in eHealth.

    Science.gov (United States)

    Zhang, Zhaoyang; Fang, Hua; Wang, Honggang

    2016-06-01

    Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.

  14. A situated knowledge representation of geographical information

    Energy Technology Data Exchange (ETDEWEB)

    Gahegan, Mark N.; Pike, William A.

    2006-11-01

    In this paper we present an approach to conceiving of, constructing and comparing the concepts developed and used by geographers, environmental scientists and other earth science researchers to help describe, analyze and ultimately understand their subject of study. Our approach is informed by the situations under which concepts are conceived and applied, captures details of their construction, use and evolution and supports their ultimate sharing along with the means for deep exploration of conceptual similarities and differences that may arise among a distributed network of researchers. The intent here is to support different perspectives onto GIS resources that researchers may legitimately take, and to capture and compute with aspects of epistemology, to complement the ontologies that are currently receiving much attention in the GIScience community.

  15. a Conceptual Framework for Virtual Geographic Environments Knowledge Engineering

    Science.gov (United States)

    You, Lan; Lin, Hui

    2016-06-01

    VGE geographic knowledge refers to the abstract and repeatable geo-information which is related to the geo-science problem, geographical phenomena and geographical laws supported by VGE. That includes expert experiences, evolution rule, simulation processes and prediction results in VGE. This paper proposes a conceptual framework for VGE knowledge engineering in order to effectively manage and use geographic knowledge in VGE. Our approach relies on previous well established theories on knowledge engineering and VGE. The main contribution of this report is following: (1) The concepts of VGE knowledge and VGE knowledge engineering which are defined clearly; (2) features about VGE knowledge different with common knowledge; (3) geographic knowledge evolution process that help users rapidly acquire knowledge in VGE; and (4) a conceptual framework for VGE knowledge engineering providing the supporting methodologies system for building an intelligent VGE. This conceptual framework systematically describes the related VGE knowledge theories and key technologies. That will promote the rapid transformation from geodata to geographic knowledge, and furtherly reduce the gap between the data explosion and knowledge absence.

  16. A CONCEPTUAL FRAMEWORK FOR VIRTUAL GEOGRAPHIC ENVIRONMENTS KNOWLEDGE ENGINEERING

    Directory of Open Access Journals (Sweden)

    L. You

    2016-06-01

    Full Text Available VGE geographic knowledge refers to the abstract and repeatable geo-information which is related to the geo-science problem, geographical phenomena and geographical laws supported by VGE. That includes expert experiences, evolution rule, simulation processes and prediction results in VGE. This paper proposes a conceptual framework for VGE knowledge engineering in order to effectively manage and use geographic knowledge in VGE. Our approach relies on previous well established theories on knowledge engineering and VGE. The main contribution of this report is following: (1 The concepts of VGE knowledge and VGE knowledge engineering which are defined clearly; (2 features about VGE knowledge different with common knowledge; (3 geographic knowledge evolution process that help users rapidly acquire knowledge in VGE; and (4 a conceptual framework for VGE knowledge engineering providing the supporting methodologies system for building an intelligent VGE. This conceptual framework systematically describes the related VGE knowledge theories and key technologies. That will promote the rapid transformation from geodata to geographic knowledge, and furtherly reduce the gap between the data explosion and knowledge absence.

  17. Imputation of orofacial clefting data identifies novel risk loci and sheds light on the genetic background of cleft lip ± cleft palate and cleft palate only

    Science.gov (United States)

    Böhmer, Anne C.; Bowes, John; Nikolić, Miloš; Ishorst, Nina; Wyatt, Niki; Hammond, Nigel L.; Gölz, Lina; Thieme, Frederic; Barth, Sandra; Schuenke, Hannah; Klamt, Johanna; Spielmann, Malte; Aldhorae, Khalid; Rojas-Martinez, Augusto; Nöthen, Markus M.; Rada-Iglesias, Alvaro; Dixon, Michael J.; Knapp, Michael; Mangold, Elisabeth

    2017-01-01

    Abstract Nonsyndromic cleft lip with or without cleft palate (nsCL/P) is among the most common human birth defects with multifactorial etiology. Here, we present results from a genome-wide imputation study of nsCL/P in which, after adding replication cohort data, four novel risk loci for nsCL/P are identified (at chromosomal regions 2p21, 14q22, 15q24 and 19p13). On a systematic level, we show that the association signals within this high-density dataset are enriched in functionally-relevant genomic regions that are active in both human neural crest cells (hNCC) and mouse embryonic craniofacial tissue. This enrichment is also detectable in hNCC regions primed for later activity. Using GCTA analyses, we suggest that 30% of the estimated variance in risk for nsCL/P in the European population can be attributed to common variants, with 25.5% contributed to by the 24 risk loci known to date. For each of these, we identify credible SNPs using a Bayesian refinement approach, with two loci harbouring only one probable causal variant. Finally, we demonstrate that there is no polygenic component of nsCL/P detectable that is shared with nonsyndromic cleft palate only (nsCPO). Our data suggest that, while common variants are strongly contributing to risk for nsCL/P, they do not seem to be involved in nsCPO which might be more often caused by rare deleterious variants. Our study generates novel insights into both nsCL/P and nsCPO etiology and provides a systematic framework for research into craniofacial development and malformation. PMID:28087736

  18. Annotation Bibliography for Geographical Science Field

    Directory of Open Access Journals (Sweden)

    Sukendra Martha

    2014-12-01

    Full Text Available This annotated bibliography is gathered specially for the field of geography obtained from various scientific articles (basic concept in geography of different geographical journals. This article aims to present information particulary for geographers who will undertake researches, and indeed need the geographical References with all spatial concepts. Other reason defeated by the rapid development of the branch of technical geography such as geographical information systems (GIS and remote sensing. It hopes that this bibliography can contribute of remotivating geographers to learn and review their original geographical thought.

  19. IL FENOMENO VOLUNTEERED GEOGRAPHIC INFORMATION

    Directory of Open Access Journals (Sweden)

    Flavio Lupia

    2014-12-01

    Full Text Available The contribution addresses the phenomenon of Voluntereed Geographic Informationexplaining these new and burgeoning sources of information offers multidisciplinary scientists an unprecedented opportunity to conduct research on a variety of topics at multiple spatial and temporal scales. In particular the contribution refers to two COST Actions which have been recently activated on the subject which areparticularly relevant for the growing of the European scientific community.

  20. Geographic Luck and Dependency Theory

    Institute of Scientific and Technical Information of China (English)

    Ziheng Liu

    2015-01-01

    Economic disparity is a huge global issue nowadays which threatens the economic justice and thus in some sense decides the fate of human be-ing,especially those from developing countries.This essay analyzes economic disparity by using Geographic Luck and Dependency Theory.None of the two theories could explain the economic disparity individually.China,Britain and Iran are used as three examples to support the thesis.

  1. Geographic Information System Data Analysis

    Science.gov (United States)

    Billings, Chad; Casad, Christopher; Floriano, Luis G.; Hill, Tracie; Johnson, Rashida K.; Locklear, J. Mark; Penn, Stephen; Rhoulac, Tori; Shay, Adam H.; Taylor, Antone; hide

    1995-01-01

    Data was collected in order to further NASA Langley Research Center's Geographic Information System(GIS). Information on LaRC's communication, electrical, and facility configurations was collected. Existing data was corrected through verification, resulting in more accurate databases. In addition, Global Positioning System(GPS) points were used in order to accurately impose buildings on digitized images. Overall, this project will help the Imaging and CADD Technology Team (ICTT) prove GIS to be a valuable resource for LaRC.

  2. Geographic names of the Antarctic

    Science.gov (United States)

    ,; ,; ,; ,; Alberts, Fred G.

    1995-01-01

    This gazetteer contains 12,710 names approved by the United States Board on Geographic Names and the Secretary of the Interior for features in Antarctica and the area extending northward to the Antarctic Convergence. Included in this geographic area, the Antarctic region, are the off-lying South Shetland Islands, the South Orkney Islands, the South Sandwich Islands, South Georgia, Bouvetøya, Heard Island, and the Balleny Islands. These names have been approved for use by U.S. Government agencies. Their use by the Antarctic specialist and the public is highly recommended for the sake of accuracy and uniformity. This publication, which supersedes previous Board gazetteers or lists for the area, contains names approved as recently as December 1994. The basic name coverage of this gazetteer corresponds to that of maps at the scale of 1:250,000 or larger for coastal Antarctica, the off-lying islands, and isolated mountains and ranges of the continent. Much of the interior of Antarctica is a featureless ice plateau. That area has been mapped at a smaller scale and is nearly devoid of toponyms. All of the names are for natural features, such as mountains, glaciers, peninsulas, capes, bays, islands, and subglacial entities. The names of scientific stations have not been listed alphabetically, but they may appear in the texts of some decisions. For the names of submarine features, reference should be made to the Gazetteer of Undersea Features, 4th edition, U.S. Board on Geographic Names, 1990.

  3. Geographic Object-Based Image Analysis: Towards a new paradigm

    NARCIS (Netherlands)

    Blaschke, T.; Hay, G.J.; Kelly, M.; Lang, S.; Hofmann, P.; Addink, E.A.; Queiroz Feitosa, R.; van der Meer, F.D.; van der Werff, H.M.A.; van Coillie, F.; Tiede, A.

    2014-01-01

    The amount of scientific literature on (Geographic) Object-based Image Analysis – GEOBIA has been and still is sharply increasing. These approaches to analysing imagery have antecedents in earlier research on image segmentation and use GIS-like spatial analysis within classification and feature extr

  4. Spatial variation of vulnerability in geographic areas of North Lebanon

    NARCIS (Netherlands)

    Issa, Sahar; van der Molen, Peterdina; Nader, M.R.; Lovett, Jonathan Cranidge

    2014-01-01

    This paper examines the spatial variation in vulnerability between different geographical areas of the northern coastal region of Lebanon within the context of armed conflict. The study is based on the ‘vulnerability of space’ approach and will be positioned in the academic debate on vulnerability

  5. Fungi identify the geographic origin of dust samples.

    Directory of Open Access Journals (Sweden)

    Neal S Grantham

    Full Text Available There is a long history of archaeologists and forensic scientists using pollen found in a dust sample to identify its geographic origin or history. Such palynological approaches have important limitations as they require time-consuming identification of pollen grains, a priori knowledge of plant species distributions, and a sufficient diversity of pollen types to permit spatial or temporal identification. We demonstrate an alternative approach based on DNA sequencing analyses of the fungal diversity found in dust samples. Using nearly 1,000 dust samples collected from across the continental U.S., our analyses identify up to 40,000 fungal taxa from these samples, many of which exhibit a high degree of geographic endemism. We develop a statistical learning algorithm via discriminant analysis that exploits this geographic endemicity in the fungal diversity to correctly identify samples to within a few hundred kilometers of their geographic origin with high probability. In addition, our statistical approach provides a measure of certainty for each prediction, in contrast with current palynology methods that are almost always based on expert opinion and devoid of statistical inference. Fungal taxa found in dust samples can therefore be used to identify the origin of that dust and, more importantly, we can quantify our degree of certainty that a sample originated in a particular place. This work opens up a new approach to forensic biology that could be used by scientists to identify the origin of dust or soil samples found on objects, clothing, or archaeological artifacts.

  6. Genome of the Netherlands population-specific imputations identify an ABCA6 variant associated with cholesterol levels

    DEFF Research Database (Denmark)

    Van Leeuwen, Elisabeth M.; Karssen, Lennart C.; Deelen, Joris;

    2015-01-01

    Variants associated with blood lipid levels may be population-specific. To identify low-frequency variants associated with this phenotype, population-specific reference panels may be used. Here we impute nine large Dutch biobanks (∼35,000 samples) with the population-specific reference panel crea...

  7. On Matrix Sampling and Imputation of Context Questionnaires with Implications for the Generation of Plausible Values in Large-Scale Assessments

    Science.gov (United States)

    Kaplan, David; Su, Dan

    2016-01-01

    This article presents findings on the consequences of matrix sampling of context questionnaires for the generation of plausible values in large-scale assessments. Three studies are conducted. Study 1 uses data from PISA 2012 to examine several different forms of missing data imputation within the chained equations framework: predictive mean…

  8. Plants and geographical names in Croatia.

    Science.gov (United States)

    Cargonja, Hrvoje; Daković, Branko; Alegro, Antun

    2008-09-01

    The main purpose of this paper is to present some general observations, regularities and insights into a complex relationship between plants and people through symbolic systems like geographical names on the territory of Croatia. The basic sources of data for this research were maps from atlas of Croatia of the scale 1:100000. Five groups of maps or areas were selected in order to represent main Croatian phytogeographic regions. A selection of toponyms from each of the map was made in which the name for a plant in Croatian language was recognized (phytotoponyms). Results showed that of all plant names recognized in geographical names the most represented are trees, and among them birch and oak the most. Furthermore, an attempt was made to explain the presence of the most represented plant species in the phytotoponyms in the light of general phytogeographical and sociocultural differences and similarities of comparing areas. The findings confirm an expectation that the genera of climazonal vegetation of particular area are the most represented among the phytotoponyms. Nevertheless, there are ample examples where representation of a plant name in the names of human environment can only be ascribed to ethno-linguistic and socio-cultural motives. Despite the reductionist character of applied methodology, this research also points out some advantages of this approach for ethnobotanic and ethnolinguistic studies of greater areas of human environment.

  9. An Ontology-Based Framework for Geographic Data Integration

    Science.gov (United States)

    Vidal, Vânia M. P.; Sacramento, Eveline R.; de Macêdo, José Antonio Fernandes; Casanova, Marco Antonio

    Ontologies have been extensively used to model domain-specific knowledge. Recent research has applied ontologies to enhance the discovery and retrieval of geographic data in Spatial Data Infrastructures (SDIs). However, in those approaches it is assumed that all the data required for answering a query can be obtained from a single data source. In this work, we propose an ontology-based framework for the integration of geographic data. In our approach, a query posed on a domain ontology is rewritten into sub-queries submitted over multiples data sources, and the query result is obtained by the proper combination of data resulting from these sub-queries. We illustrate how our framework allows the combination of data from different sources, thus overcoming some limitations of other ontology-based approaches. Our approach is illustrated by an example from the domain of aeronautical flights.

  10. Imputation of systematically missing predictors in an individual participant data meta-analysis: A generalized approach using MICE

    NARCIS (Netherlands)

    Jolani, S.; Debray, T.P.A.; Koffijberg, H.; Buuren, S. van; Moons, K.G.M.

    2015-01-01

    Individual participant data meta-analyses (IPD-MA) are increasingly used for developing and validating multivariable (diagnostic or prognostic) risk prediction models. Unfortunately, some predictors or even outcomes may not have been measured in each study and are thus systematically missing in some

  11. Imputation of systematically missing predictors in an individual participant data meta-analysis : A generalized approach using MICE

    NARCIS (Netherlands)

    Jolani, Shahab; Debray, Thomas P A; Koffijberg, Hendrik; van Buuren, Stef; Moons, Karel G M

    2015-01-01

    Individual participant data meta-analyses (IPD-MA) are increasingly used for developing and validating multivariable (diagnostic or prognostic) risk prediction models. Unfortunately, some predictors or even outcomes may not have been measured in each study and are thus systematically missing in some

  12. Geographic Names Information System (GNIS) Admin Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  13. Geographic Names Information System (GNIS) Antarctica Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  14. Geographic Names Information System (GNIS) Hydrography Points

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  15. Geographic Names Information System (GNIS) Hydrography Lines

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  16. Geographic Names Information System (GNIS) Community Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  17. Geographic Place Names, Published in unknown, SWGRC.

    Data.gov (United States)

    NSGIC GIS Inventory (aka Ramona) — This Geographic Place Names dataset as of unknown. Data by this publisher are often provided in Geographic coordinate system; in a Not Sure projection; The extent...

  18. Geographic Names Information System (GNIS) Landform Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  19. Geographic Names Information System (GNIS) Historical Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  20. Geographic Names Information System (GNIS) Transportation Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  1. Geographic Names Information System (GNIS) Cultural Features

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  2. Geographic Names Information System (GNIS) Structures

    Data.gov (United States)

    Department of Homeland Security — The Geographic Names Information System (GNIS) is the Federal standard for geographic nomenclature. The U.S. Geological Survey developed the GNIS for the U.S. Board...

  3. Influence of Pattern of Missing Data on Performance of Imputation Methods: An Example from National Data on Drug Injection in Prisons

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Baneshi

    2013-05-01

    Full Text Available Background Policy makers need models to be able to detect groups at high risk of HIV infection. Incomplete records and dirty data are frequently seen in national data sets. Presence of missing data challenges the practice of model development. Several studies suggested that performance of imputation methods is acceptable when missing rate is moderate. One of the issues which was of less concern, to be addressed here, is the role of the pattern of missing data. Methods We used information of 2720 prisoners. Results derived from fitting regression model to whole data were served as gold standard. Missing data were then generated so that 10%, 20% and 50% of data were lost. In scenario 1, we generated missing values, at above rates, in one variable which was significant in gold model (age. In scenario 2, a small proportion of each of independent variable was dropped out. Four imputation methods, under different Event Per Variable (EPV values, were compared in terms of selection of important variables and parameter estimation. Results In scenario 2, bias in estimates was low and performances of all method for handing missing data were similar. All methods at all missing rates were able to detect significance of age. In scenario 1, biases in estimations were increased, in particular at 50% missing rate. Here at EPVs of 10 and 5, imputation methods failed to capture effect of age. Conclusion In scenario 2, all imputation methods at all missing rates, were able to detect age as being significant. This was not the case in scenario 1. Our results showed that performance of imputation methods depends on the pattern of missing data.

  4. [Differentiation of geographic biovariants of smallpox virus by PCR].

    Science.gov (United States)

    Babkin, I V; Babkina, I N

    2010-01-01

    Comparative analysis of amino acid and nucleotides sequences of ORFs located in extended segments of the terminal variable regions in variola virus genome detected a promising locus for viral genotyping according to the geographic origin. This is ORF O1L of VARV. The primers were calculated for synthesis of this ORF fragment by PCR, which makes it possible to distinguish South America-Western Africa genotype from other VARV strains. Subsequent RFLP analysis reliably differentiated Asian strains from African strains (except Western Africa isolates). This method has been tested using 16 VARV strains from various geographic regions. The developed approach is simple, fast and reliable.

  5. How a Geographer Looks at Globalism.

    Science.gov (United States)

    Natoli, Salvatore J.

    1990-01-01

    Argues a global perspective is inherent to all geographic research and education. Quotes several influential geographers concerning their views on globalism and geography as a discipline. Examines geography's five fundamental themes and their applicability to a global perspective. Considers roles geographers can play in solving world environmental…

  6. Relevance Measures Using Geographic Scopes and Types

    NARCIS (Netherlands)

    Andogah, Geoffrey; Bouma, Gosse; Peters, C; Jikoun,; Mandl, T; Muller, H; Oard, DW; Penas, A; Petras,; Santos, D

    2008-01-01

    This paper proposes two kinds of relevance measures to rank documents by geographic restriction: scope-based and type-based. The non-geographic and geographic relevance scores are combined using a weighted harmonic mean. The proposed relevance measures and weighting schemes are evaluated on GeoCLEF

  7. 33 CFR 166.103 - Geographic coordinates.

    Science.gov (United States)

    2010-07-01

    ... 33 Navigation and Navigable Waters 2 2010-07-01 2010-07-01 false Geographic coordinates. 166.103...) PORTS AND WATERWAYS SAFETY SHIPPING SAFETY FAIRWAYS General § 166.103 Geographic coordinates. Geographic coordinates expressed in terms of latitude or longitude, or both, are not intended for plotting on maps...

  8. 33 CFR 167.3 - Geographic coordinates.

    Science.gov (United States)

    2010-07-01

    ... 33 Navigation and Navigable Waters 2 2010-07-01 2010-07-01 false Geographic coordinates. 167.3...) PORTS AND WATERWAYS SAFETY OFFSHORE TRAFFIC SEPARATION SCHEMES General § 167.3 Geographic coordinates. Geographic coordinates are defined using North American 1927 Datum (NAD 27) unless indicated otherwise....

  9. Estimation of Tree Lists from Airborne Laser Scanning Using Tree Model Clustering and k-MSN Imputation

    Directory of Open Access Journals (Sweden)

    Jörgen Wallerman

    2013-04-01

    Full Text Available Individual tree crowns may be delineated from airborne laser scanning (ALS data by segmentation of surface models or by 3D analysis. Segmentation of surface models benefits from using a priori knowledge about the proportions of tree crowns, which has not yet been utilized for 3D analysis to any great extent. In this study, an existing surface segmentation method was used as a basis for a new tree model 3D clustering method applied to ALS returns in 104 circular field plots with 12 m radius in pine-dominated boreal forest (64°14'N, 19°50'E. For each cluster below the tallest canopy layer, a parabolic surface was fitted to model a tree crown. The tree model clustering identified more trees than segmentation of the surface model, especially smaller trees below the tallest canopy layer. Stem attributes were estimated with k-Most Similar Neighbours (k-MSN imputation of the clusters based on field-measured trees. The accuracy at plot level from the k-MSN imputation (stem density root mean square error or RMSE 32.7%; stem volume RMSE 28.3% was similar to the corresponding results from the surface model (stem density RMSE 33.6%; stem volume RMSE 26.1% with leave-one-out cross-validation for one field plot at a time. Three-dimensional analysis of ALS data should also be evaluated in multi-layered forests since it identified a larger number of small trees below the tallest canopy layer.

  10. A new GWAS and meta-analysis with 1000Genomes imputation identifies novel risk variants for colorectal cancer

    Science.gov (United States)

    Al-Tassan, Nada A.; Whiffin, Nicola; Hosking, Fay J.; Palles, Claire; Farrington, Susan M.; Dobbins, Sara E.; Harris, Rebecca; Gorman, Maggie; Tenesa, Albert; Meyer, Brian F.; Wakil, Salma M.; Kinnersley, Ben; Campbell, Harry; Martin, Lynn; Smith, Christopher G.; Idziaszczyk, Shelley; Barclay, Ella; Maughan, Timothy S.; Kaplan, Richard; Kerr, Rachel; Kerr, David; Buchannan, Daniel D.; Ko Win, Aung; Hopper, John; Jenkins, Mark; Lindor, Noralane M.; Newcomb, Polly A.; Gallinger, Steve; Conti, David; Schumacher, Fred; Casey, Graham; Dunlop, Malcolm G.; Tomlinson, Ian P.; Cheadle, Jeremy P.; Houlston, Richard S.

    2015-01-01

    Genome-wide association studies (GWAS) of colorectal cancer (CRC) have identified 23 susceptibility loci thus far. Analyses of previously conducted GWAS indicate additional risk loci are yet to be discovered. To identify novel CRC susceptibility loci, we conducted a new GWAS and performed a meta-analysis with five published GWAS (totalling 7,577 cases and 9,979 controls of European ancestry), imputing genotypes utilising the 1000 Genomes Project. The combined analysis identified new, significant associations with CRC at 1p36.2 marked by rs72647484 (minor allele frequency [MAF] = 0.09) near CDC42 and WNT4 (P = 1.21 × 10−8, odds ratio [OR] = 1.21 ) and at 16q24.1 marked by rs16941835 (MAF = 0.21, P = 5.06 × 10−8; OR = 1.15) within the long non-coding RNA (lncRNA) RP11-58A18.1 and ~500 kb from the nearest coding gene FOXL1. Additionally we identified a promising association at 10p13 with rs10904849 intronic to CUBN (MAF = 0.32, P = 7.01 × 10-8; OR = 1.14). These findings provide further insights into the genetic and biological basis of inherited genetic susceptibility to CRC. Additionally, our analysis further demonstrates that imputation can be used to exploit GWAS data to identify novel disease-causing variants. PMID:25990418

  11. Discovery and Fine-Mapping of Glycaemic and Obesity-Related Trait Loci Using High-Density Imputation.

    Directory of Open Access Journals (Sweden)

    Momoko Horikoshi

    2015-07-01

    Full Text Available Reference panels from the 1000 Genomes (1000G Project Consortium provide near complete coverage of common and low-frequency genetic variation with minor allele frequency ≥0.5% across European ancestry populations. Within the European Network for Genetic and Genomic Epidemiology (ENGAGE Consortium, we have undertaken the first large-scale meta-analysis of genome-wide association studies (GWAS, supplemented by 1000G imputation, for four quantitative glycaemic and obesity-related traits, in up to 87,048 individuals of European ancestry. We identified two loci for body mass index (BMI at genome-wide significance, and two for fasting glucose (FG, none of which has been previously reported in larger meta-analysis efforts to combine GWAS of European ancestry. Through conditional analysis, we also detected multiple distinct signals of association mapping to established loci for waist-hip ratio adjusted for BMI (RSPO3 and FG (GCK and G6PC2. The index variant for one association signal at the G6PC2 locus is a low-frequency coding allele, H177Y, which has recently been demonstrated to have a functional role in glucose regulation. Fine-mapping analyses revealed that the non-coding variants most likely to drive association signals at established and novel loci were enriched for overlap with enhancer elements, which for FG mapped to promoter and transcription factor binding sites in pancreatic islets, in particular. Our study demonstrates that 1000G imputation and genetic fine-mapping of common and low-frequency variant association signals at GWAS loci, integrated with genomic annotation in relevant tissues, can provide insight into the functional and regulatory mechanisms through which their effects on glycaemic and obesity-related traits are mediated.

  12. Geographic Load Balanced Routing in Wireless Sensor Networks

    Directory of Open Access Journals (Sweden)

    Robin Guleria

    2013-06-01

    Full Text Available Recently the application domains of wireless sensor networks have grown exponentially. Traditional routing algorithm generates traffic related to route discovery to destination. Geographic routing algorithms exploit location information well but the problem of congestion and collision throttle its full employment for resource constrained wireless sensor networks. In this paper we present a Geographic Load Balanced Routing (GLBR, explores a technique Load balancing for WSNs which can be a viable solution to the challenges of geographic routing. Load balancing can be realized through two approaches. GLBR defines parameters based on communication overhead at sensor nodes and wireless link status through which load can be balanced across whole network. GLBR approach exploits the existing Geographic Routing approach i.e. Greedy forwarding by considering not only the distance between next hop and destination as single parameter for packet forwarding but also consider overhead at node. When load at a node is high GLBR looks for an alternate option for packet forwarding. Thus GLBR divert traffic to obviate congestion and hence avoid disconnections in the network.

  13. Network sensitivity to geographical configuration

    CERN Document Server

    Searle, A C; McClelland, D E; Searle, Antony C; Scott, Susan M; Clelland, David E Mc

    2002-01-01

    Gravitational wave astronomy will require the coordinated analysis of data from the global network of gravitational wave observatories. Questions of how to optimally configure the global network naturally arise in this context. We propose a formalism to compare different configurations of the network, using both the coincident network analysis method and the coherent network analysis method, and construct a model to compute a figure-of-merit based on the detection rate for a population of standard-candle binary inspirals. We find that this measure of network quality is very sensitive to the geographic location of component detectors under a coincident network analysis, but comparatively insensitive under a coherent network analysis.

  14. Energy Efficient Geographical Load Balancing via Dynamic Deferral of Workload

    CERN Document Server

    Adnan, Muhammad Abdullah; Gupta, Rajesh

    2012-01-01

    With the increasing popularity of Cloud computing and Mobile computing, individuals, enterprises and research centers have started outsourcing their IT and computational needs to on-demand cloud services. Recently geographical load balancing techniques have been suggested for data centers hosting cloud computation in order to reduce energy cost by exploiting the electricity price differences across regions. However, these algorithms do not draw distinction among diverse requirements for responsiveness across various workloads. In this paper, we use the flexibility from the Service Level Agreements (SLAs) to differentiate among workloads under bounded latency requirements and propose a novel approach for cost savings for geographical load balancing. We investigate how much workload to be executed in each data center and how much workload to be delayed and migrated to other data centers for energy saving while meeting deadlines. We present an offline formulation for geographical load balancing problem with dyna...

  15. Component 1: Current and Future Methods for Representing and Interacting with Qualitative Geographic Information

    Science.gov (United States)

    2011-10-26

    data organized within GIS and related technologies. A variety of approaches exist for visual exploration and analysis of text media, and this report...effectively leverage geographically-grounded text. 15. SUBJECT TERMS geovisualization, visual analytics, social media, microblogs, cartography ...process, and represent that information as well as to connect the information with more traditional geographic data organized within GIS and related

  16. Research on Geographical Urban Conditions Monitoring

    Institute of Scientific and Technical Information of China (English)

    2012-01-01

    by LUO A1inghai Abstract Geographical national conditions monitoring has become an important task of surveying and geographical information industry, and will make a profound influence on the development of surveying and ge- ographical information. This paper introduced the basic concept of ge- ographical national conditions monitoring, and discussed its main tasks including complete surveying, dynamic monitoring, statistical analysis and regular release, and expounded the main content of geographical urban conditions monitoring including urbanization monitoring, social- economic development monitoring, transportation foundation monitor- ing and natural ecological environment monitoring, and put forwards the framework system of geographical urban conditions monitoring. Key words surveying and mapping ,geographical national conditions, monitoring ( Page:l )

  17. OUTDOOR EDUCATION AND GEOGRAPHICAL EDUCATION

    Directory of Open Access Journals (Sweden)

    ANDREA GUARAN

    2016-01-01

    Full Text Available This paper focuses on the reflection on the relationship between values and methodological principles of Outdoor Education and spatial and geographical education perspectives, especially in pre-school and primary school, which relates to the age between 3 and 10 years. Outdoor Education is an educational practice that is already rooted in the philosophical thought of the 16th and the 17th centuries, from John Locke to Jean-Jacques Rousseau, and in the pedagogical thought, in particular Friedrich Fröbel, and it has now a quite stable tradition in Northern Europe countries. In Italy, however, there are still few experiences and they usually do not have a systematic and structural modality, but rather a temporarily and experimentally outdoor organization. In the first part, this paper focuses on the reasons that justify a particular attention to educational paths that favour outdoors activities, providing also a definition of outdoor education and highlighting its values. It is also essential to understand that educational programs in open spaces, such as a forest or simply the schoolyard, surely offers the possibility to learn geographical situations. Therefore, the question that arises is how to finalize the best stimulus that the spatial location guarantees for the acquisition of knowledge, skills and abilities about space and geography.

  18. Natural Scales in Geographical Patterns

    Science.gov (United States)

    Menezes, Telmo; Roth, Camille

    2017-04-01

    Human mobility is known to be distributed across several orders of magnitude of physical distances, which makes it generally difficult to endogenously find or define typical and meaningful scales. Relevant analyses, from movements to geographical partitions, seem to be relative to some ad-hoc scale, or no scale at all. Relying on geotagged data collected from photo-sharing social media, we apply community detection to movement networks constrained by increasing percentiles of the distance distribution. Using a simple parameter-free discontinuity detection algorithm, we discover clear phase transitions in the community partition space. The detection of these phases constitutes the first objective method of characterising endogenous, natural scales of human movement. Our study covers nine regions, ranging from cities to countries of various sizes and a transnational area. For all regions, the number of natural scales is remarkably low (2 or 3). Further, our results hint at scale-related behaviours rather than scale-related users. The partitions of the natural scales allow us to draw discrete multi-scale geographical boundaries, potentially capable of providing key insights in fields such as epidemiology or cultural contagion where the introduction of spatial boundaries is pivotal.

  19. Natural Scales in Geographical Patterns

    Science.gov (United States)

    Menezes, Telmo; Roth, Camille

    2017-01-01

    Human mobility is known to be distributed across several orders of magnitude of physical distances, which makes it generally difficult to endogenously find or define typical and meaningful scales. Relevant analyses, from movements to geographical partitions, seem to be relative to some ad-hoc scale, or no scale at all. Relying on geotagged data collected from photo-sharing social media, we apply community detection to movement networks constrained by increasing percentiles of the distance distribution. Using a simple parameter-free discontinuity detection algorithm, we discover clear phase transitions in the community partition space. The detection of these phases constitutes the first objective method of characterising endogenous, natural scales of human movement. Our study covers nine regions, ranging from cities to countries of various sizes and a transnational area. For all regions, the number of natural scales is remarkably low (2 or 3). Further, our results hint at scale-related behaviours rather than scale-related users. The partitions of the natural scales allow us to draw discrete multi-scale geographical boundaries, potentially capable of providing key insights in fields such as epidemiology or cultural contagion where the introduction of spatial boundaries is pivotal. PMID:28374825

  20. OUTDOOR EDUCATION AND GEOGRAPHICAL EDUCATION

    Directory of Open Access Journals (Sweden)

    ANDREA GUARAN

    2016-01-01

    Full Text Available This paper focuses on the reflection on the relationship between values and methodological principles of Outdoor Education and spatial and geographical education perspectives, especially in pre-school and primary school, which relates to the age between 3 and 10 years. Outdoor Education is an educational practice that is already rooted in the philosophical thought of the 16th and the 17th centuries, from John Locke to Jean-Jacques Rousseau, and in the pedagogical thought, in particular Friedrich Fröbel, and it has now a quite stable tradition in Northern Europe countries. In Italy, however, there are still few experiences and they usually do not have a systematic and structural modality, but rather a temporarily and experimentally outdoor organization. In the first part, this paper focuses on the reasons that justify a particular attention to educational paths that favour outdoors activities, providing also a definition of outdoor education and highlighting its values. It is also essential to understand that educational programs in open spaces, such as a forest or simply the schoolyard, surely offers the possibility to learn geographical situations. Therefore, the question that arises is how to finalize the best stimulus that the spatial location guarantees for the acquisition of knowledge, skills and abilities about space and geography.

  1. Geographic profiling and animal foraging.

    Science.gov (United States)

    Le Comber, Steven C; Nicholls, Barry; Rossmo, D Kim; Racey, Paul A

    2006-05-21

    Geographic profiling was originally developed as a statistical tool for use in criminal cases, particularly those involving serial killers and rapists. It is designed to help police forces prioritize lists of suspects by using the location of crime scenes to identify the areas in which the criminal is most likely to live. Two important concepts are the buffer zone (criminals are less likely to commit crimes in the immediate vicinity of their home) and distance decay (criminals commit fewer crimes as the distance from their home increases). In this study, we show how the techniques of geographic profiling may be applied to animal data, using as an example foraging patterns in two sympatric colonies of pipistrelle bats, Pipistrellus pipistrellus and P. pygmaeus, in the northeast of Scotland. We show that if model variables are fitted to known roost locations, these variables may be used as numerical descriptors of foraging patterns. We go on to show that these variables can be used to differentiate patterns of foraging in these two species.

  2. The Geographical Dimension of Terrorism

    Science.gov (United States)

    Hawkins, Houston T.

    The events of September 11 ushered us all into a world in which our security and sense of invulnerability were savagely replaced by vulnerability and irrational fear. To the delight of our adversaries who planned these attacks, we often responded in ways that furthered their agenda by weakening the cultural colossus that we call home. Normally terrorism is viewed as intense but localized violence. Seldom is terrorism viewed in its more expansive dimensions. It is burned into our collective memories as a collapsed building, a shattered bus, an incinerated nightclub, or facilities closed by a few anthrax-laced letters. However, terrorism must be studied in dimensions larger than the view from a news camera. This conclusion forms the intellectual basis for The Geographical Dimension of Terrorism.

  3. Network sensitivity to geographical configuration

    Energy Technology Data Exchange (ETDEWEB)

    Searle, Antony C; Scott, Susan M; McClelland, David E [Department of Physics and Theoretical Physics, Faculty of Science, Australian National University, Canberra, ACT 0200 (Australia)

    2002-04-07

    Gravitational wave astronomy will require the coordinated analysis of data from the global network of gravitational wave observatories. Questions of how to optimally configure the global network arise in this context. We have elsewhere proposed a formalism which is employed here to compare different configurations of the network, using both the coincident network analysis method and the coherent network analysis method. We have constructed a network model to compute a figure-of-merit based on the detection rate for a population of standard-candle binary inspirals. We find that this measure of network quality is very sensitive to the geographic location of component detectors under a coincident network analysis, but comparatively insensitive under a coherent network analysis.

  4. 使用GIS区划白河林业局森林经营类型%A Geographic Information Systems approach for classifying and mapping forest management category in Baihe Forestry Bureau, Northeast China

    Institute of Scientific and Technical Information of China (English)

    王顺忠; 邵国凡; 谷会岩; 王庆礼; 代力民

    2006-01-01

    This paper demonstrates a Geographic Information Systems (GIS) procedure of classifying and mapping forest management category in Baihe Forestry Burea, Jilin Province, China. Within the study area, Baihe Forestry Bureau land was classified into a two-hierarchy system. The top-level class included the non-forest and forest. Over 96% of land area is forest in the study area, which was further divided into key ecological service forest (KES), general ecological service forest (GES), and commodity forest (COM).COM covered 45.0% of the total land area and was the major forest management type in Baihe Forest Bureau. KES and GES accounted for 21.2% and 29.9% of the total land area, respectively. The forest management zones designed with GIS in this study were then compared with the forest management zones established using the hand draw by the local agency. There were obvious differences between the two products. It suggested that the differences had some to do with the data sources, basic unit and mapping procedures. It also suggested that the GIS method was a useful tool in integrating forest inventory data and other data for classifying and mapping forest zones to meet the needs of the classified forest management system.%使用GIS区划了白河林业局森林经营类型,并和原有的森林经营类型进行了比较.在数字化区划的森林经营类型中,二级系统被采纳.首先,白河林业局被区划为林地和非林地,其中,总面积的96%为林地,然后,林地区划为重点公益林,一般公益林和商品林.在重新区划的森林经营类型中,商品林达到总面积的45.0%,是最主要的森林经营类型:重点公益林和一般公益林分别为总面积的21.2%和29.9%.两个区划结果有很大的不同,在数字化区划的森林经营类型中,各类型斑块数量较多,面积较小,这些不同主要由使用数据,区划单位和区划方法的不同引起的.研究表明,GIS在区划森林经营类型时是一种有效的方

  5. A comparison of multiple imputation methods for handling missing values in longitudinal data in the presence of a time-varying covariate with a non-linear association with time: a simulation study

    National Research Council Canada - National Science Library

    Anurika Priyanjali De Silva; Margarita Moreno-Betancur; Alysha Madhu De Livera; Katherine Jane Lee; Julie Anne Simpson

    2017-01-01

    ...)) treat repeated measurements of the same time-dependent variable as just another ‘distinct’ variable for imputation and therefore do not make the most of the longitudinal structure of the data...

  6. Geographical variation in cardiovascular incidence: results from the British Women's Heart and Health Study

    Directory of Open Access Journals (Sweden)

    Ebrahim Shah

    2010-11-01

    Full Text Available Abstract Background Prevalence of cardiovascular disease (CVD in women shows regional variations not explained by common risk factors. Analysis of CVD incidence will provide insight into whether there is further divergence between regions with increasing age. Methods Seven-year follow-up data on 2685 women aged 59-80 (mean 69 at baseline from 23 towns in the UK were available from the British Women's Heart and Health Study. Time to fatal or non-fatal CVD was analyzed using Cox regression with adjustment for risk factors, using multiple imputation for missing values. Results Compared to South England, CVD incidence is similar in North England (HR 1.05 (95% CI 0.84, 1.31 and Scotland (0.93 (0.68, 1.27, but lower in Midlands/Wales (0.85 (0.64, 1.12. Event severity influenced regional variation, with South England showing lower fatal incident CVD than other regions, but higher non-fatal incident CVD. Kaplan-Meier plots suggested that regional divergence in CVD occurred before baseline (before mean baseline age of 69. Conclusions In women, regional differences in CVD early in adult life do not further diverge in later life. This may be due to regional differences in early detection, survivorship of women entering the study, or event severity. Targeting health care resources for CVD by geographic variation may not be appropriate for older age-groups.

  7. Prospects for Formation and Development of the Geographical (Territorial) Industrial Clusters in West Kazakhstan Region of the Republic of Kazakhstan

    Science.gov (United States)

    Imashev, Eduard Zh.

    2016-01-01

    The purpose of this research is to develop and implement an economic and geographic approach to forming and developing geographic (territorial) industrial clusters in regions of Kazakhstan. The purpose necessitates the accomplishment of the following scientific objectives: to investigate scientific approaches and experience of territorial economic…

  8. Geographic versus industry diversification: constraints matter

    OpenAIRE

    Ehling, Paul; Ramos, Sofia Brito

    2005-01-01

    This research addresses whether geographic diversification provides benefits over industry diversification. In the absence of constraints, no empirical evidence is found to support the argument that country diversification is superior. With short-selling constraints, however, the geographic tangency portfolio is not attainable by industry portfolios. Results with upper and lower constraints on portfolio weights as well as an out-of-sample analysis show that geographic diversification almost c...

  9. CYBERNETICS AND GEOGRAPHICAL EDUCATION: CYBERNETICS OF LEARNING AND LEARNING OF CYBERNETICS

    Directory of Open Access Journals (Sweden)

    M. R. Arpentieva

    2017-01-01

    Full Text Available Modern geographical education implies a broad implementation of innovative technologies, allowing students to fully and deeply understand the subject and methods of professional activity, and effectively and productively act upon this understanding. Therefore, in the work of modern geographer computer and media technologies occupy a significant place, and geographic education occupies an important place in learning cybernetic disciplines: computer technologies act as an important condition for obtaining high quality professional education, as well as an important tool of professional activity of modern specialist-geographer. The article is devoted to comparing three modern approaches to the study and optimization of training Cybernetics and programming in the framework of geographical education: an approach devoted to the study of “learning styles”; the metacognitive approach to learning computer science and programming; and intersubjective, evergetic or actually cybernetic, approach. It describes their advantages and limitations in the context of geographical education, as well as the internal unity as different forms of study of productivity and conditions of the dialogical interaction between teacher and student in the context of obtaining high-quality geographical education.

  10. The Oklahoma Geographic Information Retrieval System

    Science.gov (United States)

    Blanchard, W. A.

    1982-01-01

    The Oklahoma Geographic Information Retrieval System (OGIRS) is a highly interactive data entry, storage, manipulation, and display software system for use with geographically referenced data. Although originally developed for a project concerned with coal strip mine reclamation, OGIRS is capable of handling any geographically referenced data for a variety of natural resource management applications. A special effort has been made to integrate remotely sensed data into the information system. The timeliness and synoptic coverage of satellite data are particularly useful attributes for inclusion into the geographic information system.

  11. Geographic Distribution of VA Expenditures Report (GDX)

    Data.gov (United States)

    Department of Veterans Affairs — Geographic Distribution of VA Expenditures Report (GDX) located on the Expenditures page in the Expenditure Tables category. This report details VA expenditures at...

  12. Research Data Management Training for Geographers: First Impressions

    Directory of Open Access Journals (Sweden)

    Kerstin Helbig

    2016-03-01

    Full Text Available Sharing and secondary analysis of data have become increasingly important for research. Especially in geography, the collection of digital data has grown due to technological changes. Responsible handling and proper documentation of research data have therefore become essential for funders, publishers and higher education institutions. To achieve this goal, universities offer support and training in research data management. This article presents the experiences of a pilot workshop in research data management, especially for geographers. A discipline-specific approach to research data management training is recommended. The focus of this approach increases researchers’ interest and allows for more specific guidance. The instructors identified problems and challenges of research data management for geographers. In regards to training, the communication of benefits and reaching the target groups seem to be the biggest challenges. Consequently, better incentive structures as well as communication channels have to be established.

  13. A comparison of model-based imputation methods for handling missing predictor values in a linear regression model: A simulation study

    Science.gov (United States)

    Hasan, Haliza; Ahmad, Sanizah; Osman, Balkish Mohd; Sapri, Shamsiah; Othman, Nadirah

    2017-08-01

    In regression analysis, missing covariate data has been a common problem. Many researchers use ad hoc methods to overcome this problem due to the ease of implementation. However, these methods require assumptions about the data that rarely hold in practice. Model-based methods such as Maximum Likelihood (ML) using the expectation maximization (EM) algorithm and Multiple Imputation (MI) are more promising when dealing with difficulties caused by missing data. Then again, inappropriate methods of missing value imputation can lead to serious bias that severely affects the parameter estimates. The main objective of this study is to provide a better understanding regarding missing data concept that can assist the researcher to select the appropriate missing data imputation methods. A simulation study was performed to assess the effects of different missing data techniques on the performance of a regression model. The covariate data were generated using an underlying multivariate normal distribution and the dependent variable was generated as a combination of explanatory variables. Missing values in covariate were simulated using a mechanism called missing at random (MAR). Four levels of missingness (10%, 20%, 30% and 40%) were imposed. ML and MI techniques available within SAS software were investigated. A linear regression analysis was fitted and the model performance measures; MSE, and R-Squared were obtained. Results of the analysis showed that MI is superior in handling missing data with highest R-Squared and lowest MSE when percent of missingness is less than 30%. Both methods are unable to handle larger than 30% level of missingness.

  14. A Missing Data Imputation Method Based on Neighbor Rules%一种基于近邻规则的缺失数据填补方法

    Institute of Scientific and Technical Information of China (English)

    王凤梅; 胡丽霞

    2012-01-01

    Data missing is a common problem in data mining and data analysis process, it can lead to reliable decision-making if it is deleted with the cases directly. An imputation method of solving the missing data is put forward, which is based on association rale. In this method, the rules are classified by the rules' consequent, and then calculate the similarity of constrained rules cases' items and missing cases' items, impute the missing value with the most similar rule's item. Experimental results show this method has higher imputation accuracy.%数据缺失是数据挖掘与分析过程中的常见问题,若直接删除含缺失的事例可能导致不可靠的决策.为此,针对缺失数据的填补问题,提出一种基于近邻规则的缺失数据填补方法.根据关联规则的后件数据项进行分类,计算分类后的规则项与缺失项集间的相似度,用最相似的规则项值填补缺失值.实验结果表明,该方法具有较高的填补正确率.

  15. Rule-guided human classification of Volunteered Geographic Information

    Science.gov (United States)

    Ali, Ahmed Loai; Falomir, Zoe; Schmid, Falko; Freksa, Christian

    2017-05-01

    During the last decade, web technologies and location sensing devices have evolved generating a form of crowdsourcing known as Volunteered Geographic Information (VGI). VGI acted as a platform of spatial data collection, in particular, when a group of public participants are involved in collaborative mapping activities: they work together to collect, share, and use information about geographic features. VGI exploits participants' local knowledge to produce rich data sources. However, the resulting data inherits problematic data classification. In VGI projects, the challenges of data classification are due to the following: (i) data is likely prone to subjective classification, (ii) remote contributions and flexible contribution mechanisms in most projects, and (iii) the uncertainty of spatial data and non-strict definitions of geographic features. These factors lead to various forms of problematic classification: inconsistent, incomplete, and imprecise data classification. This research addresses classification appropriateness. Whether the classification of an entity is appropriate or inappropriate is related to quantitative and/or qualitative observations. Small differences between observations may be not recognizable particularly for non-expert participants. Hence, in this paper, the problem is tackled by developing a rule-guided classification approach. This approach exploits data mining techniques of Association Classification (AC) to extract descriptive (qualitative) rules of specific geographic features. The rules are extracted based on the investigation of qualitative topological relations between target features and their context. Afterwards, the extracted rules are used to develop a recommendation system able to guide participants to the most appropriate classification. The approach proposes two scenarios to guide participants towards enhancing the quality of data classification. An empirical study is conducted to investigate the classification of grass

  16. Testing for localization using micro-geographic data

    OpenAIRE

    Duranton, Gilles; Overman, Henry G.

    2005-01-01

    To study the detailed location patterns of industries, and particularly the tendency for industries to cluster relative to overall manufacturing, we develop distance-based tests of localization. In contrast to previous studies, our approach allows us to assess the statistical significance of departures from randomness. In addition, we treat space as continuous instead of using an arbitrary collection of geographical units. This avoids problems relating to scale and borders. We apply these tes...

  17. Geographic Determinants of Chinese Urbanization

    Science.gov (United States)

    Mccord, G. C.; Christensen, P.

    2011-12-01

    In the first years of the 21st century, the human race became primarily urban for the first time in history. With countries like India and China rapidly undergoing structural change from rural agricultural-based economies to urbanized manufacturing- and service-based economies, knowing where the coming waves of urbanization will occur would be of interest for infrastructure planning and for modeling consequences for ecological systems. We employ spatial econometric methods (geographically weighted regression, spatial lag models, and spatial errors models) to estimate two determinants of urbanization in China. The first is the role of physical geography, measured as topography-adjusted distance to major ports and suitability of land for agriculture. The second is the spatial agglomeration effect, which we estimate with a spatial lag model. We find that Chinese urbanization between 1990 and 2000 exhibited important spatial agglomeration effects, as well as significant explanatory power of nearby agricultural suitability and distance to ports, both in a nationwide model and in a model of local regression estimates. These results can help predict the location of new Chinese urbanization, and imply that climate change-induced changes in agricultural potential can affect the spatial distribution of urban areas.

  18. The population genomics of archaeological transition in west Iberia: Investigation of ancient substructure using imputation and haplotype-based methods.

    Science.gov (United States)

    Martiniano, Rui; Cassidy, Lara M; Ó'Maoldúin, Ros; McLaughlin, Russell; Silva, Nuno M; Manco, Licinio; Fidalgo, Daniel; Pereira, Tania; Coelho, Maria J; Serra, Miguel; Burger, Joachim; Parreira, Rui; Moran, Elena; Valera, Antonio C; Porfirio, Eduardo; Boaventura, Rui; Silva, Ana M; Bradley, Daniel G

    2017-07-01

    We analyse new genomic data (0.05-2.95x) from 14 ancient individuals from Portugal distributed from the Middle Neolithic (4200-3500 BC) to the Middle Bronze Age (1740-1430 BC) and impute genomewide diploid genotypes in these together with published ancient Eurasians. While discontinuity is evident in the transition to agriculture across the region, sensitive haplotype-based analyses suggest a significant degree of local hunter-gatherer contribution to later Iberian Neolithic populations. A more subtle genetic influx is also apparent in the Bronze Age, detectable from analyses including haplotype sharing with both ancient and modern genomes, D-statistics and Y-chromosome lineages. However, the limited nature of this introgression contrasts with the major Steppe migration turnovers within third Millennium northern Europe and echoes the survival of non-Indo-European language in Iberia. Changes in genomic estimates of individual height across Europe are also associated with these major cultural transitions, and ancestral components continue to correlate with modern differences in stature.

  19. Genome-wide association study with 1000 genomes imputation identifies signals for nine sex hormone-related phenotypes.

    Science.gov (United States)

    Ruth, Katherine S; Campbell, Purdey J; Chew, Shelby; Lim, Ee Mun; Hadlow, Narelle; Stuckey, Bronwyn G A; Brown, Suzanne J; Feenstra, Bjarke; Joseph, John; Surdulescu, Gabriela L; Zheng, Hou Feng; Richards, J Brent; Murray, Anna; Spector, Tim D; Wilson, Scott G; Perry, John R B

    2016-02-01

    Genetic factors contribute strongly to sex hormone levels, yet knowledge of the regulatory mechanisms remains incomplete. Genome-wide association studies (GWAS) have identified only a small number of loci associated with sex hormone levels, with several reproductive hormones yet to be assessed. The aim of the study was to identify novel genetic variants contributing to the regulation of sex hormones. We performed GWAS using genotypes imputed from the 1000 Genomes reference panel. The study used genotype and phenotype data from a UK twin register. We included 2913 individuals (up to 294 males) from the Twins UK study, excluding individuals receiving hormone treatment. Phenotypes were standardised for age, sex, BMI, stage of menstrual cycle and menopausal status. We tested 7,879,351 autosomal SNPs for association with levels of dehydroepiandrosterone sulphate (DHEAS), oestradiol, free androgen index (FAI), follicle-stimulating hormone (FSH), luteinizing hormone (LH), prolactin, progesterone, sex hormone-binding globulin and testosterone. Eight independent genetic variants reached genome-wide significance (Phormone regulation.

  20. Conceptual Model of Dynamic Geographic Environment

    Directory of Open Access Journals (Sweden)

    Martínez-Rosales Miguel Alejandro

    2014-04-01

    Full Text Available In geographic environments, there are many and different types of geographic entities such as automobiles, trees, persons, buildings, storms, hurricanes, etc. These entities can be classified into two groups: geographic objects and geographic phenomena. By its nature, a geographic environment is dynamic, thus, it’s static modeling is not sufficient. Considering the dynamics of geographic environment, a new type of geographic entity called event is introduced. The primary target is a modeling of geographic environment as an event sequence, because in this case the semantic relations are much richer than in the case of static modeling. In this work, the conceptualization of this model is proposed. It is based on the idea to process each entity apart instead of processing the environment as a whole. After that, the so called history of each entity and its spatial relations to other entities are defined to describe the whole environment. The main goal is to model systems at a conceptual level that make use of spatial and temporal information, so that later it can serve as the semantic engine for such systems.