WorldWideScience

Sample records for biomedical text mining

  1. Text mining patents for biomedical knowledge.

    Science.gov (United States)

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery.

  2. CONAN : Text Mining in the Biomedical Domain

    NARCIS (Netherlands)

    Malik, R.

    2006-01-01

    This thesis is about Text Mining. Extracting important information from literature. In the last years, the number of biomedical articles and journals is growing exponentially. Scientists might not find the information they want because of the large number of publications. Therefore a system was cons

  3. Biomedical text mining and its applications in cancer research.

    Science.gov (United States)

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research.

  4. Application of text mining in the biomedical domain.

    Science.gov (United States)

    Fleuren, Wilco W M; Alkema, Wynand

    2015-03-01

    In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for.

  5. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

    Science.gov (United States)

    Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.

  6. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics.

    Science.gov (United States)

    Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming

    2012-02-26

    Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research.

  7. An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining.

    Science.gov (United States)

    Jiang, Zhenchao; Li, Lishuang; Huang, Degen

    2016-01-01

    In biomedical text mining tasks, distributed word representation has succeeded in capturing semantic regularities, but most of them are shallow-window based models, which are not sufficient for expressing the meaning of words. To represent words using deeper information, we make explicit the semantic regularity to emerge in word relations, including dependency relations and context relations, and propose a novel architecture for computing continuous vector representation by leveraging those relations. The performance of our model is measured on word analogy task and Protein-Protein Interaction Extraction (PPIE) task. Experimental results show that our method performs overall better than other word representation models on word analogy task and have many advantages on biomedical text mining.

  8. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    Institute of Scientific and Technical Information of China (English)

    J; Harold; Pardue; William; T; Gerthoffer

    2012-01-01

    Computational techniques have been adopted in medi-cal and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from origi- nal data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research.

  9. Knowledge based word-concept model estimation and refinement for biomedical text mining.

    Science.gov (United States)

    Jimeno Yepes, Antonio; Berlanga, Rafael

    2015-02-01

    Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches.

  10. A text mining approach to detect mentions of protein glycosylation in biomedical text.

    Science.gov (United States)

    Shukla, Daksha; Jayaraman, Valadi K

    2012-01-01

    Protein Glycosylation is an important post translational event that plays a pivotal role in protein folding and protein is trafficking. We describe a dictionary based and a rule based approach to mine 'mentions' of protein glycosylation in text. The dictionary based approach relies on a set of manually curated dictionaries specially constructed to address this task. Abstracts are then screened for the 'mentions' of words from these dictionaries which are further scored followed by classification on the basis of a threshold. The rule based approaches also relies on the words in the dictionary to arrive at the features which are used for classification. The performance of the system using both the approaches has been evaluated using a manually curated corpus of 3133 abstracts. The evaluation suggests that the performance of the Rule based approach supersedes that of the Dictionary based approach.

  11. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  12. Community challenges in biomedical text mining over 10 years: success, failure and the future.

    Science.gov (United States)

    Huang, Chung-Chi; Lu, Zhiyong

    2016-01-01

    One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations.

  13. The BioLexicon: a large-scale terminological resource for biomedical text mining

    Directory of Open Access Journals (Sweden)

    Thompson Paul

    2011-10-01

    Full Text Available Abstract Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is

  14. An unsupervised text mining method for relation extraction from biomedical literature.

    Directory of Open Access Journals (Sweden)

    Changqin Quan

    Full Text Available The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1 Protein-protein interactions extraction, and (2 Gene-suicide association extraction. The evaluation of task (1 on the benchmark dataset (AImed corpus showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene-suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.

  15. An unsupervised text mining method for relation extraction from biomedical literature.

    Science.gov (United States)

    Quan, Changqin; Wang, Meng; Ren, Fuji

    2014-01-01

    The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1) Protein-protein interactions extraction, and (2) Gene-suicide association extraction. The evaluation of task (1) on the benchmark dataset (AImed corpus) showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene-suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.

  16. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  17. Text mining for systems biology.

    Science.gov (United States)

    Fluck, Juliane; Hofmann-Apitius, Martin

    2014-02-01

    Scientific communication in biomedicine is, by and large, still text based. Text mining technologies for the automated extraction of useful biomedical information from unstructured text that can be directly used for systems biology modelling have been substantially improved over the past few years. In this review, we underline the importance of named entity recognition and relationship extraction as fundamental approaches that are relevant to systems biology. Furthermore, we emphasize the role of publicly organized scientific benchmarking challenges that reflect the current status of text-mining technology and are important in moving the entire field forward. Given further interdisciplinary development of systems biology-orientated ontologies and training corpora, we expect a steadily increasing impact of text-mining technology on systems biology in the future.

  18. Integrating image data into biomedical text categorization.

    Science.gov (United States)

    Shatkay, Hagit; Chen, Nawei; Blostein, Dorothea

    2006-07-15

    Categorization of biomedical articles is a central task for supporting various curation efforts. It can also form the basis for effective biomedical text mining. Automatic text classification in the biomedical domain is thus an active research area. Contests organized by the KDD Cup (2002) and the TREC Genomics track (since 2003) defined several annotation tasks that involved document classification, and provided training and test data sets. So far, these efforts focused on analyzing only the text content of documents. However, as was noted in the KDD'02 text mining contest-where figure-captions proved to be an invaluable feature for identifying documents of interest-images often provide curators with critical information. We examine the possibility of using information derived directly from image data, and of integrating it with text-based classification, for biomedical document categorization. We present a method for obtaining features from images and for using them-both alone and in combination with text-to perform the triage task introduced in the TREC Genomics track 2004. The task was to determine which documents are relevant to a given annotation task performed by the Mouse Genome Database curators. We show preliminary results, demonstrating that the method has a strong potential to enhance and complement traditional text-based categorization methods.

  19. Hotspots in text mining of biomedical field%生物医学文本挖掘研究热点分析

    Institute of Scientific and Technical Information of China (English)

    史航; 高雯珺; 崔雷

    2016-01-01

    The high frequency subject terms were extracted from the PubMed-covered papers published from January 2000 to March 2015 on text mining of biomedical field to generate the matrix of high frequency subject terms and their source papers.The co-occurrence of high frequency subject terms in a same paper was analyzed by clustering analysis.The hotspots in text mining of biomedical field were analyzed according to the clustering analysis of high frequency subject terms and their corresponding class labels, which showed that the hotspots in text mining of bio-medical field were the basic technologies of text mining, application of text mining in biomedical informatics and in extraction of drugs-related facts.%为了解生物医学文本挖掘的研究现状和评估未来的发展方向,以美国国立图书馆 PubMed中收录的2000年1月-2015年3月发表的生物医学文本挖掘研究文献记录为样本来源,提取文献记录的主要主题词进行频次统计后截取高频主题词,形成高频主题词-论文矩阵,根据高频主题词在同一篇论文中的共现情况对其进行聚类分析,根据高频主题词聚类分析结果和对应的类标签文献,分析当前生物医学文本挖掘研究的热点。结果显示,当前文本挖掘在生物医学领域应用的主要研究热点为文本挖掘的基本技术研究、文本挖掘在生物信息学领域里的应用、文本挖掘在药物相关事实抽取中的应用3个方面。

  20. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  1. Mining Molecular Pharmacological Effects from Biomedical Text: a Case Study for Eliciting Anti-Obesity/Diabetes Effects of Chemical Compounds.

    Science.gov (United States)

    Dura, Elzbieta; Muresan, Sorel; Engkvist, Ola; Blomberg, Niklas; Chen, Hongming

    2014-05-01

    In the pharmaceutical industry, efficiently mining pharmacological data from the rapidly increasing scientific literature is very crucial for many aspects of the drug discovery process such as target validation, tool compound selection etc. A quick and reliable way is needed to collect literature assertions of selected compounds' biological and pharmacological effects in order to assist the hypothesis generation and decision-making of drug developers. INFUSIS, the text mining system presented here, extracts data on chemical compounds from PubMed abstracts. It involves an extensive use of customized natural language processing besides a co-occurrence analysis. As a proof-of-concept study, INFUSIS was used to search in abstract texts for several obesity/diabetes related pharmacological effects of the compounds included in a compound dictionary. The system extracts assertions regarding the pharmacological effects of each given compound and scores them by the relevance. For each selected pharmacological effect, the highest scoring assertions in 100 abstracts were manually evaluated, i.e. 800 abstracts in total. The overall accuracy for the inferred assertions was over 90 percent.

  2. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  3. A Customizable Text Classifier for Text Mining

    Directory of Open Access Journals (Sweden)

    Yun-liang Zhang

    2007-12-01

    Full Text Available Text mining deals with complex and unstructured texts. Usually a particular collection of texts that is specified to one or more domains is necessary. We have developed a customizable text classifier for users to mine the collection automatically. It derives from the sentence category of the HNC theory and corresponding techniques. It can start with a few texts, and it can adjust automatically or be adjusted by user. The user can also control the number of domains chosen and decide the standard with which to choose the texts based on demand and abundance of materials. The performance of the classifier varies with the user's choice.

  4. Chapter 16: text mining for translational bioinformatics.

    Directory of Open Access Journals (Sweden)

    K Bretonnel Cohen

    2013-04-01

    Full Text Available Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  5. Text Mining the Biomedical Literature

    Science.gov (United States)

    2007-11-05

    PREDILECTION DYSPEPSIA 6 APERISTALSIS WEARDALE EMBRYOLOGY 7 CORTISOLAEMIA SULPIRIDE HYPOTHERMIA 8 ENDANGIITIS NIACIN ACETYLGLUCOSAMINIDASE 9 FLUORESCEIN...TOP LAYER; MODEL, FLOW, STRATIFICATION: STATIC GRAINS As a general rule , the low frequency phrases in this category tend to be relatively generic...fuzzy rules OR fuzzy systems OR fuzzy logic OR frames OR heuristics OR logic program* OR scheduling OR detection OR data set* OR signals OR spatial

  6. Chapter 16: text mining for translational bioinformatics.

    Science.gov (United States)

    Cohen, K Bretonnel; Hunter, Lawrence E

    2013-04-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  7. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  8. 基于知识组织系统的生物医学文本挖掘研究%Research on Biomedical Text Mining Based on Knowledge Organization System

    Institute of Scientific and Technical Information of China (English)

    钱庆

    2016-01-01

    With the rapid development of biomedical information technology, biological medical literatures grow exponential y. It's hard to read and understand the required knowledge by manual, how to integrate knowledge from huge amounts of biomedical literatures, mining new knowledge has been becoming the current hot spot. Knowledge organization system construction in the field of biological medicine is more normative and complete than other fields, which is the foundation for biomedical text mining. A large number of text mining methodsand systems based on knowledge organization system have fast development. This paper investigates the existing medical knowledge organization systems and summarizes the process of biomedical text mining. It also summaries the researches andrecentprogressand analyzes the characteristics of biomedical text mining based on knowledge organization system. The knowledge organization systems play an important role in biomedical text mining and the chal enge for the current study are summarized, so as to provide references for biomedical workers.%随着生物医学信息技术的飞速发展,生物医学文献呈“指数型”增长,单纯依靠人工阅读获取和理解所需知识变得异常困难,如何从海量生物医学文献中整合已有知识、挖掘新知识成为当前研究热点。生物医学领域的知识组织系统建设相比其他领域更加规范和完整,为生物医学文本挖掘奠定了基础,大量基于知识组织系统的文本挖掘方法、系统得到快速发展。本文主要梳理现有医学知识组织系统,归纳生物医学文本挖掘的主要流程,按照挖掘任务探讨当前的主要研究和进展情况,并进一步分析基于知识组织系统的生物医学文本挖掘的特点,对知识组织系统在生物医学文本挖掘中发挥的主要作用和当前研究面临的挑战进行总结,以期为生物医学工作者提供借鉴。

  9. Text mining for the biocuration workflow.

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  10. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  11. Typesafe Modeling in Text Mining

    CERN Document Server

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  12. Text mining: A Brief survey

    Directory of Open Access Journals (Sweden)

    Falguni N. Patel , Neha R. Soni

    2012-12-01

    Full Text Available The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have also discussed different technologies that teach computers with natural language so that they may analyze, understand, and even generate text. In addition, we briefly discuss a number of successful applications of text mining which are used currently and in future.

  13. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text

    CERN Document Server

    Jonnalagadda, Siddhartha; Hakenberg, Jorg; Baral, Chitta; Gonzalez, Graciela

    2010-01-01

    The complexity of sentences characteristic to biomedical articles poses a challenge to natural language parsers, which are typically trained on large-scale corpora of non-technical text. We propose a text simplification process, bioSimplify, that seeks to reduce the complexity of sentences in biomedical abstracts in order to improve the performance of syntactic parsers on the processed sentences. Syntactic parsing is typically one of the first steps in a text mining pipeline. Thus, any improvement in performance would have a ripple effect over all processing steps. We evaluated our method using a corpus of biomedical sentences annotated with syntactic links. Our empirical results show an improvement of 2.90% for the Charniak-McClosky parser and of 4.23% for the Link Grammar parser when processing simplified sentences rather than the original sentences in the corpus.

  14. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  15. Using natural language processing to improve biomedical concept normalization and relation mining

    NARCIS (Netherlands)

    N. Kang (Ning)

    2013-01-01

    textabstractThis thesis concerns the use of natural language processing for improving biomedical concept normalization and relation mining. We begin with introducing the background of biomedical text mining, and subsequently we will continue by describing a typical text mining pipeline, some key iss

  16. DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.

    Science.gov (United States)

    Yin, Xu-Cheng; Yang, Chun; Pei, Wei-Yi; Man, Haixia; Zhang, Jun; Learned-Miller, Erik; Yu, Hong

    2015-01-01

    Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.

  17. Reviving "Walden": Mining the Text.

    Science.gov (United States)

    Hewitt Julia

    2000-01-01

    Describes how the author and her high school English students begin their study of Thoreau's "Walden" by mining the text for quotations to inspire their own writing and discussion on the topic, "How does Thoreau speak to you or how could he speak to someone you know?" (SR)

  18. SIAM 2007 Text Mining Competition dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...

  19. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  20. A REVIEW ON TEXT MINING IN DATA MINING

    OpenAIRE

    2016-01-01

    Data mining is the knowledge discovery in databases and the gaol is to extract patterns and knowledge from large amounts of data. The important term in data mining is text mining. Text mining extracts the quality information highly from text. Statistical pattern learning is used to high quality information. High –quality in text mining defines the combinations of relevance, novelty and interestingness. Tasks in text mining are text categorization, text clustering, entity extraction and sentim...

  1. Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

    Science.gov (United States)

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries.

  2. Analysis of biological processes and diseases using text mining approaches.

    Science.gov (United States)

    Krallinger, Martin; Leitner, Florian; Valencia, Alfonso

    2010-01-01

    A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into

  3. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  4. Efficient Retrieval of Text for Biomedical Domain using Expectation Maximization Algorithm

    Directory of Open Access Journals (Sweden)

    Sumit Vashishtha

    2011-11-01

    Full Text Available Data mining, a branch of computer science [1], is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. Biomedical text retrieval refers to text retrieval techniques applied to biomedical resources and literature available of the biomedical and molecular biology domain. The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Biomedical text retrieval is a way to aid researchers in coping with information overload. By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information extraction. However, textual variation due to typos, abbreviations, and other sources can prevent the productive discovery and utilization of hard-matching rules. Recent methods of soft clustering can exploit predictive relationships in textual data. This paper presents a technique for using soft clustering data mining algorithm to increase the accuracy of biomedical text extraction. Experimental results demonstrate that this approach improves text extraction more effectively that hard keyword matching rules.

  5. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  6. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  7. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  8. Text mining for traditional Chinese medical knowledge discovery: a survey.

    Science.gov (United States)

    Zhou, Xuezhong; Peng, Yonghong; Liu, Baoyan

    2010-08-01

    Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the TCM knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. TCM literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to TCM and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around TCM text mining and its future directions.

  9. Discovering gene annotations in biomedical text databases

    Directory of Open Access Journals (Sweden)

    Ozsoyoglu Gultekin

    2008-03-01

    Full Text Available Abstract Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i automating the annotation of genomic entities with Gene Ontology concepts, and (ii providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate

  10. MBA: a literature mining system for extracting biomedical abbreviations

    Directory of Open Access Journals (Sweden)

    Lei YiMing

    2009-01-01

    Full Text Available Abstract Background The exploding growth of the biomedical literature presents many challenges for biological researchers. One such challenge is from the use of a great deal of abbreviations. Extracting abbreviations and their definitions accurately is very helpful to biologists and also facilitates biomedical text analysis. Existing approaches fall into four broad categories: rule based, machine learning based, text alignment based and statistically based. State of the art methods either focus exclusively on acronym-type abbreviations, or could not recognize rare abbreviations. We propose a systematic method to extract abbreviations effectively. At first a scoring method is used to classify the abbreviations into acronym-type and non-acronym-type abbreviations, and then their corresponding definitions are identified by two different methods: text alignment algorithm for the former, statistical method for the latter. Results A literature mining system MBA was constructed to extract both acronym-type and non-acronym-type abbreviations. An abbreviation-tagged literature corpus, called Medstract gold standard corpus, was used to evaluate the system. MBA achieved a recall of 88% at the precision of 91% on the Medstract gold-standard EVALUATION Corpus. Conclusion We present a new literature mining system MBA for extracting biomedical abbreviations. Our evaluation demonstrates that the MBA system performs better than the others. It can identify the definition of not only acronym-type abbreviations including a little irregular acronym-type abbreviations (e.g., , but also non-acronym-type abbreviations (e.g., .

  11. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  12. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  13. Preprocessing and Morphological Analysis in Text Mining

    Directory of Open Access Journals (Sweden)

    Krishna Kumar Mohbey Sachin Tiwari

    2011-12-01

    Full Text Available This paper is based on the preprocessing activities which is performed by the software or language translators before applying mining algorithms on the huge data. Text mining is an important area of Data mining and it plays a vital role for extracting useful information from the huge database or data ware house. But before applying the text mining or information extraction process, preprocessing is must because the given data or dataset have the noisy, incomplete, inconsistent, dirty and unformatted data. In this paper we try to collect the necessary requirements for preprocessing. When we complete the preprocess task then we can easily extract the knowledgful information using mining strategy. This paper also provides the information about the analysis of data like tokenization, stemming and semantic analysis like phrase recognition and parsing. This paper also collect the procedures for preprocessing data i.e. it describe that how the stemming, tokenization or parsing are applied.

  14. Database Citation in Full Text Biomedical Articles

    OpenAIRE

    Şenay Kafkas; Jee-Hyub Kim; Johanna R. McEntyre

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleoti...

  15. Financial Statement Fraud Detection using Text Mining

    Directory of Open Access Journals (Sweden)

    Rajan Gupta

    2013-01-01

    Full Text Available Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement fraud by analyzing the hidden clues in the qualitative information (text present in financial statements.

  16. Text mining and visualization using VOSviewer

    CERN Document Server

    van Eck, Nees Jan

    2011-01-01

    VOSviewer is a computer program for creating, visualizing, and exploring bibliometric maps of science. In this report, the new text mining functionality of VOSviewer is presented. A number of examples are given of applications in which VOSviewer is used for analyzing large amounts of text data.

  17. DrugQuest - a text mining workflow for drug association discovery

    OpenAIRE

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A.; Theodosiou ,Theodosios; Vizirianakis, Ioannis S.; Iliopoulos, Ioannis

    2016-01-01

    Background Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Results Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based...

  18. Mining Texts in Reading to Write.

    Science.gov (United States)

    Greene, Stuart

    1992-01-01

    Proposes a set of strategies for connecting reading and writing, placing the discussion in the context of other pedagogical approaches designed to exploit the relationship between reading and writing. Explores ways in which students employ the strategies involved in "mining" a text--reconstructing context, inferring or imposing structure, and…

  19. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set

    Directory of Open Access Journals (Sweden)

    Abdul Wahab Muzaffar

    2015-01-01

    Full Text Available The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.

  20. Enhancing biomedical text summarization using semantic relation extraction.

    Directory of Open Access Journals (Sweden)

    Yue Shang

    Full Text Available Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1 We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2 We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3 For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.

  1. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set.

    Science.gov (United States)

    Muzaffar, Abdul Wahab; Azam, Farooque; Qamar, Usman

    2015-01-01

    The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS) and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.

  2. Semantator: semantic annotator for converting biomedical text to linked data.

    Science.gov (United States)

    Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G

    2013-10-01

    More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference.

  3. Science and Technology Text Mining: Management Decision Aids

    Science.gov (United States)

    2007-11-02

    review; data mining; text mining; bibliometrics ; scientometrics; resource allocation; project selection; operations research; management science. REPORT...review; data mining; text mining; bibliometrics ; scientometrics; resource allocation; project selection; operations research; management science. 16...support techniques include roadmaps, metrics, peer review, data and text mining, information retrieval, bibliometrics , and retrospective studies. The

  4. Demo: Using RapidMiner for Text Mining

    OpenAIRE

    2013-01-01

    In this demo the basic text mining technologies by using RapidMining have been reviewed. RapidMining basic characteristics and operators of text mining have been described. Text mining example by using Navie Bayes algorithm and process modeling have been revealed.

  5. Imitating manual curation of text-mined facts in biomedicine.

    Directory of Open Access Journals (Sweden)

    Raul Rodriguez-Esteban

    2006-09-01

    Full Text Available Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations, we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95. Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

  6. Negotiating a Text Mining License for Faculty Researchers

    Directory of Open Access Journals (Sweden)

    Leslie A. Williams

    2014-09-01

    Full Text Available This case study examines strategies used to leverage the library’s existing journal licenses to obtain a large collection of full-text journal articles in extensible markup language (XML format; the right to text mine the collection; and the right to use the collection and the data mined from it for grant-funded research to develop biomedical natural language processing (BNLP tools. Researchers attempted to obtain content directly from PubMed Central (PMC. This attempt failed due to limits on use of content in PMC. Next researchers and their library liaison attempted to obtain content from contacts in the technical divisions of the publishing industry. This resulted in an incomplete research data set. Then researchers, the library liaison, and the acquisitions librarian collaborated with the sales and technical staff of a major science, technology, engineering, and medical (STEM publisher to successfully create a method for obtaining XML content as an extension of the library’s typical acquisition process for electronic resources. Our experience led us to realize that text mining rights of full-text articles in XML format should routinely be included in the negotiation of the library’s licenses.

  7. Full text clustering and relationship network analysis of biomedical publications.

    Directory of Open Access Journals (Sweden)

    Renchu Guan

    Full Text Available Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.

  8. Text mining applications in psychiatry: a systematic literature review.

    Science.gov (United States)

    Abbe, Adeline; Grouin, Cyril; Zweigenbaum, Pierre; Falissard, Bruno

    2016-06-01

    The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd.

  9. Mining Causality for Explanation Knowledge from Text

    Institute of Scientific and Technical Information of China (English)

    Chaveevan Pechsiri; Asanee Kawtrakul

    2007-01-01

    Mining causality is essential to provide a diagnosis. This research aims at extracting the causality existing within multiple sentences or EDUs (Elementary Discourse Unit). The research emphasizes the use of causality verbs because they make explicit in a certain way the consequent events of a cause, e.g., "Aphids suck the sap from rice leaves. Then leaves will shrink. Later, they will become yellow and dry.". A verb can also be the causal-verb link between cause and effect within EDU(s), e.g., "Aphids suck the sap from rice leaves causing leaves to be shrunk" ("causing" is equivalent to a causal-verb link in Thai). The research confronts two main problems: identifying the interesting causality events from documents and identifying their boundaries. Then, we propose mining on verbs by using two different machine learning techniques, Naive Bayes classifier and Support Vector Machine. The resulted mining rules will be used for the identification and the causality extraction of the multiple EDUs from text. Our multiple EDUs extraction shows 0.88 precision with 0.75 recall from Na'ive Bayes classifier and 0.89 precision with 0.76 recall from Support Vector Machine.

  10. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  11. Corpus annotation for mining biomedical events from literature

    Directory of Open Access Journals (Sweden)

    Tsujii Jun'ichi

    2008-01-01

    Full Text Available Abstract Background Advanced Text Mining (TM such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech, syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1 to design a scheme of annotation which meets specific requirements of text annotation, (2 to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3 to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing-based TM in the bio-medical domain.

  12. Text mining for adverse drug events: the promise, challenges, and state of the art.

    Science.gov (United States)

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H

    2014-10-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. It is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event (ADE) detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.

  13. Unsupervised text mining for assessing and augmenting GWAS results.

    Science.gov (United States)

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma.

  14. Spectral signature verification using statistical analysis and text mining

    Science.gov (United States)

    DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.

    2016-05-01

    In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is

  15. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  16. Techniques, Applications and Challenging Issue in Text Mining

    Directory of Open Access Journals (Sweden)

    Shaidah Jusoh

    2012-11-01

    Full Text Available Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts. These texts can be found on a desktop, intranets and the internet. The aim of this paper is to give an overview of text mining in the contexts of its techniques, application domains and the most challenging issue. The focus is given on fundamentals methods of text mining which include natural language possessing and information extraction. This paper also gives a short review on domains which have employed text mining. The challenging issue in text mining which is caused by the complexity in a natural language is also addressed in this paper.

  17. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  18. Data, Text and Web Mining for Business Intelligence : A Survey

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid Al-Azmi

    2013-04-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amountsof data available. Enterprises use mining technologies to search vast amounts of data for vital insight andknowledge. Mining tools such as data mining, text mining, and web mining are used to find hiddenknowledge in large databases or the Internet. Mining tools are automated software tools used to achievebusiness intelligence by finding hidden relations,and predicting future events from vast amounts of data.This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, andeven fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented.Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then lookinto some case studies of success stories using mining tools. Finally, we shall demonstrate some of the mainchallenges to the mining technologies that limit their potential.

  19. DATA, TEXT, AND WEB MINING FOR BUSINESS INTELLIGENCE: A SURVEY

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid

    2013-03-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data. This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, and even fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented. Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential.

  20. A comparison study on algorithms of detecting long forms for short forms in biomedical text

    Directory of Open Access Journals (Sweden)

    Wu Cathy H

    2007-11-01

    Full Text Available Abstract Motivation With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs and their corresponding definitions as long forms (LFs. The study was designed to answer the following questions; i how well a system performs in detecting LFs from novel text, ii what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii how to combine results from various SF knowledge bases. Method We evaluated the following three publicly available detection systems in detecting LFs for SFs: i a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii a machine learning system by Chang et al., and iii a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i the UMLS (the Unified Medical Language System, and ii the BioThesaurus (a thesaurus of names for all UniProt protein records. We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. Availability The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.

  1. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

    Directory of Open Access Journals (Sweden)

    Liu Haibin

    2012-04-01

    Full Text Available Abstract Background The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research. Results In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system. Conclusions The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.

  2. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus

    OpenAIRE

    Alnazzawi, Noha; Thompson, Paul; Batista-Navarro, Riza; Ananiadou, Sophia

    2015-01-01

    Background Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from fre...

  3. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  4. Text mining in livestock animal science: introducing the potential of text mining to animal sciences.

    Science.gov (United States)

    Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M

    2012-10-01

    In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from

  5. Analysing Customer Opinions with Text Mining Algorithms

    Science.gov (United States)

    Consoli, Domenico

    2009-08-01

    Knowing what the customer thinks of a particular product/service helps top management to introduce improvements in processes and products, thus differentiating the company from their competitors and gain competitive advantages. The customers, with their preferences, determine the success or failure of a company. In order to know opinions of the customers we can use technologies available from the web 2.0 (blog, wiki, forums, chat, social networking, social commerce). From these web sites, useful information must be extracted, for strategic purposes, using techniques of sentiment analysis or opinion mining.

  6. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.

    Science.gov (United States)

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.

  7. Event-based text mining for biology and functional genomics.

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Nawaz, Raheel; McNaught, John; Kell, Douglas B

    2015-05-01

    The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of 'events', i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research.

  8. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  9. MINING TEXTS TO UNDERSTAND CUSTOMERS' IMAGE OF BRANDS

    Directory of Open Access Journals (Sweden)

    Hyung Jun Ahn

    2013-06-01

    Full Text Available Text mining is becoming increasingly important in understanding customers and markets these days. This paper presents a method of mining texts about customer sentiments using a network analysis technique. A data set collected about two global mobile device manufactures were used for testing the method. The analysis results show that the method can be effectively used to extract key sentiments in the customers' texts.

  10. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    NARCIS (Netherlands)

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is base

  11. Text mining of web-based medical content

    CERN Document Server

    Neustein, Amy

    2014-01-01

    Text Mining of Web-Based Medical Content examines web mining for extracting useful information that can be used for treating and monitoring the healthcare of patients. This work provides methodological approaches to designing mapping tools that exploit data found in social media postings. Specific linguistic features of medical postings are analyzed vis-a-vis available data extraction tools for culling useful information.

  12. HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways.

    Science.gov (United States)

    Subramani, Suresh; Kalpana, Raja; Monickaraj, Pankaj Moses; Natarajan, Jeyakumar

    2015-04-01

    The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks.

  13. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    Science.gov (United States)

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.

  14. Text mining for improved exposure assessment

    Science.gov (United States)

    Baker, Simon; Silins, Ilona; Guo, Yufan; Stenius, Ulla; Korhonen, Anna; Berglund, Marika

    2017-01-01

    Chemical exposure assessments are based on information collected via different methods, such as biomonitoring, personal monitoring, environmental monitoring and questionnaires. The vast amount of chemical-specific exposure information available from web-based databases, such as PubMed, is undoubtedly a great asset to the scientific community. However, manual retrieval of relevant published information is an extremely time consuming task and overviewing the data is nearly impossible. Here, we present the development of an automatic classifier for chemical exposure information. First, nearly 3700 abstracts were manually annotated by an expert in exposure sciences according to a taxonomy exclusively created for exposure information. Natural Language Processing (NLP) techniques were used to extract semantic and syntactic features relevant to chemical exposure text. Using these features, we trained a supervised machine learning algorithm to automatically classify PubMed abstracts according to the exposure taxonomy. The resulting classifier demonstrates good performance in the intrinsic evaluation. We also show that the classifier improves information retrieval of chemical exposure data compared to keyword-based PubMed searches. Case studies demonstrate that the classifier can be used to assist researchers by facilitating information retrieval and classification, enabling data gap recognition and overviewing available scientific literature using chemical-specific publication profiles. Finally, we identify challenges to be addressed in future development of the system. PMID:28257498

  15. Text mining in cancer gene and pathway prioritization.

    Science.gov (United States)

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  16. Application of text mining for customer evaluations in commercial banking

    Science.gov (United States)

    Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.

  17. Managing biological networks by using text mining and computer-aided curation

    Science.gov (United States)

    Yu, Seok Jong; Cho, Yongseong; Lee, Min-Ho; Lim, Jongtae; Yoo, Jaesoo

    2015-11-01

    In order to understand a biological mechanism in a cell, a researcher should collect a huge number of protein interactions with experimental data from experiments and the literature. Text mining systems that extract biological interactions from papers have been used to construct biological networks for a few decades. Even though the text mining of literature is necessary to construct a biological network, few systems with a text mining tool are available for biologists who want to construct their own biological networks. We have developed a biological network construction system called BioKnowledge Viewer that can generate a biological interaction network by using a text mining tool and biological taggers. It also Boolean simulation software to provide a biological modeling system to simulate the model that is made with the text mining tool. A user can download PubMed articles and construct a biological network by using the Multi-level Knowledge Emergence Model (KMEM), MetaMap, and A Biomedical Named Entity Recognizer (ABNER) as a text mining tool. To evaluate the system, we constructed an aging-related biological network that consist 9,415 nodes (genes) by using manual curation. With network analysis, we found that several genes, including JNK, AP-1, and BCL-2, were highly related in aging biological network. We provide a semi-automatic curation environment so that users can obtain a graph database for managing text mining results that are generated in the server system and can navigate the network with BioKnowledge Viewer, which is freely available at http://bioknowledgeviewer.kisti.re.kr.

  18. Science and Technology Text Mining: Cross-Disciplinary Innovation

    Science.gov (United States)

    2007-11-02

    Technovation. 19:10. October. 593-604. Kostoff, R. N. (1999b). Hypersonic and supersonic flow roadmaps using bibliometrics and database tomography. Journal...Fullerene roadmaps using bibliometrics and database tomography. Journal of Chemical Information and Computer Science. 40. January-February. Kostoff...Citation mining: integrating text mining and bibliometrics for research user profiling. Journal of the American Society for Information Science and

  19. Grid-based Support for Different Text Mining Tasks

    Directory of Open Access Journals (Sweden)

    Martin Sarnovský

    2009-12-01

    Full Text Available This paper provides an overview of our research activities aimed at efficient useof Grid infrastructure to solve various text mining tasks. Grid-enabling of various textmining tasks was mainly driven by increasing volume of processed data. Utilizing the Gridservices approach therefore enables to perform various text mining scenarios and alsoopen ways to design distributed modifications of existing methods. Especially, some partsof mining process can significantly benefit from decomposition paradigm, in particular inthis study we present our approach to data-driven decomposition of decision tree buildingalgorithm, clustering algorithm based on self-organizing maps and its application inconceptual model building task using the FCA-based algorithm. Work presented in thispaper is rather to be considered as a 'proof of concept' for design and implementation ofdecomposition methods as we performed the experiments mostly on standard textualdatabases.

  20. Approximate subgraph matching-based literature mining for biomedical events and relations.

    Directory of Open Access Journals (Sweden)

    Haibin Liu

    Full Text Available The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.

  1. Research on Online Topic Evolutionary Pattern Mining in Text Streams

    Directory of Open Access Journals (Sweden)

    Qian Chen

    2014-06-01

    Full Text Available Text Streams are a class of ubiquitous data that came in over time and are extraordinary large in scale that we often lose track of. Basically, text streams forms the fundamental source of information that can be used to detect semantic topic which individuals and organizations are interested in as well as detect burst events within communities. Thus, intelligent system that can automatically extract interesting temporal pattern from text streams is terribly needed; however, Evolutionary Pattern Mining is not well addressed in previous work. In this paper, we start a tentative research on topic evolutionary pattern mining system by discussing fully properties of a topic after formally definition, as well as proposing a common and formal framework in analyzing text streams. We also defined three basic tasks including (1 online topic Detection, (2 event evolution extraction and (3 topic property life cycle, and proposed three common mining algorithms respectively. Finally we exemplify the application of Evolutionary Pattern Mining and shows that interesting patterns can be extracted in newswire dataset

  2. Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use

    Science.gov (United States)

    White, Sheida

    2012-01-01

    This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…

  3. Negation scope and spelling variation for text-mining of Danish electronic patient records

    DEFF Research Database (Denmark)

    Thomas, Cecilia Engel; Jensen, Peter Bjødstrup; Werge, Thomas;

    2014-01-01

    Electronic patient records are a potentially rich data source for knowledge extraction in biomedical research. Here we present a method based on the ICD10 system for text-mining of Danish health records. We have evaluated how adding functionalities to a baseline text-mining tool affected...... and negation scope. The most important functionality of the tool was handling of spelling variation, which greatly increased the number of phenotypes that could be identified in the records, without noticeably decreasing the precision. Further, our results show that different negations have different optimal...... the overall performance. The purpose of the tool was to create enriched phenotypic profiles for each patient in a corpus consisting of records from 5,543 patients at a Danish psychiatric hospital, by assigning each patient additional ICD10 codes based on freetext parts of these records. The tool...

  4. DISEASES: text mining and data integration of disease-gene associations.

    Science.gov (United States)

    Pletscher-Frankild, Sune; Pallejà, Albert; Tsafou, Kalliopi; Binder, Janos X; Jensen, Lars Juhl

    2015-03-01

    Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download.

  5. Text mining for biology--the way forward

    DEFF Research Database (Denmark)

    Altman, Russ B; Bergman, Casey M; Blake, Judith;

    2008-01-01

    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify...

  6. Using Text Mining to Characterize Online Discussion Facilitation

    Science.gov (United States)

    Ming, Norma; Baumer, Eric

    2011-01-01

    Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…

  7. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    Directory of Open Access Journals (Sweden)

    Calvin Lam

    Full Text Available Research on publication trends in journal articles on sleep disorders (SDs and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings.SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms.MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms.Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  8. Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.

    Science.gov (United States)

    Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria

    2001-01-01

    Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…

  9. Text mining improves prediction of protein functional sites.

    Directory of Open Access Journals (Sweden)

    Karin M Verspoor

    Full Text Available We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites. The structure analysis was carried out using Dynamics Perturbation Analysis (DPA, which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

  10. Collaborative mining and interpretation of large-scale data for biomedical research insights.

    Directory of Open Access Journals (Sweden)

    Georgia Tsiliki

    Full Text Available Biomedical research becomes increasingly interdisciplinary and collaborative in nature. Researchers need to efficiently and effectively collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. In line with related research directives revealing that, in spite of the recent advances in data mining and computational analysis, humans can easily detect patterns which computer algorithms may have difficulty in finding, this paper reports on the practical use of an innovative web-based collaboration support platform in a biomedical research context. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, the proposed platform adopts a hybrid approach that builds on the synergy between machine and human intelligence to facilitate the underlying sense-making and decision making processes. User experience shows that the platform enables more informed and quicker decisions, by displaying the aggregated information according to their needs, while also exploiting the associated human intelligence.

  11. Text mining improves prediction of protein functional sites.

    Science.gov (United States)

    Verspoor, Karin M; Cohn, Judith D; Ravikumar, Komandur E; Wall, Michael E

    2012-01-01

    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

  12. Unsupervised mining of frequent tags for clinical eligibility text indexing.

    Science.gov (United States)

    Miotto, Riccardo; Weng, Chunhua

    2013-12-01

    Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.

  13. Text Mining to Support Gene Ontology Curation and Vice Versa.

    Science.gov (United States)

    Ruch, Patrick

    2017-01-01

    In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.

  14. Decision Support for E-Governance: A Text Mining Approach

    Directory of Open Access Journals (Sweden)

    G. Koteswara Rao

    2011-09-01

    Full Text Available Information and communication technology has the capability to improve the process by whichgovernments involve citizens in formulating public policy and public projects. Even though much ofgovernment regulations may now be in digital form (and often available online, due to their complexityand diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with theadvent of a number of electronic online forums, social networking sites and blogs, the opportunity ofgathering citizens’ petitions and stakeholders’ views on government policy and proposals has increasedgreatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the otherhand, text mining has come a long way from simple keyword search, and matured into a discipline capableof dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help inretrieval of information and relationships from textual data sources, thereby assisting policy makers indiscovering associations between policies and citizens’ opinions expressed in electronic public forums andblogs etc. We also present here, an integrated text mining based architecture for e-governance decisionsupport along with a discussion on the Indian scenario.

  15. Decision Support for e-Governance: A Text Mining Approach

    CERN Document Server

    Rao, G Koteswara

    2011-01-01

    Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens' petitions and stakeholders' views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy...

  16. Text mining and medicine: usefulness in respiratory diseases.

    Science.gov (United States)

    Piedra, David; Ferrer, Antoni; Gea, Joaquim

    2014-03-01

    It is increasingly common to have medical information in electronic format. This includes scientific articles as well as clinical management reviews, and even records from health institutions with patient data. However, traditional instruments, both individual and institutional, are of little use for selecting the most appropriate information in each case, either in the clinical or research field. So-called text or data «mining» enables this huge amount of information to be managed, extracting it from various sources using processing systems (filtration and curation), integrating it and permitting the generation of new knowledge. This review aims to provide an overview of text and data mining, and of the potential usefulness of this bioinformatic technique in the exercise of care in respiratory medicine and in research in the same field.

  17. The Role of Text Mining in Export Control

    Energy Technology Data Exchange (ETDEWEB)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon [Korea Institute of Nuclear Nonproliferation and Control, Daejeon (Korea, Republic of)

    2015-10-15

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control.

  18. Text mining a self-report back-translation.

    Science.gov (United States)

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record

  19. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks.

    Science.gov (United States)

    Caporaso, J Gregory; Deshpande, Nita; Fink, J Lynn; Bourne, Philip E; Cohen, K Bretonnel; Hunter, Lawrence

    2008-01-01

    Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.

  20. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  1. UMLS content views appropriate for NLP processing of the biomedical literature vs. clinical text.

    Science.gov (United States)

    Demner-Fushman, Dina; Mork, James G; Shooshan, Sonya E; Aronson, Alan R

    2010-08-01

    Identification of medical terms in free text is a first step in such Natural Language Processing (NLP) tasks as automatic indexing of biomedical literature and extraction of patients' problem lists from the text of clinical notes. Many tools developed to perform these tasks use biomedical knowledge encoded in the Unified Medical Language System (UMLS) Metathesaurus. We continue our exploration of automatic approaches to creation of subsets (UMLS content views) which can support NLP processing of either the biomedical literature or clinical text. We found that suppression of highly ambiguous terms in the conservative AutoFilter content view can partially replace manual filtering for literature applications, and suppression of two character mappings in the same content view achieves 89.5% precision at 78.6% recall for clinical applications.

  2. Facilitating Full-text Access to Biomedical Literature Using Open Access Resources.

    Science.gov (United States)

    Kang, Hongyu; Hou, Zhen; Li, Jiao

    2015-01-01

    Open access (OA) resources and local libraries often have their own literature databases, especially in the field of biomedicine. We have developed a method of linking a local library to a biomedical OA resource facilitating researchers' full-text article access. The method uses a model based on vector space to measure similarities between two articles in local library and OA resources. The method achieved an F-score of 99.61%. This method of article linkage and mapping between local library and OA resources is available for use. Through this work, we have improved the full-text access of the biomedical OA resources.

  3. Mining consumer health vocabulary from community-generated text.

    Science.gov (United States)

    Vydiswaran, V G Vinod; Mei, Qiaozhu; Hanauer, David A; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.

  4. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  5. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  6. MeInfoText: associated gene methylation and cancer information from text mining

    Directory of Open Access Journals (Sweden)

    Juan Hsueh-Fen

    2008-01-01

    Full Text Available Abstract Background DNA methylation is an important epigenetic modification of the genome. Abnormal DNA methylation may result in silencing of tumor suppressor genes and is common in a variety of human cancer cells. As more epigenetics research is published electronically, it is desirable to extract relevant information from biological literature. To facilitate epigenetics research, we have developed a database called MeInfoText to provide gene methylation information from text mining. Description MeInfoText presents comprehensive association information about gene methylation and cancer, the profile of gene methylation among human cancer types and the gene methylation profile of a specific cancer type, based on association mining from large amounts of literature. In addition, MeInfoText offers integrated protein-protein interaction and biological pathway information collected from the Internet. MeInfoText also provides pathway cluster information regarding to a set of genes which may contribute the development of cancer due to aberrant methylation. The extracted evidence with highlighted keywords and the gene names identified from each methylation-related abstract is also retrieved. The database is now available at http://mit.lifescience.ntu.edu.tw/. Conclusion MeInfoText is a unique database that provides comprehensive gene methylation and cancer association information. It will complement existing DNA methylation information and will be useful in epigenetics research and the prevention of cancer.

  7. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  8. Practical text mining and statistical analysis for non-structured text data applications

    CERN Document Server

    Miner, Gary; Hill, Thomas; Nisbet, Robert; Delen, Dursun

    2012-01-01

    The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase d

  9. tmBioC: improving interoperability of text-mining tools with BioC.

    Science.gov (United States)

    Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong

    2014-01-01

    The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit

  10. Extraction of semantic biomedical relations from text using conditional random fields

    Directory of Open Access Journals (Sweden)

    Stetter Martin

    2008-04-01

    text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.

  11. Semantic-based image retrieval by text mining on environmental texts

    Science.gov (United States)

    Yang, Hsin-Chang; Lee, Chung-Hong

    2003-01-01

    In this paper we propose a novel method to bridge the 'semantic gap' between a user's information need and the image content. The semantic gap describes the major deficiency of content-based image retrieval (CBIR) systems which use visual features extracted from images to describe the images. We conquer the deficiency by extracting semantic of an image from the environmental texts around it. Since an image generally co-exists with accompany texts in various formats, we may rely on such environmental texts to discover the semantic of the image. A text mining approach based on self-organizing maps is used to extract the semantic of an image from its environmental texts. We performed experiments on a small set of images and obtained promising results.

  12. Mining Sequential Update Summarization with Hierarchical Text Analysis

    Directory of Open Access Journals (Sweden)

    Chunyun Zhang

    2016-01-01

    Full Text Available The outbreak of unexpected news events such as large human accident or natural disaster brings about a new information access problem where traditional approaches fail. Mostly, news of these events shows characteristics that are early sparse and later redundant. Hence, it is very important to get updates and provide individuals with timely and important information of these incidents during their development, especially when being applied in wireless and mobile Internet of Things (IoT. In this paper, we define the problem of sequential update summarization extraction and present a new hierarchical update mining system which can broadcast with useful, new, and timely sentence-length updates about a developing event. The new system proposes a novel method, which incorporates techniques from topic-level and sentence-level summarization. To evaluate the performance of the proposed system, we apply it to the task of sequential update summarization of temporal summarization (TS track at Text Retrieval Conference (TREC 2013 to compute four measurements of the update mining system: the expected gain, expected latency gain, comprehensiveness, and latency comprehensiveness. Experimental results show that our proposed method has good performance.

  13. Data Mining Algorithms for Classification of Complex Biomedical Data

    Science.gov (United States)

    Lan, Liang

    2012-01-01

    In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray…

  14. CrossRef text and data mining services

    Directory of Open Access Journals (Sweden)

    Rachael Lammey

    2015-02-01

    Full Text Available CrossRef is an association of scholarly publishers that develops shared infrastructure to support more effective scholarly communications. It is a registration agency for the digital object identifier (DOI, and has built additional services for CrossRef members around the DOI and the bibliographic metadata that publishers deposit in order to register DOIs for their publications. Among these services are CrossCheck, powered by iThenticate, which helps publishers screen for plagiarism in submitted manuscripts and FundRef, which gives publishers standard way to report funding sources for published scholarly research. To add to these services, Cross-Ref launched CrossRef text and data mining services in May 2014. This article will explain the thinking behind CrossRef launching this new service, what it offers to publishers and researchers alike, how publishers can participate in it, and the uptake of the service so far.

  15. Data mining of text as a tool in authorship attribution

    Science.gov (United States)

    Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu

    2001-03-01

    It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

  16. EnvMine: A text-mining system for the automatic extraction of contextual information

    Directory of Open Access Journals (Sweden)

    de Lorenzo Victor

    2010-06-01

    Full Text Available Abstract Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles. So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude, thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical

  17. Text-mining-assisted biocuration workflows in Argo.

    Science.gov (United States)

    Rak, Rafal; Batista-Navarro, Riza Theresa; Rowley, Andrew; Carter, Jacob; Ananiadou, Sophia

    2014-01-01

    Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk.

  18. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    Directory of Open Access Journals (Sweden)

    Shalini Puri

    2011-11-01

    Full Text Available Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM is proposed to classify a set of text documents into pre - defined Category Groups (CG by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC to classify correctly the training data patterns into two groups; i. e., + 1 and – 1, thereby producing accurate and correct results. The proposed model works efficiently and effectively with great performance and high - accuracy results.

  19. Sentiment analysis of Arabic tweets using text mining techniques

    Science.gov (United States)

    Al-Horaibi, Lamia; Khan, Muhammad Badruddin

    2016-07-01

    Sentiment analysis has become a flourishing field of text mining and natural language processing. Sentiment analysis aims to determine whether the text is written to express positive, negative, or neutral emotions about a certain domain. Most sentiment analysis researchers focus on English texts, with very limited resources available for other complex languages, such as Arabic. In this study, the target was to develop an initial model that performs satisfactorily and measures Arabic Twitter sentiment by using machine learning approach, Naïve Bayes and Decision Tree for classification algorithms. The datasets used contains more than 2,000 Arabic tweets collected from Twitter. We performed several experiments to check the performance of the two algorithms classifiers using different combinations of text-processing functions. We found that available facilities for Arabic text processing need to be made from scratch or improved to develop accurate classifiers. The small functionalities developed by us in a Python language environment helped improve the results and proved that sentiment analysis in the Arabic domain needs lot of work on the lexicon side.

  20. Protein-protein interaction predictions using text mining methods.

    Science.gov (United States)

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Iliopoulos, Ioannis

    2015-03-01

    It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein-protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools.

  1. A concept-driven biomedical knowledge extraction and visualization framework for conceptualization of text corpora.

    Science.gov (United States)

    Jahiruddin; Abulaish, Muhammad; Dey, Lipika

    2010-12-01

    A number of techniques such as information extraction, document classification, document clustering and information visualization have been developed to ease extraction and understanding of information embedded within text documents. However, knowledge that is embedded in natural language texts is difficult to extract using simple pattern matching techniques and most of these methods do not help users directly understand key concepts and their semantic relationships in document corpora, which are critical for capturing their conceptual structures. The problem arises due to the fact that most of the information is embedded within unstructured or semi-structured texts that computers can not interpret very easily. In this paper, we have presented a novel Biomedical Knowledge Extraction and Visualization framework, BioKEVis to identify key information components from biomedical text documents. The information components are centered on key concepts. BioKEVis applies linguistic analysis and Latent Semantic Analysis (LSA) to identify key concepts. The information component extraction principle is based on natural language processing techniques and semantic-based analysis. The system is also integrated with a biomedical named entity recognizer, ABNER, to tag genes, proteins and other entity names in the text. We have also presented a method for collating information extracted from multiple sources to generate semantic network. The network provides distinct user perspectives and allows navigation over documents with similar information components and is also used to provide a comprehensive view of the collection. The system stores the extracted information components in a structured repository which is integrated with a query-processing module to handle biomedical queries over text documents. We have also proposed a document ranking mechanism to present retrieved documents in order of their relevance to the user query.

  2. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts.

    Science.gov (United States)

    Zhang, Shaodian; Elhadad, Noémie

    2013-12-01

    Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work.

  3. Drug name recognition in biomedical texts: a machine-learning-based method.

    Science.gov (United States)

    He, Linna; Yang, Zhihao; Lin, Hongfei; Li, Yanpeng

    2014-05-01

    Currently, there is an urgent need to develop a technology for extracting drug information automatically from biomedical texts, and drug name recognition is an essential prerequisite for extracting drug information. This article presents a machine-learning-based approach to recognize drug names in biomedical texts. In this approach, a drug name dictionary is first constructed with the external resource of DrugBank and PubMed. Then a semi-supervised learning method, feature coupling generalization, is used to filter this dictionary. Finally, the dictionary look-up and the condition random field method are combined to recognize drug names. Experimental results show that our approach achieves an F-score of 92.54% on the test set of DDIExtraction2011.

  4. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

    Directory of Open Access Journals (Sweden)

    Halil Kilicoglu

    Full Text Available Coreference resolution is one of the fundamental and challenging tasks in natural language processing. Resolving coreference successfully can have a significant positive effect on downstream natural language processing tasks, such as information extraction and question answering. The importance of coreference resolution for biomedical text analysis applications has increasingly been acknowledged. One of the difficulties in coreference resolution stems from the fact that distinct types of coreference (e.g., anaphora, appositive are expressed with a variety of lexical and syntactic means (e.g., personal pronouns, definite noun phrases, and that resolution of each combination often requires a different approach. In the biomedical domain, it is common for coreference annotation and resolution efforts to focus on specific subcategories of coreference deemed important for the downstream task. In the current work, we aim to address some of these concerns regarding coreference resolution in biomedical text. We propose a general, modular framework underpinned by a smorgasbord architecture (Bio-SCoRes, which incorporates a variety of coreference types, their mentions and allows fine-grained specification of resolution strategies to resolve coreference of distinct coreference type-mention pairs. For development and evaluation, we used a corpus of structured drug labels annotated with fine-grained coreference information. In addition, we evaluated our approach on two other corpora (i2b2/VA discharge summaries and protein coreference dataset to investigate its generality and ease of adaptation to other biomedical text types. Our results demonstrate the usefulness of our novel smorgasbord architecture. The specific pipelines based on the architecture perform successfully in linking coreferential mention pairs, while we find that recognition of full mention clusters is more challenging. The corpus of structured drug labels (SPL as well as the components of Bio

  5. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

    Science.gov (United States)

    Kilicoglu, Halil; Demner-Fushman, Dina

    2016-01-01

    Coreference resolution is one of the fundamental and challenging tasks in natural language processing. Resolving coreference successfully can have a significant positive effect on downstream natural language processing tasks, such as information extraction and question answering. The importance of coreference resolution for biomedical text analysis applications has increasingly been acknowledged. One of the difficulties in coreference resolution stems from the fact that distinct types of coreference (e.g., anaphora, appositive) are expressed with a variety of lexical and syntactic means (e.g., personal pronouns, definite noun phrases), and that resolution of each combination often requires a different approach. In the biomedical domain, it is common for coreference annotation and resolution efforts to focus on specific subcategories of coreference deemed important for the downstream task. In the current work, we aim to address some of these concerns regarding coreference resolution in biomedical text. We propose a general, modular framework underpinned by a smorgasbord architecture (Bio-SCoRes), which incorporates a variety of coreference types, their mentions and allows fine-grained specification of resolution strategies to resolve coreference of distinct coreference type-mention pairs. For development and evaluation, we used a corpus of structured drug labels annotated with fine-grained coreference information. In addition, we evaluated our approach on two other corpora (i2b2/VA discharge summaries and protein coreference dataset) to investigate its generality and ease of adaptation to other biomedical text types. Our results demonstrate the usefulness of our novel smorgasbord architecture. The specific pipelines based on the architecture perform successfully in linking coreferential mention pairs, while we find that recognition of full mention clusters is more challenging. The corpus of structured drug labels (SPL) as well as the components of Bio-SCoRes and

  6. Using Collaborative Tagging for Text Classification: From Text Classification to Opinion Mining

    Directory of Open Access Journals (Sweden)

    Eric Charton

    2013-11-01

    Full Text Available Numerous initiatives have allowed users to share knowledge or opinions using collaborative platforms. In most cases, the users provide a textual description of their knowledge, following very limited or no constraints. Here, we tackle the classification of documents written in such an environment. As a use case, our study is made in the context of text mining evaluation campaign material, related to the classification of cooking recipes tagged by users from a collaborative website. This context makes some of the corpus specificities difficult to model for machine-learning-based systems and keyword or lexical-based systems. In particular, different authors might have different opinions on how to classify a given document. The systems presented hereafter were submitted to the D´Efi Fouille de Textes 2013 evaluation campaign, where they obtained the best overall results, ranking first on task 1 and second on task 2. In this paper, we explain our approach for building relevant and effective systems dealing with such a corpus.

  7. Improving named entity recognition accuracy for gene and protein in biomedical text literature.

    Science.gov (United States)

    Tohidi, Hossein; Ibrahim, Hamidah; Murad, Masrah Azrifah Azmi

    2014-01-01

    The task of recognising biomedical named entities in natural language documents called biomedical Named Entity Recognition (NER) is the focus of many researchers due to complex nature of such texts. This complexity includes the issues of character-level, word-level and word order variations. In this study, an approach for recognising gene and protein names that handles the above issues is proposed. Similar to the previous related works, our approach is based on the assumption that a named entity occurs within a noun group. The strength of our proposed approach lies on a Statistical Character-based Syntax Similarity (SCSS) algorithm which measures similarity between the extracted candidates and the well-known biomedical named entities from the GENIA V3.0 corpus. The proposed approach is evaluated and results are satisfied. For recognitions of both gene and protein names, we achieved 97.2% for precision (P), 95.2% for recall (R), and 96.1 for F-measure. While for protein names recognition we gained 98.1% for P, 97.5% for R and 97.7 for F-measure.

  8. PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

    Science.gov (United States)

    Döring, Kersten; Grüning, Björn A.; Telukunta, Kiran K.; Thomas, Philippe; Günther, Stefan

    2016-01-01

    Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects. PMID:27706202

  9. Seqenv: linking sequences to environments through text mining

    Science.gov (United States)

    Jensen, Lars Juhl; Coolen, Marco J.L.; Gubry-Rangin, Cecile; Chroňáková, Alica; Oulas, Anastasis; Pavloudi, Christina; Schnetzer, Julia; Weimann, Aaron; Ijaz, Ali; Eiler, Alexander

    2016-01-01

    Understanding the distribution of taxa and associated traits across different environments is one of the central questions in microbial ecology. High-throughput sequencing (HTS) studies are presently generating huge volumes of data to address this biogeographical topic. However, these studies are often focused on specific environment types or processes leading to the production of individual, unconnected datasets. The large amounts of legacy sequence data with associated metadata that exist can be harnessed to better place the genetic information found in these surveys into a wider environmental context. Here we introduce a software program, seqenv, to carry out precisely such a task. It automatically performs similarity searches of short sequences against the “nt” nucleotide database provided by NCBI and, out of every hit, extracts–if it is available–the textual metadata field. After collecting all the isolation sources from all the search results, we run a text mining algorithm to identify and parse words that are associated with the Environmental Ontology (EnvO) controlled vocabulary. This, in turn, enables us to determine both in which environments individual sequences or taxa have previously been observed and, by weighted summation of those results, to summarize complete samples. We present two demonstrative applications of seqenv to a survey of ammonia oxidizing archaea as well as to a plankton paleome dataset from the Black Sea. These demonstrate the ability of the tool to reveal novel patterns in HTS and its utility in the fields of environmental source tracking, paleontology, and studies of microbial biogeography. To install seqenv, go to: https://github.com/xapple/seqenv. PMID:28028456

  10. An integrated text mining framework for metabolic interaction network reconstruction

    Directory of Open Access Journals (Sweden)

    Preecha Patumcharoenpol

    2016-03-01

    Full Text Available Text mining (TM in the field of biology is fast becoming a routine analysis for the extraction and curation of biological entities (e.g., genes, proteins, simple chemicals as well as their relationships. Due to the wide applicability of TM in situations involving complex relationships, it is valuable to apply TM to the extraction of metabolic interactions (i.e., enzyme and metabolite interactions through metabolic events. Here we present an integrated TM framework containing two modules for the extraction of metabolic events (Metabolic Event Extraction module—MEE and for the construction of a metabolic interaction network (Metabolic Interaction Network Reconstruction module—MINR. The proposed integrated TM framework performed well based on standard measures of recall, precision and F-score. Evaluation of the MEE module using the constructed Metabolic Entities (ME corpus yielded F-scores of 59.15% and 48.59% for the detection of metabolic events for production and consumption, respectively. As for the testing of the entity tagger for Gene and Protein (GP and metabolite with the test corpus, the obtained F-score was greater than 80% for the Superpathway of leucine, valine, and isoleucine biosynthesis. Mapping of enzyme and metabolite interactions through network reconstruction showed a fair performance for the MINR module on the test corpus with F-score >70%. Finally, an application of our integrated TM framework on a big-scale data (i.e., EcoCyc extraction data for reconstructing a metabolic interaction network showed reasonable precisions at 69.93%, 70.63% and 46.71% for enzyme, metabolite and enzyme–metabolite interaction, respectively. This study presents the first open-source integrated TM framework for reconstructing a metabolic interaction network. This framework can be a powerful tool that helps biologists to extract metabolic events for further reconstruction of a metabolic interaction network. The ME corpus, test corpus, source

  11. Mining Texts in Reading to Write. Occasional Paper No. 29.

    Science.gov (United States)

    Greene, Stuart

    Reading and writing are commonly seen as parallel processes of composing meaning, employing similar cognitive and linguistic strategies. Research has begun to examine ways in which knowledge of content and strategies contribute to the construction of meaning in reading and writing. The metaphor of mining can provide a useful and descriptive means…

  12. An integrated text mining framework for metabolic interaction network reconstruction.

    Science.gov (United States)

    Patumcharoenpol, Preecha; Doungpan, Narumol; Meechai, Asawin; Shen, Bairong; Chan, Jonathan H; Vongsangnak, Wanwipa

    2016-01-01

    Text mining (TM) in the field of biology is fast becoming a routine analysis for the extraction and curation of biological entities (e.g., genes, proteins, simple chemicals) as well as their relationships. Due to the wide applicability of TM in situations involving complex relationships, it is valuable to apply TM to the extraction of metabolic interactions (i.e., enzyme and metabolite interactions) through metabolic events. Here we present an integrated TM framework containing two modules for the extraction of metabolic events (Metabolic Event Extraction module-MEE) and for the construction of a metabolic interaction network (Metabolic Interaction Network Reconstruction module-MINR). The proposed integrated TM framework performed well based on standard measures of recall, precision and F-score. Evaluation of the MEE module using the constructed Metabolic Entities (ME) corpus yielded F-scores of 59.15% and 48.59% for the detection of metabolic events for production and consumption, respectively. As for the testing of the entity tagger for Gene and Protein (GP) and metabolite with the test corpus, the obtained F-score was greater than 80% for the Superpathway of leucine, valine, and isoleucine biosynthesis. Mapping of enzyme and metabolite interactions through network reconstruction showed a fair performance for the MINR module on the test corpus with F-score >70%. Finally, an application of our integrated TM framework on a big-scale data (i.e., EcoCyc extraction data) for reconstructing a metabolic interaction network showed reasonable precisions at 69.93%, 70.63% and 46.71% for enzyme, metabolite and enzyme-metabolite interaction, respectively. This study presents the first open-source integrated TM framework for reconstructing a metabolic interaction network. This framework can be a powerful tool that helps biologists to extract metabolic events for further reconstruction of a metabolic interaction network. The ME corpus, test corpus, source code, and virtual

  13. Significant Term List Based Metadata Conceptual Mining Model for Effective Text Clustering

    Directory of Open Access Journals (Sweden)

    J. Janet

    2012-01-01

    Full Text Available As the engineering world are growing fast, the usage of data for the day to day activity of the engineering industry also growing rapidly. In order to handle and to find the hidden knowledge from huge data storage, data mining is very helpful right now. Text mining, network mining, multimedia mining, trend analysis are few applications of data mining. In text mining, there are variety of methods are proposed by many researchers, even though high precision, better recall are still is a critical issues. In this study, text mining is focused and conceptual mining model is applied for improved clustering in the text mining. The proposed work is termed as Meta data Conceptual Mining Model (MCMM, is validated with few world leading technical digital library data sets such as IEEE, ACM and Scopus. The performance derived as precision, recall are described in terms of Entropy, F-Measure which are calculated and compared with existing term based model and concept based mining model.

  14. Role of text mining in early identification of potential drug safety issues.

    Science.gov (United States)

    Liu, Mei; Hu, Yong; Tang, Buzhou

    2014-01-01

    Drugs are an important part of today's medicine, designed to treat, control, and prevent diseases; however, besides their therapeutic effects, drugs may also cause adverse effects that range from cosmetic to severe morbidity and mortality. To identify these potential drug safety issues early, surveillance must be conducted for each drug throughout its life cycle, from drug development to different phases of clinical trials, and continued after market approval. A major aim of pharmacovigilance is to identify the potential drug-event associations that may be novel in nature, severity, and/or frequency. Currently, the state-of-the-art approach for signal detection is through automated procedures by analyzing vast quantities of data for clinical knowledge. There exists a variety of resources for the task, and many of them are textual data that require text analytics and natural language processing to derive high-quality information. This chapter focuses on the utilization of text mining techniques in identifying potential safety issues of drugs from textual sources such as biomedical literature, consumer posts in social media, and narrative electronic medical records.

  15. A collaborative biomedical image mining framework: application on the image analysis of microscopic kidney biopsies.

    Science.gov (United States)

    Goudas, T; Doukas, C; Chatziioannou, A; Maglogiannis, I

    2013-01-01

    The analysis and characterization of biomedical image data is a complex procedure involving several processing phases, like data acquisition, preprocessing, segmentation, feature extraction and classification. The proper combination and parameterization of the utilized methods are heavily relying on the given image data set and experiment type. They may thus necessitate advanced image processing and classification knowledge and skills from the side of the biomedical expert. In this work, an application, exploiting web services and applying ontological modeling, is presented, to enable the intelligent creation of image mining workflows. The described tool can be directly integrated to the RapidMiner, Taverna or similar workflow management platforms. A case study dealing with the creation of a sample workflow for the analysis of kidney biopsy microscopy images is presented to demonstrate the functionality of the proposed framework.

  16. A feature representation method for biomedical scientific data based on composite text description

    Institute of Scientific and Technical Information of China (English)

    SUN; Wei

    2009-01-01

    Feature representation is one of the key issues in data clustering.The existing feature representation of scientific data is not sufficient,which to some extent affects the result of scientific data clustering.Therefore,the paper proposes a concept of composite text description(CTD)and a CTD-based feature representation method for biomedical scientific data.The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively,combines and finally strengthens the two feature sets.Experiments show that comparing with traditional methods,the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.

  17. Opinion Mining in Latvian Text Using Semantic Polarity Analysis and Machine Learning Approach

    Directory of Open Access Journals (Sweden)

    Gatis Špats

    2016-07-01

    Full Text Available In this paper we demonstrate approaches for opinion mining in Latvian text. Authors have applied, combined and extended results of several previous studies and public resources to perform opinion mining in Latvian text using two approaches, namely, semantic polarity analysis and machine learning. One of the most significant constraints that make application of opinion mining for written content classification in Latvian text challenging is the limited publicly available text corpora for classifier training. We have joined several sources and created a publically available extended lexicon. Our results are comparable to or outperform current achievements in opinion mining in Latvian. Experiments show that lexicon-based methods provide more accurate opinion mining than the application of Naive Bayes machine learning classifier on Latvian tweets. Methods used during this study could be further extended using human annotators, unsupervised machine learning and bootstrapping to create larger corpora of classified text.

  18. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    Directory of Open Access Journals (Sweden)

    Anika Oellrich

    Full Text Available Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES, the National Center for Biomedical Ontology (NCBO Annotator, the Biomedical Concept Annotation System (BeCAS and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74% and their quality (best F1-measure of 33%, independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%, the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content

  19. Parts-of-Speech Tagger Errors Do Not Necessarily Degrade Accuracy in Extracting Information from Biomedical Text

    CERN Document Server

    Ling, Maurice HT; Nicholas, Kevin R

    2008-01-01

    A recent study reported development of Muscorian, a generic text processing tool for extracting protein-protein interactions from text that achieved comparable performance to biomedical-specific text processing tools. This result was unexpected since potential errors from a series of text analysis processes is likely to adversely affect the outcome of the entire process. Most biomedical entity relationship extraction tools have used biomedical-specific parts-of-speech (POS) tagger as errors in POS tagging and are likely to affect subsequent semantic analysis of the text, such as shallow parsing. This study aims to evaluate the parts-of-speech (POS) tagging accuracy and attempts to explore whether a comparable performance is obtained when a generic POS tagger, MontyTagger, was used in place of MedPost, a tagger trained in biomedical text. Our results demonstrated that MontyTagger, Muscorian's POS tagger, has a POS tagging accuracy of 83.1% when tested on biomedical text. Replacing MontyTagger with MedPost did ...

  20. BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    Directory of Open Access Journals (Sweden)

    Tsafnat Guy

    2011-04-01

    Full Text Available Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH and the PharmacoKinetic Interaction Screening (PKIS database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100% and 157 (of 197 minor drug classes (80% with areas under the receiver operating characteristic curve (AUC > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238 adverse events (73%, up to 12 (of 15 groups of clinically significant cytochrome P450 enzyme (CYP inducers or inhibitors (80%, and up to 11 (of 14 groups of narrow therapeutic index drugs (79%. Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task. Conclusions BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.

  1. A Two Step Data Mining Approach for Amharic Text Classification

    Directory of Open Access Journals (Sweden)

    Seffi Gebeyehu

    2016-08-01

    Full Text Available Traditionally, text classifiers are built from labeled training examples (supervised. Labeling is usually done manually by human experts (or the users, which is a labor intensive and time consuming process. In the past few years, researchers have investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper is aimed to show as the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. In this paper, intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM and two classifiers: Naive Bayes (NB and locally weighted learning (LWL. NB first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents while LWL uses a class of function approximation to build a model around the current point of interest. An experiment conducted on a mixture of labeled and unlabeled Amharic text documents showed that the new method achieved a significant performance in comparison with that of a supervised LWL and NB. The result also pointed out that the use of unlabeled data with EM reduces the classification absolute error by 27.6%. In general, since unlabeled documents are much less expensive and easier to collect than labeled documents, this method will be useful for text categorization tasks including online data sources such as web pages, e-mails and news group postings. If one uses this method, building text categorization systems will be significantly faster and less expensive than the supervised learning approach.

  2. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Directory of Open Access Journals (Sweden)

    Hamish Cunningham

    Full Text Available This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  3. Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming

    Science.gov (United States)

    Abdous, M'hammed; He, Wu

    2011-01-01

    Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…

  4. An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.

    Science.gov (United States)

    Trybula, Walter J.; Wyllys, Ronald E.

    2000-01-01

    Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)

  5. Signal Detection Framework Using Semantic Text Mining Techniques

    Science.gov (United States)

    Sudarsan, Sithu D.

    2009-01-01

    Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…

  6. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining

    CERN Document Server

    Sharma, Sugam; Cohly, Hari

    2010-01-01

    In Bioinformatics, text mining and text data mining sometimes interchangeably used is a process to derive high-quality information from text. Perl Status Reporter (SRr) is a data fetching tool from a flat text file and in this research paper we illustrate the use of SRr in text or data mining. SRr needs a flat text input file where the mining process to be performed. SRr reads input file and derives the high quality information from it. Typically text mining tasks are text categorization, text clustering, concept and entity extraction, and document summarization. SRr can be utilized for any of these tasks with little or none customizing efforts. In our implementation we perform text categorization mining operation on input file. The input file has two parameters of interest (firstKey and secondKey). The composition of these two parameters describes the uniqueness of entries in that file in the similar manner as done by composite key in database. SRr reads the input file line by line and extracts the parameter...

  7. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Science.gov (United States)

    Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina

    2013-01-01

    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  8. A Survey of Topic Modeling in Text Mining

    Directory of Open Access Journals (Sweden)

    Rubayyi Alghamdi

    2015-01-01

    Full Text Available Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA, Probabilistic latent semantic analysis (PLSA, Latent Dirichlet allocation (LDA, and Correlated topic model (CTM. The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT, dynamic topic models (DTM, multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

  9. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    Directory of Open Access Journals (Sweden)

    Karan P. Singh

    2010-02-01

    Full Text Available Text and structural data mining of web and social media (WSM provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.

  10. Comparative Study of Clustering Algorithms in Text Mining Context

    Directory of Open Access Journals (Sweden)

    Abdennour Mohamed Jalil

    2016-06-01

    Full Text Available The spectacular increasing of Data is due to the appearance of networks and smartphones. Amount 42% of world population using internet [1]; have created a problem related of the processing of the data exchanged, which is rising exponentially and that should be automatically treated. This paper presents a classical process of knowledge discovery databases, in order to treat textual data. This process is divided into three parts: preprocessing, processing and post-processing. In the processing step, we present a comparative study between several clustering algorithms such as KMeans, Global KMeans, Fast Global KMeans, Two Level KMeans and FWKmeans. The comparison between these algorithms is made on real textual data from the web using RSS feeds. Experimental results identified two problems: the first one quality results which remain for algorithms, which rapidly converge. The second problem is due to the execution time that needs to decrease for some algorithms.

  11. Text Mining in Python through the HTRC Feature Reader

    Directory of Open Access Journals (Sweden)

    Peter Organisciak

    2016-11-01

    Full Text Available We introduce a toolkit for working with the 13.6 million volume Extracted Features Dataset from the HathiTrust Research Center. You will learn how to peer at the words and trends of any book in the collection, while developing broadly useful Python data analysis skills. The HathiTrust holds nearly 15 million digitized volumes from libraries around the world. In addition to their individual value, these works in aggregate are extremely valuable for historians. Spanning many centuries and genres, they offer a way to learn about large-scale trends in history and culture, as well as evidence for changes in language or even the structure of the book. To simplify access to this collection the HathiTrust Research Center (HTRC has released the Extracted Features dataset (Capitanu et al. 2015: a dataset that provides quantitative information describing every page of every volume in the collection. In this lesson, we introduce the HTRC Feature Reader, a library for working with the HTRC Extracted Features dataset using the Python programming language. The HTRC Feature Reader is structured to support work using popular data science libraries, particularly Pandas. Pandas provides simple structures for holding data and powerful ways to interact with it. The HTRC Feature Reader uses these data structures, so learning how to use it will also cover general data analysis skills in Python.

  12. A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles.

    Science.gov (United States)

    Jones, David E; Ghandehari, Hamidreza; Facelli, Julio C

    2016-08-01

    This article presents a comprehensive review of applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles of medical interest. The papers reviewed here present the results of research using these techniques to predict the biological fate and properties of a variety of nanoparticles relevant to their biomedical applications. These include the influence of particle physicochemical properties on cellular uptake, cytotoxicity, molecular loading, and molecular release in addition to manufacturing properties like nanoparticle size, and polydispersity. Overall, the results are encouraging and suggest that as more systematic data from nanoparticles becomes available, machine learning and data mining would become a powerful aid in the design of nanoparticles for biomedical applications. There is however the challenge of great heterogeneity in nanoparticles, which will make these discoveries more challenging than for traditional small molecule drug design.

  13. Text Mining on the Internet%Internet 上的文本数据挖掘

    Institute of Scientific and Technical Information of China (English)

    王伟强; 高文; 段立娟

    2000-01-01

    The booming growth of the Internet has made text mining on it a promising research field in practice. The paper summarily introduces some aspects about it,which involve some potential applications,some techniques used and some present systems.

  14. Visualization Model for Chinese Text Mining%可视化中文文本挖掘模型

    Institute of Scientific and Technical Information of China (English)

    林鸿飞; 贡大跃; 张跃; 姚天顺

    2000-01-01

    This paper briefly describes the background of text mining and the main difficulties in Chinese text mining,presents a visual model for Chinese text mining and puts forward the method of text categories based on concept,the method of text summary based on statistics and the method of identifying Chinese name.

  15. Text mining and visualization case studies using open-source tools

    CERN Document Server

    Chisholm, Andrew

    2016-01-01

    Text Mining and Visualization: Case Studies Using Open-Source Tools provides an introduction to text mining using some of the most popular and powerful open-source tools: KNIME, RapidMiner, Weka, R, and Python. The contributors-all highly experienced with text mining and open-source software-explain how text data are gathered and processed from a wide variety of sources, including books, server access logs, websites, social media sites, and message boards. Each chapter presents a case study that you can follow as part of a step-by-step, reproducible example. You can also easily apply and extend the techniques to other problems. All the examples are available on a supplementary website. The book shows you how to exploit your text data, offering successful application examples and blueprints for you to tackle your text mining tasks and benefit from open and freely available tools. It gets you up to date on the latest and most powerful tools, the data mining process, and specific text mining activities.

  16. Using text-mining techniques in electronic patient records to identify ADRs from medicine use.

    Science.gov (United States)

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-05-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs.

  17. Text and structural data mining of influenza mentions in Web and social media.

    Science.gov (United States)

    Corley, Courtney D; Cook, Diane J; Mikler, Armin R; Singh, Karan P

    2010-02-01

    Text and structural data mining of web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.

  18. Aspects of Text Mining From Computational Semiotics to Systemic Functional Hypertexts

    Directory of Open Access Journals (Sweden)

    Alexander Mehler

    2001-05-01

    Full Text Available The significance of natural language texts as the prime information structure for the management and dissemination of knowledge in organisations is still increasing. Making relevant documents available depending on varying tasks in different contexts is of primary importance for any efficient task completion. Implementing this demand requires the content based processing of texts, which enables to reconstruct or, if necessary, to explore the relationship of task, context and document. Text mining is a technology that is suitable for solving problems of this kind. In the following, semiotic aspects of text mining are investigated. Based on the primary object of text mining - natural language lexis - the specific complexity of this class of signs is outlined and requirements for the implementation of text mining procedures are derived. This is done with reference to text linkage introduced as a special task in text mining. Text linkage refers to the exploration of implicit, content based relations of texts (and their annotation as typed links in corpora possibly organised as hypertexts. In this context, the term systemic functional hypertext is introduced, which distinguishes genre and register layers for the management of links in a poly-level hypertext system.

  19. Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection

    Directory of Open Access Journals (Sweden)

    Shengyu Liu

    2015-01-01

    Full Text Available Drug name recognition (DNR is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.

  20. Logical implications for regulatory relations represented by verbs in biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    Relations used in biomedical ontologies can be very general or very specific in respect to the domain. However, some relations are used widely in for example regulatory networks. This work focuses on positive and negative regulatory relations, in particular their usage expressed as verbs in diffe......Relations used in biomedical ontologies can be very general or very specific in respect to the domain. However, some relations are used widely in for example regulatory networks. This work focuses on positive and negative regulatory relations, in particular their usage expressed as verbs...

  1. Ontology-based retrieval of bio-medical information based on microarray text corpora

    DEFF Research Database (Denmark)

    Hansen, Kim Allan; Zambach, Sine; Have, Christian Theil

    Microarray technology is often used in gene expression exper- iments. Information retrieval in the context of microarrays has mainly been concerned with the analysis of the numeric data produced; how- ever, the experiments are often annotated with textual metadata. Al- though biomedical resources...... degree. We explore the possibilities of retrieving biomedical information from microarrays in Gene Expression Omnibus (GEO), of which we have indexed a sample semantically, as a rst step towards ontology based searches. Through an example we argue that it is possible to improve the retrieval...

  2. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    OpenAIRE

    Karan P Singh; Mikler, Armin R.; Cook, Diane J.; Courtney D. Corley

    2010-01-01

    Text and structural data mining of web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like ill...

  3. A Formal Framework on the Semantics of Regulatory Relations and Their Presence as Verbs in Biomedical Texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    2009-01-01

    on the logical properties of positive and negative regulations, both as formal relations and the frequency of their usage as verbs in texts. The paper discusses whether there exists a weak transitivity-like property for the relations. Our corpora consist of biomedical patents, Medline abstracts and the British...

  4. A framework of Chinese semantic text mining based on ontology learning

    Science.gov (United States)

    Zhang, Yu-feng; Hu, Feng

    2012-01-01

    Text mining and ontology learning can be effectively employed to acquire the Chinese semantic information. This paper explores a framework of semantic text mining based on ontology learning to find the potential semantic knowledge from the immensity text information on the Internet. This framework consists of four parts: Data Acquisition, Feature Extraction, Ontology Construction, and Text Knowledge Pattern Discovery. Then the framework is applied into an actual case to try to find out the valuable information, and even to assist the consumers with selecting proper products. The results show that this framework is reasonable and effective.

  5. Annotating image ROIs with text descriptions for multimodal biomedical document retrieval

    Science.gov (United States)

    You, Daekeun; Simpson, Matthew; Antani, Sameer; Demner-Fushman, Dina; Thoma, George R.

    2013-01-01

    Regions of interest (ROIs) that are pointed to by overlaid markers (arrows, asterisks, etc.) in biomedical images are expected to contain more important and relevant information than other regions for biomedical article indexing and retrieval. We have developed several algorithms that localize and extract the ROIs by recognizing markers on images. Cropped ROIs then need to be annotated with contents describing them best. In most cases accurate textual descriptions of the ROIs can be found from figure captions, and these need to be combined with image ROIs for annotation. The annotated ROIs can then be used to, for example, train classifiers that separate ROIs into known categories (medical concepts), or to build visual ontologies, for indexing and retrieval of biomedical articles. We propose an algorithm that pairs visual and textual ROIs that are extracted from images and figure captions, respectively. This algorithm based on dynamic time warping (DTW) clusters recognized pointers into groups, each of which contains pointers with identical visual properties (shape, size, color, etc.). Then a rule-based matching algorithm finds the best matching group for each textual ROI mention. Our method yields a precision and recall of 96% and 79%, respectively, when ground truth textual ROI data is used.

  6. Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability

    Science.gov (United States)

    Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel

    2011-01-01

    The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…

  7. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    DEFF Research Database (Denmark)

    Warrer, Pernille; Hansen, Ebba Holme; Jensen, Lars Juhl

    2012-01-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We......, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text...... searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design...

  8. Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree-Shape Structures

    Directory of Open Access Journals (Sweden)

    Lin Guo

    2015-01-01

    Full Text Available The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.

  9. Rewriting and suppressing UMLS terms for improved biomedical term identification

    NARCIS (Netherlands)

    K.M. Hettne (Kristina); E.M. van Mulligen (Erik); M.J. Schuemie (Martijn); R.J.A. Schijvenaars (Bob); J.A. Kors (Jan)

    2010-01-01

    textabstractBackground: Identification of terms is essential for biomedical text mining. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS). To make the UMLS more suitable for biomedical text mining we implemented and evalu

  10. How to learn about gene function: text-mining or ontologies?

    Science.gov (United States)

    Soldatos, Theodoros G; Perdigão, Nelson; Brown, Nigel P; Sabir, Kenneth S; O'Donoghue, Seán I

    2015-03-01

    As the amount of genome information increases rapidly, there is a correspondingly greater need for methods that provide accurate and automated annotation of gene function. For example, many high-throughput technologies--e.g., next-generation sequencing--are being used today to generate lists of genes associated with specific conditions. However, their functional interpretation remains a challenge and many tools exist trying to characterize the function of gene-lists. Such systems rely typically in enrichment analysis and aim to give a quick insight into the underlying biology by presenting it in a form of a summary-report. While the load of annotation may be alleviated by such computational approaches, the main challenge in modern annotation remains to develop a systems form of analysis in which a pipeline can effectively analyze gene-lists quickly and identify aggregated annotations through computerized resources. In this article we survey some of the many such tools and methods that have been developed to automatically interpret the biological functions underlying gene-lists. We overview current functional annotation aspects from the perspective of their epistemology (i.e., the underlying theories used to organize information about gene function into a body of verified and documented knowledge) and find that most of the currently used functional annotation methods fall broadly into one of two categories: they are based either on 'known' formally-structured ontology annotations created by 'experts' (e.g., the GO terms used to describe the function of Entrez Gene entries), or--perhaps more adventurously--on annotations inferred from literature (e.g., many text-mining methods use computer-aided reasoning to acquire knowledge represented in natural languages). Overall however, deriving detailed and accurate insight from such gene lists remains a challenging task, and improved methods are called for. In particular, future methods need to (1) provide more holistic

  11. The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text.

    Science.gov (United States)

    Rindflesch, Thomas C; Fiszman, Marcelo

    2003-12-01

    Interpretation of semantic propositions in free-text documents such as MEDLINE citations would provide valuable support for biomedical applications, and several approaches to semantic interpretation are being pursued in the biomedical informatics community. In this paper, we describe a methodology for interpreting linguistic structures that encode hypernymic propositions, in which a more specific concept is in a taxonomic relationship with a more general concept. In order to effectively process these constructions, we exploit underspecified syntactic analysis and structured domain knowledge from the Unified Medical Language System (UMLS). After introducing the syntactic processing on which our system depends, we focus on the UMLS knowledge that supports interpretation of hypernymic propositions. We first use semantic groups from the Semantic Network to ensure that the two concepts involved are compatible; hierarchical information in the Metathesaurus then determines which concept is more general and which more specific. A preliminary evaluation of a sample based on the semantic group Chemicals and Drugs provides 83% precision. An error analysis was conducted and potential solutions to the problems encountered are presented. The research discussed here serves as a paradigm for investigating the interaction between domain knowledge and linguistic structure in natural language processing, and could also make a contribution to research on automatic processing of discourse structure. Additional implications of the system we present include its integration in advanced semantic interpretation processors for biomedical text and its use for information extraction in specific domains. The approach has the potential to support a range of applications, including information retrieval and ontology engineering.

  12. BioCreative Workshops for DOE Genome Sciences: Text Mining for Metagenomics

    Energy Technology Data Exchange (ETDEWEB)

    Wu, Cathy H. [Univ. of Delaware, Newark, DE (United States). Center for Bioinformatics and Computational Biology; Hirschman, Lynette [The MITRE Corporation, Bedford, MA (United States)

    2016-10-29

    The objective of this project was to host BioCreative workshops to define and develop text mining tasks to meet the needs of the Genome Sciences community, focusing on metadata information extraction in metagenomics. Following the successful introduction of metagenomics at the BioCreative IV workshop, members of the metagenomics community and BioCreative communities continued discussion to identify candidate topics for a BioCreative metagenomics track for BioCreative V. Of particular interest was the capture of environmental and isolation source information from text. The outcome was to form a “community of interest” around work on the interactive EXTRACT system, which supported interactive tagging of environmental and species data. This experiment is included in the BioCreative V virtual issue of Database. In addition, there was broad participation by members of the metagenomics community in the panels held at BioCreative V, leading to valuable exchanges between the text mining developers and members of the metagenomics research community. These exchanges are reflected in a number of the overview and perspective pieces also being captured in the BioCreative V virtual issue. Overall, this conversation has exposed the metagenomics researchers to the possibilities of text mining, and educated the text mining developers to the specific needs of the metagenomics community.

  13. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Årup; Hansen, Lars Kai; Balslev, D.

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...... that the statistically motivated associations are well aligned with general neuroscientific knowledge....

  14. Trends of E-Learning Research from 2000 to 2008: Use of Text Mining and Bibliometrics

    Science.gov (United States)

    Hung, Jui-long

    2012-01-01

    This study investigated the longitudinal trends of e-learning research using text mining techniques. Six hundred and eighty-nine (689) refereed journal articles and proceedings were retrieved from the Science Citation Index/Social Science Citation Index database in the period from 2000 to 2008. All e-learning publications were grouped into two…

  15. Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

    Science.gov (United States)

    Michalski, Greg V.

    2011-01-01

    Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

  16. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Arup; Hansen, Lars Kai; Balslev, Daniela

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...

  17. Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

    Science.gov (United States)

    Jiang, Feng; McComas, William F.

    2014-01-01

    This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…

  18. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    Science.gov (United States)

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  19. The Determination of Children's Knowledge of Global Lunar Patterns from Online Essays Using Text Mining Analysis

    Science.gov (United States)

    Cheon, Jongpil; Lee, Sangno; Smith, Walter; Song, Jaeki; Kim, Yongjin

    2013-01-01

    The purpose of this study was to use text mining analysis of early adolescents' online essays to determine their knowledge of global lunar patterns. Australian and American students in grades five to seven wrote about global lunar patterns they had discovered by sharing observations with each other via the Internet. These essays were analyzed for…

  20. SimConcept: a hybrid approach for simplifying composite named entities in biomedical text.

    Science.gov (United States)

    Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong

    2015-07-01

    One particular challenge in biomedical named entity recognition (NER) and normalization is the identification and resolution of composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Previous NER and normalization studies have either ignored composite mentions, used simple ad hoc rules, or only handled coordination ellipsis, making a robust approach for handling multitype composite mentions greatly needed. To this end, we propose a hybrid method integrating a machine-learning model with a pattern identification strategy to identify the individual components of each composite mention. Our method, which we have named SimConcept, is the first to systematically handle many types of composite mentions. The technique achieves high performance in identifying and resolving composite mentions for three key biological entities: genes (90.42% in F-measure), diseases (86.47% in F-measure), and chemicals (86.05% in F-measure). Furthermore, our results show that using our SimConcept method can subsequently improve the performance of gene and disease concept recognition and normalization. SimConcept is available for download at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/SimConcept/.

  1. LiverCancerMarkerRIF: a liver cancer biomarker interactive curation system combining text mining and expert annotations

    Science.gov (United States)

    Dai, Hong-Jie; Wu, Johnny Chi-Yang; Lin, Wei-San; Reyes, Aaron James F.; dela Rosa, Mira Anne C.; Syed-Abdul, Shabbir; Tsai, Richard Tzong-Han; Hsu, Wen-Lian

    2014-01-01

    Biomarkers are biomolecules in the human body that can indicate disease states and abnormal biological processes. Biomarkers are often used during clinical trials to identify patients with cancers. Although biomedical research related to biomarkers has increased over the years and substantial effort has been expended to obtain results in these studies, the specific results obtained often contain ambiguities, and the results might contradict each other. Therefore, the information gathered from these studies must be appropriately integrated and organized to facilitate experimentation on biomarkers. In this study, we used liver cancer as the target and developed a text-mining–based curation system named LiverCancerMarkerRIF, which allows users to retrieve biomarker-related narrations and curators to curate supporting evidence on liver cancer biomarkers directly while browsing PubMed. In contrast to most of the other curation tools that require curators to navigate away from PubMed and accommodate distinct user interfaces or Web sites to complete the curation process, our system provides a user-friendly method for accessing text-mining–aided information and a concise interface to assist curators while they remain at the PubMed Web site. Biomedical text-mining techniques are applied to automatically recognize biomedical concepts such as genes, microRNA, diseases and investigative technologies, which can be used to evaluate the potential of a certain gene as a biomarker. Through the participation in the BioCreative IV user-interactive task, we examined the feasibility of using this novel type of augmented browsing-based curation method, and collaborated with curators to curate biomarker evidential sentences related to liver cancer. The positive feedback received from curators indicates that the proposed method can be effectively used for curation. A publicly available online database containing all the aforementioned information has been constructed at http

  2. From university research to innovation Detecting knowledge transfer via text mining

    DEFF Research Database (Denmark)

    Woltmann, Sabrina; Clemmensen, Line Katrine Harder; Alkærsig, Lars

    2016-01-01

    recognition. Text samples for this purpose can include files containing social media contents, company websites and annual reports. The empirical focus in the present study is on the technical sciences and in particular on the case of the Technical University of Denmark (DTU). We generated two independent...... and indicators such as patents, collaborative publications and license agreements, to assess the contribution to the socioeconomic surrounding of universities. In this study, we present an extension of the current empirical framework by applying new computational methods, namely text mining and pattern...... associated the former with the latter to obtain insights into possible text and semantic relatedness. The text mining methods are extrapolating the correlations, semantic patterns and content comparison of the two corpora to define the document relatedness. We expect the development of a novel tool using...

  3. Exploring the potential of Social Media Data using Text Mining to augment Business Intelligence

    Directory of Open Access Journals (Sweden)

    Dr. Ananthi Sheshasaayee

    2014-03-01

    Full Text Available In recent years, social media has become world-wide famous and important for content sharing, social networking, etc., The contents generated from these websites remains largely unused. Social media contains text, images, audio, video, and so on. Social media data largely contains unstructured text. Foremost thing is to extract the information in the unstructured text. This paper presents the influence of social media data for research and how the content can be used to predict real-world decisions that enhance business intelligence, by applying the text mining methods.

  4. The nuclear power debate after Fukushima : a text-mining analysis of Japanese newspapers

    OpenAIRE

    Abe, Yuki; アベ, ユウキ; 阿部 , 悠貴

    2015-01-01

    This paper analyzes the debate on nuclear power after the Fukushima accident by using a text-mining approach. Texts are taken from the editorial articles of five major Japanese newspapers, Asahi Shinbun, Mainichi Shinbun, Nikkei Shinbun, Sankei Shinbun and Yomiuri Shinbun. After elucidating their different views on nuclear power policy, including general issues such as radiation risks, renewable energy and lessons from the meltdown, the paper reveals two main strands of arguments. Newspapers ...

  5. Feature engineering for drug name recognition in biomedical texts: feature conjunction and feature selection.

    Science.gov (United States)

    Liu, Shengyu; Tang, Buzhou; Chen, Qingcai; Wang, Xiaolong; Fan, Xiaoming

    2015-01-01

    Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.

  6. A tm Plug-In for Distributed Text Mining in R

    Directory of Open Access Journals (Sweden)

    Stefan Theussl

    2012-11-01

    Full Text Available R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text corpora. However, we typically face two challenges when analyzing large corpora: (1 the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM, and (2 the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

  7. SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    Directory of Open Access Journals (Sweden)

    Jan Paralič

    2007-06-01

    Full Text Available In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl, as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the system is described, followed by some typical use case scenarios, which have been used in some previous projects. Then, basic principles and technologies used for service-oriented computing, web services and semantic web services are presented. We further discuss how the JBowl system can be adopted into a distributed environment via technologies available already and what benefits can bring such an adaptation. This is in particular important in the context of a new integrated EU-funded project KP-Lab2 (Knowledge Practices Laboratory that is briefly presented as well as the role of the proposed text mining services, which are currently being designed and developed there.

  8. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    Science.gov (United States)

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and

  9. Review of Text Mining Tools%文本挖掘工具述评

    Institute of Scientific and Technical Information of China (English)

    张雯雯; 许鑫

    2012-01-01

    The authors briefly describe some commercial text mining tools and open source text mining tools, coupled with detailed comparisons of four typical open source tools concerning data format, functional module and user experience firstly. Then, the authors realize the testing of text classification function for three kinds of distinctive tool design. Finally, the authors offer some suggestions for the status of open source text mining tools.%简要介绍一些商业文本挖掘工具和开源文本挖掘工具,针对其中四款典型的开源工具进行详细的比较,包括数据格式、功能模块和用户体验三个方面;选取三种各具特色的工具就其文本分类功能进行测评。最后,针对开源文本挖掘工具的现状,提出几点建议。

  10. Discovering low-rank shared concept space for adapting text mining models.

    Science.gov (United States)

    Chen, Bo; Lam, Wai; Tsang, Ivor W; Wong, Tak-Lam

    2013-06-01

    We propose a framework for adapting text mining models that discovers low-rank shared concept space. Our major characteristic of this concept space is that it explicitly minimizes the distribution gap between the source domain with sufficient labeled data and the target domain with only unlabeled data, while at the same time it minimizes the empirical loss on the labeled data in the source domain. Our method is capable of conducting the domain adaptation task both in the original feature space as well as in the transformed Reproducing Kernel Hilbert Space (RKHS) using kernel tricks. Theoretical analysis guarantees that the error of our adaptation model can be bounded with respect to the embedded distribution gap and the empirical loss in the source domain. We have conducted extensive experiments on two common text mining problems, namely, document classification and information extraction, to demonstrate the efficacy of our proposed framework.

  11. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    Science.gov (United States)

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.

  12. Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL

    Science.gov (United States)

    Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo

    2011-01-01

    Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

  13. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    Science.gov (United States)

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.

  14. miRTex: A Text Mining System for miRNA-Gene Relation Extraction.

    Science.gov (United States)

    Li, Gang; Ross, Karen E; Arighi, Cecilia N; Peng, Yifan; Wu, Cathy H; Vijay-Shanker, K

    2015-01-01

    MicroRNAs (miRNAs) regulate a wide range of cellular and developmental processes through gene expression suppression or mRNA degradation. Experimentally validated miRNA gene targets are often reported in the literature. In this paper, we describe miRTex, a text mining system that extracts miRNA-target relations, as well as miRNA-gene and gene-miRNA regulation relations. The system achieves good precision and recall when evaluated on a literature corpus of 150 abstracts with F-scores close to 0.90 on the three different types of relations. We conducted full-scale text mining using miRTex to process all the Medline abstracts and all the full-length articles in the PubMed Central Open Access Subset. The results for all the Medline abstracts are stored in a database for interactive query and file download via the website at http://proteininformationresource.org/mirtex. Using miRTex, we identified genes potentially regulated by miRNAs in Triple Negative Breast Cancer, as well as miRNA-gene relations that, in conjunction with kinase-substrate relations, regulate the response to abiotic stress in Arabidopsis thaliana. These two use cases demonstrate the usefulness of miRTex text mining in the analysis of miRNA-regulated biological processes.

  15. Automatic extraction of reference gene from literature in plants based on texting mining.

    Science.gov (United States)

    He, Lin; Shen, Gengyu; Li, Fei; Huang, Shuiqing

    2015-01-01

    Real-Time Quantitative Polymerase Chain Reaction (qRT-PCR) is widely used in biological research. It is a key to the availability of qRT-PCR experiment to select a stable reference gene. However, selecting an appropriate reference gene usually requires strict biological experiment for verification with high cost in the process of selection. Scientific literatures have accumulated a lot of achievements on the selection of reference gene. Therefore, mining reference genes under specific experiment environments from literatures can provide quite reliable reference genes for similar qRT-PCR experiments with the advantages of reliability, economic and efficiency. An auxiliary reference gene discovery method from literature is proposed in this paper which integrated machine learning, natural language processing and text mining approaches. The validity tests showed that this new method has a better precision and recall on the extraction of reference genes and their environments.

  16. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.

    Science.gov (United States)

    Torii, Manabu; Tilak, Sameer S; Doan, Son; Zisook, Daniel S; Fan, Jung-Wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.

  17. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    Directory of Open Access Journals (Sweden)

    Przemyslaw Maciolek

    2013-01-01

    Full Text Available The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT system, which extracts and analyzes significant quantitiesof openly available information.

  18. Zsyntax: a formal language for molecular biology with projected applications in text mining and biological prediction.

    Directory of Open Access Journals (Sweden)

    Giovanni Boniolo

    Full Text Available We propose a formal language that allows for transposing biological information precisely and rigorously into machine-readable information. This language, which we call Zsyntax (where Z stands for the Greek word zetaomegaeta, life, is grounded on a particular type of non-classical logic, and it can be used to write algorithms and computer programs. We present it as a first step towards a comprehensive formal language for molecular biology in which any biological process can be written and analyzed as a sort of logical "deduction". Moreover, we illustrate the potential value of this language, both in the field of text mining and in that of biological prediction.

  19. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    OpenAIRE

    Verspoor Karin; Cohen Kevin; Lanfranchi Arrick; Warner Colin; Johnson Helen L; Roeder Christophe; Choi Jinho D; Funk Christopher; Malenkiy Yuriy; Eckert Miriam; Xue Nianwen; Baumgartner William A; Bada Michael; Palmer Martha; Hunter Lawrence E

    2012-01-01

    Abstract Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance...

  20. Using a Text-Mining Approach to Evaluate the Quality of Nursing Records.

    Science.gov (United States)

    Chang, Hsiu-Mei; Chiou, Shwu-Fen; Liu, Hsiu-Yun; Yu, Hui-Chu

    2016-01-01

    Nursing records in Taiwan have been computerized, but their quality has rarely been discussed. Therefore, this study employed a text-mining approach and a cross-sectional retrospective research design to evaluate the quality of electronic nursing records at a medical center in Northern Taiwan. SAS Text Miner software Version 13.2 was employed to analyze unstructured nursing event records. The results show that SAS Text Miner is suitable for developing a textmining model for validating nursing records. The sensitivity of SAS Text Miner was approximately 0.94, and the specificity and accuracy were 0.99. Thus, SAS Text Miner software is an effective tool for auditing unstructured electronic nursing records.

  1. Mining free-text medical records for companion animal enteric syndrome surveillance.

    Science.gov (United States)

    Anholt, R M; Berezowski, J; Jamal, I; Ribble, C; Stephen, C

    2014-03-01

    Large amounts of animal health care data are present in veterinary electronic medical records (EMR) and they present an opportunity for companion animal disease surveillance. Veterinary patient records are largely in free-text without clinical coding or fixed vocabulary. Text-mining, a computer and information technology application, is needed to identify cases of interest and to add structure to the otherwise unstructured data. In this study EMR's were extracted from veterinary management programs of 12 participating veterinary practices and stored in a data warehouse. Using commercially available text-mining software (WordStat™), we developed a categorization dictionary that could be used to automatically classify and extract enteric syndrome cases from the warehoused electronic medical records. The diagnostic accuracy of the text-miner for retrieving cases of enteric syndrome was measured against human reviewers who independently categorized a random sample of 2500 cases as enteric syndrome positive or negative. Compared to the reviewers, the text-miner retrieved cases with enteric signs with a sensitivity of 87.6% (95%CI, 80.4-92.9%) and a specificity of 99.3% (95%CI, 98.9-99.6%). Automatic and accurate detection of enteric syndrome cases provides an opportunity for community surveillance of enteric pathogens in companion animals.

  2. Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts

    Directory of Open Access Journals (Sweden)

    Díaz Alberto

    2011-08-01

    Full Text Available Abstract Background Word sense disambiguation (WSD attempts to solve lexical ambiguities by identifying the correct meaning of a word based on its context. WSD has been demonstrated to be an important step in knowledge-based approaches to automatic summarization. However, the correlation between the accuracy of the WSD methods and the summarization performance has never been studied. Results We present three existing knowledge-based WSD approaches and a graph-based summarizer. Both the WSD approaches and the summarizer employ the Unified Medical Language System (UMLS Metathesaurus as the knowledge source. We first evaluate WSD directly, by comparing the prediction of the WSD methods to two reference sets: the NLM WSD dataset and the MSH WSD collection. We next apply the different WSD methods as part of the summarizer, to map documents onto concepts in the UMLS Metathesaurus, and evaluate the summaries that are generated. The results obtained by the different methods in both evaluations are studied and compared. Conclusions It has been found that the use of WSD techniques has a positive impact on the results of our graph-based summarizer, and that, when both the WSD and summarization tasks are assessed over large and homogeneous evaluation collections, there exists a correlation between the overall results of the WSD and summarization tasks. Furthermore, the best WSD algorithm in the first task tends to be also the best one in the second. However, we also found that the improvement achieved by the summarizer is not directly correlated with the WSD performance. The most likely reason is that the errors in disambiguation are not equally important but depend on the relative salience of the different concepts in the document to be summarized.

  3. BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects.

    Science.gov (United States)

    He, Xin; Li, Yanen; Khetani, Radhika; Sanders, Barry; Lu, Yue; Ling, Xu; Zhai, Chengxiang; Schatz, Bruce

    2010-07-01

    Text mining is one promising way of extracting information automatically from the vast biological literature. To maximize its potential, the knowledge encoded in the text should be translated to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. We present BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology, covering diverse aspects from molecular interactions of genes to insect behavior. BSQA recognizes a number of entities and relations in Medline documents about the model insect, Drosophila melanogaster. For any text query, BSQA exploits entity annotation of retrieved documents to identify important concepts in different categories. By utilizing the extracted relations, BSQA is also able to answer many biologically motivated questions, from simple ones such as, which anatomical part is a gene expressed in, to more complex ones involving multiple types of relations. BSQA is freely available at http://www.beespace.uiuc.edu/QuestionAnswer.

  4. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more.

    Science.gov (United States)

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-07-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized 'Given X, find all associated Ys' query, where X and Y can be selected from the aforementioned biomedical entities. An example query might be: 'Find all diseases associated with Bisphenol A'. To find its answers, PolySearch2 searches for associations against comprehensive collections of free-text collections, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles and US Patent application abstracts. PolySearch2 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and Human Metabolome Database to improve its accuracy and coverage. PolySearch2 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch2 also generates, ranks and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation.

  5. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

    Science.gov (United States)

    Wiegers, Thomas C; Davis, Allan Peter; Mattingly, Carolyn J

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and

  6. Text Mining of the Classical Medical Literature for Medicines That Show Potential in Diabetic Nephropathy

    Directory of Open Access Journals (Sweden)

    Lei Zhang

    2014-01-01

    Full Text Available Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1 identification of candidate ancient terms; (2 systemic search and assessment of medical records written in classical Chinese; (3 preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae.

  7. Comparison between BIDE, PrefixSpan, and TRuleGrowth for Mining of Indonesian Text

    Science.gov (United States)

    Sa’adillah Maylawati, Dian; Irfan, Mohamad; Budiawan Zulfikar, Wildan

    2017-01-01

    Mining proscess for Indonesian language still be an interesting research. Multiple of words representation was claimed can keep the meaning of text better than bag of words. In this paper, we compare several sequential pattern algortihm, among others BIDE (BIDirectional Extention), PrefixSpan, and TRuleGrowth. All of those algorithm produce frequent word sequence to keep the meaning of text. However, the experiment result, with 14.006 of Indonesian tweet from Twitter, shows that BIDE can produce more efficient frequent word sequence than PrefixSpan and TRuleGrowth without missing the meaning of text. Then, the average of time process of PrefixSpan is faster than BIDE and TRuleGrowth. In the other hand, PrefixSpan and TRuleGrowth is more efficient in using memory than BIDE.

  8. The eFIP system for text mining of protein interaction networks of phosphorylated proteins.

    Science.gov (United States)

    Tudor, Catalina O; Arighi, Cecilia N; Wang, Qinghua; Wu, Cathy H; Vijay-Shanker, K

    2012-01-01

    Protein phosphorylation is a central regulatory mechanism in signal transduction involved in most biological processes. Phosphorylation of a protein may lead to activation or repression of its activity, alternative subcellular location and interaction with different binding partners. Extracting this type of information from scientific literature is critical for connecting phosphorylated proteins with kinases and interaction partners, along with their functional outcomes, for knowledge discovery from phosphorylation protein networks. We have developed the Extracting Functional Impact of Phosphorylation (eFIP) text mining system, which combines several natural language processing techniques to find relevant abstracts mentioning phosphorylation of a given protein together with indications of protein-protein interactions (PPIs) and potential evidences for impact of phosphorylation on the PPIs. eFIP integrates our previously developed tools, Extracting Gene Related ABstracts (eGRAB) for document retrieval and name disambiguation, Rule-based LIterature Mining System (RLIMS-P) for Protein Phosphorylation for extraction of phosphorylation information, a PPI module to detect PPIs involving phosphorylated proteins and an impact module for relation extraction. The text mining system has been integrated into the curation workflow of the Protein Ontology (PRO) to capture knowledge about phosphorylated proteins. The eFIP web interface accepts gene/protein names or identifiers, or PubMed identifiers as input, and displays results as a ranked list of abstracts with sentence evidence and summary table, which can be exported in a spreadsheet upon result validation. As a participant in the BioCreative-2012 Interactive Text Mining track, the performance of eFIP was evaluated on document retrieval (F-measures of 78-100%), sentence-level information extraction (F-measures of 70-80%) and document ranking (normalized discounted cumulative gain measures of 93-100% and mean average

  9. DDMGD: the database of text-mined associations between genes methylated in diseases from different species

    KAUST Repository

    Raies, A. B.

    2014-11-14

    Gathering information about associations between methylated genes and diseases is important for diseases diagnosis and treatment decisions. Recent advancements in epigenetics research allow for large-scale discoveries of associations of genes methylated in diseases in different species. Searching manually for such information is not easy, as it is scattered across a large number of electronic publications and repositories. Therefore, we developed DDMGD database (http://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD\\'s scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we developed earlier and additional post-processing, we extracted associations of genes methylated in different diseases from PubMed Central articles and PubMed abstracts. The accuracy of extracted associations is 82% as estimated on 2500 hand-curated entries. DDMGD provides a user-friendly interface facilitating retrieval of these associations ranked according to confidence scores. Submission of new associations to DDMGD is provided. A comparison analysis of DDMGD with several other databases focused on genes methylated in diseases shows that DDMGD is comprehensive and includes most of the recent information on genes methylated in diseases.

  10. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    Science.gov (United States)

    Hirdt, J. A.; Brown, D. A.

    2016-01-01

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  11. Beyond the biomedical and behavioural: towards an integrated approach to HIV prevention in the southern African mining industry.

    Science.gov (United States)

    Campbell, C; Williams, B

    1999-06-01

    While migrant labour is believed to play an important role in the dynamics of HIV-transmission in many of the countries of southern Africa, little has been written about the way in which HIV/AIDS has been dealt with in the industrial settings in which many migrant workers are employed. This paper takes the gold mining industry in the countries of the Southern African Development Community (SADC) as a case study. While many mines made substantial efforts to establish HIV-prevention programmes relatively early on in the epidemic, these appear to have had little impact. The paper analyses the response of key players in the mining industry, in the interests of highlighting the limitations of the way in which both managements and trade unions have responded to HIV. It will be argued that the energy that has been devoted either to biomedical or behavioural prevention programmes or to human rights issues has served to obscure the social and developmental dimensions of HIV-transmission. This argument is supported by means of a case study which seeks to highlight the complexity of the dynamics of disease transmission in this context, a complexity which is not reflected in individualistic responses. An account is given of a new intervention which seeks to develop a more integrated approach to HIV management in an industrial setting.

  12. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.

    Science.gov (United States)

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves

    2013-03-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

  13. Automated extraction of precise protein expression patterns in lymphoma by text mining abstracts of immunohistochemical studies

    Directory of Open Access Journals (Sweden)

    Jia-Fu Chang

    2013-01-01

    Full Text Available Background: In general, surgical pathology reviews report protein expression by tumors in a semi-quantitative manner, that is, -, -/+, +/-, +. At the same time, the experimental pathology literature provides multiple examples of precise expression levels determined by immunohistochemical (IHC tissue examination of populations of tumors. Natural language processing (NLP techniques enable the automated extraction of such information through text mining. We propose establishing a database linking quantitative protein expression levels with specific tumor classifications through NLP. Materials and Methods: Our method takes advantage of typical forms of representing experimental findings in terms of percentages of protein expression manifest by the tumor population under study. Characteristically, percentages are represented straightforwardly with the % symbol or as the number of positive findings of the total population. Such text is readily recognized using regular expressions and templates permitting extraction of sentences containing these forms for further analysis using grammatical structures and rule-based algorithms. Results: Our pilot study is limited to the extraction of such information related to lymphomas. We achieved a satisfactory level of retrieval as reflected in scores of 69.91% precision and 57.25% recall with an F-score of 62.95%. In addition, we demonstrate the utility of a web-based curation tool for confirming and correcting our findings. Conclusions: The experimental pathology literature represents a rich source of pathobiological information, which has been relatively underutilized. There has been a combinatorial explosion of knowledge within the pathology domain as represented by increasing numbers of immunophenotypes and disease subclassifications. NLP techniques support practical text mining techniques for extracting this knowledge and organizing it in forms appropriate for pathology decision support systems.

  14. Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

    Directory of Open Access Journals (Sweden)

    Rebholz-Schuhmann Dietrich

    2008-04-01

    Full Text Available Abstract Background Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. Results We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections are the major source of technology-specific terms as opposed to paper abstracts. Conclusions We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.

  15. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    Science.gov (United States)

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.

  16. Online Discourse on Fibromyalgia: Text-Mining to Identify Clinical Distinction and Patient Concerns

    Science.gov (United States)

    Park, Jungsik; Ryu, Young Uk

    2014-01-01

    Background The purpose of this study was to evaluate the possibility of using text-mining to identify clinical distinctions and patient concerns in online memoires posted by patients with fibromyalgia (FM). Material/Methods A total of 399 memoirs were collected from an FM group website. The unstructured data of memoirs associated with FM were collected through a crawling process and converted into structured data with a concordance, parts of speech tagging, and word frequency. We also conducted a lexical analysis and phrase pattern identification. After examining the data, a set of FM-related keywords were obtained and phrase net relationships were set through a web-based visualization tool. Results The clinical distinction of FM was verified. Pain is the biggest issue to the FM patients. The pains were affecting body parts including ‘muscles,’ ‘leg,’ ‘neck,’ ‘back,’ ‘joints,’ and ‘shoulders’ with accompanying symptoms such as ‘spasms,’ ‘stiffness,’ and ‘aching,’ and were described as ‘sever,’ ‘chronic,’ and ‘constant.’ This study also demonstrated that it was possible to understand the interests and concerns of FM patients through text-mining. FM patients wanted to escape from the pain and symptoms, so they were interested in medical treatment and help. Also, they seemed to have interest in their work and occupation, and hope to continue to live life through the relationships with the people around them. Conclusions This research shows the potential for extracting keywords to confirm the clinical distinction of a certain disease, and text-mining can help objectively understand the concerns of patients by generalizing their large number of subjective illness experiences. However, it is believed that there are limitations to the processes and methods for organizing and classifying large amounts of text, so these limits have to be considered when analyzing the results. The development of research methodology to overcome

  17. In search of new product ideas: Identifying ideas in online communities by machine learning and text mining

    DEFF Research Database (Denmark)

    Christensen, Kasper; Frederiksen, Lars; Nørskov, Sladjana

    2016-01-01

    contains an idea or not. 137 idea texts and 2666 non-idea texts were identified. The human raters could not agree on the remaining 197 texts and therefore those texts were omitted from the analysis. In a second step, the remaining 2803 texts were processed by means of text mining techniques and then used...... as input to train a classification model. We describe how to tune the model and which text mining steps to perform. We conclude that machine learning and text mining can be useful for detecting ideas in online communities. The method can help researchers and firms in idea identification when dealing...... with large amounts of unstructured text data. Also, it is interesting in its own right that machine learning can be used to detect ideas....

  18. MET network in PubMed: a text-mined network visualization and curation system.

    Science.gov (United States)

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET.Database URL: http://btm.tmu.edu.tw/metastasisway.

  19. MET network in PubMed: a text-mined network visualization and curation system

    Science.gov (United States)

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET. Database URL: http://btm.tmu.edu.tw/metastasisway PMID:27242035

  20. A Distributed Look-up Architecture for Text Mining Applications using MapReduce.

    Science.gov (United States)

    Balkir, Atilla Soner; Foster, Ian; Rzhetsky, Andrey

    2011-11-01

    Text mining applications typically involve statistical models that require accessing and updating model parameters in an iterative fashion. With the growing size of the data, such models become extremely parameter rich, and naive parallel implementations fail to address the scalability problem of maintaining a distributed look-up table that maps model parameters to their values. We evaluate several existing alternatives to provide coordination among worker nodes in Hadoop [11] clusters, and suggest a new multi-layered look-up architecture that is specifically optimized for certain problem domains. Our solution exploits the power-law distribution characteristics of the phrase or n-gram counts in large corpora while utilizing a Bloom Filter [2], in-memory cache, and an HBase [12] cluster at varying levels of abstraction.

  1. Natural products for chronic cough: Text mining the East Asian historical literature for future therapeutics.

    Science.gov (United States)

    Shergis, Johannah Linda; Wu, Lei; May, Brian H; Zhang, Anthony Lin; Guo, Xinfeng; Lu, Chuanjian; Xue, Charlie Changli

    2015-08-01

    Chronic cough is a significant health burden. Patients experience variable benefits from over the counter and prescribed products, but there is an unmet need to provide more effective treatments. Natural products have been used to treat cough and some plant compounds such as pseudoephedrine from ephedra and codeine from opium poppy have been developed into drugs. Text mining historical literature may offer new insight for future therapeutic development. We identified natural products used in the East Asian historical literature to treat chronic cough. Evaluation of the historical literature revealed 331 natural products used to treat chronic cough. Products included plants, minerals and animal substances. These natural products were found in 75 different books published between AD 363 and 1911. Of the 331 products, the 10 most frequently and continually used products were examined, taking into consideration findings from contemporary experimental studies. The natural products identified are promising and offer new directions in therapeutic development for treating chronic cough.

  2. Finding falls in ambulatory care clinical documents using statistical text mining

    Science.gov (United States)

    McCart, James A; Berndt, Donald J; Jarman, Jay; Finch, Dezon K; Luther, Stephen L

    2013-01-01

    Objective To determine how well statistical text mining (STM) models can identify falls within clinical text associated with an ambulatory encounter. Materials and Methods 2241 patients were selected with a fall-related ICD-9-CM E-code or matched injury diagnosis code while being treated as an outpatient at one of four sites within the Veterans Health Administration. All clinical documents within a 48-h window of the recorded E-code or injury diagnosis code for each patient were obtained (n=26 010; 611 distinct document titles) and annotated for falls. Logistic regression, support vector machine, and cost-sensitive support vector machine (SVM-cost) models were trained on a stratified sample of 70% of documents from one location (dataset Atrain) and then applied to the remaining unseen documents (datasets Atest–D). Results All three STM models obtained area under the receiver operating characteristic curve (AUC) scores above 0.950 on the four test datasets (Atest–D). The SVM-cost model obtained the highest AUC scores, ranging from 0.953 to 0.978. The SVM-cost model also achieved F-measure values ranging from 0.745 to 0.853, sensitivity from 0.890 to 0.931, and specificity from 0.877 to 0.944. Discussion The STM models performed well across a large heterogeneous collection of document titles. In addition, the models also generalized across other sites, including a traditionally bilingual site that had distinctly different grammatical patterns. Conclusions The results of this study suggest STM-based models have the potential to improve surveillance of falls. Furthermore, the encouraging evidence shown here that STM is a robust technique for mining clinical documents bodes well for other surveillance-related topics. PMID:23242765

  3. CTSS: A Tool for Efficient Information Extraction with Soft Matching Rules for Text Mining

    Directory of Open Access Journals (Sweden)

    A. Christy

    2008-01-01

    Full Text Available The abundance of information available digitally in modern world had made a demand for structured information. The problem of text mining which dealt with discovering useful information from unstructured text had attracted the attention of researchers. The role of Information Extraction (IE software was to identify relevant information from texts, extracting information from a variety of sources and aggregating it to create a single view. Information extraction systems depended on particular corpora and were poor in recall values. Therefore, developing the system as domain-independent as well as improving the recall was an important challenge for IE. In this research, the authors proposed a domain-independent algorithm for information extraction, called SOFTRULEMINING for extracting the aim, methodology and conclusion from technical abstracts. The algorithm was implemented by combining trigram model with softmatching rules. A tool CTSS was constructed using SOFTRULEMINING and was tested with technical abstracts of www.computer.org and www.ansinet.org and found that the tool had improved its recall value and therefore the precision value in comparison with other search engines.

  4. Mining

    Directory of Open Access Journals (Sweden)

    Khairullah Khan

    2014-09-01

    Full Text Available Opinion mining is an interesting area of research because of its applications in various fields. Collecting opinions of people about products and about social and political events and problems through the Web is becoming increasingly popular every day. The opinions of users are helpful for the public and for stakeholders when making certain decisions. Opinion mining is a way to retrieve information through search engines, Web blogs and social networks. Because of the huge number of reviews in the form of unstructured text, it is impossible to summarize the information manually. Accordingly, efficient computational methods are needed for mining and summarizing the reviews from corpuses and Web documents. This study presents a systematic literature survey regarding the computational techniques, models and algorithms for mining opinion components from unstructured reviews.

  5. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience.

    Science.gov (United States)

    Krallinger, Martin; Leitner, Florian; Vazquez, Miguel; Salgado, David; Marcelle, Christophe; Tyers, Mike; Valencia, Alfonso; Chatr-aryamontri, Andrew

    2012-01-01

    There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein-protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein-protein interaction data and PSI-MI terms referring to interaction detection methods.

  6. Development of Workshops on Biodiversity and Evaluation of the Educational Effect by Text Mining Analysis

    Science.gov (United States)

    Baba, R.; Iijima, A.

    2014-12-01

    Conservation of biodiversity is one of the key issues in the environmental studies. As means to solve this issue, education is becoming increasingly important. In the previous work, we have developed a course of workshops on the conservation of biodiversity. To disseminate the course as a tool for environmental education, determination of the educational effect is essential. A text mining enables analyses of frequency and co-occurrence of words in the freely described texts. This study is intended to evaluate the effect of workshop by using text mining technique. We hosted the originally developed workshop on the conservation of biodiversity for 22 college students. The aim of the workshop was to inform the definition of biodiversity. Generally, biodiversity refers to the diversity of ecosystem, diversity between species, and diversity within species. To facilitate discussion, supplementary materials were used. For instance, field guides of wildlife species were used to discuss about the diversity of ecosystem. Moreover, a hierarchical framework in an ecological pyramid was shown for understanding the role of diversity between species. Besides, we offered a document material on the historical affair of Potato Famine in Ireland to discuss about the diversity within species from the genetic viewpoint. Before and after the workshop, we asked students for free description on the definition of biodiversity, and analyzed by using Tiny Text Miner. This technique enables Japanese language morphological analysis. Frequently-used words were sorted into some categories. Moreover, a principle component analysis was carried out. After the workshop, frequency of the words tagged to diversity between species and diversity within species has significantly increased. From a principle component analysis, the 1st component consists of the words such as producer, consumer, decomposer, and food chain. This indicates that the students have comprehended the close relationship between

  7. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes.

    Directory of Open Access Journals (Sweden)

    Nicholas J Leeper

    Full Text Available BACKGROUND: Peripheral arterial disease (PAD is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF. METHODS AND RESULTS: We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55], myocardial infarction (OR = 1.00, CI [0.71, 1.39], or death (OR = 0.86, CI [0.63, 1.18]. Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients. CONCLUSIONS: This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

  8. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    Directory of Open Access Journals (Sweden)

    Verspoor Karin

    2012-08-01

    Full Text Available Abstract Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

  9. SYRIAC: The systematic review information automated collection system a data warehouse for facilitating automated biomedical text classification.

    Science.gov (United States)

    Yang, Jianji J; Cohen, Aaron M; Cohen, Aaron; McDonagh, Marian S

    2008-11-06

    Automatic document classification can be valuable in increasing the efficiency in updating systematic reviews (SR). In order for the machine learning process to work well, it is critical to create and maintain high-quality training datasets consisting of expert SR inclusion/exclusion decisions. This task can be laborious, especially when the number of topics is large and source data format is inconsistent.To approach this problem, we build an automated system to streamline the required steps, from initial notification of update in source annotation files to loading the data warehouse, along with a web interface to monitor the status of each topic. In our current collection of 26 SR topics, we were able to standardize almost all of the relevance judgments and recovered PMIDs for over 80% of all articles. Of those PMIDs, over 99% were correct in a manual random sample study. Our system performs an essential function in creating training and evaluation data sets for SR text mining research.

  10. Penggunaan Web Crawler Untuk Menghimpun Tweets dengan Metode Pre-Processing Text Mining

    Directory of Open Access Journals (Sweden)

    Bayu Rima Aditya

    2015-11-01

    Full Text Available Saat ini jumlah data di media sosial sudah terbilang sangat besar, namun jumlah data tersebut masih belum banyak dimanfaatkan atau diolah untuk menjadi sesuatu yang bernilai guna, salah satunya adalah tweets pada media sosial twitter. Paper ini menguraikan hasil penggunaan engine web crawel menggunakan metode pre-processing text mining. Penggunaan engine web crawel itu sendiri bertujuan untuk menghimpun tweets melalui API twitter sebagai data teks tidak terstruktur yang kemudian direpresentasikan kembali kedalam bentuk web. Sedangkan penggunaan metode pre-processing bertujuan untuk menyaring tweets melalui tiga tahap, yaitu cleansing, case folding, dan parsing. Aplikasi yang dirancang pada penelitian ini menggunakan metode pengembangan perangkat lunak yaitu model waterfall dan diimplementasikan dengan bahasa pemrograman PHP. Sedangkan untuk pengujiannya menggunakan black box testing untuk memeriksa apakah hasil perancangan sudah dapat berjalan sesuai dengan harapan atau belum. Hasil dari penelitian ini adalah berupa aplikasi yang dapat mengubah tweets yang telah dihimpun menjadi data yang siap diolah lebih lanjut sesuai dengan kebutuhan user berdasarkan kata kunci dan tanggal pencarian. Hal ini dilakukan karena dari beberapa penelitian terkait terlihat bahwa data pada media sosial khususnya twitter saat ini menjadi tujuan perusahaan atau instansi untuk memahami opini masyarakat

  11. Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining

    Directory of Open Access Journals (Sweden)

    Xuan Shi

    2016-09-01

    Full Text Available In response to the widespread concern about the adequacy, distribution, and disparity of access to a health care workforce, the correct identification of physicians’ practice locations is critical to access public health services. In prior literature, little effort has been made to detect and resolve the uncertainty about whether the address provided by a physician in the survey is a practice address or a home address. This paper introduces how to identify the uncertainty in a physician’s practice location through spatial analytics, text mining, and visual examination. While land use and zoning code, embedded within the parcel datasets, help to differentiate resident areas from other types, spatial analytics may have certain limitations in matching and comparing physician and parcel datasets with different uncertainty issues, which may lead to unforeseen results. Handling and matching the string components between physicians’ addresses and the addresses of the parcels could identify the spatial uncertainty and instability to derive a more reasonable relationship between different datasets. Visual analytics and examination further help to clarify the undetectable patterns. This research will have a broader impact over federal and state initiatives and policies to address both insufficiency and maldistribution of a health care workforce to improve the accessibility to public health services.

  12. Cardiac data mining (CDM); organization and predictive analytics on biomedical (cardiac) data

    Science.gov (United States)

    Bilal, M. Musa; Hussain, Masood; Basharat, Iqra; Fatima, Mamuna

    2013-10-01

    Data mining and data analytics has been of immense importance to many different fields as we witness the evolution of data sciences over recent years. Biostatistics and Medical Informatics has proved to be the foundation of many modern biological theories and analysis techniques. These are the fields which applies data mining practices along with statistical models to discover hidden trends from data that comprises of biological experiments or procedures on different entities. The objective of this research study is to develop a system for the efficient extraction, transformation and loading of such data from cardiologic procedure reports given by Armed Forces Institute of Cardiology. It also aims to devise a model for the predictive analysis and classification of this data to some important classes as required by cardiologists all around the world. This includes predicting patient impressions and other important features.

  13. Mining texts to efficiently generate global data on political regime types

    Directory of Open Access Journals (Sweden)

    Shahryar Minhas

    2015-07-01

    Full Text Available We describe the design and results of an experiment in using text-mining and machine-learning techniques to generate annual measures of national political regime types. Valid and reliable measures of countries’ forms of national government are essential to cross-national and dynamic analysis of many phenomena of great interest to political scientists, including civil war, interstate war, democratization, and coups d’état. Unfortunately, traditional measures of regime type are very expensive to produce, and observations for ambiguous cases are often sharply contested. In this project, we train a series of support vector machine (SVM classifiers to infer regime type from textual data sources. To train the classifiers, we used vectorized textual reports from Freedom House and the State Department as features for a training set of prelabeled regime type data. To validate our SVM classifiers, we compare their predictions in an out-of-sample context, and the performance results across a variety of metrics (accuracy, precision, recall are very high. The results of this project highlight the ability of these techniques to contribute to producing real-time data sources for use in political science that can also be routinely updated at much lower cost than human-coded data. To this end, we set up a text-processing pipeline that pulls updated textual data from selected sources, conducts feature extraction, and applies supervised machine learning methods to produce measures of regime type. This pipeline, written in Python, can be pulled from the Github repository associated with this project and easily extended as more data becomes available.

  14. Integrating protein-protein interactions and text mining for protein function prediction

    Directory of Open Access Journals (Sweden)

    Leser Ulf

    2008-07-01

    Full Text Available Abstract Background Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature. Results Using this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision according to the verifications from a trained curator. Conclusion Our method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.

  15. Discovery and explanation of drug-drug interactions via text mining.

    Science.gov (United States)

    Percha, Bethany; Garten, Yael; Altman, Russ B

    2012-01-01

    Drug-drug interactions (DDIs) can occur when two drugs interact with the same gene product. Most available information about gene-drug relationships is contained within the scientific literature, but is dispersed over a large number of publications, with thousands of new publications added each month. In this setting, automated text mining is an attractive solution for identifying gene-drug relationships and aggregating them to predict novel DDIs. In previous work, we have shown that gene-drug interactions can be extracted from Medline abstracts with high fidelity - we extract not only the genes and drugs, but also the type of relationship expressed in individual sentences (e.g. metabolize, inhibit, activate and many others). We normalize these relationships and map them to a standardized ontology. In this work, we hypothesize that we can combine these normalized gene-drug relationships, drawn from a very broad and diverse literature, to infer DDIs. Using a training set of established DDIs, we have trained a random forest classifier to score potential DDIs based on the features of the normalized assertions extracted from the literature that relate two drugs to a gene product. The classifier recognizes the combinations of relationships, drugs and genes that are most associated with the gold standard DDIs, correctly identifying 79.8% of assertions relating interacting drug pairs and 78.9% of assertions relating noninteracting drug pairs. Most significantly, because our text processing method captures the semantics of individual gene-drug relationships, we can construct mechanistic pharmacological explanations for the newly-proposed DDIs. We show how our classifier can be used to explain known DDIs and to uncover new DDIs that have not yet been reported.

  16. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.

    Science.gov (United States)

    Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H

    2015-12-01

    Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research.

  17. 生物信息学中的文本挖掘方法%Text mining in bioinformatics

    Institute of Scientific and Technical Information of China (English)

    邹权; 林琛; 刘晓燕; 郭茂祖

    2011-01-01

    从两个角度讨论应用于生物信息学中的文本挖掘方法.以搜索生物知识为目标,利用文本挖掘方法进行文献检索,进而构建相关数据库,如在PubMed中挖掘蛋白质相互作用和基因疾病关系等知识.总结了可以应用文本挖掘技术的生物信息学问题,如蛋白质结构与功能的分析.探讨了文本挖掘研究者可以探索的生物信息学领域,以便更多的文本挖掘研究者可以将相关成果应用于生物信息学的研究中.%Text mining methods in bioinformatics are discussed from two views. First, three problems are reviewed including searching biology knowledge, retrieving the reference by text mining method and reconstructing databases. For example, protein-protein interaction and gene-disease relationship can be mined from PubMed. Then the bioinformatics applications of text mining are concluded, such as protein structure and function prediction. At last, more methods and applications are discussed for helping text mining researchers to do more contribution in bioinformatics.

  18. METSP: a maximum-entropy classifier based text mining tool for transporter-substrate identification with semistructured text.

    Science.gov (United States)

    Zhao, Min; Chen, Yanming; Qu, Dacheng; Qu, Hong

    2015-01-01

    The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs) from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.

  19. Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

    Directory of Open Access Journals (Sweden)

    Jordan MI

    2006-05-01

    Full Text Available Abstract Background The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. Results An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Conclusion Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

  20. Text Mining to inform construction of Earth and Environmental Science Ontologies

    Science.gov (United States)

    Schildhauer, M.; Adams, B.; Rebich Hespanha, S.

    2013-12-01

    There is a clear need for better semantic representation of Earth and environmental concepts, to facilitate more effective discovery and re-use of information resources relevant to scientists doing integrative research. In order to develop general-purpose Earth and environmental science ontologies, however, it is necessary to represent concepts and relationships that span usage across multiple disciplines and scientific specialties. Traditional knowledge modeling through ontologies utilizes expert knowledge but inevitably favors the particular perspectives of the ontology engineers, as well as the domain experts who interacted with them. This often leads to ontologies that lack robust coverage of synonymy, while also missing important relationships among concepts that can be extremely useful for working scientists to be aware of. In this presentation we will discuss methods we have developed that utilize statistical topic modeling on a large corpus of Earth and environmental science articles, to expand coverage and disclose relationships among concepts in the Earth sciences. For our work we collected a corpus of over 121,000 abstracts from many of the top Earth and environmental science journals. We performed latent Dirichlet allocation topic modeling on this corpus to discover a set of latent topics, which consist of terms that commonly co-occur in abstracts. We match terms in the topics to concept labels in existing ontologies to reveal gaps, and we examine which terms are commonly associated in natural language discourse, to identify relationships that are important to formally model in ontologies. Our text mining methodology uncovers significant gaps in the content of some popular existing ontologies, and we show how, through a workflow involving human interpretation of topic models, we can bootstrap ontologies to have much better coverage and richer semantics. Because we base our methods directly on what working scientists are communicating about their

  1. The Text Mining and its Key Technigues and Methods%文本挖掘及其关键技术与方法

    Institute of Scientific and Technical Information of China (English)

    王丽坤; 王宏; 陆玉昌

    2002-01-01

    With the dramatically development of Internet, the information processing and management technology onWWW have become a great important branch of data mining and data warehouse. Especially, nowadays, Text Miningis marvelously emerging and plays an important role in interrelated fields. So it is worth summarizing the contentabout text mining from its definition to relational methods and techniques. In this paper, combined to comparativelymature data mining technology, we present the definition of text mining and the multi-stage text mining process mod-el. Moreover, this paper roundly introduces the key areas of text mining and some of the powerful text analysis tech-niques, including: Word Automatic Segmenting, Feature Representation, Feature Extraction, Text Categorization,Text Clustering, Text Summarization, Information Extraction, Pattern Quality Evaluation, etc. These techniquescover the whole process from information preprocessing to knowledge obtaining.

  2. Text mining with emergent self organizing maps and multi-dimensional scaling: a comparative study on domestic violence

    NARCIS (Netherlands)

    J. Poelmans; M.M. van Hulle; S. Viaene; P. Elzinga; G. Dedene

    2011-01-01

    In this paper we compare the usability of ESOM and MDS as text exploration instruments in police investigations. We combine them with traditional classification instruments such as the SVM and Naïve Bayes. We perform a case of real-life data mining using a dataset consisting of police reports descri

  3. Examining Mobile Learning Trends 2003-2008: A Categorical Meta-Trend Analysis Using Text Mining Techniques

    Science.gov (United States)

    Hung, Jui-Long; Zhang, Ke

    2012-01-01

    This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…

  4. Impact of Text-Mining and Imitating Strategies on Lexical Richness, Lexical Diversity and General Success in Second Language Writing

    Science.gov (United States)

    Çepni, Sevcan Bayraktar; Demirel, Elif Tokdemir

    2016-01-01

    This study aimed to find out the impact of "text mining and imitating" strategies on lexical richness, lexical diversity and general success of students in their compositions in second language writing. The participants were 98 students studying their first year in Karadeniz Technical University in English Language and Literature…

  5. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

    Directory of Open Access Journals (Sweden)

    Sun Lin

    2009-09-01

    Full Text Available Abstract Background One of the most neglected areas of biomedical Text Mining (TM is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA. Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.

  6. Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track

    Science.gov (United States)

    2015-11-20

    investigated the effectiveness of mining session co-occurrence data. For a search engine log, session bound- aries can be defined in the typical way but to...matching seed candidates (link text from the web graph or queries over search logs) and expand to related candidate key phrases via this session as...Given a query, we find matching seed candidates (link text 1 from the web graph or queries over search logs) using a soft matching. These seed

  7. An Integrated Suite of Text and Data Mining Tools - Phase II

    Science.gov (United States)

    2007-11-02

    USA, July, 2003 [10] Watts, Robert J., Porter, Alan L., “R&D Cluster Quality Measures and Technology Maturity,” Technology Forecasting & Social ...science & technology documents databases, illustrated for the case of ‘knowledge discovery and data mining,’ Ciencia da Informacao 28 (1999) 1-8. [13...intelligence and forecasting, Technol. Forecast. and Social Change 69 (2002) 495-506. [23] R.J. Watts, Knowledge discovery using the Tech OASIS: Meeting

  8. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

    Directory of Open Access Journals (Sweden)

    Allan Peter Davis

    Full Text Available The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/ is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS, wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel. Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

  9. Towards Evidence-based Precision Medicine: Extracting Population Information from Biomedical Text using Binary Classifiers and Syntactic Patterns

    Science.gov (United States)

    Raja, Kalpana; Dasot, Naman; Goyal, Pawan; Jonnalagadda, Siddhartha R

    2016-01-01

    Precision Medicine is an emerging approach for prevention and treatment of disease that considers individual variability in genes, environment, and lifestyle for each person. The dissemination of individualized evidence by automatically identifying population information in literature is a key for evidence-based precision medicine at the point-of-care. We propose a hybrid approach using natural language processing techniques to automatically extract the population information from biomedical literature. Our approach first implements a binary classifier to classify sentences with or without population information. A rule-based system based on syntactic-tree regular expressions is then applied to sentences containing population information to extract the population named entities. The proposed two-stage approach achieved an F-score of 0.81 using a MaxEnt classifier and the rule- based system, and an F-score of 0.87 using a Nai've-Bayes classifier and the rule-based system, and performed relatively well compared to many existing systems. The system and evaluation dataset is being released as open source. PMID:27570671

  10. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

    Science.gov (United States)

    Davis, Allan Peter; Wiegers, Thomas C; Johnson, Robin J; Lay, Jean M; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J

    2013-01-01

    The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

  11. DESTAF: A database of text-mined associations for reproductive toxins potentially affecting human fertility

    KAUST Repository

    Dawe, Adam Sean

    2012-01-01

    The Dragon Exploration System for Toxicants and Fertility (DESTAF) is a publicly available resource which enables researchers to efficiently explore both known and potentially novel information and associations in the field of reproductive toxicology. To create DESTAF we used data from the literature (including over 10. 500 PubMed abstracts), several publicly available biomedical repositories, and specialized, curated dictionaries. DESTAF has an interface designed to facilitate rapid assessment of the key associations between relevant concepts, allowing for a more in-depth exploration of information based on different gene/protein-, enzyme/metabolite-, toxin/chemical-, disease- or anatomically centric perspectives. As a special feature, DESTAF allows for the creation and initial testing of potentially new association hypotheses that suggest links between biological entities identified through the database.DESTAF, along with a PDF manual, can be found at http://cbrc.kaust.edu.sa/destaf. It is free to academic and non-commercial users and will be updated quarterly. © 2011 Elsevier Inc.

  12. Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation.

    Science.gov (United States)

    Smalheiser, Neil R; Bonifield, Gary

    2016-01-01

    In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html. Handling editor: Elizabeth Workman, MLIS, PhD.

  13. Licence to Mine? Ein Überblick über Rahmenbedingungen von Text and Data Mining und den aktuellen Stand der Diskussion

    Directory of Open Access Journals (Sweden)

    Christian Winterhalter

    2016-11-01

    Full Text Available Der Artikel gibt einen Überblick über die Möglichkeiten der Anwendung von Text and Data Mining (TDM und ähnlichen Verfahren auf der Grundlage bestehender Regelungen in Lizenzverträgen zu kostenpflichtigen elektronischen Ressourcen, die Debatte über zusätzliche Lizenzen für TDM am Beispiel von Elseviers TDM Policy und den Stand der Diskussion über die Einführung von Schrankenregelungen im Urheberrecht für TDM zu nichtkommerziellen wissenschaftlichen Zwecken. The article gives a survey about the potential application of text and data mining (TDM or similar techniques on the basis of given licence agreements for subscription-based electronic resources. It also resumes the debate about the supplemental licence amendments for TDM that has arisen over the introduction of Elsevier’s TDM Policy. Finally, it describes the current discussions about the possible implementation of copyright exemptions for TDM within the context of non-commercial scientific research.

  14. Exploring the potential of Social Media Data using Text Mining to augment Business Intelligence

    OpenAIRE

    Dr. Ananthi Sheshasaayee; Jayanthi, R

    2014-01-01

    In recent years, social media has become world-wide famous and important for content sharing, social networking, etc., The contents generated from these websites remains largely unused. Social media contains text, images, audio, video, and so on. Social media data largely contains unstructured text. Foremost thing is to extract the information in the unstructured text. This paper presents the influence of social media data for research and how the content can be used to predic...

  15. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification

    Directory of Open Access Journals (Sweden)

    Yin Wang

    2016-01-01

    Full Text Available Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

  16. Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    DEFF Research Database (Denmark)

    Debortoli, Stefan; Müller, Oliver; Junglas, Iris

    2016-01-01

    topic modeling via Latent Dirichlet Allocation, an unsupervised text miningtechnique, in combination with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifactby automatically analyzing more than 12,000 online customer reviews. For fellow information systems...

  17. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool.

    Science.gov (United States)

    Silins, Ilona; Korhonen, Anna; Stenius, Ulla

    2014-01-01

    Toxicity caused by chemical mixtures has emerged as a significant challenge for toxicologists and risk assessors. Information on individual chemicals' modes of action is an important part of the hazard identification step. In this study, an automatic text mining-based tool was employed as a method to identify the carcinogenic modes of action of pesticides frequently found in fruit on the Swedish market. The current available scientific literature on the 26 most common pesticides found in apples and oranges was evaluated. The literature was classified according to a taxonomy that specifies the main type of scientific evidence used for determining carcinogenic properties of chemicals. The publication profiles of many pesticides were similar, containing evidence for both genotoxic and non-genotoxic modes of action, including effects such as oxidative stress, chromosomal changes and cell proliferation. We also found that 18 of the 26 pesticides studied here had previously caused tumors in at least one animal species, findings which support the mode of action data. This study shows how a text-mining tool could be used to identify carcinogenic modes of action for a group of chemicals in large quantities of text. This strategy could support the risk assessment process of chemical mixtures.

  18. Combining QSAR modeling and text-mining techniques to link chemical structures and carcinogenic modes of action

    Directory of Open Access Journals (Sweden)

    Georgios Papamokos

    2016-08-01

    Full Text Available There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. QSAR, a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.

  19. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach.

    Science.gov (United States)

    Ren, Xiang; El-Kishky, Ahmed; Wang, Chi; Han, Jiawei

    2015-08-01

    In today's computerized and information-based society, we are soaked with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To unlock the value of these unstructured text data from various domains, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and tweets how these typed entities aid in knowledge discovery and management.

  20. Ask and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify Which Topics Are Associated with Bond Passage and Voter Turnout

    Science.gov (United States)

    Bowers, Alex J.; Chen, Jingjing

    2015-01-01

    The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…

  1. Automatic vs. manual curation of a multi-source chemical dictionary: The impact on text mining

    NARCIS (Netherlands)

    K.M. Hettne (Kristina); A.J. Williams (Antony); E.M. van Mulligen (Erik); J. Kleinjans (Jos); V. Tkachenko (Valery); J.A. Kors (Jan)

    2010-01-01

    textabstractBackground. Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of

  2. Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

    NARCIS (Netherlands)

    Hettne, K.M.; Williams, A.J.; van Mulligen, E.M.; Kleinjans, J.C.S.; Tkachenko, V.; Kors, J.A.

    2010-01-01

    ABSTRACT: BACKGROUND: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of a

  3. Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized versus Common Languages

    Science.gov (United States)

    Jarman, Jay

    2011-01-01

    This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms,…

  4. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine

    Directory of Open Access Journals (Sweden)

    Baskin Berivan

    2003-03-01

    Full Text Available Abstract Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days. Conclusions Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

  5. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.

  6. Newspaper archives + text mining = rich sources of historical geo-spatial data

    Science.gov (United States)

    Yzaguirre, A.; Smit, M.; Warren, R.

    2016-04-01

    Newspaper archives are rich sources of cultural, social, and historical information. These archives, even when digitized, are typically unstructured and organized by date rather than by subject or location, and require substantial manual effort to analyze. The effort of journalists to be accurate and precise means that there is often rich geo-spatial data embedded in the text, alongside text describing events that editors considered to be of sufficient importance to the region or the world to merit column inches. A regional newspaper can add over 100,000 articles to its database each year, and extracting information from this data for even a single country would pose a substantial Big Data challenge. In this paper, we describe a pilot study on the construction of a database of historical flood events (location(s), date, cause, magnitude) to be used in flood assessment projects, for example to calibrate models, estimate frequency, establish high water marks, or plan for future events in contexts ranging from urban planning to climate change adaptation. We then present a vision for extracting and using the rich geospatial data available in unstructured text archives, and suggest future avenues of research.

  7. ChemicalTagger: A tool for semantic text-mining in chemistry

    Directory of Open Access Journals (Sweden)

    Hawizy Lezan

    2011-05-01

    Full Text Available Abstract Background The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP approaches. Results We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names. Conclusions It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.

  8. E-Cigarette Social Media Messages: A Text Mining Analysis of Marketing and Consumer Conversations on Twitter

    Science.gov (United States)

    2016-01-01

    Background As the use of electronic cigarettes (e-cigarettes) rises, social media likely influences public awareness and perception of this emerging tobacco product. Objective This study examined the public conversation on Twitter to determine overarching themes and insights for trending topics from commercial and consumer users. Methods Text mining uncovered key patterns and important topics for e-cigarettes on Twitter. SAS Text Miner 12.1 software (SAS Institute Inc) was used for descriptive text mining to reveal the primary topics from tweets collected from March 24, 2015, to July 3, 2015, using a Python script in conjunction with Twitter’s streaming application programming interface. A total of 18 keywords related to e-cigarettes were used and resulted in a total of 872,544 tweets that were sorted into overarching themes through a text topic node for tweets (126,127) and retweets (114,451) that represented more than 1% of the conversation. Results While some of the final themes were marketing-focused, many topics represented diverse proponent and user conversations that included discussion of policies, personal experiences, and the differentiation of e-cigarettes from traditional tobacco, often by pointing to the lack of evidence for the harm or risks of e-cigarettes or taking the position that e-cigarettes should be promoted as smoking cessation devices. Conclusions These findings reveal that unique, large-scale public conversations are occurring on Twitter alongside e-cigarette advertising and promotion. Proponents and users are turning to social media to share knowledge, experience, and questions about e-cigarette use. Future research should focus on these unique conversations to understand how they influence attitudes towards and use of e-cigarettes. PMID:27956376

  9. Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2010-03-01

    Full Text Available Abstract Background Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. Results We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. Conclusions We conclude the following: (1 The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2 Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely

  10. Analyzing Self-Help Forums with Ontology-Based Text Mining: An Exploration in Kidney Space.

    Science.gov (United States)

    Burckhardt, Philipp; Padman, Rema

    2015-01-01

    The Internet has emerged as a popular source for health-related information. More than eighty percent of American Internet users have searched for health topics online. Millions of patients use self-help online forums to exchange information and support. In parallel, the increasing prevalence of chronic diseases has become a financial burden for the healthcare system demanding new, cost-effective interventions. To provide such interventions, it is necessary to understand patients' preferences of treatment options and to gain insights into their experiences as patients. We introduce a text-processing algorithm based on semantic ontologies to allow for finer-grained analyses of online forums compared to standard methods. We have applied our method in an analysis of two major Chronic Kidney Disease (CKD) forums. Our results suggest that the analysis of forums may provide valuable insights on daily issues patients face, their choice of different treatment options and interactions between patients, their relatives and clinicians.

  11. Parallel Strands A Preliminary Investigation into Mining the Web for Bilingual Text

    CERN Document Server

    Resnik, P

    1998-01-01

    Parallel corpora are a valuable resource for machine translation, but at present their availability and utility is limited by genre- and domain-specificity, licensing restrictions, and the basic difficulty of locating parallel texts in all but the most dominant of the world's languages. A parallel corpus resource not yet explored is the World Wide Web, which hosts an abundance of pages in parallel translation, offering a potential solution to some of these problems and unique opportunities of its own. This paper presents the necessary first step in that exploration: a method for automatically finding parallel translated documents on the Web. The technique is conceptually simple, fully language independent, and scalable, and preliminary evaluation results indicate that the method may be accurate enough to apply without human intervention.

  12. Automated Personal Email Organizer with Information Management and Text Mining Application

    Directory of Open Access Journals (Sweden)

    Dr. Sanjay Tanwani

    2012-04-01

    Full Text Available Email is one of the most ubiquitous applications used regularly by millions of people worldwide. Professionals have to manage hundreds of emails on a daily basis, sometimes leading to overload and stress. Lots of emails are unanswered and sometimes remain unattended as the time pass by. Managing every single email takes a lot of effort especially when the size of email transaction log is very large. This work is focused on creating better ways of automatically organizing personal email messages. In this paper, a methodology for automated event information extraction from incoming email messages is proposed. The proposed methodology/algorithm and the software based on the above, has helped to improve the email management leading to reduction in the stress and timely response of emails.

  13. A Review on Subjectivity Analysis through Text Classification Using Mining Techniques

    Directory of Open Access Journals (Sweden)

    Ashwin Shinde

    2017-03-01

    Full Text Available The increased use of web for expressing ones opinion has resulted in to an enhanced amount of subjective content available in the Web. These contents can often be categorized as social content like movie or product reviews, Customer Feedbacks, Blogs, Communication exchange in discussion forums etc. Accurate recognition of the subjective or sentimental web content has a number of benefits. Understanding of the sentiments of human masses towards different entities and products enables better services for contextual advertisements, recommendation systems and analysis of market trends. The objective behind framing this paper to analyze various sentiment based classification techniques which can be utilized for quick estimation of subjective contents of Political reviews based on politicians speech. The paper elaborately discusses supervised machine learning algorithm: Naïve Bayes classification and compares its overall accuracy, precisions as well as recall values

  14. 用文本挖掘方法发现药物的副作用%Detection of drug adverse effects by text-mining

    Institute of Scientific and Technical Information of China (English)

    隋明爽; 崔雷

    2015-01-01

    After the necessity and feasibility to detect drug adverse effects by text-mining were analyzed, the cur-rent researches on detection drug adverse effects by text-mining, unsolved problems and future development were summarized in aspects of text-mining process, text mining/detecting methods, results assessment, and current tool software.%分析了用文本挖掘方法探测药物副作用的必要性及可行性,从挖掘流程、挖掘/提取方法、结果评价和现有工具软件4个方面总结了用文本挖掘技术提取药物副作用的研究现状及尚未解决的问题和未来发展趋势。

  15. Cloud Based Metalearning System for Predictive Modeling of Biomedical Data

    Directory of Open Access Journals (Sweden)

    Milan Vukićević

    2014-01-01

    Full Text Available Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.

  16. The biomedical discourse relation bank

    Directory of Open Access Journals (Sweden)

    Joshi Aravind

    2011-05-01

    Full Text Available Abstract Background Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource. Results We have developed the Biomedical Discourse Relation Bank (BioDRB, in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB, which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89. These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28, mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57. Conclusion Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more

  17. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    André SANTOS

    2012-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  18. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    Anália LOURENÇO

    2013-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  19. Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO Cellular Component curation

    Directory of Open Access Journals (Sweden)

    Chan Juancarlos

    2009-07-01

    Full Text Available Abstract Background Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%, when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%. From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%. Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given

  20. MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.

    Science.gov (United States)

    Karthikeyan, Muthukumarasamy; Pandit, Yogesh; Pandit, Deepak; Vyas, Renu

    2015-01-01

    Virtual screening is an indispensable tool to cope with the massive amount of data being tossed by the high throughput omics technologies. With the objective of enhancing the automation capability of virtual screening process a robust portal termed MegaMiner has been built using the cloud computing platform wherein the user submits a text query and directly accesses the proposed lead molecules along with their drug-like, lead-like and docking scores. Textual chemical structural data representation is fraught with ambiguity in the absence of a global identifier. We have used a combination of statistical models, chemical dictionary and regular expression for building a disease specific dictionary. To demonstrate the effectiveness of this approach, a case study on malaria has been carried out in the present work. MegaMiner offered superior results compared to other text mining search engines, as established by F score analysis. A single query term 'malaria' in the portlet led to retrieval of related PubMed records, protein classes, drug classes and 8000 scaffolds which were internally processed and filtered to suggest new molecules as potential anti-malarials. The results obtained were validated by docking the virtual molecules into relevant protein targets. It is hoped that MegaMiner will serve as an indispensable tool for not only identifying hidden relationships between various biological and chemical entities but also for building better corpus and ontologies.

  1. Preprocessing Techniques for Image Mining on Biopsy Images

    Directory of Open Access Journals (Sweden)

    Ms. Nikita Ramrakhiani

    2015-08-01

    Full Text Available Biomedical imaging has been undergoing rapid technological advancements over the last several decades and has seen the development of many new applications. A single Image can give all the details about an organ from the cellular level to the whole-organ level. Biomedical imaging is becoming increasingly important as an approach to synthesize, extract and translate useful information from large multidimensional databases accumulated in research frontiers such as functional genomics, proteomics, and functional imaging. To fulfill this approach Image Mining can be used. Image Mining will bridge this gap to extract and translate semantically meaningful information from biomedical images and apply it for testing and detecting any anomaly in the target organ. The essential component in image mining is identifying similar objects in different images and finding correlations in them. Integration of Image Mining and Biomedical field can result in many real world applications

  2. Analysis of US underground thin seam mining potential. Volume 1. Text. Final technical report, December 1978. [In thin seams

    Energy Technology Data Exchange (ETDEWEB)

    Pimental, R. A; Barell, D.; Fine, R. J.; Douglas, W. J.

    1979-06-01

    An analysis of the potential for US underground thin seam (< 28'') coal mining is undertaken to provide basic information for use in making a decision on further thin seam mining equipment development. The characteristics of the present low seam mines and their mining methods are determined, in order to establish baseline data against which changes in mine characteristics can be monitored as a function of time. A detailed data base of thin seam coal resources is developed through a quantitative and qualitative analysis at the bed, county and state level. By establishing present and future coal demand and relating demand to production and resources, the market for thin seam coal has been identified. No thin seam coal demand of significance is forecast before the year 2000. Current uncertainty as to coal's future does not permit market forecasts beyond the year 2000 with a sufficient level of reliability.

  3. 基于Web的文本挖掘系统的研究与实现%The Research and Development of Text Mining System Based on Web

    Institute of Scientific and Technical Information of China (English)

    唐菁; 沈记全; 杨炳儒

    2003-01-01

    With the development of network technology, the spread of information on Internet becomes more andmore quick. There are many types of complicated data in the information ocean. How to acquire useful knowledgequickly from the information ocean is the very difficult. The Text Mining based on Web is the new research fieldwhich can solve the problem effectively. In this paper, we present a structure model of Text Mining and research thecore arithmetic - Classification arithmetic. We have developed the Text Mining system based on Web and appliedit in the modern long-distance education. This system can automatically classify the text information of education fieldwhich is collected from education site on Internet and help people to browser the important information quickly andacquire knowledge.

  4. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-01-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  5. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-08-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  6. Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification.

    Science.gov (United States)

    Ben Abacha, Asma; Chowdhury, Md Faisal Mahbub; Karanasiou, Aikaterini; Mrabet, Yassine; Lavelli, Alberto; Zweigenbaum, Pierre

    2015-12-01

    Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDIs). The shared tasks on DDI-Extraction organized in 2011 and 2013 have pointed out the importance of this issue and provided benchmarks for: Drug Name Recognition, DDI extraction and DDI classification. In this paper, we present our text mining systems for these tasks and evaluate their results on the DDI-Extraction benchmarks. Our systems rely on machine learning techniques using both feature-based and kernel-based methods. The obtained results for drug name recognition are encouraging. For DDI-Extraction, our hybrid system combining a feature-based method and a kernel-based method was ranked second in the DDI-Extraction-2011 challenge, and our two-step system for DDI detection and classification was ranked first in the DDI-Extraction-2013 task at SemEval. We discuss our methods and results and give pointers to future work.

  7. Grouping chemicals for health risk assessment: A text mining-based case study of polychlorinated biphenyls (PCBs).

    Science.gov (United States)

    Ali, Imran; Guo, Yufan; Silins, Ilona; Högberg, Johan; Stenius, Ulla; Korhonen, Anna

    2016-01-22

    As many chemicals act as carcinogens, chemical health risk assessment is critically important. A notoriously time consuming process, risk assessment could be greatly supported by classifying chemicals with similar toxicological profiles so that they can be assessed in groups rather than individually. We have previously developed a text mining (TM)-based tool that can automatically identify the mode of action (MOA) of a carcinogen based on the scientific evidence in literature, and it can measure the MOA similarity between chemicals on the basis of their literature profiles (Korhonen et al., 2009, 2012). A new version of the tool (2.0) was recently released and here we apply this tool for the first time to investigate and identify meaningful groups of chemicals for risk assessment. We used published literature on polychlorinated biphenyls (PCBs)-persistent, widely spread toxic organic compounds comprising of 209 different congeners. Although chemically similar, these compounds are heterogeneous in terms of MOA. We show that our TM tool, when applied to 1648 PubMed abstracts, produces a MOA profile for a subgroup of dioxin-like PCBs (DL-PCBs) which differs clearly from that for the rest of PCBs. This suggests that the tool could be used to effectively identify homogenous groups of chemicals and, when integrated in real-life risk assessment, could help and significantly improve the efficiency of the process.

  8. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

    Directory of Open Access Journals (Sweden)

    Tareef K. Mustafa

    2010-01-01

    Full Text Available Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational

  9. 基于高维聚类的探索性文本挖掘算法%Exploratory text mining algorithm based on high-dimensional clustering

    Institute of Scientific and Technical Information of China (English)

    张爱科; 符保龙

    2013-01-01

    建立了一种基于高维聚类的探索性文本挖掘算法,利用文本挖掘的引导作用实现数据类文本中的数据挖掘.算法只需要少量迭代,就能够从非常大的文本集中产生良好的集群;映射到其他数据与将文本记录到用户组,能进一步提高算法的结果.通过对相关数据的测试以及实验结果的分析,证实了该方法的可行性与有效性.%Because of the unstructured characteristics of free text, text mining becomes an important branch of data mining. In recent years, types of text mining algorithms emerged in large numbers. In this paper, an exploratory text mining algorithm was proposed based on high-dimensional clustering. The algorithm required only a small number of iterations to produce favorable clusters from very large text. Mapping to other recorded data and recording the text to the user group enabled the result of the algorithm be improved further. The feasibility and validity of the proposed method is verified by related data test and the analysis of experimental results.

  10. Biomedical Science, Unit II: Nutrition in Health and Medicine. Digestion of Foods; Organic Chemistry of Nutrients; Energy and Cell Respiration; The Optimal Diet; Foodborne Diseases; Food Technology; Dental Science and Nutrition. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of science within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on…

  11. Research on Fuzzy Clustering Validity in Web Text Mining%Web文本挖掘中模糊聚类的有效性评价研究

    Institute of Scientific and Technical Information of China (English)

    罗琪

    2012-01-01

    本文研究了基于模糊聚类的Web文本挖掘和模糊聚类有效性评价函数,并将其应用于Web文本挖掘中模糊聚类有效性评价.仿真实验表明该方法有一定的准确性和可行性.%This paper studies web documents mining based on fuzzy clustering and validity evaluation function, and puts forward to applying validity evaluation function into evaluation of web text mining. The experiments show that FKCM can effectively improve the precision of web text clustering; the method is feasible in web documents mining. The result of emulation examinations indicates that the method has certain feasibility and accuracy.

  12. Quantifying the impact and extent of undocumented biomedical synonymy.

    Directory of Open Access Journals (Sweden)

    David R Blair

    2014-09-01

    Full Text Available Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through "crowd-sourcing." Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90% of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for "next-generation," high-coverage lexical terminologies.

  13. Research on Web Text Mining Based on Ontology%基于领域本体实现Web文本挖掘研究

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2011-01-01

    为弥补改进传统Web文本挖掘方法缺乏对文本语义理解的不足,采用本体与Web文本挖掘相结合的方法,探讨基于领域本体的Web文本挖掘方法。首先创建Web文本的本体结构,然后引入领域本体“概念-概念”相似度矩阵,并就概念间关系识别进行描述,最后给出Web文本挖掘的实现方法,发现Web文本信息的内涵。实验中以网络媒体报道为例,通过文本挖掘得出相关结论。%The paper improved the traditional web text mining technology which can not understand the text semantics. The author discusses the web text mining methods based on the ontology, and sets up the web ontology structure at first, then introduces the "concept-concept" similarity matrix, and describs the relations among the concepts; puts forward the web text mining method at last. Based on text mining, the paper can find the potential information from the web pages. Finally, the author did a case study and drew some conclusion.

  14. Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining: report from the expert group

    NARCIS (Netherlands)

    I. Hargreaves; L. Guibault; C. Handke; P. Valcke; B. Martens

    2014-01-01

    Text and data mining (TDM) is an important technique for analysing and extracting new insights and knowledge from the exponentially increasing store of digital data (‘Big Data’). TDM is useful to researchers of all kinds, from historians to medical experts, and its methods are relevant to organisati

  15. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    NARCIS (Netherlands)

    Hettne, K.M.; Boorsma, A.; Dartel, D.A. van; Goeman, J.J.; Jong, Esther de; Piersma, A.H.; Stierum, R.H.; Kleinjans, J.C.; Kors, J.A.

    2013-01-01

    BACKGROUND: Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set anal

  16. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    NARCIS (Netherlands)

    Hettne, K.M.; Boorsma, A.; Dartel, van D.A.M.; Goeman, J.J.; Jong, de E.; Piersma, A.H.; Stierum, R.H.; Kleinjans, J.C.; Kors, J.A.

    2013-01-01

    Background: Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with gene set anal

  17. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    NARCIS (Netherlands)

    K.M. Hettne (Kristina); J. Boorsma (Jeffrey); D.A.M. van Dartel (Dorien A M); J.J. Goeman (Jelle); E.C. de Jong (Esther); A.H. Piersma (Aldert); R.H. Stierum (Rob); J. Kleinjans (Jos); J.A. Kors (Jan)

    2013-01-01

    textabstractBackground: Availability of chemical response-specific lists of genes (gene sets) for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM), and that these can be used with g

  18. 基于领域本体的语义文本挖掘研究%Research on Semantic Text Mining Based on Domain Ontology

    Institute of Scientific and Technical Information of China (English)

    张玉峰; 何超

    2011-01-01

    为了提高文本挖掘的深度和精度,研究并提出了一种基于领域本体的语义文本挖掘模型.该模型利用语义角色标注进行语义分析,获取概念和概念间的语义关系,提高文本表示的准确度;针对传统的知识挖掘算法不能有效挖掘语义元数据库,设计了一种基于语义的模式挖掘算法挖掘文本深层的语义模式.实验结果表明,该模型能够挖掘文本数据库中的深层语义知识,获取的模式具有很强的潜在应用价值,设计的算法具有很强的适应性和可扩展性.%In order to improve the depth and accuracy of text mining, a semantic text mining model based on domain ontology is proposed. In this model, semantic role labeling is applied to semantic analysis so that the semantic relations can be extracted accurately. For the defect of traditional knowledge mining algorithms that can not effectively mine semantic meta database, an association patterns mining algorithm hased on semantic is designed and used to acquire the deep semantic association patterns from semantic meta database. Experimental results show that the model can mine deep semantic knowledge from text database. The pattern got has great potential applications, and the algorithm designed has strong adaptability and scalability.

  19. The Mining Methods of Multimedia Text Data Pattern%多媒体文本数据的模式挖掘方法

    Institute of Scientific and Technical Information of China (English)

    刘茂福; 曹加恒; 彭敏; 叶可; 林芝

    2001-01-01

    给出了多媒体文本数据挖掘(MTM)的定义和分类,提出了多媒体文本数据挖掘过程模型(MTMM)及其特征表示,讨论了多媒体文本分类挖掘方法,MTM与Web挖掘的区别与联系,以期发现有用的知识或模式,促进MTM的发展和应用.%Multimedia Text data Mining is a new research field in data mining. The definition and classifications of MTM are given. This article also focuses on Multimedia Text data Mining Model(MTMM) and feature expression, discusses multimedia text data categorization and its alteration. In this paper, the author points out the differences and relationships between MTM and Web mining. The goal of MTM is to discover the useful knowledge or model and push the development and application of MTM.

  20. Research on Methods of Text Mining and Its Application%文本挖掘的方法及应用研究

    Institute of Scientific and Technical Information of China (English)

    张晓艳; 华英

    2011-01-01

    互联网的兴起带来了大量的文本信息。在半结构化和非结构化的文本中提取对用户有用的信息,主要采用文本挖掘技术.本文对文本挖掘常用的方法进行比较分析,总结文本挖掘目前主要的应用领域%Vast amount of text information comes with the rise of the Internet. Text mining technology is used to extract useful information for users from semi-structured or non-structured texts. This article analyzes several common methods of text mining and summarizes its application.

  1. What Online Communities Can Tell Us About Electronic Cigarettes and Hookah Use: A Study Using Text Mining and Visualization Techniques

    OpenAIRE

    Chen, AT; Zhu, SH; Conway, M

    2015-01-01

    © 2015 Journal of Medical Internet Research. Background: The rise in popularity of electronic cigarettes (e-cigarettes) and hookah over recent years has been accompanied by some confusion and uncertainty regarding the development of an appropriate regulatory response towards these emerging products. Mining online discussion content can lead to insights into people's experiences, which can in turn further our knowledge of how to address potential health implications. In this work, we take a no...

  2. Application of Text Mining to Extract Hotel Attributes and Construct Perceptual Map of Five Star Hotels from Online Review: Study of Jakarta and Singapore Five-Star Hotels

    Directory of Open Access Journals (Sweden)

    Arga Hananto

    2015-12-01

    Full Text Available The use of post-purchase online consumer review in hotel attributes study was still scarce in the literature. Arguably, post purchase online review data would gain more accurate attributes thatconsumers actually consider in their purchase decision. This study aims to extract attributes from two samples of five-star hotel reviews (Jakarta and Singapore with text mining methodology. In addition,this study also aims to describe positioning of five-star hotels in Jakarta and Singapore based on the extracted attributes using Correspondence Analysis. This study finds that reviewers of five star hotels in both cities mentioned similar attributes such as service, staff, club, location, pool and food. Attributes derived from text mining seem to be viable input to build fairly accurate positioning map of hotels. This study has demonstrated the viability of online review as a source of data for hotel attribute and positioning studies.

  3. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    Science.gov (United States)

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  4. TML:A General High-Performance Text Mining Language%TML:一种通用高效的文本挖掘语言

    Institute of Scientific and Technical Information of China (English)

    李佳静; 李晓明; 孟涛

    2015-01-01

    实现了一种通用高效的文本挖掘编程语言,包括其编译器、运行虚拟机和图形开发环境。其工作方式是用户通过编写该语言的代码以定制抽取目标和抽取手段,然后将用户代码编译成字节码并进行优化,再将其与输入文本语义结构做匹配。该语言具有如下特点:1)提供了一种描述文本挖掘的范围、目标和手段的形式化方法,从而能通过编写该语言的代码来在不同应用领域做声明式文本挖掘;2)运行虚拟机以信息抽取技术为核心,高效地实现了多种常用文本挖掘技术,并将其组成一个文本分析流水线;3)通过一系列编译优化技术使得大量匹配指令能够充分并发执行,从而解决了该语言在处理海量规则和海量数据上的执行效率问题。实用案例说明了TML语言的描述能力以及它的实际应用情况。%This paper proposes a general‐purpose programming language named TML for text mining . TML is the abbreviation of “text mining language” ,and it aims at turning complicated text mining tasks into easy jobs . The implementation of TML includes a compiler ,a runtime virtual machine (interpreter ) , and an IDE . TML has supplied most usual text mining techniques , which are implemented as grammars and reserved words .Users can use TML to program ,and the code will be compiled into bytecodes ,which will be next interpreted in the virual runtime machine .TML has the following characteristics :1) It supplies a formal way to model the searching area ,object definition and mining methods of text mining jobs ,so users can program with it to make a declarative text mining easily ;2) The TML runtime machine implements usual text mining techniques ,and organizes them into an efficient text analysis pipeline ;3) The TML compiler fully explores the possibility of concurrently executing its byte codes , and the execution has good performance on very large collections of

  5. 基于文本挖掘的网络新闻报道差异分析%Analysis on Web Media Report Differences Based on Text Mining

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2012-01-01

    It is a new research on how to find potential but valued information in the web media reports based on text mining technology. This paper discusses the text mining methods of web media reports. In the case of web media reports on Shanghai Expo, the author has done some empirical study to analyze the differences among different web media. The paper selected the web media reports on Expo" from Hong Kong, Tai Wan, overseas newspapers (Chinese version) and Shanghai, analyzed the differences among these different regions base on text mining and attribution extraction and drew some conclusions.%运用文本挖掘技术发现网络新闻报道中潜在的、有价值的信息是情报研究的一个新尝试。笔者探讨了网络新闻的文本挖掘方法,以上海世博新闻媒体网络版报道为例,进行实证研究,并对报道差异进行对比分析。本文选取香港、台湾、境外媒体华语版、上海本地媒体对世博会相关报道,基于文本挖掘、特征提取对报道内容的差异进行阐述,并得出结论。

  6. The Application of the Web Text Mining in the Druggist Interest Extraction%Web文本挖掘在药商兴趣提取中的应用

    Institute of Scientific and Technical Information of China (English)

    孙士新

    2014-01-01

    The information attainment has become the important component of the druggist's business operation and the market judgment basis. The appearance of the largely unstructured and semi-structured network has provided the technology space and the demonstration basis for the druggist's individual service. Through the critical technology of the text mining in individual service,the paper,combining the Traditional Chinese Medicinal Materials information website,has actually applied the text mining process, and applies the text mining technology to the example of the user's interest attainment about the Traditional Chinese Medicinal Materials information website.%信息获取已成为药商经营活动的重要组成部分和市场判断依据,网络大量非结构化、半结构化信息的出现为药商个性化服务提供了技术空间和实证依据。文章通过对个性化服务中文本挖掘的关键技术进行设计,并应用了中药材信息网站文本挖掘流程,把文本挖掘技术应用于中药材信息网站的用户兴趣获取实例中,实现用户兴趣的自动获取功能。

  7. Key Issues in Morphology Analysis Based on Text Mining%基于文本挖掘的形态分析方法的关键问题

    Institute of Scientific and Technical Information of China (English)

    冷伏海; 王林; 王立学

    2012-01-01

    Morphological analysis based on text mining integration of text mining method, that reduce the reliance on technical experts, and adds objective data to support analysis. Morphological analysis based on text mining has four key issues, which are the definition of the morphological structure, feature word selection, morphology representation, morphology analysis. The improvement to the four issues has key role in enhancing the method.%基于文本挖掘的形态分析方法是在传统方法基础上融入文本挖掘的手段,是国内外学者对形态分析方法的一次有益的探索与改进。改进后的方法减轻对领域专家的依赖,并且增加分析过程中客观数据的支持,提高方法的效率和科学性。基于文本挖掘的形态分析方法包括形态结构定义、特征词选择、形态表示、形态分析等4个关键问题,这4个问题解决方案的优化对整个方法的分析效率和质量的提高有关键作用。

  8. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    Directory of Open Access Journals (Sweden)

    Cesareni Gianni

    2011-10-01

    Full Text Available Abstract Background The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP systems is essential for the selection of relevant publications, identification of data attributes and partially automated annotation. One of the tasks of the Biocreative 2010 Challenge III was devoted to the evaluation of NLP systems developed to identify articles for curation and extraction of protein-protein interaction (PPI data. Results The Biocreative 2010 competition addressed three tasks: gene normalization, article classification and interaction method identification. The BioGRID and MINT protein interaction databases both participated in the generation of the test publication set for gene normalization, annotated the development and test sets for article classification, and curated the test set for interaction method classification. These test datasets served as a gold standard for the evaluation of data extraction algorithms. Conclusion The development of efficient tools for extraction of PPI data is a necessary step to achieve full curation of the biomedical literature. NLP systems can in the first instance facilitate expert curation by refining the list of candidate publications that contain PPI data; more ambitiously, NLP approaches may be able to directly extract relevant information from full-text articles for rapid inspection by expert curators. Close collaboration between biological databases and NLP systems developers will continue to facilitate the long-term objectives of both disciplines.

  9. Fundamental of biomedical engineering

    CERN Document Server

    Sawhney, GS

    2007-01-01

    About the Book: A well set out textbook explains the fundamentals of biomedical engineering in the areas of biomechanics, biofluid flow, biomaterials, bioinstrumentation and use of computing in biomedical engineering. All these subjects form a basic part of an engineer''s education. The text is admirably suited to meet the needs of the students of mechanical engineering, opting for the elective of Biomedical Engineering. Coverage of bioinstrumentation, biomaterials and computing for biomedical engineers can meet the needs of the students of Electronic & Communication, Electronic & Instrumenta

  10. WSAM: AN INTERNET TEXT UGC SUBJECTIVE ATTITUDE MINING SYSTEM%WSAM:互联网 UGC 文本主观观点挖掘系统

    Institute of Scientific and Technical Information of China (English)

    费仲超; 朱鲲鹏; 魏芳

    2012-01-01

    The information about subjective attitude of users contained in UGC (User Generated Content) of internet is much valuable for user behaviour analysis and user demand analysis. In this paper we design an internet text UGC subjective attitude analysing system, WSAM, based on nature language comprehension. This system can mine the objects attended to and the subjective components, all contained in subjective attitude of users. The UGC phenomena in internet and the reason they generated are analysed in the paper? And four main types of subjective attitude of users in text UGC are concluded. During the process of mining subjective attitude of users, we convert the procedure of subjective attitude mining into the procedures of recognising the object attended to by subjective attitude in sentence and determining the subjective components. The algorithm uses the maximum entropy classifier to mine subjective attitude of users in combination with relative features in regard to lexical and structural classes. Experiments validate that the algorithm adopted by WSAM system is good in performance, and the system can be extended easily to related applications such as opinion mining with preferred good results as well.%互联网上的用户生成内容UGC( User Generated Content)中蕴含的用户主观观点信息对分析用户行为、用户需求等工作有着重要的价值.设计一套基于自然语言理解的互联网UGC文本主观观点分析系统WSAM,该系统能挖掘出用户主观观点所蕴含的关注对象和主观成分.分析了互联网UGC现象和生成原因,总结出UGC中用户主观观点中的四种主要类型.挖掘用户主观观点过程中,将用户主观观点的挖掘转化为句子中主观观点关注对象的识别和主观成分的判断.算法结合基于词语类、结构类等相关特征,采用最大熵分类器挖掘用户主观观点.实验验证,WSAM 系统所采用的算法性能较好,且还能够

  11. 关联挖掘下的海量文本信息深入挖掘实现%Text Mining Method of Massive Network Based on Correlation Mining

    Institute of Scientific and Technical Information of China (English)

    彭其华

    2013-01-01

    研究基于关联度挖掘的海量网络文本挖掘方法;随着计算机和网络技术的快速发展,网络上的文本呈现海量增长的趋势,传统的网络文本挖掘方法采用基于特征提取的方法实现,能够实现小数据量下的文本挖掘,但是在信息量的快速增长下,传统方法已经不能适应;提出一种基于关联度挖掘的海量网络文本挖掘方法,首先采用特征提取的方法对海量文本进行初步的分类和特征识别,然后采用关联度挖掘的方法对各个文本特征之间的关联度进行计算处理,根据关联度的大小最终实现文本挖掘,由于关联度可以很好的体现特征文本之间的相互关系;最后采用一组随机的网络热门词汇进行测试实验,结果显示,算法能够很好适应海量文本下的挖掘实现,具有很好的应用价值。%The text mining method of massive network based on correlation mining was research on .With the rapid development of computer and network technology ,the text rendering of network grew fast ,the traditional network-based text mining method extracted feature from text to achieve text mining ,but with the rapid growth in the amount of information ,the traditional methods cannot meet the need of development .So a text mining method of massive network based on correlation mining was proposed ,the feature was extracted with mass text to finish initial classification and characteristics identification ,and then the method of mining correlation between the characteristics of the various texts correlation was used to do calculate the coefficient of correlation ,according to the coefficient of correlation ,the text was divided into several types ,so the correlation can reflect the relationship between the characteristics of the text well .Finally ,a team of random words were used to test the ability of the algorithm ,and the result shows that the algorithm can adapt to massive text excavation

  12. Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens

    Directory of Open Access Journals (Sweden)

    Perna Nicole T

    2009-06-01

    Full Text Available Abstract Background The Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP, and in particular Information Extraction (IE technology, can be a significant aid to this process. Description We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc. and over 70% for relations (gene/gene product to role, etc. This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application. Conclusion Our Text Mining application is available online on the ERIC website http://www.ericbrc.org/portal/eric/articles. The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations

  13. Using Google blogs and discussions to recommend biomedical resources: a case study.

    Science.gov (United States)

    Reed, Robyn B; Chattopadhyay, Ansuman; Iwema, Carrie L

    2013-01-01

    This case study investigated whether data gathered from discussions within the social media provide a reliable basis for a biomedical resources recommendation system. Using a search query to mine text from Google Blogs and Discussions, a ranking of biomedical resources was determined based on those most frequently mentioned. To establish quality, these results were compared with rankings by subject experts. An overall agreement between the frequency of social media discussions and subject expert recommendations was observed when identifying key bioinformatics and consumer health resources. Testing the method in more than one biomedical area implies this procedure could be employed across different subjects.

  14. 文本挖掘、数据挖掘和知识管理%Text Mining,Data Mining vs.Knowledge Management:the Intelligent Information Processing in the 21st Century

    Institute of Scientific and Technical Information of China (English)

    韩客松; 王永成

    2001-01-01

    本文首先介绍了数据挖掘、文本挖掘和知识管理等概念,然后从技术角度出发,将知识管理划分为知识库、知识共享和知识发现三个阶段,分析了作为最高阶段的知识发现的关键技术和意义,最后指出在文本中进行知识发现是新世纪智能信息处理的重要方向。%Based on the introduction to Data Mining,Text Mining andKnowledge Management,we divide the knowledge management into three phases,Knowledge Repository,Knowledge Sharing and Knowledge Discovery respectively,from the view-point of technical development.We analyse the key component of text mining,and point out that it is the main trend of intelligent information processing in the coming new century.

  15. Ranking Biomedical Annotations with Annotator’s Semantic Relevancy

    Directory of Open Access Journals (Sweden)

    Aihua Wu

    2014-01-01

    Full Text Available Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator’s knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user’s vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.

  16. 基于文本挖掘技术的偏头痛临床诊疗规律分析%Analysis of Regularity of Clinical Medication for Migraine with Text Mining Approach

    Institute of Scientific and Technical Information of China (English)

    杨静; 蔡峰; 谭勇; 郑光; 李立; 姜淼; 吕爱平

    2013-01-01

    Objective To analyze the regularity of clinical medication in the treatment of migraine with text mining approach. Methods The data set of migraine was downloaded from Chinese BioMedical Literature Database (CBM). Rules of TCM pattern, symptoms, Chinese herbal medicines (CHM), Chinese patent medicines (CPM) and western medicines on migraine were mined out by data slicing algorithm, the results were demonstrated in both frequency tables and two-dimension based network. Results A total of 7 921 literatures were searched. The main syndrome classification in TCM of migraine included liver-yang hyperactivity syndrome, stagnation of liver qi syndrome and qi deficiency to blood stasis syndrome, et al. The core symptoms of migraine included headache, vomit, nausea, swirl, photophobia, et al. Traditional Chinese medicine for migraine contained Chuanxiong, Tianma, Danshen, Chaihu, Danggui, Baishao and Baizhi, et al. For Chinese patent medicine, Yangxueqingnao granule, Toutongning capsule and Zhengtian pill were used in treating migraine. In western medicine, flunarizine, nimodipine and aspirin were used frequently. For the integrated treatment of TCM and western medicine, the combination of Yangxueqingnao granule and nimodipine was most commonly used. Conclusion Text mining approach provides a novel method in the summary of treatment rules on migraine in both TCM and western medicine. To some' extent, the migraine results from texting mining has significance for clinical practice.%目的 探索偏头痛中西医临床诊疗的规律.方法 应用中国生物医学文献服务系统,收集治疗偏头痛的文献数据,采用基于敏感关键词频数统计的数据分层算法,并结合原文献回溯、人工阅读分析等方法,挖掘有关偏头痛证候、症状、中药、中成药以及西药联用的规律,并通过一维频次表及二维的网络图对结果进行展示.结果 共检索到偏头痛文献7 921篇.文本挖掘结果显示,肝阳上亢、肝气郁

  17. A study on application of the text mining technology to the adverse drug reaction signal detection system

    OpenAIRE

    村永, 文学; ムラナガ, フミノリ; MURANAGA, Fuminori

    2012-01-01

    2009-2011年度科学研究費助成事業(科学研究費補助金(基盤研究(C)))研究成果報告書 課題番号:21590571 研究代表者:村永文学 (鹿児島大学医学部・歯学部附属病院講師) 本研究では、総合病院情報システムの情報から、薬剤相互作用による白血球減少症をアソシエーション分析アルゴリズムで発見する方法の技術的な検討を行った。対象は2008年~2009年に当院に入院し前立腺癌の化学療法を受けた患者を対象とした。2009年の症例について発見したIF-THENルールのうち、2008年の症例から作成した知識データベースに含まれないものが35万レコードであった。lift値の大きい薬剤について調査したところ、多くが既知の薬剤であった。発生頻度が低くクリティカルな有害事象については、十分に整備された知識辞書が必須であった。 This study was performed to evaluate the utility of a data mining algorithm for detection of adverse drug events. We used exp...

  18. Mining Related Articles for Automatic Journal Cataloging

    Directory of Open Access Journals (Sweden)

    Yuqing Mao

    2016-06-01

    Full Text Available Purpose: This paper is an investigation of the effectiveness of the method of clustering biomedical journals through mining the content similarity of journal articles. Design/methodology/approach: 3,265 journals in PubMed are analyzed based on article content similarity and Web usage, respectively. Comparisons of the two analysis approaches and a citation-based approach are given. Findings: Our results suggest that article content similarity is useful for clustering biomedical journals, and the content-similarity-based journal clustering method is more robust and less subject to human factors compared with the usage-based approach and the citation-based approach. Research limitations: Our paper currently focuses on clustering journals in the biomedical domain because there are a large volume of freely available resources such as PubMed and MeSH in this field. Further investigation is needed to improve this approach to fit journals in other domains. Practical implications: Our results show that it is feasible to catalog biomedical journals by mining the article content similarity. This work is also significant in serving practical needs in research portfolio analysis. Originality/value: To the best of our knowledge, we are among the first to report on clustering journals in the biomedical field through mining the article content similarity. This method can be integrated with existing approaches to create a new paradigm for future studies of journal clustering.

  19. Application of Lanczos bidiagonalization algorithm in text mining%Lanczos双对角算法在文本挖掘当中的应用

    Institute of Scientific and Technical Information of China (English)

    范伟鹏

    2012-01-01

    Text mining plays an important role in data mining, and classical text mining is based on latent semantic analysis. In the past, to get a low rank approximation, singular value decomposition is applied for latent semantic analysis. As we all know, singular value decomposition needs a cubic operation for it; so, it is cost, in particular, when the matrix is large and sparse. To solve this problem, this paper uses Lancos bidiagonalization algorithm and extended Lanczos bidiagonalization algorithm in here, both of them are efficient and effective for a large and sparse matrix.%文本挖掘是数据挖掘中的一个重要组成部分,传统的文本挖掘方法大部分是基于潜在语义分析的基础上进行的.由于由文本构成的矩阵基本上是大型稀疏的,而传统的潜在语义分析都是基于矩阵的奇异值分解的基础上进行的,矩阵的奇异值分解是一种立方次运算的求矩阵低秩近似方法,因而是一种低效的方法.针对文本矩阵是大型稀疏的特点,将Lanczos双对角算法和Lanczos双对角算法运用于此,并且从文中的算法分析得出,Lanczos双对角算法和扩展的Lanczos双对角算法是两种高效的求大型稀疏矩阵低秩近似的方法.

  20. 面向生物文本挖掘的语义标注研究%Semantic Relation Annotation for Biomedical Text Mining Based on Recursive Directed Graph

    Institute of Scientific and Technical Information of China (English)

    陈波; 吕晨; 魏小梅

    2015-01-01

    文章提出了一个新颖的模型——“基于特征结构的递归有向图”,将其用于描述英文生物文本中定语后置的语义关系.后置定语的用法是复杂多变的,主要有三类情况:现在分词充当后置定语,过去分词充当后置定语,介词短语充当后置定语,这为自动分析带来很多难题.我们总结和标注了这三类后置定语的语义信息.与依存结构相比,特征结构可以形式化为可递归的有向图,标注结果表明递归有向图更适合与生物文本挖掘中的复杂语义关系抽取.

  1. Biomedical Science, Unit I: Respiration in Health and Medicine. Respiratory Anatomy, Physiology and Pathology; The Behavior of Gases; Introductory Chemistry; and Air Pollution. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text deals with the human respiratory system and its relation to the environment. Topics include the process of respiration, the relationship of air to diseases of the respiratory system, the chemical and physical properties of gases, the impact on air quality of human activities and the effect of this air pollution on health.…

  2. Simulation Research of Text Categorization Based on Data Mining%基于数据挖掘的文本自动分类仿真研究

    Institute of Scientific and Technical Information of China (English)

    赖娟

    2011-01-01

    Research text classification problem. Text classification feature dimension is usually up to tens of thousands , characteristics and interrelated information are redundant, and the accuracy of traditional text classification is low. In order to improve the accuracy of automatic text categorization, we put forward a method of automatic text categorization based on data mining technology. Using the insensitivity of support vector machine to the characteristic correlation and sparseness and the advantage of handling high dimension problems, the contribution value of single word to the classification was calculated. Then, the words with similar contribution value were merged into a feature item of the text vector. Finally, by using support vector machine, text classification results was obtained. The results show that performance testing based on data mining technology is very good, and this method can quicken the text classification speed and improve the text classification accuracy and recall rate.%研究文本分类优化问题,文本是一种半结构化形式,特征数常高达几万,特征互相关联、冗余比较严重,影响分类的准确性.传统分类方法难以获得高正确率.为了提高文本自动分类正确率,提出了一种数据挖掘技术的文本自动分类方法.利用支持向量机对于特征相关性和稀疏性不敏感,能很好处理高维数问题的优点对单词对分类的贡献值进行计算,将对分类贡献相近单词合并成文本向量的一个特征项,采用支持向量机对特征项进行学习和分类.用文本分类库数据进行测试,结果表明,数据挖掘技术的分类方法,不仅加快了文本分类速度,同时提高文本分类准确率.

  3. Study of Cloud Based ERP Services for Small and Medium Enterprises (Data is Processed by Text Mining Technique

    Directory of Open Access Journals (Sweden)

    SHARMA, R.

    2014-06-01

    Full Text Available The purpose of this research paper is to explore the knowledge of the existing studies related to cloud computing current trend. The outcome of research is demonstrated in the form of diagram which simplifies the ERP integration process for in-house and cloud eco-system. It will provide a conceptual view to the new client or entrepreneurs using ERP services and explain them how to deal with two stages of ERP systems (cloud and in-house. Also suggest how to improve knowledge about ERP services and implementation process for both stages. The work recommends which ERP services can be outsourced over the cloud. Cloud ERP is a mix of standard ERP services along with cloud flexibility and low cost to afford these services. This is a recent phenomenon in enterprise service offering. For most of non IT background entrepreneurs it is unclear and broad concept, since all the research work related to it are done in couple of years. Most of cloud ERP vendors describe their products as straight forward tasks. The process and selection of Cloud ERP Services and vendors is not clear. This research work draws a framework for selecting non-core business process from preferred ERP service partners. It also recommends which ERP services outsourced first over the cloud, and the security issues related to data or information moved out from company premises to the cloud eco-system.

  4. 文本数据挖掘技术在Web知识库中的应用研究%The Applied Research of Text Data Mining Technology in the Web Knowledge Base

    Institute of Scientific and Technical Information of China (English)

    蔡立斌

    2012-01-01

    介绍了文本数据挖掘和知识提取的基本理论,然后分析了网络信息的检索与挖掘的特征,特别是文本挖掘、Web数据挖掘和基于内容数据挖掘与之相关联的系列问题.在此基础上,分析了Web知识库的设计、建立、文本数据挖掘和知识发现所需的理论和技术,对Web知识库系统的架构和功能模块进行分析和设计,建立了基于文本数据挖掘的Web网络知识库的模型.%This article first briefly describes the basic theory of text data mining and knowledge extraction, and then analyzes the network information retrieval and mining of feature, especially Web text mining, data mining and data mining based on content associated with the series of problems. On this basis, we analyzed theory and technology that the Web knowledge base design, build, text data mining and knowledge discovery are required, the Web knowledge base system structure and function module is analyzed and designed, based on text data mining Web network knowledge base model.

  5. Association text classification of mining ItemSet significance%挖掘重要项集的关联文本分类

    Institute of Scientific and Technical Information of China (English)

    蔡金凤; 白清源

    2011-01-01

    针对在关联规则分类算法的构造分类器阶段中只考虑特征词是否存在,忽略了文本特征权重的问题,基于关联规则的文本分类方法(ARC-BC)的基础上提出一种可以提高关联文本分类准确率的ISARC(ItemSet Significance-based ARC)算法.该算法利用特征项权重定义了k-项集重要度,通过挖掘重要项集来产生关联规则,并考虑提升度对待分类文本的影响.实验结果表明,挖掘重要项集的ISARC算法可以提高关联文本分类的准确率.%Text classification technology is an important basis of information retrieval and text mining,and its main task is to mark category according to a given category set.Text classification has a wide range of applications in natural language processing and understanding、information organization and management、information filtering and other areas.At present,text classification can be mainly divided into three groups: based on statistical methods、based on connection method and the method based on rules. The basic idea of the traditional association text classification algorithm associative rule-based classifier by category(ARC-BC) is to use the association rule mining algorithm Apriori which generates frequent items that appear frequently feature items or itemsets,and then use these frequent items as rule antecedent and category is used as rule consequent to form the rule set and then make these rules constitute a classifier.During classifying the test samples,if the test sample matches the rule antecedent,put the rule that belongs to the class counterm to the cumulative confidence.If the confidence of the category counter is the maximum,then determine the test sample belongs to that category. However,ARC-BC algorithm has two main drawbacks:(1) During the structure classifier,it only considers the existence of feature words and ignores the weight of text features for mining frequent itemsets and generated association rules

  6. Biomedical signal processing

    CERN Document Server

    Akay, Metin

    1994-01-01

    Sophisticated techniques for signal processing are now available to the biomedical specialist! Written in an easy-to-read, straightforward style, Biomedical Signal Processing presents techniques to eliminate background noise, enhance signal detection, and analyze computer data, making results easy to comprehend and apply. In addition to examining techniques for electrical signal analysis, filtering, and transforms, the author supplies an extensive appendix with several computer programs that demonstrate techniques presented in the text.

  7. Visualization and analysis of a cardio vascular disease- and MUPP1-related biological network combining text mining and data warehouse approaches.

    Science.gov (United States)

    Sommer, Björn; Tiys, Evgeny S; Kormeier, Benjamin; Hippe, Klaus; Janowski, Sebastian J; Ivanisenko, Timofey V; Bragin, Anatoly O; Arrigo, Patrizio; Demenkov, Pavel S; Kochetov, Alexey V; Ivanisenko, Vladimir A; Kolchanov, Nikolay A; Hofestädt, Ralf

    2010-11-11

    Detailed investigation of socially important diseases with modern experimental methods has resulted in the generation of large volume of valuable data. However, analysis and interpretation of this data needs application of efficient computational techniques and systems biology approaches. In particular, the techniques allowing the reconstruction of associative networks of various biological objects and events can be useful. In this publication, the combination of different techniques to create such a network associated with an abstract cell environment is discussed in order to gain insights into the functional as well as spatial interrelationships. It is shown that experimentally gained knowledge enriched with data warehouse content and text mining data can be used for the reconstruction and localization of a cardiovascular disease developing network beginning with MUPP1/MPDZ (multi-PDZ domain protein).

  8. Analysis of ingredient lists of commercially available gluten-free and gluten-containing food products using the text mining technique.

    Science.gov (United States)

    do Nascimento, Amanda Bagolin; Fiates, Giovanna Medeiros Rataichesck; Dos Anjos, Adilson; Teixeira, Evanilda

    2013-03-01

    Ingredients mentioned on the labels of commercially available packaged gluten-free and similar gluten-containing food products were analyzed and compared, using the text mining technique. A total of 324 products' labels were analyzed for content (162 from gluten-free products), and ingredient diversity in gluten-free products was 28% lower. Raw materials used as ingredients of gluten-free products were limited to five varieties: rice, cassava, corn, soy, and potato. Sugar was the most frequently mentioned ingredient on both types of products' labels. Salt and sodium also were among these ingredients. Presence of hydrocolloids, enzymes or raw materials of high nutritional content such as pseudocereals, suggested by academic studies as alternatives to improve nutritional and sensorial quality of gluten-free food products, was not identified in the present study. Nutritional quality of gluten-free diets and health of celiac patients may be compromised.

  9. A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.

    Science.gov (United States)

    Davis, Allan Peter; Wiegers, Thomas C; Roberts, Phoebe M; King, Benjamin L; Lay, Jean M; Lennon-Hopkins, Kelley; Sciaky, Daniela; Johnson, Robin; Keating, Heather; Greene, Nigel; Hernandez, Robert; McConnell, Kevin J; Enayetallah, Ahmed E; Mattingly, Carolyn J

    2013-01-01

    Improving the prediction of chemical toxicity is a goal common to both environmental health research and pharmaceutical drug development. To improve safety detection assays, it is critical to have a reference set of molecules with well-defined toxicity annotations for training and validation purposes. Here, we describe a collaboration between safety researchers at Pfizer and the research team at the Comparative Toxicogenomics Database (CTD) to text mine and manually review a collection of 88,629 articles relating over 1,200 pharmaceutical drugs to their potential involvement in cardiovascular, neurological, renal and hepatic toxicity. In 1 year, CTD biocurators curated 254,173 toxicogenomic interactions (152,173 chemical-disease, 58,572 chemical-gene, 5,345 gene-disease and 38,083 phenotype interactions). All chemical-gene-disease interactions are fully integrated with public CTD, and phenotype interactions can be downloaded. We describe Pfizer's text-mining process to collate the articles, and CTD's curation strategy, performance metrics, enhanced data content and new module to curate phenotype information. As well, we show how data integration can connect phenotypes to diseases. This curation can be leveraged for information about toxic endpoints important to drug safety and help develop testable hypotheses for drug-disease events. The availability of these detailed, contextualized, high-quality annotations curated from seven decades' worth of the scientific literature should help facilitate new mechanistic screening assays for pharmaceutical compound survival. This unique partnership demonstrates the importance of resource sharing and collaboration between public and private entities and underscores the complementary needs of the environmental health science and pharmaceutical communities. Database URL: http://ctdbase.org/

  10. LINNAEUS: A species name identification system for biomedical literature

    Directory of Open Access Journals (Sweden)

    Nenadic Goran

    2010-02-01

    Full Text Available Abstract Background The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. Results In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. Conclusions LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.

  11. 基于文本情报的数据挖掘%Data Mining Realization Technology Based on Text Intelligence Data

    Institute of Scientific and Technical Information of China (English)

    吕曹芳; 侯智斌

    2012-01-01

    文章介绍了适合于军事领域中进行情报数据的挖掘方法,建立了军事情报中非结构化文本情报数据处理方法,结合军孥睛报的特点,提出了军事情报中数据挖掘的框架模型,探讨了军事情报挖掘中文文本的方法。实现了对情报文本数据的分词、关键字提取、词频分析、关联分析等。%This paper introduces intelligence text classification model in military, the data processing map of unstructured intelligence text is established. Data mining model framework is established firstly by the feature of military intelligence. And implements Chinese word segmentation on text data, keyword extraction, word frequency analysis, relational analysis.

  12. Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value

    Directory of Open Access Journals (Sweden)

    Dorothy V.M. Bishop

    2016-02-01

    Full Text Available Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015 and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the “p-hacking bump” just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.

  13. Spatial Patterns of the Indications of Acupoints Using Data Mining in Classic Medical Text: A Possible Visualization of the Meridian System

    Directory of Open Access Journals (Sweden)

    Won-Mo Jung

    2015-01-01

    Full Text Available The indications of acupoints are thought to be highly associated with the lines of the meridian systems. The present study used data mining methods to analyze the characteristics of the indications of each acupoint and to visualize the relationships between the acupoints and disease sites in the classic Korean medical text Chimgoogyeongheombang. Using a term frequency-inverse document frequency (tf-idf scheme, the present study extracted valuable data regarding the indications of each acupoint according to the frequency of the cooccurrences of eight Source points and eighteen disease sites. Furthermore, the spatial patterns of the indications of each acupoint on a body map were visualized according to the tf-idf values. Each acupoint along the different meridians exhibited different constellation patterns at various disease sites. Additionally, the spatial patterns of the indications of each acupoint were highly associated with the route of the corresponding meridian. The present findings demonstrate that the indications of each acupoint were primarily associated with the corresponding meridian system. Furthermore, these findings suggest that the routes of the meridians may have clinical implications in terms of identifying the constellations of the indications of acupoints.

  14. CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM to identify genetically-associated cancers and candidate genes

    Directory of Open Access Journals (Sweden)

    Jones Steven

    2005-03-01

    Full Text Available Abstract Background Online Mendelian Inheritance in Man (OMIM is a computerized database of information about genes and heritable traits in human populations, based on information reported in the scientific literature. Our objective was to establish an automated text-mining system for OMIM that will identify genetically-related cancers and cancer-related genes. We developed the computer program CGMIM to search for entries in OMIM that are related to one or more cancer types. We performed manual searches of OMIM to verify the program results. Results In the OMIM database on September 30, 2004, CGMIM identified 1943 genes related to cancer. BRCA2 (OMIM *164757, BRAF (OMIM *164757 and CDKN2A (OMIM *600160 were each related to 14 types of cancer. There were 45 genes related to cancer of the esophagus, 121 genes related to cancer of the stomach, and 21 genes related to both. Analysis of CGMIM results indicate that fewer than three gene entries in OMIM should mention both, and the more than seven-fold discrepancy suggests cancers of the esophagus and stomach are more genetically related than current literature suggests. Conclusion CGMIM identifies genetically-related cancers and cancer-related genes. In several ways, cancers with shared genetic etiology are anticipated to lead to further etiologic hypotheses and advances regarding environmental agents. CGMIM results are posted monthly and the source code can be obtained free of charge from the BC Cancer Research Centre website http://www.bccrc.ca/ccr/CGMIM.

  15. Construction of an annotated corpus to support biomedical information extraction

    Directory of Open Access Journals (Sweden)

    McNaught John

    2009-10-01

    Full Text Available Abstract Background Information Extraction (IE is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC, consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining. Initial experiments have also shown that the corpus may

  16. Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value.

    Science.gov (United States)

    Bishop, Dorothy V M; Thompson, Paul A

    2016-01-01

    Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.

  17. Biomedical photonics handbook biomedical diagnostics

    CERN Document Server

    Vo-Dinh, Tuan

    2014-01-01

    Shaped by Quantum Theory, Technology, and the Genomics RevolutionThe integration of photonics, electronics, biomaterials, and nanotechnology holds great promise for the future of medicine. This topic has recently experienced an explosive growth due to the noninvasive or minimally invasive nature and the cost-effectiveness of photonic modalities in medical diagnostics and therapy. The second edition of the Biomedical Photonics Handbook presents fundamental developments as well as important applications of biomedical photonics of interest to scientists, engineers, manufacturers, teachers, studen

  18. RESEARCH ON TEXT MINING BASED ON BACKGROUND KNOWLEDGE AND ACTIVE LEARNING%基于背景知识和主动学习的文本挖掘技术研究

    Institute of Scientific and Technical Information of China (English)

    符保龙

    2013-01-01

    为了达成好的文本分类和文本挖掘效果,往往需要使用大量的标识数据.然而数据标识不但操作复杂,而且成本昂贵.为此,在基于支持向量机的分类技术框架下,在文本分类和文本挖掘中引入未标识数据,具体的执行通过基于背景知识和基于主动学习两种方法展开.实验结果表明,基于背景知识的文本挖掘方法在基线分类器性能较强的情况下可以发挥优秀的文本挖掘性能,而基于主动学习的文本挖掘方法在一般的情况下就可以改善文本挖掘的性能指标.%In order to achieve good effect in text classification and text mining,there often needs to use a large number of labelled data.However,to label data is usually complex in operation and also expensive.Therefore,in this paper we introduce the unlabelled data to text classification and text mining in the framework of support vector machine-based classification technology.The specific implementation is carried out through two methods,the background knowledge-based and the active learning-based.Experimental results show that the text mining based on background knowledge can bring the text mining performance into excellent play under the condition of stronger baseline classifier,while the text mining based on active learning can improve the performance index of text mining just in general situation.

  19. Spatial Patterns of the Indications of Acupoints Using Data Mining in Classic Medical Text: A Possible Visualization of the Meridian System.

    Science.gov (United States)

    Jung, Won-Mo; Lee, Taehyung; Lee, In-Seon; Kim, Sanghyun; Jang, Hyunchul; Kim, Song-Yi; Park, Hi-Joon; Chae, Younbyoung

    2015-01-01

    The indications of acupoints are thought to be highly associated with the lines of the meridian systems. The present study used data mining methods to analyze the characteristics of the indications of each acupoint and to visualize the relationships between the acupoints and disease sites in the classic Korean medical text Chimgoogyeongheombang. Using a term frequency-inverse document frequency (tf-idf) scheme, the present study extracted valuable data regarding the indications of each acupoint according to the frequency of the cooccurrences of eight Source points and eighteen disease sites. Furthermore, the spatial patterns of the indications of each acupoint on a body map were visualized according to the tf-idf values. Each acupoint along the different meridians exhibited different constellation patterns at various disease sites. Additionally, the spatial patterns of the indications of each acupoint were highly associated with the route of the corresponding meridian. The present findings demonstrate that the indications of each acupoint were primarily associated with the corresponding meridian system. Furthermore, these findings suggest that the routes of the meridians may have clinical implications in terms of identifying the constellations of the indications of acupoints.

  20. [Studies Using Text Mining on the Differences in Learning Effects between the KJ and World Café Method as Learning Strategies].

    Science.gov (United States)

    Yasuhara, Tomohisa; Sone, Tomomichi; Konishi, Motomi; Kushihata, Taro; Nishikawa, Tomoe; Yamamoto, Yumi; Kurio, Wasako; Kohno, Takeyuki

    2015-01-01

    The KJ method (named for developer Jiro Kawakita; also known as affinity diagramming) is widely used in participatory learning as a means to collect and organize information. In addition, the World Café (WC) has recently become popular. However, differences in the information obtained using each method have not been studied comprehensively. To determine the appropriate information selection criteria, we analyzed differences in the information generated by the WC and KJ methods. Two groups engaged in sessions to collect and organize information using either the WC or KJ method and small group discussions were held to create "proposals to improve first-year education". Both groups answered two pre- and post- session questionnaires that asked for free descriptions. Key words were extracted from the results of the two questionnaires and categorized using text mining. In the responses to questionnaire 1, which was directly related to the session theme, a significant increase in the number of key words was observed in the WC group (p=0.0050, Fisher's exact test). However, there was no significant increase in the number of key words in the responses to questionnaire 2, which was not directly related to the session theme (p=0.8347, Fisher's exact test). In the KJ method, participants extracted the most notable issues and progressed to a detailed discussion, whereas in the WC method, various information and problems were spread among the participants. The choice between the WC and KJ method should be made to reflect the educational objective and desired direction of discussion.

  1. Novel Mining Implicit Text Fragment from Abstract Scheme%一种摘要中隐含的知识片段的挖掘方案

    Institute of Scientific and Technical Information of China (English)

    戴璐; 丁立新; 薛兵

    2013-01-01

    This paper extracted high-frequency keywords appearing in the literature, then positioned the abstract through inverted index,mined the fixed semantic phrases with keywords in the abstract,and tracked the dynamic changes phrases in recent years by text bibliometric. By using the related affect matrix to establish associated network, the association between the semantic phrases was analysed and figured out. The experimental results show that the literature summary implicit knowledge fragments can better reflect the trends of disciplines.%提取文献中高频出现的关键词,通过倒排索引的方法将关键词在摘要中定位,挖掘出摘要中隐含的与关键词能构成固定搭配的语义词组,并运用文本计量的方法追踪词组近年来的动态变化.利用关联影响度矩阵对语义词组进行了网络分析.实验结果表明,文献摘要中隐含的知识片段更能反映学科的发展趋势.

  2. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2013-01-01

    Full Text Available Abstract Background Availability of chemical response-specific lists of genes (gene sets for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM, and that these can be used with gene set analysis (GSA methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human and 588 (mouse gene sets from the Comparative Toxicogenomics Database (CTD. We tested for significant differential expression (SDE (false discovery rate -corrected p-values Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.

  3. Constructing a semantic predication gold standard from the biomedical literature

    Directory of Open Access Journals (Sweden)

    Kilicoglu Halil

    2011-12-01

    Full Text Available Abstract Background Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology. Results We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475. With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536 in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations. Conclusions While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular

  4. Biomedical nanotechnology.

    Science.gov (United States)

    Hurst, Sarah J

    2011-01-01

    This chapter summarizes the roles of nanomaterials in biomedical applications, focusing on those highlighted in this volume. A brief history of nanoscience and technology and a general introduction to the field are presented. Then, the chemical and physical properties of nanostructures that make them ideal for use in biomedical applications are highlighted. Examples of common applications, including sensing, imaging, and therapeutics, are given. Finally, the challenges associated with translating this field from the research laboratory to the clinic setting, in terms of the larger societal implications, are discussed.

  5. A CONDITIONAL RANDOM FIELDS APPROACH TO BIOMEDICAL NAMED ENTITY RECOGNITION

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail,and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better results.

  6. 人文社会科学研究中文本挖掘技术应用进展%Progress of Text Mining Applications in Humanities and Social Science

    Institute of Scientific and Technical Information of China (English)

    郭金龙; 许鑫; 陆宇杰

    2012-01-01

    指出作为处理海量数据的有效工具,文本挖掘技术近年来在人文社科领域得到广泛重视。概述文本挖掘的相关技术和研究现状,介绍信息抽取、文本分类、文本聚类、关联规则与模式发现等常用的文本挖掘方法在人文社科研究中的具体应用,以拓展文本挖掘的应用领域,并为人文社科研究的方法创新提供新的思路。%As an effective method to handle data deluge, text mining has earned widespread respect in humanities and social science in recent years. This paper firstly summarizes the relevant techniques of text mining and current situation of study, then introduces spe- cific applications of frequently - used text mining techniques like information extraction, text classification, text clustering, association rules and pattern discovery in the domain of humanities and social science, so as to expand the domain of text mining application as well as providing new ideas for humanities and social science research.

  7. Biomedical Engineering

    CERN Document Server

    Suh, Sang C; Tanik, Murat M

    2011-01-01

    Biomedical Engineering: Health Care Systems, Technology and Techniques is an edited volume with contributions from world experts. It provides readers with unique contributions related to current research and future healthcare systems. Practitioners and researchers focused on computer science, bioinformatics, engineering and medicine will find this book a valuable reference.

  8. Biomedical Libraries

    Science.gov (United States)

    Pizer, Irwin H.

    1978-01-01

    Biomedical libraries are discussed as a distinct and specialized group of special libraries and their unique services and user interactions are described. The move toward professional standards, as evidenced by the Medical Library Association's new certification program, and the current state of development for a new section of IFLA established…

  9. A re-evaluation of biomedical named entity-term relations.

    Science.gov (United States)

    Ohta, Tomoko; Pyysalo, Sampo; Kim, Jin-Dong; Tsujii, Jun'ichi

    2010-10-01

    Text mining can support the interpretation of the enormous quantity of textual data produced in biomedical field. Recent developments in biomedical text mining include advances in the reliability of the recognition of named entities (NEs) such as specific genes and proteins, as well as movement toward richer representations of the associations of NEs. We argue that this shift in representation should be accompanied by the adoption of a more detailed model of the relations holding between NEs and other relevant domain terms. As a step toward this goal, we study NE-term relations with the aim of defining a detailed, broadly applicable set of relation types based on accepted domain standard concepts for use in corpus annotation and domain information extraction approaches.

  10. MONK Project and the Reference for Text Mining Applied to the Humanities in China%MONK项目及其对我国人文领域文本挖掘的借鉴

    Institute of Scientific and Technical Information of China (English)

    许鑫; 郭金龙; 蔚海燕

    2012-01-01

    MONK is a crossdisciplinary text mining project in the humanities undertaken by several universities and research institutes from America and Canada. This paper mainly discusses the text mining process of MONK as well as relevant tools, techniques and algorithm. Two case studies based on MONK are introduced to give details about the application of text mining to the humanities. Finally the authors summarize some unique applications of text mining applied to the humanities and discuss what we can learn from the MONK projeet.%针对美国和加拿大等高校共同承担的大型跨学科人文文本挖掘项目MONK,详细介绍其文本挖掘流程及相应的工具、技术和算法,并具体探讨利用MONK提供的工具进行文学文本挖掘研究的应用实例。最后总结人文领域文本挖掘方法的几类应用,提出该项目对我国人文领域应用文本挖掘的启示。

  11. Effective use of Latent Semantic Indexing and Computational Linguistics in Biological and Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Hongyu eChen

    2013-01-01

    Full Text Available Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI, a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  12. Using a search engine-based mutually reinforcing approach to assess the semantic relatedness of biomedical terms.

    Directory of Open Access Journals (Sweden)

    Yi-Yu Hsu

    Full Text Available BACKGROUND: Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field. Previous studies, such as those using ontology-based and corpus-based approaches, measured semantic relatedness by using information from the structure of biomedical literature, but these methods are limited by the small size of training resources. To increase the size of training datasets, the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text. ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms. By combining sentence structures and the linking activities between containers and lexical patterns, our algorithm can explore the correlation between two biomedical terms. CONCLUSIONS/SIGNIFICANCE: The average correlation coefficient of the ReLPR algorithm was 0.82 for various datasets. The results of the ReLPR algorithm were significantly superior to those of previous methods.

  13. Big Data Knowledge Mining

    Directory of Open Access Journals (Sweden)

    Huda Umar Banuqitah

    2016-11-01

    Full Text Available Big Data (BD era has been arrived. The ascent of big data applications where information accumulation has grown beyond the ability of the present programming instrument to catch, manage and process within tolerable short time. The volume is not only the characteristic that defines big data, but also velocity, variety, and value. Many resources contain BD that should be processed. The biomedical research literature is one among many other domains that hides a rich knowledge. MEDLINE is a huge biomedical research database which remain a significantly underutilized source of biological information. Discovering the useful knowledge from such huge corpus leading to many problems related to the type of information such as the related concepts of the domain of texts and the semantic relationship associated with them. In this paper, an agent-based system of two–level for Self-supervised relation extraction from MEDLINE using Unified Medical Language System (UMLS Knowledgebase, has been proposed . The model uses a Self-supervised Approach for Relation Extraction (RE by constructing enhanced training examples using information from UMLS with hybrid text features. The model incorporates Apache Spark and HBase BD technologies with multiple data mining and machine learning technique with the Multi Agent System (MAS. The system shows a better result in comparison with the current state of the art and naïve approach in terms of Accuracy, Precision, Recall and F-score.

  14. Biomedical Materials

    Institute of Scientific and Technical Information of China (English)

    CHANG Jiang; ZHOU Yanling

    2011-01-01

    @@ Biomedical materials, biomaterials for short, is regarded as "any substance or combination of substances, synthetic or natural in origin, which can be used for any period of time, as a whole or as part of a system which treats, augments, or replaces any tissue, organ or function of the body" (Vonrecum & Laberge, 1995).Biomaterials can save lives, relieve suffering and enhance the quality of life for human being.

  15. 融合语义关联挖掘的文本情感分析算法研究%Text Sentiment Analysis Algorithm Combining with Semantic Association Mining

    Institute of Scientific and Technical Information of China (English)

    明均仁

    2012-01-01

    Facing the enriching textual sentiment information resources in the network, utilizing association mining technology to mine and analyse them automatically and intelligently to obtain user sentiment knowledge at semantic level has an important potential value for enterprise to formulate competitive strategies and keep competitive advantage. This paper integrates association mining technology into text sentiment analysis, researches and designs the text sentiment analysis algorithm combining with semantic association mining to realize sentiment analysis and user sentiment knowledge mining at semantic level. Experiment results demonstrate that this algorithm a- chieves a good expected effect. It dramatically improves the accuracy and efficiency of sentiment analysis and the depth and width of as- sociation mining.%面对网络中日益丰富的文本性情感信息资源,利用关联挖掘技术对其进行智能化的自动挖掘与分析,获取语义层面的用户情感知识,对于企业竞争策略的制定和竞争优势的保持具有重要的潜在价值。将关联挖掘技术融入文本情感分析之中,研究并设计一种融合语义关联挖掘的文本情感分析算法,实现语义层面的情感分析与用户情感知识挖掘。实验结果表明,该算法取得了很好的预期效果,显著提高了情感分析的准确率与效率以及关联挖掘的深度与广度。

  16. Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications.

    Science.gov (United States)

    Chen, Hongyu; Martin, Bronwen; Daimon, Caitlin M; Maudsley, Stuart

    2013-01-01

    Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI), a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  17. A critical review of PASBio's argument structures for biomedical verbs

    Directory of Open Access Journals (Sweden)

    Cohen K Bretonnel

    2006-11-01

    Full Text Available Abstract Background Propositional representations of biomedical knowledge are a critical component of most aspects of semantic mining in biomedicine. However, the proper set of propositions has yet to be determined. Recently, the PASBio project proposed a set of propositions and argument structures for biomedical verbs. This initial set of representations presents an opportunity for evaluating the suitability of predicate-argument structures as a scheme for representing verbal semantics in the biomedical domain. Here, we quantitatively evaluate several dimensions of the initial PASBio propositional structure repository. Results We propose a number of metrics and heuristics related to arity, role labelling, argument realization, and corpus coverage for evaluating large-scale predicate-argument structure proposals. We evaluate the metrics and heuristics by applying them to PASBio 1.0. Conclusion PASBio demonstrates the suitability of predicate-argument structures for representing aspects of the semantics of biomedical verbs. Metrics related to theta-criterion violations and to the distribution of arguments are able to detect flaws in semantic representations, given a set of predicate-argument structures and a relatively small corpus annotated with them.

  18. Application of Text Mining Technology in the Traditional Chinese Medicine Literature Research%文本挖掘技术在中医药文献研究中的应用

    Institute of Scientific and Technical Information of China (English)

    郭洪涛

    2013-01-01

    目的:探讨文本挖掘技术在中医药文献研究中的应用成果.方法:对近年来文本挖掘技术应用于中医药研究的文献进行综述,总结文本挖掘技术在中医药中的应用成果.结果:文本挖掘技术能以线性和非线性方式解析数据,且能进行高层次的知识整合,又善于处理模糊和非量化数据.未来文本挖掘有可能整合中医药数据、蛋白质及代谢组学数据,分析组合中药活性成分,为新药发现和组合药物形成构建研发平台.结论:利用文本挖掘技术对中医药进行研究分析是一种很有前景的方法.%Objective:To investigate the application progress of text mining technology on traditional Chinese medicine literature research.Methods:Achievements of the application of text mining technology in TCM were summarized through literature review of the application of text mining technology in traditional chinese medicine in recent years.Results:Text mining technology can parse the data in the linear and nonlinear manner,with a high level of integration of knowledge.It is also good at dealing with fuzzy and not quantitative data.Text mining technology is likely to integrate traditional Chinese medicine data,protein and metabolomics data,analysis of combination traditional Chinese medicine active ingredients,which builds development platform for the new drug discovery and the form of combination drugs.Conclusion:Using text mining technology for research and analysis of traditional Chinese medicine is a promising method

  19. A realistic assessment of methods for extracting gene/protein interactions from free text

    Directory of Open Access Journals (Sweden)

    Shepherd Adrian J

    2009-07-01

    Full Text Available Abstract Background The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.

  20. A Review of Technical Topic Analysis Based on Text Mining%基于文本挖掘的专利技术主题分析研究综述

    Institute of Scientific and Technical Information of China (English)

    胡阿沛; 张静; 雷孝平; 张晓宇

    2013-01-01

    为应对专利数量巨大和技术的日益复杂给专利技术主题分析带来的挑战,以及利用文本挖掘技术的专利技术主题分析近来成为研究热点。首先介绍文本挖掘的概念和其发展历史。其次,对目前基于文本挖掘的专利技术主题分析方法进行了归纳,包括主题词词频分析、共词分析、文本聚类分析和与引文聚类结合的分析方法,总结其常用的分析工具并介绍新的科学图谱分析软件---SciMAT。最后总结基于文本挖掘的专利技术主题分析方法的优点与不足,为其将来的研究提供建议。%To cope with the challenges of patent technical topic analysis presented by huge amounts of patent documents and increasingly sophisticated technology, this paper uses text mining technology to assist technical topic analysis and get researchers' focus in recent years. Firstly, the concept of text mining and its development history are introduced. Secondly, the methods of analyzing patent technical topic based on text mining is summarized, including word frequency analysis, co-word analysis, text clustering analysis and analysis combined with citation clustering. Some important analytical tools and a new science mapping analysis software tool SciMAT are introduced. Final-ly, the paper points out the advantages and deficiencies of technical topic analysis based on text mining and future research direction.

  1. Review of Text Mining Application in Humanity and Social Science%文本挖掘在人文社会科学研究中的典型应用述评

    Institute of Scientific and Technical Information of China (English)

    陆宇杰; 许鑫; 郭金龙

    2012-01-01

    调研文本挖掘在人文社会科学领域的应用现况,介绍国际上文本挖掘在这些领域应用的成功案例与经验,展现目前文本挖掘在人文社科领域的最新研究进展,给国内相关研究的开展提供一定的启示。%This paper investigates the text mining application status in typical practice in the two domains, shows the newest progress of text sonic inspiration to Chinese researchers. humanity and social science, introduces the best exprience and mining all over the world, and correspondingly, tries to bring

  2. The Design and Implemention of a Text Mining Algorithm Based on MapReduce Framework%基于MapReduce框架一种文本挖掘算法的设计与实现

    Institute of Scientific and Technical Information of China (English)

    朱蔷蔷; 张桂芸; 刘文龙

    2012-01-01

    随着文本挖掘在主动信息服务中应用的日益扩展,在文本数据的基础上分析数据的内在特征已经成为目前的研究趋势,本文在Hadoop平台上设计并实现了一种文本挖掘算法,该算法利用MapReduce框架按照自然语料中相邻词组出现的频数进行降序输出,从而有助于用户挖掘大量数据中各项集之间的联系,实验结果体现了该算法的有效性和良好的加速比.%With the expanding application of text mining in active information service, analyzing the inherent characteristics of data based on the text data is becoming a current research trend, this paper designs and implements a text mining algorithm based on the Hadoop platform which outputs the data according to the natural corpora adjacent phrase descending frequency, thus helping the users mine the link between the set in the large quantities of data, In view of the distributed feature of the Hadoop platform, the experimental result shows the efficiency and better speedup.

  3. 文本挖掘技术在电力工单数据分析中的应用%Application of text mining technology in electric power work order data analysis

    Institute of Scientific and Technical Information of China (English)

    邹云峰; 何维民; 赵洪莹; 程雅梦; 杨红

    2016-01-01

    The text mining technology provides the method and technical support for text analysis. On the basis of the text classification technology of text mining,the construction methods and processes of text preprocessing and text classification model are introduced briefly. Taking the hot issues reflected by power supply service center in power supply service processing as an exam⁃ple,the text automatic classification model of 95598 work orders was established. The rapidly and accurately automatic classifica⁃tion of 95598 work orders was realized with verification. The model can accurately mine the hidden important information in real time,and provide the foundation and data basis to analyze the influence of power supply service on customers′electricity demand.%文本挖掘技术为文本分析提供了方法和技术支持,以文本挖掘中的文本分类技术为基础,简要介绍文本预处理、文本分类器模型构建的方法和过程,并以供电服务过程中客户通过供电服务中心反映的热点事件为实例,建立95598工单文本自动分类的模型,通过验证实现95598工单文本快速精准的自动分类,及时准确地挖掘出隐藏的重要信息,并且为分析供电服务对客户的用电诉求的影响提供依据和数据基础。

  4. The structural and content aspects of abstracts versus bodies of full text journal articles are different

    Directory of Open Access Journals (Sweden)

    Roeder Christophe

    2010-09-01

    Full Text Available Abstract Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that

  5. Algorithm of Text Vectors Feature Mining Based on Multi Factor Analysis of Variance%基于多因素方差分析的文本向量特征挖掘算法

    Institute of Scientific and Technical Information of China (English)

    谭海中; 何波

    2015-01-01

    The text feature vector mining applied to information resources organization and management field, in the field of data mining and has great application value, characteristic vector of traditional text mining algorithm using K-means algo⁃rithm , the accuracy is not good. A new method based on multi factor variance analysis of the characteristics of mining algo⁃rithm of text vector. The features used multi factor variance analysis method to obtain a variety of corpora mining rules, based on ant colony algorithm, based on ant colony fitness probability regular training transfer rule, get the evolution of pop⁃ulation of recent data sets obtained effective moment features the maximum probability, the algorithm selects K-means ini⁃tial clustering center based on optimized division, first division of the sample data, then according to the sample distribu⁃tion characteristics to determine the initial cluster center, improve the performance of text feature mining, the simulation re⁃sults show that, this algorithm improves the clustering effect of the text feature vectors, and then improve the performance of feature mining, data feature has higher recall rate and detection rate, time consuming less, greater in the application of data mining in areas such as value.%文本向量特征挖掘应用于信息资源组织和管理领域,在大数据挖掘领域具有较大应用价值,传统算法精度不好。提出一种基于多因素方差分析的文本向量特征挖掘算法。使用多因素方差分析方法得到多种语料库的特征挖掘规律,结合蚁群算法,根据蚁群适应度概率正则训练迁移法则,得到种群进化最近时刻获得的数据集有效特征概率最大值,基于最优划分的K-means初始聚类中心选取算法,先对数据样本进行划分,然后根据样本分布特点来确定初始聚类中心,提高文本特征挖掘性能。仿真结果表明,该算法提高了文本向量特征的聚类效

  6. Optical Polarizationin Biomedical Applications

    CERN Document Server

    Tuchin, Valery V; Zimnyakov, Dmitry A

    2006-01-01

    Optical Polarization in Biomedical Applications introduces key developments in optical polarization methods for quantitative studies of tissues, while presenting the theory of polarization transfer in a random medium as a basis for the quantitative description of polarized light interaction with tissues. This theory uses the modified transfer equation for Stokes parameters and predicts the polarization structure of multiple scattered optical fields. The backscattering polarization matrices (Jones matrix and Mueller matrix) important for noninvasive medical diagnostic are introduced. The text also describes a number of diagnostic techniques such as CW polarization imaging and spectroscopy, polarization microscopy and cytometry. As a new tool for medical diagnosis, optical coherent polarization tomography is analyzed. The monograph also covers a range of biomedical applications, among them cataract and glaucoma diagnostics, glucose sensing, and the detection of bacteria.

  7. Application of Text Mining in Employment Information Analysis in Higher Vocational College%文本挖掘在高职院校就业信息分析中的应用

    Institute of Scientific and Technical Information of China (English)

    宁建飞

    2016-01-01

    、文本分类、文本聚类、文本关联分析、分布分析和趋势预测等。文本关联分析是其中一种很关键的挖掘任务,也是在文本信息处理领域用得较多的一种技术。本文主要用到文本关联分析,下面做重点介绍。%Taking the employment information data of the graduates in higher vocational colleges as the analysis object,text mining is applied to the employment information analysis. Through the employment information data mining,valuable data is obtained to act as important references for talent training,employment guidance and other scientific decisions. The experimental results show that text mining is a very effective analysis method for data anal-ysis of employment information.

  8. 基于增量队列的在全置信度下的关联挖掘%Association Mining on Massive Text under Full Confidence Based on Incremental Queue

    Institute of Scientific and Technical Information of China (English)

    刘炜

    2015-01-01

    关联挖掘是一种重要的数据分析方法, 提出了一种在全置信度下的增量队列关联挖掘算法模型, 在传统的 FP-Growth 及 PF-Tree 算法的关联挖掘中使用了全置信度规则, 算法的适应性得到提升, 由此提出FP4W-Growth 算法并运用到对文本数据的关联计算以及对增量式的数据进行关联性挖掘的研究中, 通过实验验证了此算法及模型的可行性与优化性, 为在庞大的文本数据中发现隐藏着的先前未知的并潜在有用的新信息和新模式, 提供了科学的决策方法.%Association mining is an important data analysis method, this article proposes an incremental queue association mining algorithm model under full confidence,using the full confidence rules in the traditional FP-Growth and PF-Tree association mining algorithm can improve the algorithm adaptability. Thus, the article proposes FP4W-Growth algorithm, and applies this algotithm to the association calculation of text data and association mining of incremental data. Then this paper conducted verification experiment. The experimental results show the feasibility of this algorithm and model. The article provides a scientific approach to finding hidden but useful information and patterns from large amount of text data.

  9. Research on Text Data Mining on Human Genome Sequence Analysis%人类基因组测序文本数据挖掘研究

    Institute of Scientific and Technical Information of China (English)

    于跃; 潘玮; 王丽伟; 王伟

    2012-01-01

    对PubMed数据库中2001年1月1日-2011年5月11日的人类基因组测序相关文献进行检索,对其题录信息进行提取并进行共词聚类分析,提取高频主题词,生成词篇矩阵、共现聚阵、共词聚类,认为文本数据挖掘技术能够很好地反映学科发展状况及研究热点,从而为研究人员提供有价值的信息。%Retrieving the literatures on human genome sequence analysis from PubMed published from 2001.1.1 to 2011.5.11,extracts bibliographic information and carries out co - word analysis,high frequency subject headings are extracted,word matrix,co - occurrence matrix,co - word clustering are formulated.It clarifies that data mining is a good way to reflect development status and research hotspots,so as to provide valuable information to researchers.

  10. TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery.

    Directory of Open Access Journals (Sweden)

    Yi-An Chen

    Full Text Available Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

  11. Application of an efficient Bayesian discretization method to biomedical data

    Directory of Open Access Journals (Sweden)

    Gopalakrishnan Vanathi

    2011-07-01

    Full Text Available Abstract Background Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI discretization method, which is commonly used for discretization. Results On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI. Conclusions On a range of biomedical datasets, a Bayesian discretization method (EBD yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.

  12. 国际图书馆界对文本和数据挖掘权利的争取及启示%The International Library Community to the Text and Data Mining Rights for the Fight and Enlightenment

    Institute of Scientific and Technical Information of China (English)

    于静

    2016-01-01

    面对版权问题对文本和数据挖掘技术在图书馆领域应用的制约,国际图书馆界采取了发布版权原则声明、游说开展版权立法、对出版商的版权政策提出质疑以及构建维权合作同盟等争取文本与数据挖掘权利的对策,其成果主要体现在:版权立场受到社会认同和支持、版权例外制度初现端倪、部分出版商调整了版权政策、图书馆的版权实践模式趋于多元化等。国际图书馆界争取文本和数据挖掘权利的做法与经验,对我国图书馆具有启示意义。%〔Abstract〕In the face of the copyright problem of text and data mining technology in library ifeld constraints, the international library community take the publishing copyright statement of principles ,lobbying to carry out copyright legislation on the publisher’s copyright policy raised the question as well as the construction of rights cooperation alliance for text and number according to the Countermeasures of mining rights, the results mainly relfected in: copyright position by social recognition and support, incipient copyright exception system, some publishers copyright policy is adjusted, the practice mode of copyright in the library diversiifed etc.. The practice and experience of the international library community to strive for the right of the text and data mining, to our country library has the enlightenment signiifcance.

  13. Patent Analysis Method for Field of Video Codec Based on Text Mining%基于文本挖掘的视频编解码领域专利分析方法

    Institute of Scientific and Technical Information of China (English)

    于雷; 夏鹏

    2012-01-01

    介绍了通过高级语义技术以及自然语言处理技术对专利进行文本挖掘分析的方法,同时利用该方法对涉及视频编解码领域的专利进行分析,得到一些有用的建议.%This paper introduces the advanced semantic technology and natural language processing technology of the patent text mining analysis methods, while taking advantage of the method to analyze the patent involved in the field of video codec and get some useful suggestions.

  14. Principles of Biomedical Engineering

    CERN Document Server

    Madihally, Sundararajan V

    2010-01-01

    Describing the role of engineering in medicine today, this comprehensive volume covers a wide range of the most important topics in this burgeoning field. Supported with over 145 illustrations, the book discusses bioelectrical systems, mechanical analysis of biological tissues and organs, biomaterial selection, compartmental modeling, and biomedical instrumentation. Moreover, you find a thorough treatment of the concept of using living cells in various therapeutics and diagnostics.Structured as a complete text for students with some engineering background, the book also makes a valuable refere

  15. Study on science and research information's auto-suggestion method based on text mining%文本挖掘技术在科研信息自动建议中的应用

    Institute of Scientific and Technical Information of China (English)

    李芳; 朱群雄

    2011-01-01

    This paper studies the characteristics of text data from research journal literature,applies the popular text mining technique into analyzing and processing research literature text data, and proposes research information's auto-suggestion system. Case study on journal documents is discussed.%研究了科研期刊文献文本数据的特点,将文本挖掘技术用于对科研期刊文本数据的分析和处理中,提出了基于文本挖掘技术的科研信息自动建议系统.结合国内信息领域较有影响的3种期刊2007全年的期刊文章,进行了实例仿真.

  16. 基于语义文本挖掘的企业竞争对手分析模型研究%A Competitor Analysis Model Based on Semantic Text Mining

    Institute of Scientific and Technical Information of China (English)

    唐晓波; 郭萍

    2013-01-01

    为弥补传统竞争对手分析方法无法有效挖掘网络化企业竞争对手信息的缺陷,本文将语义文本挖掘技术引入企业竞争对手分析中,提出了一个基于语义文本挖掘的企业竞争对手分析模型.该模型采用规则化主题爬取技术获取结构化信息,利用竞争情报领域本体知识库和语义VSM矩阵实现竞争对手信息语义分析和描述,通过基于语义的文本挖掘技术提取竞争对手深层次语义知识.并以相机市场的两大竞争力企业--佳能、尼康为例进行了实证分析研究,实验结果表明,该模型具有潜在的实际应用价值,可有效提高企业决策水平.%In order to make up for the failure of traditional competitor analysis methods to mine information about corporate competitors in Web, this paper puts forward a competitor analysis model based on semantic text mining,involving text mining technology into the enterprise competitor analysis. This model adopts the regularized topical crawling technologies to obtain structured information, uses competitive ontology knowledge base and semantic VSM matrix to achieve semantic analysis and description of competitor information, extracts rival deep-level semantic knowledge through semantic-based text mining technology. Two competitive enterprises in camera market, namely Canon and Nikon are chosen to demonstrate the applicability of the model, the results show that this model has potential practical value, which can effectively improve business decision-making level.

  17. BioN∅T: A searchable database of biomedical negated sentences

    Directory of Open Access Journals (Sweden)

    Agarwal Shashank

    2011-10-01

    Full Text Available Abstract Background Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioN∅T, a database of negated sentences that can be used to extract such negated events. Description Currently BioN∅T incorporates ≈32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ≈2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ≈20 million abstracts in PubMed. We evaluated BioN∅T on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioN∅T is able to capture negated events that may be ignored by experts. Conclusions The BioN∅T database can be a useful resource for biomedical researchers. BioN∅T is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioN∅T.

  18. Semi-supervised learning of causal relations in biomedical scientific discourse

    Science.gov (United States)

    2014-01-01

    Background The increasing number of daily published articles in the biomedical domain has become too large for humans to handle on their own. As a result, bio-text mining technologies have been developed to improve their workload by automatically analysing the text and extracting important knowledge. Specific bio-entities, bio-events between these and facts can now be recognised with sufficient accuracy and are widely used by biomedical researchers. However, understanding how the extracted facts are connected in text is an extremely difficult task, which cannot be easily tackled by machinery. Results In this article, we describe our method to recognise causal triggers and their arguments in biomedical scientific discourse. We introduce new features and show that a self-learning approach improves the performance obtained by supervised machine learners to 83.47% for causal triggers. Furthermore, the spans of causal arguments can be recognised to a slightly higher level that by using supervised or rule-based methods that have been employed before. Conclusion Exploiting the large amount of unlabelled data that is already available can help improve the performance of recognising causal discourse relations in the biomedical domain. This improvement will further benefit the development of multiple tasks, such as hypothesis generation for experimental laboratories, contradiction detection, and the creation of causal networks. PMID:25559746

  19. PALM-IST: Pathway Assembly from Literature Mining - an Information Search Tool

    Science.gov (United States)

    Mandloi, Sapan; Chakrabarti, Saikat

    2015-01-01

    Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of “co-authorship network,” for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm. PMID:25989388

  20. BIG: a Grid Portal for Biomedical Data and Images

    Directory of Open Access Journals (Sweden)

    Giovanni Aloisio

    2004-06-01

    Full Text Available Modern management of biomedical systems involves the use of many distributed resources, such as high performance computational resources to analyze biomedical data, mass storage systems to store them, medical instruments (microscopes, tomographs, etc., advanced visualization and rendering tools. Grids offer the computational power, security and availability needed by such novel applications. This paper presents BIG (Biomedical Imaging Grid, a Web-based Grid portal for management of biomedical information (data and images in a distributed environment. BIG is an interactive environment that deals with complex user's requests, regarding the acquisition of biomedical data, the "processing" and "delivering" of biomedical images, using the power and security of Computational Grids.

  1. Current Market Demand for Core Competencies of Librarianship—A Text Mining Study of American Library Association’s Advertisements from 2009 through 2014

    Directory of Open Access Journals (Sweden)

    Qinghong Yang

    2016-02-01

    Full Text Available As librarianship evolves, it is important to examine the changes that have taken place in professional requirements. To provide an understanding of the current market demand for core competencies of librarianship, this article conducts a semi-automatic methodology to analyze job advertisements (ads posted on the American Library Association (ALA Joblist from 2009 through 2014. There is evidence that the ability to solve unexpected complex problems and to provide superior customer service gained increasing importance for librarians during those years. The authors contend that the findings in this report question the status quo of core competencies of librarianship in the US job market.

  2. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D

    2014-01-01

    Known as the bible of biomedical engineering, The Biomedical Engineering Handbook, Fourth Edition, sets the standard against which all other references of this nature are measured. As such, it has served as a major resource for both skilled professionals and novices to biomedical engineering.Biomedical Engineering Fundamentals, the first volume of the handbook, presents material from respected scientists with diverse backgrounds in physiological systems, biomechanics, biomaterials, bioelectric phenomena, and neuroengineering. More than three dozen specific topics are examined, including cardia

  3. A Comparative Study of Root -Based and Stem -Based Approaches for Measuring the Similarity Between Arabic Words for Arabic Text Mining Applications

    Directory of Open Access Journals (Sweden)

    Hanane FROUD

    2012-12-01

    Full Text Available Representation of semantic information contained in the words is needed for any Arabic Text Miningapplications. More precisely, the purpose is to better take into account the semantic dependenciesbetween words expressed by the co-occurrence frequencies of these words. There have been manyproposals to compute similarities between words based on their distributions in contexts. In this paper,we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased(Stemming, and Stem-based (Light Stemming approaches for measuring the similarity betweenArabic words with the well known abstractive model -Latent Semantic Analysis (LSA- with a widevariety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the onehand, the variety of the corpus produces more accurate results; on the other hand, the Stem-basedapproach outperformed the Root-based one because this latter affects the words meanings

  4. Implementation of Paste Backfill Mining Technology in Chinese Coal Mines

    Directory of Open Access Journals (Sweden)

    Qingliang Chang

    2014-01-01

    Full Text Available Implementation of clean mining technology at coal mines is crucial to protect the environment and maintain balance among energy resources, consumption, and ecology. After reviewing present coal clean mining technology, we introduce the technology principles and technological process of paste backfill mining in coal mines and discuss the components and features of backfill materials, the constitution of the backfill system, and the backfill process. Specific implementation of this technology and its application are analyzed for paste backfill mining in Daizhuang Coal Mine; a practical implementation shows that paste backfill mining can improve the safety and excavation rate of coal mining, which can effectively resolve surface subsidence problems caused by underground mining activities, by utilizing solid waste such as coal gangues as a resource. Therefore, paste backfill mining is an effective clean coal mining technology, which has widespread application.

  5. Text Mining Technique and Application of Lifecycle Condition Assessment for Circuit Breaker%文本信息挖掘技术及其在断路器全寿命状态评价中的应用

    Institute of Scientific and Technical Information of China (English)

    邱剑; 王慧芳; 应高亮; 张波; 邹国平; 何奔腾

    2016-01-01

    In power grids,operating and maintaining engineers have recorded plenty of texts or logs during maintaining and inspecting activities.These textual data contain abundant asset health information.So far,however,few researches,if any, have studied text mining techniques in the power grid.We take the circuit breaker (CB) as a case in point to establish a framework of text mining-based lifecycle condition assessment.Firstly,the key issues of text mining and lifecycle condition assessment models are listed based on reviewing the research of CB condition assessment.Then,use is made of the framework including a hidden Markov model (HMM)-based text preprocessing and vectorizion,self-interval searching k-nearest neighbor (KNN)-based text classification,and a proportional health-index fusion model (PHFM).Finally,we have collected real textual data from a certain power company to demonstrate an example that shows the text mining technique could learn similar defects from other assets by itself,and PHFM shows historical data stream and lifecycle health index much more rigorously.%电网企业记录了大量故障与缺陷中文文本,这些文本蕴藏了丰富的设备健康信息。但迄今为止,鲜有电力领域的文本信息挖掘技术研究。以断路器全寿命状态评价为应用研究背景,探索了电网中文文本挖掘方法。首先,根据断路器状态评价的研究现状,提出了构建文本挖掘与全寿命状态评价模型的关键问题。然后,构建了包含文本挖掘信息的全寿命状态评价模型,通过基于隐马尔可夫法(HMM)的文本预处理与向量化、自主区间搜索 k 最近邻(KNN)算法的文本分类和比率型状态信息融合模型完成了断路器全寿命健康状态指数的展示。最后,采用某电网公司实际缺陷文本构建算例。算例表明,文本挖掘技术实现了相似缺陷的相关性学习,比率型信息融合模型能更全面真实地展示健康状态评价的历史流。

  6. Functionalized carbon nanotubes: biomedical applications

    Directory of Open Access Journals (Sweden)

    Vardharajula S

    2012-10-01

    Full Text Available Sandhya Vardharajula,1 Sk Z Ali,2 Pooja M Tiwari,1 Erdal Eroğlu,1 Komal Vig,1 Vida A Dennis,1 Shree R Singh11Center for NanoBiotechnology and Life Sciences Research, Alabama State University, Montgomery, AL, USA; 2Department of Microbiology, Osmania University, Hyderabad, IndiaAbstract: Carbon nanotubes (CNTs are emerging as novel nanomaterials for various biomedical applications. CNTs can be used to deliver a variety of therapeutic agents, including biomolecules, to the target disease sites. In addition, their unparalleled optical and electrical properties make them excellent candidates for bioimaging and other biomedical applications. However, the high cytotoxicity of CNTs limits their use in humans and many biological systems. The biocompatibility and low cytotoxicity of CNTs are attributed to size, dose, duration, testing systems, and surface functionalization. The functionalization of CNTs improves their solubility and biocompatibility and alters their cellular interaction pathways, resulting in much-reduced cytotoxic effects. Functionalized CNTs are promising novel materials for a variety of biomedical applications. These potential applications are particularly enhanced by their ability to penetrate biological membranes with relatively low cytotoxicity. This review is directed towards the overview of CNTs and their functionalization for biomedical applications with minimal cytotoxicity.Keywords: carbon nanotubes, cytotoxicity, functionalization, biomedical applications

  7. Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles.

    Directory of Open Access Journals (Sweden)

    Rey-Long Liu

    Full Text Available Biomedical literature is an essential source of biomedical evidence. To translate the evidence for biomedicine study, researchers often need to carefully read multiple articles about specific biomedical issues. These articles thus need to be highly related to each other. They should share similar core contents, including research goals, methods, and findings. However, given an article r, it is challenging for search engines to retrieve highly related articles for r. In this paper, we present a technique PBC (Passage-based Bibliographic Coupling that estimates inter-article similarity by seamlessly integrating bibliographic coupling with the information collected from context passages around important out-link citations (references in each article. Empirical evaluation shows that PBC can significantly improve the retrieval of those articles that biomedical experts believe to be highly related to specific articles about gene-disease associations. PBC can thus be used to improve search engines in retrieving the highly related articles for any given article r, even when r is cited by very few (or even no articles. The contribution is essential for those researchers and text mining systems that aim at cross-validating the evidence about specific gene-disease associations.

  8. Biomedical microsystems

    CERN Document Server

    Meng, Ellis

    2010-01-01

    IntroductionEvolution of MEMSApplications of MEMSBioMEMS ApplicationsMEMS ResourcesText Goals and OrganizationMiniaturization and ScalingBioMEMS MaterialsTraditional MEMS and Microelectronic MaterialsPolymeric Materials for MEMSBiomaterialsMicrofabrication Methods and Processes for BioMEMSIntroductionMicrolithographyDopingMicromachiningWafer Bonding, Assembly, and PackagingSurface TreatmentConversion Factors for Energy and Intensity UnitsLaboratory ExercisesMicrofluidicsIntroduction and Fluid PropertiesConcepts in MicrofluidicsFluid-Transport Phenomena and PumpingFlow ControlLaboratory Exercis

  9. 面向海量高维数据的文本主题发现%Text Topic Mining Oriented toward Massive High-dimensional Data

    Institute of Scientific and Technical Information of China (English)

    王和勇; 蓝金炯

    2015-01-01

    针对潜在语义分析( LSA: Latent Semantic Analysis)方法在海量高维数据中的制约,提出K均值聚类的LSA方法( KLSA):通过利用K均值聚类对主题词进行预处理,将主题词降到相对低维空间后再使用LSA方法;选取新浪微博文本数据作为具体研究对象,通过实验证明了所提出的方法能够在确保模型分类效果条件下,很好地满足海量高维数据对LSA方法计算速度的敏感要求。%Considering the constraints of Latent Semantic Analysis ( LSA) method in massive high-dimensional data, this paper proposes an improved LSA method based on k-means algorithm, called KLSA. This method takes advantage of k-means algorithm to reduce those feature words to relatively low-dimensional space and then uses the LSA method. In order to ensure the validity of this idea, the paper chooses text data from Sina Weibo to conduct an experiment. It is proved that the proposed method can satisfy the requirements of compu-tational efficiency in massive high-dimensional data under the condition of ensuring the classification results.

  10. Statistics in biomedical research

    Directory of Open Access Journals (Sweden)

    González-Manteiga, Wenceslao

    2007-06-01

    Full Text Available The discipline of biostatistics is nowadays a fundamental scientific component of biomedical, public health and health services research. Traditional and emerging areas of application include clinical trials research, observational studies, physiology, imaging, and genomics. The present article reviews the current situation of biostatistics, considering the statistical methods traditionally used in biomedical research, as well as the ongoing development of new methods in response to the new problems arising in medicine. Clearly, the successful application of statistics in biomedical research requires appropriate training of biostatisticians. This training should aim to give due consideration to emerging new areas of statistics, while at the same time retaining full coverage of the fundamentals of statistical theory and methodology. In addition, it is important that students of biostatistics receive formal training in relevant biomedical disciplines, such as epidemiology, clinical trials, molecular biology, genetics, and neuroscience.La Bioestadística es hoy en día una componente científica fundamental de la investigación en Biomedicina, salud pública y servicios de salud. Las áreas tradicionales y emergentes de aplicación incluyen ensayos clínicos, estudios observacionales, fisología, imágenes, y genómica. Este artículo repasa la situación actual de la Bioestadística, considerando los métodos estadísticos usados tradicionalmente en investigación biomédica, así como los recientes desarrollos de nuevos métodos, para dar respuesta a los nuevos problemas que surgen en Medicina. Obviamente, la aplicación fructífera de la estadística en investigación biomédica exige una formación adecuada de los bioestadísticos, formación que debería tener en cuenta las áreas emergentes en estadística, cubriendo al mismo tiempo los fundamentos de la teoría estadística y su metodología. Es importante, además, que los estudiantes de

  11. Commercial Data Mining Software

    Science.gov (United States)

    Zhang, Qingyu; Segall, Richard S.

    This chapter discusses selected commercial software for data mining, supercomputing data mining, text mining, and web mining. The selected software are compared with their features and also applied to available data sets. The software for data mining are SAS Enterprise Miner, Megaputer PolyAnalyst 5.0, PASW (formerly SPSS Clementine), IBM Intelligent Miner, and BioDiscovery GeneSight. The software for supercomputing are Avizo by Visualization Science Group and JMP Genomics from SAS Institute. The software for text mining are SAS Text Miner and Megaputer PolyAnalyst 5.0. The software for web mining are Megaputer PolyAnalyst and SPSS Clementine . Background on related literature and software are presented. Screen shots of each of the selected software are presented, as are conclusions and future directions.

  12. Special Issue: 3D Printing for Biomedical Engineering

    Directory of Open Access Journals (Sweden)

    Chee Kai Chua

    2017-02-01

    Full Text Available Three-dimensional (3D printing has a long history of applications in biomedical engineering. The development and expansion of traditional biomedical applications are being advanced and enriched by new printing technologies. New biomedical applications such as bioprinting are highly attractive and trendy. This Special Issue aims to provide readers with a glimpse of the recent profile of 3D printing in biomedical research.

  13. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  14. Science and Technology Text Mining: Electrochemical Power

    Science.gov (United States)

    2003-07-14

    mercury or clinical or amino or hydrogen peroxide or paste or corona or tissue* or helium or ascorbic acid or receptor* or chromium or radiation or...organic, composite, nonaqueous, 1-i, ceramic, propylene carbonate, ceo2, lanthium gallate, zirconia -based, rubbery, with modest emphasis on composite film...oxides, limn2o4, licoo2, linio2, propylene carbonate, water, methanol, polyaniline, methane, polyethelyne oxide, manganese dioxide, zirconia , polypyrrole

  15. Science and Technology Text Mining: Analytical Chemistry

    Science.gov (United States)

    2001-01-01

    Raynauds Syndrome papers and having a central theme( s ) "b" and sub-themes "c." One interesting discovery was that dietary eicosapentaenoic acid (theme...citations from all sources. 3) COMPUTATIONAL LINGUISTICS -LITERATURE-BASED DISCOVERY, RAYNAUDS SYNDROME Some initial applications of literature-based...study (7) was focused on identifying treatments for Raynauds Syndrome , a circulatory disease. Assume that two literatures can be generated, the first

  16. Science and Technology Text Mining Basic Concepts

    Science.gov (United States)

    2003-01-01

    from author. Kostoff, R. N. (1993). Database Tomography for Technical Intelligence. Competitive Intelligence Review. 4:1. Kostoff, R.N. (1994...Database Tomography: Origins and Applications. Competitive Intelligence Review, Special Issue on Technology, 5:1. Kostoff, R.N. et al (1995) System and

  17. Science and Technology Text Mining: Wireless LANS

    Science.gov (United States)

    2005-01-01

    Kostoff, R. N. [1993]. Database Tomography for technical intelligence. Competitive Intelligence Review. 4(1). 38-43. Kostoff, R.N. [1994...Database Tomography: origins and applications. Competitive Intelligence Review. Special Issue on Technology. 5:1. 48-55. Kostoff, R. N. et al

  18. Science and Technology Text Mining: Nonlinear Dynamics

    Science.gov (United States)

    2004-02-01

    Technovation. 19. 2). Kostoff, R. N. (1993). Database Tomography for Technical Intelligence. Competitive Intelligence Review. 4:1. 3). Kostoff...R.N. (1994). Database Tomography: Origins and Applications. Competitive Intelligence Review. Special Issue on Technology, 5:1. 4). Kostoff, R. N. et

  19. Exploring Dimensionality Reduction for Text Mining

    Science.gov (United States)

    2007-05-04

    scientific evidence beta carotene bigrams Harvard School macular degeneration beta carotene potato chip macular degeneration prostate cancer FDA Review...Researchers say olestra binds and helps flush away certain key nutrients believed to protect against chronic diseases . ”The public needs to know more...Boston meeting. Stampfer, also at the Harvard School of Public Health, turned his attention first to age- related macular degeneration , a disorder

  20. Checklists in biomedical publications

    Directory of Open Access Journals (Sweden)

    Pardal-Refoyo JL

    2013-12-01

    Full Text Available Introduction and objectives: the authors, reviewers, editors and readers must have specific tools that help them in the process of drafting, review, or reading the articles. Objective: to offer a summary of the major checklists for different types of biomedical research articles. Material and method: review literature and resources of the EQUATOR Network and adaptations in Spanish published by Medicina Clínica and Evidencias en Pediatría journals. Results: are the checklists elaborated by various working groups. (CONSORT and TREND, experimental studies for observational studies (STROBE, accuracy (STARD diagnostic studies, systematic reviews and meta-analyses (PRISMA and for studies to improve the quality (SQUIRE. Conclusions: the use of checklists help to improve the quality of articles and help to authors, reviewers, to the editor and readers in the development and understanding of the content.

  1. Data mining in radiology

    Directory of Open Access Journals (Sweden)

    Amit T Kharat

    2014-01-01

    Full Text Available Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining.

  2. Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles.

    Science.gov (United States)

    Liu, Rey-Long

    2015-01-01

    Biomedical literature is an essential source of biomedical evidence. To translate the evidence for biomedicine study, researchers often need to carefully read multiple articles about specific biomedical issues. These articles thus need to be highly related to each other. They should share similar core contents, including research goals, methods, and findings. However, given an article r, it is challenging for search engines to retrieve highly related articles for r. In this paper, we present a technique PBC (Passage-based Bibliographic Coupling) that estimates inter-article similarity by seamlessly integrating bibliographic coupling with the information collected from context passages around important out-link citations (references) in each article. Empirical evaluation shows that PBC can significantly improve the retrieval of those articles that biomedical experts believe to be highly related to specific articles about gene-disease associations. PBC can thus be used to improve search engines in retrieving the highly related articles for any given article r, even when r is cited by very few (or even no) articles. The contribution is essential for those researchers and text mining systems that aim at cross-validating the evidence about specific gene-disease associations.

  3. 改进的朴素贝叶斯聚类Web文本分类挖掘技术%The Improved Naive Bayes Text Classification Data Mining Clustering Web

    Institute of Scientific and Technical Information of China (English)

    高胜利

    2012-01-01

    通过对Web数据的特点进行详细的分析,在基于传统的贝叶斯聚类算法基础上,采用网页标记形式来有效地弥补朴素贝叶斯算法的不足,并将改进的方法应用在文本分类中,是一种很好的改进思路。最后实验结果也表明,此方法能够有效地对文本进行分类。%This paper first introduced the Web mining and text classification of basic theory, specific to the Web data characteristics are analyzed in detail, mainly based on the traditional Bayesian clustering algorithm based on the proposed algorithm, the improvement of the webpage, marked form to effectively compensates for the naive Bayes algorithm is in- sufficient, will be improved method and its application in text classification, finally the experimental results show that the method can effectively classify the text.

  4. Biomedical Engineering Laboratory

    Science.gov (United States)

    2007-11-02

    The Masters of Engineering program with concentration in Biomedical Engineering at Tennessee State University was established in fall 2000. Under... biomedical engineering . The lab is fully equipped with 10 Pentium5-based, 2 Pentium4-based laptops for mobile experiments at remote locations, 8 Biopac...students (prospective graduate students in biomedical engineering ) are regularly using this lab. This summer, 8 new prospective graduate students

  5. Tuning, Diagnostics & Data Preparation for Generalized Linear Models Supervised Algorithm in Data Mining Technologies

    Directory of Open Access Journals (Sweden)

    Sachin Bhaskar

    2015-07-01

    Full Text Available Data mining techniques are the result of a long process of research and product development. Large amount of data are searched by the practice of Data Mining to find out the trends and patterns that go beyond simple analysis. For segmentation of data and also to evaluate the possibility of future events, complex mathematical algorithms are used here. Specific algorithm produces each Data Mining model. More than one algorithms are used to solve in best way by some Data Mining problems. Data Mining technologies can be used through Oracle. Generalized Linear Models (GLM Algorithm is used in Regression and Classification Oracle Data Mining functions. For linear modelling, GLM is one the popular statistical techniques. For regression and binary classification, GLM is implemented by Oracle Data Mining. Row diagnostics as well as model statistics and extensive co-efficient statistics are provided by GLM. It also supports confidence bounds.. This paper outlines and produces analysis of GLM algorithm, which will guide to understand the tuning, diagnostics & data preparation process and the importance of Regression & Classification supervised Oracle Data Mining functions and it is utilized in marketing, time series prediction, financial forecasting, overall business planning, trend analysis, environmental modelling, biomedical and drug response modelling, etc.

  6. Biomedical engineering principles

    CERN Document Server

    Ritter, Arthur B; Valdevit, Antonio; Ascione, Alfred N

    2011-01-01

    Introduction: Modeling of Physiological ProcessesCell Physiology and TransportPrinciples and Biomedical Applications of HemodynamicsA Systems Approach to PhysiologyThe Cardiovascular SystemBiomedical Signal ProcessingSignal Acquisition and ProcessingTechniques for Physiological Signal ProcessingExamples of Physiological Signal ProcessingPrinciples of BiomechanicsPractical Applications of BiomechanicsBiomaterialsPrinciples of Biomedical Capstone DesignUnmet Clinical NeedsEntrepreneurship: Reasons why Most Good Designs Never Get to MarketAn Engineering Solution in Search of a Biomedical Problem

  7. 中医方剂数据库文本挖掘数据预处理的尝试%An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database

    Institute of Scientific and Technical Information of China (English)

    吴磊; 李舒

    2015-01-01

    目的:针对中医方剂数据挖掘需要提出一套以数据清洗为主的数据预处理方法,使数据规范、准确和有序,利于后续处理。方法通过检索技术,在方剂数据库中获取文本数据源,将非规范化的数据通过辅助词群行处理、正则表达式替换、异名处理等步骤进行清洗,改进数据质量。结果在中国方剂数据库共检索到1758条记录,在方剂现代应用数据库共检索到91条记录。源文本数据经预处理后共得到有效记录6913味药,可成功导入相关信息挖掘系统进行方剂名称和中药名词的信息抽取。结论本方法适用于基于中医方剂数据库的文本挖掘和知识发现,可成功对源文本数据实施清洗,得到标准统一、无噪声的数据,实现所需方药信息的有效抽取,可为中医方剂文本型数据信息分析与挖掘研究提供有益的借鉴。%Objective To propose a set of data preprocessing method based on data cleaning for TCM prescription database;To make data more standard, accurate and orderly, and convenient for follow-up processing. Methods The text data source was retrieved from prescription databases by bibliographic searching techniques. Non-normalized data were processed through steps followed by auxiliary word group line processing, regular expression substitution, and synonyms processing, with a purpose to improve data quality. Results Totally 1758 effective records were retrieved from TCM prescription database, and 91 records were retrieved from prescription modern application database. 6913 effective Chinese herbal medicines were retrieved after preprocessing, which can be successfully imported into relevant information mining system, and information about prescription and herb names can be extracted. Conclusion This method is applicable for text mining and knowledge discovery in TCM prescription database. It can successfully implement data cleaning for source text data, get

  8. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  9. Rewriting and suppressing UMLS terms for improved biomedical term identification

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2010-03-01

    Full Text Available Abstract Background Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS. To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. Results Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. Conclusions We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software

  10. Context-Aware Adaptive Hybrid Semantic Relatedness in Biomedical Science

    Science.gov (United States)

    Emadzadeh, Ehsan

    Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1--6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

  11. Data mining in radiology.

    Science.gov (United States)

    Kharat, Amit T; Singh, Amarjit; Kulkarni, Vilas M; Shah, Digish

    2014-04-01

    Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining.

  12. Exploring subdomain variation in biomedical language

    Directory of Open Access Journals (Sweden)

    Séaghdha Diarmuid Ó

    2011-05-01

    Full Text Available Abstract Background Applications of Natural Language Processing (NLP technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. Results Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. Conclusions We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.

  13. Biomedical devices and their applications

    CERN Document Server

    2004-01-01

    This volume introduces readers to the basic concepts and recent advances in the field of biomedical devices. The text gives a detailed account of novel developments in drug delivery, protein electrophoresis, estrogen mimicking methods and medical devices. It also provides the necessary theoretical background as well as describing a wide range of practical applications. The level and style make this book accessible not only to scientific and medical researchers but also to graduate students.

  14. Classification of protein-protein interaction full-text documents using text and citation network features.

    Science.gov (United States)

    Kolchinsky, Artemy; Abi-Haidar, Alaa; Kaur, Jasleen; Hamed, Ahmed Abdeen; Rocha, Luis M

    2010-01-01

    We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.

  15. Clinic-Genomic Association Mining for Colorectal Cancer Using Publicly Available Datasets

    Directory of Open Access Journals (Sweden)

    Fang Liu

    2014-01-01

    Full Text Available In recent years, a growing number of researchers began to focus on how to establish associations between clinical and genomic data. However, up to now, there is lack of research mining clinic-genomic associations by comprehensively analysing available gene expression data for a single disease. Colorectal cancer is one of the malignant tumours. A number of genetic syndromes have been proven to be associated with colorectal cancer. This paper presents our research on mining clinic-genomic associations for colorectal cancer under biomedical big data environment. The proposed method is engineered with multiple technologies, including extracting clinical concepts using the unified medical language system (UMLS, extracting genes through the literature mining, and mining clinic-genomic associations through statistical analysis. We applied this method to datasets extracted from both gene expression omnibus (GEO and genetic association database (GAD. A total of 23517 clinic-genomic associations between 139 clinical concepts and 7914 genes were obtained, of which 3474 associations between 31 clinical concepts and 1689 genes were identified as highly reliable ones. Evaluation and interpretation were performed using UMLS, KEGG, and Gephi, and potential new discoveries were explored. The proposed method is effective in mining valuable knowledge from available biomedical big data and achieves a good performance in bridging clinical data with genomic data for colorectal cancer.

  16. Integrating Data Mining Techniques into Telemedicine Systems

    Directory of Open Access Journals (Sweden)

    Mihaela GHEORGHE

    2014-01-01

    Full Text Available The medical system is facing a wide range of challenges nowadays due to changes that are taking place in the global healthcare systems. These challenges are represented mostly by economic constraints (spiraling costs, financial issues, but also, by the increased emphasis on accountability and transparency, changes that were made in the education field, the fact that the biomedical research keeps growing in what concerns the complexities of the specific studies etc. Also the new partnerships that were made in medical care systems and the great advances in IT industry suggest that a predominant paradigm shift is occurring. This needs a focus on interaction, collaboration and increased sharing of information and knowledge, all of these may is in turn be leading healthcare organizations to embrace the techniques of data mining in order to create and sustain optimal healthcare outcomes. Data mining is a domain of great importance nowadays as it provides advanced data analysis techniques for extracting the knowledge from the huge volumes of data collected and stored by every system of a daily basis. In the healthcare organizations data mining can provide valuable information for patient's diagnosis and treatment planning, customer relationship management, organization resources management or fraud detection. In this article we focus on describing the importance of data mining techniques and systems for healthcare organizations with a focus on developing and implementing telemedicine solution in order to improve the healthcare services provided to the patients. We provide architecture for integrating data mining techniques into telemedicine systems and also offer an overview on understanding and improving the implemented solution by using Business Process Management methods.

  17. Application of Computer Text Information Mining Technology in Network Security%计算机文本信息挖掘技术在网络安全中的应用

    Institute of Scientific and Technical Information of China (English)

    韩文智

    2016-01-01

    针对网络文本信息的安全性判别问题,采取改进的邻近分类算法挖掘文本。该改进邻近分类方法在传统方法定义分类特征的同时,起用共线性判别矩阵,对具有共线属性的特征合并处理。这种改进策略,不仅可以增加分类特征的准确性,也可以加快文本信息的分类进程。对 Spambase 语料库开展实验研究,从精度、召回率、联判度、误差4个维度对分类效果进行评价。结果显示:改进的邻近分类方法具有明显的优势,可以更加准确地区分安全文本和危险文本。%In view of the security problem of network text information,we adopt an improved neighbor classification al-gorithm to carry out text mining.In improved nearest neighbor method,definition and classification are carried out by tra-ditional method,and characteristics are merged by reinstating co-linear discriminant matrix of collinear attribute features. This improved strategy not only increase the accuracy of classification features,but also speed up the classification process of text information.An experimental study is carried out on the Spambase corpus,and the classification results are evalu-ated from 4 dimensions.Namely accuracy,recall rate,the degree of error,and the error.Results show that the improved method has obvious advantages,and that is more accurate in the area of security text and dangerous text.

  18. Biomedical Applications of Biodegradable Polyesters

    Directory of Open Access Journals (Sweden)

    Iman Manavitehrani

    2016-01-01

    Full Text Available The focus in the field of biomedical engineering has shifted in recent years to biodegradable polymers and, in particular, polyesters. Dozens of polyester-based medical devices are commercially available, and every year more are introduced to the market. The mechanical performance and wide range of biodegradation properties of this class of polymers allow for high degrees of selectivity for targeted clinical applications. Recent research endeavors to expand the application of polymers have been driven by a need to target the general hydrophobic nature of polyesters and their limited cell motif sites. This review provides a comprehensive investigation into advanced strategies to modify polyesters and their clinical potential for future biomedical applications.

  19. Introducing Text Analytics as a Graduate Business School Course

    Science.gov (United States)

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  20. Powering biomedical devices

    CERN Document Server

    Romero, Edwar

    2013-01-01

    From exoskeletons to neural implants, biomedical devices are no less than life-changing. Compact and constant power sources are necessary to keep these devices running efficiently. Edwar Romero's Powering Biomedical Devices reviews the background, current technologies, and possible future developments of these power sources, examining not only the types of biomedical power sources available (macro, mini, MEMS, and nano), but also what they power (such as prostheses, insulin pumps, and muscular and neural stimulators), and how they work (covering batteries, biofluids, kinetic and ther

  1. Biomedical applications of polymers

    CERN Document Server

    Gebelein, C G

    1991-01-01

    The biomedical applications of polymers span an extremely wide spectrum of uses, including artificial organs, skin and soft tissue replacements, orthopaedic applications, dental applications, and controlled release of medications. No single, short review can possibly cover all these items in detail, and dozens of books andhundreds of reviews exist on biomedical polymers. Only a few relatively recent examples will be cited here;additional reviews are listed under most of the major topics in this book. We will consider each of the majorclassifications of biomedical polymers to some extent, inclu

  2. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D; Bronzino, Joseph D

    2006-01-01

    Over the last century,medicine has come out of the "black bag" and emerged as one of the most dynamic and advanced fields of development in science and technology. Today, biomedical engineering plays a critical role in patient diagnosis, care, and rehabilitation. As such, the field encompasses a wide range of disciplines, from biology and physiology to informatics and signal processing. Reflecting the enormous growth and change in biomedical engineering during the infancy of the 21st century, The Biomedical Engineering Handbook enters its third edition as a set of three carefully focused and

  3. Biomedical Engineering Desk Reference

    CERN Document Server

    Ratner, Buddy D; Schoen, Frederick J; Lemons, Jack E; Dyro, Joseph; Martinsen, Orjan G; Kyle, Richard; Preim, Bernhard; Bartz, Dirk; Grimnes, Sverre; Vallero, Daniel; Semmlow, John; Murray, W Bosseau; Perez, Reinaldo; Bankman, Isaac; Dunn, Stanley; Ikada, Yoshito; Moghe, Prabhas V; Constantinides, Alkis

    2009-01-01

    A one-stop Desk Reference, for Biomedical Engineers involved in the ever expanding and very fast moving area; this is a book that will not gather dust on the shelf. It brings together the essential professional reference content from leading international contributors in the biomedical engineering field. Material covers a broad range of topics including: Biomechanics and Biomaterials; Tissue Engineering; and Biosignal Processing* A hard-working desk reference providing all the essential material needed by biomedical and clinical engineers on a day-to-day basis * Fundamentals, key techniques,

  4. Handbook of biomedical optics

    CERN Document Server

    Boas, David A

    2011-01-01

    Biomedical optics holds tremendous promise to deliver effective, safe, non- or minimally invasive diagnostics and targeted, customizable therapeutics. Handbook of Biomedical Optics provides an in-depth treatment of the field, including coverage of applications for biomedical research, diagnosis, and therapy. It introduces the theory and fundamentals of each subject, ensuring accessibility to a wide multidisciplinary readership. It also offers a view of the state of the art and discusses advantages and disadvantages of various techniques.Organized into six sections, this handbook: Contains intr

  5. Regulatory relations represented in logics and biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    Regulatory networks are used for simple modeling of varying complexity, for example within biology, economics and other elds that apply dynamic systems. In biomedicine, regulatory networks are widely used to model regulatory pathways, which, in short, are characterized by processes containing gene...... products and smaller molecules that regulate each other through dierent mechanisms through dierent paths. The relations between the building blocks of these networks are typically modeled either very expressively or very simply in graphs in information systems. The focus of this dissertation...... of information services on regulatory events within biomedicine....

  6. Review of Biomedical Image Processing

    Directory of Open Access Journals (Sweden)

    Ciaccio Edward J

    2011-11-01

    Full Text Available Abstract This article is a review of the book: 'Biomedical Image Processing', by Thomas M. Deserno, which is published by Springer-Verlag. Salient information that will be useful to decide whether the book is relevant to topics of interest to the reader, and whether it might be suitable as a course textbook, are presented in the review. This includes information about the book details, a summary, the suitability of the text in course and research work, the framework of the book, its specific content, and conclusions.

  7. Sensors for biomedical applications

    NARCIS (Netherlands)

    Bergveld, Piet

    1986-01-01

    This paper considers the impact during the last decade of modern IC technology, microelectronics, thin- and thick-film technology, fibre optic technology, etc. on the development of sensors for biomedical applications.

  8. Statistics in biomedical research

    OpenAIRE

    González-Manteiga, Wenceslao; Cadarso-Suárez, Carmen

    2007-01-01

    The discipline of biostatistics is nowadays a fundamental scientific component of biomedical, public health and health services research. Traditional and emerging areas of application include clinical trials research, observational studies, physiology, imaging, and genomics. The present article reviews the current situation of biostatistics, considering the statistical methods traditionally used in biomedical research, as well as the ongoing development of new methods in response to the new p...

  9. Biomedical signal analysis

    CERN Document Server

    Rangayyan, Rangaraj M

    2015-01-01

    The book will help assist a reader in the development of techniques for analysis of biomedical signals and computer aided diagnoses with a pedagogical examination of basic and advanced topics accompanied by over 350 figures and illustrations. Wide range of filtering techniques presented to address various applications. 800 mathematical expressions and equations. Practical questions, problems and laboratory exercises. Includes fractals and chaos theory with biomedical applications.

  10. Extraction of genotype-phenotype-drug relationships from text: from entity recognition to bioinformatics application.

    Science.gov (United States)

    Coulet, Adrien; Shah, Nigam; Hunter, Lawrence; Barral, Chitta; Altman, Russ B

    2010-01-01

    Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.

  11. Data mining in Cloud Computing

    Directory of Open Access Journals (Sweden)

    Ruxandra-Ştefania PETRE

    2012-10-01

    Full Text Available This paper describes how data mining is used in cloud computing. Data Mining is used for extracting potentially useful information from raw data. The integration of data mining techniques into normal day-to-day activities has become common place. Every day people are confronted with targeted advertising, and data mining techniques help businesses to become more efficient by reducing costs.Data mining techniques and applications are very much needed in the cloud computing paradigm. The implementation of data mining techniques through Cloud computing will allow the users to retrieve meaningful information from virtually integrated data warehouse that reduces the costs of infrastructure and storage.

  12. 基于文本挖掘技术的高层管理者战略领导能力探究%Exploration on Senior Manager's Strategic Leadership Based on Text Mining Technology

    Institute of Scientific and Technical Information of China (English)

    任嵘嵘; 吴凯

    2015-01-01

    从企业需求的视角探讨了中国背景下企业高层管理者的战略领导能力.以103家企业高层管理者招聘广告为样本,采用文本挖掘技术,从基本要求和战略要求两个方面探讨企业高层管理者战略领导能力的构成因素.研究发现:从基本要求看,企业对高层管理者的学历、年龄、知识和专业没有具体要求,对工作经历和从业时间的要求相对清晰和明确;从战略要求来看,企业更希望高层管理者具备判断决策能力、沟通能力、前瞻能力、统率全局能力、灵活应变能力和团队合作能力.%This paper discusses the strategic leadership of senior managers in China's enterprises from the perspective of business requirement.Adopting the text mining technique to study the employment ads for senior managers from 103 enterprises,it explores the component and the influencing factors of senior manager's strategic leadership from two aspect,namely essential and strategic requirements.It finds as follows:in terms of essential requirements,enterprises have clear and definite the specifications about senior manager's working experience and time,but no requirements on educational background,age,expertise or major;as to the strategic requirements,enterprises prefer senior managers with judgment and decision-making capacity,communication skill,foresight,leadership,adaptability to changes and team work ability.

  13. 基于文本挖掘的本体自动构建系统架构解析%System Architecture Analysis of Automatic Construction System of Ontology Based on Text Mining

    Institute of Scientific and Technical Information of China (English)

    薛中玉; 李春梅; 黄道雄

    2011-01-01

    Ontology is able to offer a semantic support for human-computer interaction so that it can be found wide applications in the fields of artificial intelligence, knowledge engineering and so on. However, at present ontology construction mainly uses the manual approach with disadvantage of higher construction cost, long development period, and unsure quality. This becomes a major bottleneck to hinder ontology applications. This paper presents an automatic construction system of ontology and method based on text mining, introduces in detail the functions and implementation method of the user layer, system tools layer and data resource layer in the system, and analyzes the whole system data processing flow. This system and method can be used for reference to solve the similar problems in ontology construction.%本体可以为人与计算机之间的沟通和交流提供语义支撑,在人工智能、知识工程等众多领域有着广泛的应用空间,但现阶段本体主要采用人工构建方法,投入资源大、建设周期长,且质量无法保障,这些成为制约本体应用的主要瓶颈.文中提出了一种基于文本挖掘的本体自动构建系统和方法,详细介绍了用户层、系统工具层和数据资源层中各模块的功能和实现方法,具体分析了系统数据处理的整个流程.该系统和方法对于解决本体构建问题具有借鉴意义.

  14. Gold mining areas in Suriname: reservoirs of malaria resistance?

    Directory of Open Access Journals (Sweden)

    Adhin MR

    2014-05-01

    Full Text Available Malti R Adhin,1 Mergiory Labadie-Bracho,2 Stephen Vreden31Faculty of Medical Sciences, Department of Biochemistry, Anton de Kom Universiteit van Suriname, 2Prof Dr Paul C Flu Institute for Biomedical Sciences, 3Academic Hospital Paramaribo, Paramaribo, SurinameBackground: At present, malaria cases in Suriname occur predominantly in migrants and people living and/or working in areas with gold mining operations. A molecular survey was performed in Plasmodium falciparum isolates originating from persons from gold mining areas to assess the extent and role of mining areas as reservoirs of malaria resistance in Suriname.Methods: The status of 14 putative resistance-associated single nucleotide polymorphisms in the pfdhfr, pfcrt, pfmdr1, and pfATP6 genes was assessed for 28 samples from gold miners diagnosed with P. falciparum malaria using polymerase chain reaction amplification and restriction fragment length polymorphism analysis, and the results were compared with earlier data from nonmining villagers.Results: Isolates from miners showed a high degree of homogeneity, with a fixed pfdhfr Ile51/Asn108, pfmdr1 Phe184/Asp1042/Tyr1246, and pfcrt Thr76 mutant genotype, while an exclusively wild-type genotype was observed for pfmdr1 Asn86 and pfdhfr Ala16, Cys59, and Ile164, and for the pfATP6 positions Leu263/Ala623/Ser769. Small variations were observed for pfmdr1 S1034C. No statistically significant difference could be detected in allele frequencies between mining and nonmining villagers.Conclusion: Despite the increased risk of malaria infection in individuals working/living in gold mining areas, we did not detect an increase in mutation frequency at the 14 analyzed single nucleotide polymorphisms. Therefore, mining areas in Suriname cannot yet be considered as reservoirs for malaria resistance.Keywords: Plasmodium falciparum, gold mining, mutation frequency, Suriname

  15. Text Analytics to Data Warehousing

    Directory of Open Access Journals (Sweden)

    Kalli Srinivasa Nageswara Prasad

    2010-09-01

    Full Text Available Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language to extract and analyze data, textanalytics has evolved to handle unstructured data to helps unlock and predict business results via Business Intelligence and Data Warehousing. Text mining involves dealing with texts in documents and discovering hidden patterns, but Text Analytics enhances InformationRetrieval in form of search and enabling clustering of results and more over Text Analytics is text mining and visualization. In this paper we would discuss on handling unstructured data that are in documents so that they fit into business applications like Data Warehouses for further analysis and it helps in the framework we have used for the solution.

  16. Biomedical Image Analysis by Program "Vision Assistant" and "Labview"

    Directory of Open Access Journals (Sweden)

    Peter Izak

    2005-01-01

    Full Text Available This paper introduces application in image analysis of biomedical images. General task is focused on analysis and diagnosis biomedical images obtained from program ImageJ. There are described methods which can be used for images in biomedical application. The main idea is based on particle analysis, pattern matching techniques. For this task was chosensophistication method by program Vision Assistant, which is a part of program LabVIEW.

  17. A Review of Biomedical Centrifugal Microfluidic Platforms

    Directory of Open Access Journals (Sweden)

    Minghui Tang

    2016-02-01

    Full Text Available Centrifugal microfluidic or lab-on-a-disc platforms have many advantages over other microfluidic systems. These advantages include a minimal amount of instrumentation, the efficient removal of any disturbing bubbles or residual volumes, and inherently available density-based sample transportation and separation. Centrifugal microfluidic devices applied to biomedical analysis and point-of-care diagnostics have been extensively promoted recently. This paper presents an up-to-date overview of these devices. The development of biomedical centrifugal microfluidic platforms essentially covers two categories: (i unit operations that perform specific functionalities, and (ii systems that aim to address certain biomedical applications. With the aim to provide a comprehensive representation of current development in this field, this review summarizes progress in both categories. The advanced unit operations implemented for biological processing include mixing, valving, switching, metering and sequential loading. Depending on the type of sample to be used in the system, biomedical applications are classified into four groups: nucleic acid analysis, blood analysis, immunoassays, and other biomedical applications. Our overview of advanced unit operations also includes the basic concepts and mechanisms involved in centrifugal microfluidics, while on the other hand an outline on reported applications clarifies how an assembly of unit operations enables efficient implementation of various types of complex assays. Lastly, challenges and potential for future development of biomedical centrifugal microfluidic devices are discussed.

  18. Longwall mining

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1995-03-14

    As part of EIA`s program to provide information on coal, this report, Longwall-Mining, describes longwall mining and compares it with other underground mining methods. Using data from EIA and private sector surveys, the report describes major changes in the geologic, technological, and operating characteristics of longwall mining over the past decade. Most important, the report shows how these changes led to dramatic improvements in longwall mining productivity. For readers interested in the history of longwall mining and greater detail on recent developments affecting longwall mining, the report includes a bibliography.

  19. Clustering Text Data Streams

    Institute of Scientific and Technical Information of China (English)

    Yu-Bao Liu; Jia-Rong Cai; Jian Yin; Ada Wai-Chee Fu

    2008-01-01

    Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF.IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

  20. Science and Technology Text Mining: Text Mining of the Journal Cortex

    Science.gov (United States)

    2004-01-01

    Tomography for Technical Intelligence," Competitive Intelligence Review, 4:1, Spring 1993. 2. Kostoff, R.N., "Database Tomography: Origins and...Applications," Competitive Intelligence Review, Special Issue on Technology, 5:1, Spring 1994. 3. Kostoff, R. N., Eberhart, H. J., and Miles, D., "System and

  1. Thermoresponsive Polymers for Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Theoni K. Georgiou

    2011-08-01

    Full Text Available Thermoresponsive polymers are a class of “smart” materials that have the ability to respond to a change in temperature; a property that makes them useful materials in a wide range of applications and consequently attracts much scientific interest. This review focuses mainly on the studies published over the last 10 years on the synthesis and use of thermoresponsive polymers for biomedical applications including drug delivery, tissue engineering and gene delivery. A summary of the main applications is given following the different studies on thermoresponsive polymers which are categorized based on their 3-dimensional structure; hydrogels, interpenetrating networks, micelles, crosslinked micelles, polymersomes, films and particles.

  2. 基于网络论坛文本挖掘的笔记本电脑满意度研究%Study on laptop satisfaction degree via text mining based on network forum

    Institute of Scientific and Technical Information of China (English)

    李艳红; 程翔

    2014-01-01

    Different from the past satisfaction model based on the idea of establishing index system via brainstorms and questionnaire surveys, this study is built on a large number of comment information that is buried and distributed in various network platforms and tries to analyze what aspects and content the customers concern about laptops by means of text mining tools in order to establish the evaluation system. Using the multivariate regression method, we found the laptop satisfaction model based on Formell model. On account of consumers′ true feelings, the advantages and defects which customers concern most about laptops can be extracted in order to help product manufacturers to understand the demand and psychological expectations of customers comprehensively and accurately. At the same time, the satisfaction model will contribute to the comprehensive calculation and comparison of the satisfaction on laptops by both consumers and suppliers.%不同于以往的满意度模型中头脑风暴和问卷调研等手段建立指标体系的思路,以大量掩埋和分布在各个网络平台中的评论信息为基础,通过文本挖掘手段分析消费者对笔记本电脑重点关注的角度和内容,确立评价指标体系;基于 Formell 模型,使用多元回归方法,建立了笔记本电脑满意度模型。该研究过程基于消费者的真实感受,提炼出了消费者对笔记本电脑最为关注的优势属性和缺陷属性,帮助产品制造商全面、准确地了解消费者的需求和心理期望。同时,满意度模型有助于消费者和制造商对笔记本电脑的满意度进行综合测算、比较和选择。

  3. 中文在线产品评论有用性评估实证研究--基于文本挖掘视角%Empirical Study on Assessing Chinese Online Product Reviews Helpfulness:A Text Mining Approach

    Institute of Scientific and Technical Information of China (English)

    丁乃鹏; 汪勇慧

    2016-01-01

    Web2.0时代,阅读在线产品评论已经成为人们购物前的一种习惯。然而,网络上的评论数量巨大且观点不一,消费者很难获取到真正对其有用的评论。本文从研究中文在线产品评论的有用性评估入手,结合中文在线评论的特点,构建了评论有用性评估特征体系。以二分类思想为中心,基于文本挖掘的基本流程,实现对中文产品评论的分类,并考察了评论内容各特征对分类效果的影响。结果表明,本文提出的评估方法能有效识别出有用评论,并且发现浅层句法特征在分类中的贡献度较高,语义特征与情感特征则会因语料类型的不同而有不同的分类贡献度。%In the web2.0 era, before going shopping, reading online product reviews has become a habit of people. However, the number of reviews on the Internet is huge and opinions are different, consumers are difficult to gain helpful reviews. This paper studies on assessing the helpfulness of Chinese online product reviews.Combined with the characteristics of Chinese online reviews ,we construct a feature system about assessing reviews helpfulness. Taking the thought of binary classification as a center, based on the basic flow of text mining, to achieve the Chinese product reviews classification. And study on the degree of each feature of review content affecting the classification effect. The results show that the proposed evaluation method can effectively identify helpful reviews, and found shallow syntactic features has higher contribution in the classification. In addition, semantic features and sentiment features have different classification contribution on different type of corpus.

  4. Biomedical enhancements as justice.

    Science.gov (United States)

    Nam, Jeesoo

    2015-02-01

    Biomedical enhancements, the applications of medical technology to make better those who are neither ill nor deficient, have made great strides in the past few decades. Using Amartya Sen's capability approach as my framework, I argue in this article that far from being simply permissible, we have a prima facie moral obligation to use these new developments for the end goal of promoting social justice. In terms of both range and magnitude, the use of biomedical enhancements will mark a radical advance in how we compensate the most disadvantaged members of society.

  5. Advances in biomedical engineering

    CERN Document Server

    Brown, J H U

    1976-01-01

    Advances in Biomedical Engineering, Volume 6, is a collection of papers that discusses the role of integrated electronics in medical systems and the usage of biological mathematical models in biological systems. Other papers deal with the health care systems, the problems and methods of approach toward rehabilitation, as well as the future of biomedical engineering. One paper discusses the use of system identification as it applies to biological systems to estimate the values of a number of parameters (for example, resistance, diffusion coefficients) by indirect means. More particularly, the i

  6. Advances in biomedical engineering

    CERN Document Server

    Brown, J H U

    1976-01-01

    Advances in Biomedical Engineering, Volume 5, is a collection of papers that deals with application of the principles and practices of engineering to basic and applied biomedical research, development, and the delivery of health care. The papers also describe breakthroughs in health improvements, as well as basic research that have been accomplished through clinical applications. One paper examines engineering principles and practices that can be applied in developing therapeutic systems by a controlled delivery system in drug dosage. Another paper examines the physiological and materials vari

  7. Biomedical implantable microelectronics.

    Science.gov (United States)

    Meindl, J D

    1980-10-17

    Innovative applications of microelectronics in new biomedical implantable instruments offer a singular opportunity for advances in medical research and practice because of two salient factors: (i) beyond all other types of biomedical instruments, implants exploit fully the inherent technical advantages--complex functional capability, high reliability, lower power drain, small size and weight-of microelectronics, and (ii) implants bring microelectronics into intimate association with biological systems. The combination of these two factors enables otherwise impossible new experiments to be conducted and new paostheses developed that will improve the quality of human life.

  8. Ethics in biomedical engineering.

    Science.gov (United States)

    Morsy, Ahmed; Flexman, Jennifer

    2008-01-01

    This session focuses on a number of aspects of the subject of Ethics in Biomedical Engineering. The session starts by providing a case study of a company that manufactures artificial heart valves where the valves were failing at an unexpected rate. The case study focuses on Biomedical Engineers working at the company and how their education and training did not prepare them to deal properly with such situation. The second part of the session highlights the need to learn about various ethics rules and policies regulating research involving human or animal subjects.

  9. Biomedical Engineering in Modern Society

    Science.gov (United States)

    Attinger, E. O.

    1971-01-01

    Considers definition of biomedical engineering (BME) and how biomedical engineers should be trained. State of the art descriptions of BME and BME education are followed by a brief look at the future of BME. (TS)

  10. Anatomy for Biomedical Engineers

    Science.gov (United States)

    Carmichael, Stephen W.; Robb, Richard A.

    2008-01-01

    There is a perceived need for anatomy instruction for graduate students enrolled in a biomedical engineering program. This appeared especially important for students interested in and using medical images. These students typically did not have a strong background in biology. The authors arranged for students to dissect regions of the body that…

  11. Holography In Biomedical Sciences

    Science.gov (United States)

    von Bally, G.

    1988-01-01

    Today not only physicists and engineers but also biological and medical scientists are exploring the potentials of holographic methods in their special field of work. Most of the underlying physical principles such as coherence, interference, diffraction and polarization as well as general features of holography e.g. storage and retrieval of amplitude and phase of a wavefront, 3-d-imaging, large field of depth, redundant storage of information, spatial filtering, high-resolving, non-contactive, 3-d form and motion analysis are explained in detail in other contributions to this book. Therefore, this article is confined to the applications of holography in biomedical sciences. Because of the great number of contributions and the variety of applications [1,2,3,4,5,6,7,8] in this review the investigations can only be mentioned briefly and the survey has to be confined to some examples. As in all fields of optics and laser metrology, a review of biomedical applications of holography would be incomplete if military developments and their utilization are not mentioned. As will be demonstrated by selected examples the increasing interlacing of science with the military does not stop at domains that traditionally are regarded as exclusively oriented to human welfare like biomedical research [9]. This fact is actually characterized and stressed by the expression "Star Wars Medicine", which becomes increasingly common as popular description for laser applications (including holography) in medicine [10]. Thus, the consequence - even in such highly specialized fields like biomedical applications of holography - have to be discussed.

  12. What is biomedical informatics?

    Science.gov (United States)

    Bernstam, Elmer V; Smith, Jack W; Johnson, Todd R

    2010-02-01

    Biomedical informatics lacks a clear and theoretically-grounded definition. Many proposed definitions focus on data, information, and knowledge, but do not provide an adequate definition of these terms. Leveraging insights from the philosophy of information, we define informatics as the science of information, where information is data plus meaning. Biomedical informatics is the science of information as applied to or studied in the context of biomedicine. Defining the object of study of informatics as data plus meaning clearly distinguishes the field from related fields, such as computer science, statistics and biomedicine, which have different objects of study. The emphasis on data plus meaning also suggests that biomedical informatics problems tend to be difficult when they deal with concepts that are hard to capture using formal, computational definitions. In other words, problems where meaning must be considered are more difficult than problems where manipulating data without regard for meaning is sufficient. Furthermore, the definition implies that informatics research, teaching, and service should focus on biomedical information as data plus meaning rather than only computer applications in biomedicine.

  13. Various criteria in the evaluation of biomedical named entity recognition

    Directory of Open Access Journals (Sweden)

    Lin Yu-Chun

    2006-02-01

    Full Text Available Abstract Background Text mining in the biomedical domain is receiving increasing attention. A key component of this process is named entity recognition (NER. Generally speaking, two annotated corpora, GENIA and GENETAG, are most frequently used for training and testing biomedical named entity recognition (Bio-NER systems. JNLPBA and BioCreAtIvE are two major Bio-NER tasks using these corpora. Both tasks take different approaches to corpus annotation and use different matching criteria to evaluate system performance. This paper details these differences and describes alternative criteria. We then examine the impact of different criteria and annotation schemes on system performance by retesting systems participated in the above two tasks. Results To analyze the difference between JNLPBA's and BioCreAtIvE's evaluation, we conduct Experiment 1 to evaluate the top four JNLPBA systems using BioCreAtIvE's classification scheme. We then compare them with the top four BioCreAtIvE systems. Among them, three systems participated in both tasks, and each has an F-score lower on JNLPBA than on BioCreAtIvE. In Experiment 2, we apply hypothesis testing and correlation coefficient to find alternatives to BioCreAtIvE's evaluation scheme. It shows that right-match and left-match criteria have no significant difference with BioCreAtIvE. In Experiment 3, we propose a customized relaxed-match criterion that uses right match and merges JNLPBA's five NE classes into two, which achieves an F-score of 81.5%. In Experiment 4, we evaluate a range of five matching criteria from loose to strict on the top JNLPBA system and examine the percentage of false negatives. Our experiment gives the relative change in precision, recall and F-score as matching criteria are relaxed. Conclusion In many applications, biomedical NEs could have several acceptable tags, which might just differ in their left or right boundaries. However, most corpora annotate only one of them. In our

  14. 大数据时代医学生物信息的挖掘和利用%Mining and Utilization of Bio-medical Information in the Time of Big Data

    Institute of Scientific and Technical Information of China (English)

    时钢; 王兴梅; 黄志民; 洪松林; 闫妍; 高伟伟; 门天男

    2014-01-01

    With the development of the hospital information construction,the progress of medical diagnostics and the use of high-throughput experimental equipment, medical data presentation of geometric growth showed the characteristic of big data.In medical research,the specimens library construction, clinical medical treatment, medical and health regulatory aspects put forward a huge chal enge,how to utilize existing medical information system and medical iological information construction in the future,has brought unprecedented opportunities for biomedical research too.To begin the work of big data research which is very meaningful to the construction of hospital informatization construction and biological specimen information database.The application of this research technique wil become the trend of the development of the biomedical science and technology, wil be the future of core technology in the field of bioinformatics research.So do the requirement of technical knowledge, infrastructure, personnel training content is very necessary.Big data wil infiltrate into the medical field, changing the medical research and practice of clinical medicine, medical management.%随着医院信息化的建设,医疗诊断手段进步和高通量实验设备的利用,医学数据呈现几何级数的增长表现出大数据的特征。如何利用现在已有的医疗信息系统和在将来医学生物信息化建设的问题上,对医学研究、标本库建设、临床医疗、医疗卫生监管等都提出了巨大的挑战,也为生物医学研究带来了前所未有的机遇。开展"大数据"相关研究工作对医院信息化建设、生物标本信息库建设是有着意义的。这种研究技术的应用必将成为生物医药科学技术发展的趋势,也必将是未来生物信息研究领域的核心技术。所以做好相关的技术知识了解、基础建设要求、人才培养内容是非常必要的。大数据必将渗透到医学领域,改变

  15. Carbon Nanotubes Reinforced Composites for Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Wei Wang

    2014-01-01

    Full Text Available This review paper reported carbon nanotubes reinforced composites for biomedical applications. Several studies have found enhancement in the mechanical properties of CNTs-based reinforced composites by the addition of CNTs. CNTs reinforced composites have been intensively investigated for many aspects of life, especially being made for biomedical applications. The review introduced fabrication of CNTs reinforced composites (CNTs reinforced metal matrix composites, CNTs reinforced polymer matrix composites, and CNTs reinforced ceramic matrix composites, their mechanical properties, cell experiments in vitro, and biocompatibility tests in vivo.

  16. Building biomedical materials layer-by-layer

    Directory of Open Access Journals (Sweden)

    Paula T. Hammond

    2012-05-01

    Full Text Available In this materials perspective, the promise of water based layer-by-layer (LbL assembly as a means of generating drug-releasing surfaces for biomedical applications, from small molecule therapeutics to biologic drugs and nucleic acids, is examined. Specific advantages of the use of LbL assembly versus traditional polymeric blend encapsulation are discussed. Examples are provided to present potential new directions. Translational opportunities are discussed to examine the impact and potential for true biomedical translation using rapid assembly methods, and applications are discussed with high need and medical return.

  17. Functionalized Gold Nanoparticles and Their Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Shree R. Singh

    2011-06-01

    Full Text Available Metal nanoparticles are being extensively used in various biomedical applications due to their small size to volume ratio and extensive thermal stability. Gold nanoparticles (GNPs are an obvious choice due to their amenability of synthesis and functionalization, less toxicity and ease of detection. The present review focuses on various methods of functionalization of GNPs and their applications in biomedical research. Functionalization facilitates targeted delivery of these nanoparticles to various cell types, bioimaging, gene delivery, drug delivery and other therapeutic and diagnostic applications. This review is an amalgamation of recent advances in the field of functionalization of gold nanoparticles and their potential applications in the field of medicine and biology.

  18. Graphene based materials for biomedical applications

    Directory of Open Access Journals (Sweden)

    Yuqi Yang

    2013-10-01

    Full Text Available Graphene, a single layer 2-dimensional structure nanomaterial with unique physicochemical properties (e.g. high surface area, excellent electrical conductivity, strong mechanical strength, unparalleled thermal conductivity, remarkable biocompatibility and ease of functionalization, has received increasing attention in physical, chemical and biomedical fields. This article selectively reviews current advances of graphene based materials for biomedical applications. In particular, graphene based biosensors for small biomolecules (glucose, dopamine etc., proteins and DNA detection have been summarized; graphene based bioimaging, drug delivery, and photothermal therapy applications have been described in detail. Future perspectives and possible challenges in this rapidly developing area are also discussed.

  19. Integrating systems biology models and biomedical ontologies

    Directory of Open Access Journals (Sweden)

    de Bono Bernard

    2011-08-01

    Full Text Available Abstract Background Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology. Results We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models. Conclusions We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms.

  20. 基于加权关联规则和文本挖掘的金融新闻传播 Agent 实现%WEIGHTED ASSOCIATION RULES AND TEXT MINING-BASED AGENT REALISATION OF FINANCIAL NEWS SPREADING

    Institute of Scientific and Technical Information of China (English)

    张人上; 曲开社

    2015-01-01

    针对传统的金融预测系统仅仅依靠股票价格和市场指数等定量数据而不能很好地满足实时性和高准确性的问题,提出一种基于加权关联规则和文本挖掘的新闻传播 Agent 实现方法。首先,利用中文知识与信息处理系统将每个新闻标题分离得到每个中文单词;然后,利用加权关联规则算法检测频繁出现在同一条新闻标题中的多个术语,并提取名词、动词和复合语;最后,根据新闻供给市场第一个交易日股票交易金融价格指数为提取的关键字分配权重,并根据新闻标题的权重值判断其对股票价格的影响程度。新闻标题特征数据库上的实验验证了该方法在金融新闻标题的实时信息发布应用中的可行性,实验结果表明,相比其他几种预测方法,该方法取得了更高的预测准确率和召回率。%Traditional financial prediction systems cannot well satisfy both real-time property and high accuracy because they rely on quantitative data of stock prices and market indexes only.For which,we propose the weighted association rules and text mining-based Agent realisation of news spreading.First,it employs Chinese knowledge and information processing system to divide every news headline into single Chinese characters.Then,it uses WAR algorithm to detect multiple terminologies frequently appearing in same news headlines,and extracts noun,verb and complex languages as well.Finally,it assigns weights to the extracted keywords according to the first day’s financial price index of stock transactions in news supplying market,and estimates the influence degree of weighted values of news headlines on stock prices. The effectiveness of the proposed method in application of real-time information delivery of financial news headlines has been verified by the experiments on news headlines characteristic database.Experimental results show that the proposed method achieves higher accuracy

  1. Data Mining and Pattern Recognition Models for Identifying Inherited Diseases

    DEFF Research Database (Denmark)

    Iddamalgoda, Lahiru; Das, Partha S; Aponso, Achala;

    2016-01-01

    Data mining and pattern recognition methods reveal interesting findings in genetic studies, especially on how the genetic makeup is associated with inherited diseases. Although researchers have proposed various data mining models for biomedical approaches, there remains a challenge in accurately...... prioritizing the single nucleotide polymorphisms (SNP) associated with the disease. In this commentary, we review the state-of-art data mining and pattern recognition models for identifying inherited diseases and deliberate the need of binary classification- and scoring-based prioritization methods...

  2. Optomechatronics for Biomedical Optical Imaging: An Overview

    Directory of Open Access Journals (Sweden)

    Cho Hyungsuck

    2015-01-01

    Full Text Available The use of optomechatronic technology, particularly in biomedical optical imaging, is becoming pronounced and ever increasing due to its synergistic effect of the integration of optics and mechatronics. The background of this trend is that the biomedical optical imaging for example in-vivo imaging related to retraction of tissues, diagnosis, and surgical operations have a variety of challenges due to complexity in internal structure and properties of biological body and the resulting optical phenomena. This paper addresses the technical issues related to tissue imaging, visualization of interior surfaces of organs, laparoscopic and endoscopic imaging and imaging of neuronal activities and structures. Within such problem domains the paper overviews the states of the art technology focused on how optical components are fused together with those of mechatronics to create the functionalities required for the imaging systems. Future perspective of the optical imaging in biomedical field is presented in short.

  3. Wikipedia Mining

    Science.gov (United States)

    Nakayama, Kotaro; Ito, Masahiro; Erdmann, Maike; Shirakawa, Masumi; Michishita, Tomoyuki; Hara, Takahiro; Nishio, Shojiro

    Wikipedia, a collaborative Wiki-based encyclopedia, has become a huge phenomenon among Internet users. It covers a huge number of concepts of various fields such as arts, geography, history, science, sports and games. As a corpus for knowledge extraction, Wikipedia's impressive characteristics are not limited to the scale, but also include the dense link structure, URL based word sense disambiguation, and brief anchor texts. Because of these characteristics, Wikipedia has become a promising corpus and a new frontier for research. In the past few years, a considerable number of researches have been conducted in various areas such as semantic relatedness measurement, bilingual dictionary construction, and ontology construction. Extracting machine understandable knowledge from Wikipedia to enhance the intelligence on computational systems is the main goal of "Wikipedia Mining," a project on CREP (Challenge for Realizing Early Profits) in JSAI. In this paper, we take a comprehensive, panoramic view of Wikipedia Mining research and the current status of our challenge. After that, we will discuss about the future vision of this challenge.

  4. Biomedical signals, imaging, and informatics

    CERN Document Server

    Bronzino, Joseph D

    2014-01-01

    Known as the bible of biomedical engineering, The Biomedical Engineering Handbook, Fourth Edition, sets the standard against which all other references of this nature are measured. As such, it has served as a major resource for both skilled professionals and novices to biomedical engineering.Biomedical Signals, Imaging, and Informatics, the third volume of the handbook, presents material from respected scientists with diverse backgrounds in biosignal processing, medical imaging, infrared imaging, and medical informatics.More than three dozen specific topics are examined, including biomedical s

  5. Link mining models, algorithms, and applications

    CERN Document Server

    Yu, Philip S; Faloutsos, Christos

    2010-01-01

    This book presents in-depth surveys and systematic discussions on models, algorithms and applications for link mining. Link mining is an important field of data mining. Traditional data mining focuses on 'flat' data in which each data object is represented as a fixed-length attribute vector. However, many real-world data sets are much richer in structure, involving objects of multiple types that are related to each other. Hence, recently link mining has become an emerging field of data mining, which has a high impact in various important applications such as text mining, social network analysi

  6. Data mining in healthcare and biomedicine: a survey of the literature.

    Science.gov (United States)

    Yoo, Illhoi; Alafaireet, Patricia; Marinov, Miroslav; Pena-Hernandez, Keila; Gopidi, Rajitha; Chang, Jia-Fu; Hua, Lei

    2012-08-01

    As a new concept that emerged in the middle of 1990's, data mining can help researchers gain both novel and deep insights and can facilitate unprecedented understanding of large biomedical datasets. Data mining can uncover new biomedical and healthcare knowledge for clinical and administrative decision making as well as generate scientific hypotheses from large experimental data, clinical databases, and/or biomedical literature. This review first introduces data mining in general (e.g., the background, definition, and process of data mining), discusses the major differences between statistics and data mining and then speaks to the uniqueness of data mining in the biomedical and healthcare fields. A brief summarization of various data mining algorithms used for classification, clustering, and association as well as their respective advantages and drawbacks is also presented. Suggested guidelines on how to use data mining algorithms in each area of classification, clustering, and association are offered along with three examples of how data mining has been used in the healthcare industry. Given the successful application of data mining by health related organizations that has helped to predict health insurance fraud and under-diagnosed patients, and identify and classify at-risk people in terms of health with the goal of reducing healthcare cost, we introduce how data mining technologies (in each area of classification, clustering, and association) have been used for a multitude of purposes, including research in the biomedical and healthcare fields. A discussion of the technologies available to enable the prediction of healthcare costs (including length of hospital stay), disease diagnosis and prognosis, and the discovery of hidden biomedical and healthcare patterns from related databases is offered along with a discussion of the use of data mining to discover such relationships as those between health conditions and a disease, relationships among diseases, and

  7. Multilingual biomedical dictionary.

    Science.gov (United States)

    Daumke, Philipp; Markó, Kornél; Poprat, Michael; Schulz, Stefan

    2005-01-01

    We present a unique technique to create a multilingual biomedical dictionary, based on a methodology called Morpho-Semantic indexing. Our approach closes a gap caused by the absence of free available multilingual medical dictionaries and the lack of accuracy of non-medical electronic translation tools. We first explain the underlying technology followed by a description of the dictionary interface, which makes use of a multilingual subword thesaurus and of statistical information from a domain-specific, multilingual corpus.

  8. Adaptive Biomedical Innovation.

    Science.gov (United States)

    Honig, P K; Hirsch, G

    2016-12-01

    Adaptive Biomedical Innovation (ABI) is a multistakeholder approach to product and process innovation aimed at accelerating the delivery of clinical value to patients and society. ABI offers the opportunity to transcend the fragmentation and linearity of decision-making in our current model and create a common collaborative framework that optimizes the benefit and access of new medicines for patients as well as creating a more sustainable innovation ecosystem.

  9. [Biomedical activity of biosurfactants].

    Science.gov (United States)

    Krasowska, Anna

    2010-07-23

    Biosurfactants, amphiphilic compounds, synthesized by microorganisms have surface, antimicrobial and antitumor properties. Biosurfactants prevent adhesion and biofilms formation by bacteria and fungi on various surfaces. For many years microbial surfactants are used as antibiotics with board spectrum of activity against microorganisms. Biosurfactants act as antiviral compounds and their antitumor activities are mediated through induction of apoptosis. This work presents the current state of knowledge related to biomedical activity of biosurfactants.

  10. MINING INDUSTRY IN CROATIA

    Directory of Open Access Journals (Sweden)

    Slavko Vujec

    1996-12-01

    Full Text Available The trends of World and European mine industry is presented with introductory short review. The mining industry is very important in economy of Croatia, because of cover most of needed petroleum and natural gas quantity, total construction raw materials and industrial non-metallic raw minerals. Detail quantitative presentation of mineral raw material production is compared with pre-war situation. The value of annual production is represented for each raw mineral (the paper is published in Croatian.

  11. Biomedical accelerator mass spectrometry

    Science.gov (United States)

    Freeman, Stewart P. H. T.; Vogel, John S.

    1995-05-01

    Ultrasensitive SIMS with accelerator based spectrometers has recently begun to be applied to biomedical problems. Certain very long-lived radioisotopes of very low natural abundances can be used to trace metabolism at environmental dose levels ( [greater-or-equal, slanted] z mol in mg samples). 14C in particular can be employed to label a myriad of compounds. Competing technologies typically require super environmental doses that can perturb the system under investigation, followed by uncertain extrapolation to the low dose regime. 41Ca and 26Al are also used as elemental tracers. Given the sensitivity of the accelerator method, care must be taken to avoid contamination of the mass spectrometer and the apparatus employed in prior sample handling including chemical separation. This infant field comprises the efforts of a dozen accelerator laboratories. The Center for Accelerator Mass Spectrometry has been particularly active. In addition to collaborating with groups further afield, we are researching the kinematics and binding of genotoxins in-house, and we support innovative uses of our capability in the disciplines of chemistry, pharmacology, nutrition and physiology within the University of California. The field can be expected to grow further given the numerous potential applications and the efforts of several groups and companies to integrate more the accelerator technology into biomedical research programs; the development of miniaturized accelerator systems and ion sources capable of interfacing to conventional HPLC and GMC, etc. apparatus for complementary chemical analysis is anticipated for biomedical laboratories.

  12. 基于Web文本挖掘的行业态势分析——以2011上海车展为例%Industry Situation Analysis Based on Web Text Mining ---Case Study of Shanghai Auto Show in 2011

    Institute of Scientific and Technical Information of China (English)

    许鑫; 郭金龙; 姚占雷

    2012-01-01

    通过总结Web文本挖掘在竞争情报中的应用,尝试将Web文本挖掘方法应用于行业态势分析,提出基于行业态势分析的Web文本挖掘流程,并以2011年上海车展中的相关报道,采用时空分布、词频分析、共现分析等方法对汽车行业态势分析进行实证研究,最后探讨我国汽车行业的发展趋势。%This paper summarizes the applications of Web text mining in the field of competitive intelligence, tries to apply Web text mining to industry situation analysis and puts forward a Web text mining process for industry situation analysis. The authors take Shanghai Auto Show in 2011 as an example to conduct an empirical study by using time series analysis, space distribution analysis, word-frequency analysis and co-occurrence analysis. Finally the authors discuss the trends of the development of China's automobile industry.

  13. Discovery of Recurring Anomalies in Text Reports

    Data.gov (United States)

    National Aeronautics and Space Administration — This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining algorithms to...

  14. Mining the pharmacogenomics literature--a survey of the state of the art.

    Science.gov (United States)

    Hahn, Udo; Cohen, K Bretonnel; Garten, Yael; Shah, Nigam H

    2012-07-01

    This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

  15. RECENT POTENTIAL USAGE OF SURFACTANT FROM MICROBIAL ORIGIN IN PHARMACEUTICAL AND BIOMEDICAL ARENA: A PERSPECTIVE

    Directory of Open Access Journals (Sweden)

    Rath Kalyani

    2011-08-01

    Full Text Available The use and potential commercial application of biosurfactant has increased during the past decade which can be used as emulsifiers, de-emulsifiers, wetting and foaming agents, functional food ingredients and as detergents in petroleum, petrochemicals, environmental management, agrochemicals, foods and beverages, cosmetics and pharmaceuticals and in the mining and metallurgical industries. Their antibacterial, antifungal and antiviral activities make them relevant molecules for applications in combating many diseases and as therapeutic agents. In addition to this their role as antiadhesive agents against several disease causing pathogens makes their utility as suitable antiadhesive coating agents for medical insertional materials which helps in the reduction in a large number of hospital infections without the use of synthetic drugs and chemicals. This review looks at the various pharmaceutical, biomedical and therapeutic perspectives on biosurfactant applications.

  16. Pbm: A new dataset for blog mining

    CERN Document Server

    Aziz, Mehwish

    2012-01-01

    Text mining is becoming vital as Web 2.0 offers collaborative content creation and sharing. Now Researchers have growing interest in text mining methods for discovering knowledge. Text mining researchers come from variety of areas like: Natural Language Processing, Computational Linguistic, Machine Learning, and Statistics. A typical text mining application involves preprocessing of text, stemming and lemmatization, tagging and annotation, deriving knowledge patterns, evaluating and interpreting the results. There are numerous approaches for performing text mining tasks, like: clustering, categorization, sentimental analysis, and summarization. There is a growing need to standardize the evaluation of these tasks. One major component of establishing standardization is to provide standard datasets for these tasks. Although there are various standard datasets available for traditional text mining tasks, but there are very few and expensive datasets for blog-mining task. Blogs, a new genre in web 2.0 is a digital...

  17. Biomedical Applications for Introductory Physics

    Science.gov (United States)

    Tuszynski, J. A.; Dixon, J. M.

    2001-12-01

    Can be utilized in either Algebra or Calculus-based courses and is available either as a standalone text or as a supplement for books like Cutnell PHYSICS, 5e or Halliday, Resnick, & Walker FUNDAMENTALS OF PHYSICS, 6e. Math level is Algebra & Trigonometry; however, a few examples require the use of integration and differentiation. Unlike competing supplements, Tuszinski offers both a wealth of engaging biomedical applications as well as quantitative problem-solving. The quantitative problem-solving is presented in the form of worked examples and homework problems. The quantitative problem-solving is presented in the form of worked examples and homework problems. The standard organization facilitates the integration of the material into most introductory courses.

  18. An Approach to Mine Textual Information From Pubmed Database

    Directory of Open Access Journals (Sweden)

    G Charles Babu

    2012-05-01

    Full Text Available The web has greatly improved access to scientific literature. A wide spectrum of research data has been created and collected by researchers. However, textual information on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites and researcher homepages. Data was widely available over internet and many kinds of data pose the current challenge in storage and retrieval. Datasets can be made more accessible and user-friendly through annotation, aggregation, cross-linking to other datasets. Biomedical datasets are growing exponentially and new curative information appears regularly in research publications such as MedLine, PubMed, Science Direct etc. Therefore, a context based text mining was developed using python language to search huge database such as PubMed based on a given keyword which retrieves data between specified years.

  19. Enriching a biomedical event corpus with meta-knowledge annotation

    Directory of Open Access Journals (Sweden)

    Thompson Paul

    2011-10-01

    Full Text Available Abstract Background Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an important resource for the training of domain-specific information extraction (IE systems, to facilitate semantic-based searching of documents. Correct interpretation of these events is not possible without additional information, e.g., does an event describe a fact, a hypothesis, an experimental result or an analysis of results? How confident is the author about the validity of her analyses? These and other types of information, which we collectively term meta-knowledge, can be derived from the context of the event. Results We have designed an annotation scheme for meta-knowledge enrichment of biomedical event corpora. The scheme is multi-dimensional, in that each event is annotated for 5 different aspects of meta-knowledge that can be derived from the textual context of the event. Textual clues used to determine the values are also annotated. The scheme is intended to be general enough to allow integration with different types of bio-event annotation, whilst being detailed enough to capture important subtleties in the nature of the meta-knowledge expressed in the text. We report here on both the main features of the annotation scheme, as well as its application to the GENIA event corpus (1000 abstracts with 36,858 events. High levels of inter-annotator agreement have been achieved, falling in the range of 0.84-0.93 Kappa. Conclusion By augmenting event annotations with meta-knowledge, more sophisticated IE systems can be trained, which allow interpretative

  20. Health Effects Associated with Inhalation of Airborne Arsenic Arising from Mining Operations

    Directory of Open Access Journals (Sweden)

    Rachael Martin

    2014-08-01

    Full Text Available Arsenic in dust and aerosol generated by mining, mineral processing and metallurgical extraction industries, is a serious threat to human populations throughout the world. Major sources of contamination include smelting operations, coal combustion, hard rock mining, as well as their associated waste products, including fly ash, mine wastes and tailings. The number of uncontained arsenic-rich mine waste sites throughout the world is of growing concern, as is the number of people at risk of exposure. Inhalation exposures to arsenic-bearing dusts and aerosol, in both occupational and environmental settings, have been definitively linked to increased systemic uptake, as well as carcinogenic and non-carcinogenic health outcomes. It is therefore becoming increasingly important to identify human populations and sensitive sub-populations at risk of exposure, and to better understand the modes of action for pulmonary arsenic toxicity and carcinogenesis. In this paper we explore the contribution of smelting, coal combustion, hard rock mining and their associated waste products to atmospheric arsenic. We also report on the current understanding of the health effects of inhaled arsenic, citing results from various toxicological, biomedical and epidemiological studies. This review is particularly aimed at those researchers engaged in the distinct, but complementary areas of arsenic research within the multidisciplinary field of medical geology.