WorldWideScience

Sample records for biomedical text mining

  1. Text mining patents for biomedical knowledge.

    Science.gov (United States)

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. PMID:27179985

  2. CONAN : Text Mining in the Biomedical Domain

    NARCIS (Netherlands)

    Malik, R.

    2006-01-01

    This thesis is about Text Mining. Extracting important information from literature. In the last years, the number of biomedical articles and journals is growing exponentially. Scientists might not find the information they want because of the large number of publications. Therefore a system was cons

  3. @Note: a workbench for biomedical text mining.

    Science.gov (United States)

    Lourenço, Anália; Carreira, Rafael; Carneiro, Sónia; Maia, Paulo; Glez-Peña, Daniel; Fdez-Riverola, Florentino; Ferreira, Eugénio C; Rocha, Isabel; Rocha, Miguel

    2009-08-01

    Biomedical Text Mining (BioTM) is providing valuable approaches to the automated curation of scientific literature. However, most efforts have addressed the benchmarking of new algorithms rather than user operational needs. Bridging the gap between BioTM researchers and biologists' needs is crucial to solve real-world problems and promote further research. We present @Note, a platform for BioTM that aims at the effective translation of the advances between three distinct classes of users: biologists, text miners and software developers. Its main functional contributions are the ability to process abstracts and full-texts; an information retrieval module enabling PubMed search and journal crawling; a pre-processing module with PDF-to-text conversion, tokenisation and stopword removal; a semantic annotation schema; a lexicon-based annotator; a user-friendly annotation view that allows to correct annotations and a Text Mining Module supporting dataset preparation and algorithm evaluation. @Note improves the interoperability, modularity and flexibility when integrating in-home and open-source third-party components. Its component-based architecture allows the rapid development of new applications, emphasizing the principles of transparency and simplicity of use. Although it is still on-going, it has already allowed the development of applications that are currently being used. PMID:19393341

  4. Frontiers of biomedical text mining: current progress

    OpenAIRE

    Zweigenbaum, Pierre; Demner-Fushman, Dina; Hong YU; Cohen, Kevin B.

    2007-01-01

    It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a ...

  5. Biomedical text mining and its applications in cancer research.

    Science.gov (United States)

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. PMID:23159498

  6. OntoGene web services for biomedical text mining

    OpenAIRE

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an appro...

  7. OntoGene web services for biomedical text mining.

    Science.gov (United States)

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them. PMID:25472638

  8. Application of text mining in the biomedical domain

    NARCIS (Netherlands)

    Fleuren, W.W.M.; Alkema, W.B.L.

    2015-01-01

    In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of

  9. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

    Science.gov (United States)

    Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. PMID:26420781

  10. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    Institute of Scientific and Technical Information of China (English)

    J; Harold; Pardue; William; T; Gerthoffer

    2012-01-01

    Computational techniques have been adopted in medi-cal and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from origi- nal data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research.

  11. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery

    OpenAIRE

    Gonzalez, Graciela H.; Tahsin, Tasnia; Britton C Goodale; Greene, Anna C.; Greene, Casey S

    2015-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacog...

  12. Community challenges in biomedical text mining over 10 years: success, failure and the future.

    Science.gov (United States)

    Huang, Chung-Chi; Lu, Zhiyong

    2016-01-01

    One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations. PMID:25935162

  13. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  14. An unsupervised text mining method for relation extraction from biomedical literature.

    Directory of Open Access Journals (Sweden)

    Changqin Quan

    Full Text Available The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1 Protein-protein interactions extraction, and (2 Gene-suicide association extraction. The evaluation of task (1 on the benchmark dataset (AImed corpus showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene-suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.

  15. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    OpenAIRE

    Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming

    2012-01-01

    Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical l...

  16. Text Mining for Neuroscience

    Science.gov (United States)

    Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis

    2011-06-01

    Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in

  17. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  18. Services for annotation of biomedical text

    OpenAIRE

    Hakenberg, Jörg

    2008-01-01

    Motivation: Text mining in the biomedical domain in recent years has focused on the development of tools for recognizing named entities and extracting relations. Such research resulted from the need for such tools as basic components for more advanced solutions. Named entity recognition, entity mention normalization, and relationship extraction now have reached a stage where they perform comparably to human annotators (considering inter--annotator agreement, measured in many studies to be aro...

  19. Hotspots in text mining of biomedical field%生物医学文本挖掘研究热点分析

    Institute of Scientific and Technical Information of China (English)

    史航; 高雯珺; 崔雷

    2016-01-01

    The high frequency subject terms were extracted from the PubMed-covered papers published from January 2000 to March 2015 on text mining of biomedical field to generate the matrix of high frequency subject terms and their source papers.The co-occurrence of high frequency subject terms in a same paper was analyzed by clustering analysis.The hotspots in text mining of biomedical field were analyzed according to the clustering analysis of high frequency subject terms and their corresponding class labels, which showed that the hotspots in text mining of bio-medical field were the basic technologies of text mining, application of text mining in biomedical informatics and in extraction of drugs-related facts.%为了解生物医学文本挖掘的研究现状和评估未来的发展方向,以美国国立图书馆 PubMed中收录的2000年1月-2015年3月发表的生物医学文本挖掘研究文献记录为样本来源,提取文献记录的主要主题词进行频次统计后截取高频主题词,形成高频主题词-论文矩阵,根据高频主题词在同一篇论文中的共现情况对其进行聚类分析,根据高频主题词聚类分析结果和对应的类标签文献,分析当前生物医学文本挖掘研究的热点。结果显示,当前文本挖掘在生物医学领域应用的主要研究热点为文本挖掘的基本技术研究、文本挖掘在生物信息学领域里的应用、文本挖掘在药物相关事实抽取中的应用3个方面。

  20. Figure text extraction in biomedical literature.

    Directory of Open Access Journals (Sweden)

    Daehyun Kim

    Full Text Available BACKGROUND: Figures are ubiquitous in biomedical full-text articles, and they represent important biomedical knowledge. However, the sheer volume of biomedical publications has made it necessary to develop computational approaches for accessing figures. Therefore, we are developing the Biomedical Figure Search engine (http://figuresearch.askHERMES.org to allow bioscientists to access figures efficiently. Since text frequently appears in figures, automatically extracting such text may assist the task of mining information from figures. Little research, however, has been conducted exploring text extraction from biomedical figures. METHODOLOGY: We first evaluated an off-the-shelf Optical Character Recognition (OCR tool on its ability to extract text from figures appearing in biomedical full-text articles. We then developed a Figure Text Extraction Tool (FigTExT to improve the performance of the OCR tool for figure text extraction through the use of three innovative components: image preprocessing, character recognition, and text correction. We first developed image preprocessing to enhance image quality and to improve text localization. Then we adapted the off-the-shelf OCR tool on the improved text localization for character recognition. Finally, we developed and evaluated a novel text correction framework by taking advantage of figure-specific lexicons. RESULTS/CONCLUSIONS: The evaluation on 382 figures (9,643 figure texts in total randomly selected from PubMed Central full-text articles shows that FigTExT performed with 84% precision, 98% recall, and 90% F1-score for text localization and with 62.5% precision, 51.0% recall and 56.2% F1-score for figure text extraction. When limiting figure texts to those judged by domain experts to be important content, FigTExT performed with 87.3% precision, 68.8% recall, and 77% F1-score. FigTExT significantly improved the performance of the off-the-shelf OCR tool we used, which on its own performed with 36

  1. Mining Molecular Pharmacological Effects from Biomedical Text: a Case Study for Eliciting Anti-Obesity/Diabetes Effects of Chemical Compounds.

    Science.gov (United States)

    Dura, Elzbieta; Muresan, Sorel; Engkvist, Ola; Blomberg, Niklas; Chen, Hongming

    2014-05-01

    In the pharmaceutical industry, efficiently mining pharmacological data from the rapidly increasing scientific literature is very crucial for many aspects of the drug discovery process such as target validation, tool compound selection etc. A quick and reliable way is needed to collect literature assertions of selected compounds' biological and pharmacological effects in order to assist the hypothesis generation and decision-making of drug developers. INFUSIS, the text mining system presented here, extracts data on chemical compounds from PubMed abstracts. It involves an extensive use of customized natural language processing besides a co-occurrence analysis. As a proof-of-concept study, INFUSIS was used to search in abstract texts for several obesity/diabetes related pharmacological effects of the compounds included in a compound dictionary. The system extracts assertions regarding the pharmacological effects of each given compound and scores them by the relevance. For each selected pharmacological effect, the highest scoring assertions in 100 abstracts were manually evaluated, i.e. 800 abstracts in total. The overall accuracy for the inferred assertions was over 90 percent. PMID:27485890

  2. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  3. Chapter 16: Text Mining for Translational Bioinformatics

    OpenAIRE

    Bretonnel Cohen, K; Hunter, Lawrence E.

    2013-01-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. P...

  4. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  5. 基于知识组织系统的生物医学文本挖掘研究%Research on Biomedical Text Mining Based on Knowledge Organization System

    Institute of Scientific and Technical Information of China (English)

    钱庆

    2016-01-01

    With the rapid development of biomedical information technology, biological medical literatures grow exponential y. It's hard to read and understand the required knowledge by manual, how to integrate knowledge from huge amounts of biomedical literatures, mining new knowledge has been becoming the current hot spot. Knowledge organization system construction in the field of biological medicine is more normative and complete than other fields, which is the foundation for biomedical text mining. A large number of text mining methodsand systems based on knowledge organization system have fast development. This paper investigates the existing medical knowledge organization systems and summarizes the process of biomedical text mining. It also summaries the researches andrecentprogressand analyzes the characteristics of biomedical text mining based on knowledge organization system. The knowledge organization systems play an important role in biomedical text mining and the chal enge for the current study are summarized, so as to provide references for biomedical workers.%随着生物医学信息技术的飞速发展,生物医学文献呈“指数型”增长,单纯依靠人工阅读获取和理解所需知识变得异常困难,如何从海量生物医学文献中整合已有知识、挖掘新知识成为当前研究热点。生物医学领域的知识组织系统建设相比其他领域更加规范和完整,为生物医学文本挖掘奠定了基础,大量基于知识组织系统的文本挖掘方法、系统得到快速发展。本文主要梳理现有医学知识组织系统,归纳生物医学文本挖掘的主要流程,按照挖掘任务探讨当前的主要研究和进展情况,并进一步分析基于知识组织系统的生物医学文本挖掘的特点,对知识组织系统在生物医学文本挖掘中发挥的主要作用和当前研究面临的挑战进行总结,以期为生物医学工作者提供借鉴。

  6. Text Mining Infrastructure in R

    OpenAIRE

    Kurt Hornik; Ingo Feinerer; David Meyer

    2008-01-01

    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classiffication and string kernels. (authors' abstract)

  7. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  8. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  9. Text mining meets workflow: linking U-Compare with Taverna

    OpenAIRE

    Kano, Yoshinobu; Dobson, Paul; Nakanishi, Mio; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare t...

  10. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text

    CERN Document Server

    Jonnalagadda, Siddhartha; Hakenberg, Jorg; Baral, Chitta; Gonzalez, Graciela

    2010-01-01

    The complexity of sentences characteristic to biomedical articles poses a challenge to natural language parsers, which are typically trained on large-scale corpora of non-technical text. We propose a text simplification process, bioSimplify, that seeks to reduce the complexity of sentences in biomedical abstracts in order to improve the performance of syntactic parsers on the processed sentences. Syntactic parsing is typically one of the first steps in a text mining pipeline. Thus, any improvement in performance would have a ripple effect over all processing steps. We evaluated our method using a corpus of biomedical sentences annotated with syntactic links. Our empirical results show an improvement of 2.90% for the Charniak-McClosky parser and of 4.23% for the Link Grammar parser when processing simplified sentences rather than the original sentences in the corpus.

  11. Text mining: A Brief survey

    OpenAIRE

    Falguni N. Patel , Neha R. Soni

    2012-01-01

    The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have...

  12. Using natural language processing to improve biomedical concept normalization and relation mining

    OpenAIRE

    Kang, Ning

    2013-01-01

    textabstractThis thesis concerns the use of natural language processing for improving biomedical concept normalization and relation mining. We begin with introducing the background of biomedical text mining, and subsequently we will continue by describing a typical text mining pipeline, some key issues and problems in mining biomedical texts, and the possibility of using natural language procesing to solve the problems. Finally we end an outline of the work done in this thesis.

  13. DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.

    Directory of Open Access Journals (Sweden)

    Xu-Cheng Yin

    Full Text Available Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.

  14. Typesafe Modeling in Text Mining

    OpenAIRE

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  15. System for Distributed Text Mining

    OpenAIRE

    Torgersen, Martin Nordseth

    2011-01-01

    Text mining presents us with new possibilities for the use of collections of documents.There exists a large amount of hidden implicit information inside these collection, which text mining techniques may help us to uncover. Unfortunately, these techniques generally requires large amounts of computational power. This is addressed by the introduction of distributed systems and methods for distributed processing, such as Hadoop and MapReduce.This thesis aims to describe, design, implement and ev...

  16. Text mining: A Brief survey

    Directory of Open Access Journals (Sweden)

    Falguni N. Patel , Neha R. Soni

    2012-12-01

    Full Text Available The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have also discussed different technologies that teach computers with natural language so that they may analyze, understand, and even generate text. In addition, we briefly discuss a number of successful applications of text mining which are used currently and in future.

  17. Learning the Structure of Biomedical Relationships from Unstructured Text.

    Directory of Open Access Journals (Sweden)

    Bethany Percha

    2015-07-01

    Full Text Available The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes. Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC, that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1 pharmacogenomic relationships from PharmGKB and (2 drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.

  18. Text mining for the biocuration workflow.

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  19. Learning the Structure of Biomedical Relationships from Unstructured Text.

    Science.gov (United States)

    Percha, Bethany; Altman, Russ B

    2015-07-01

    The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining. PMID:26219079

  20. Using natural language processing to improve biomedical concept normalization and relation mining

    NARCIS (Netherlands)

    N. Kang (Ning)

    2013-01-01

    textabstractThis thesis concerns the use of natural language processing for improving biomedical concept normalization and relation mining. We begin with introducing the background of biomedical text mining, and subsequently we will continue by describing a typical text mining pipeline, some key iss

  1. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  2. Database citation in full text biomedical articles.

    Science.gov (United States)

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176

  3. NAMED ENTITY RECOGNITION FROM BIOMEDICAL TEXT -AN INFORMATION EXTRACTION TASK

    Directory of Open Access Journals (Sweden)

    N. Kanya

    2016-07-01

    Full Text Available Biomedical Text Mining targets the Extraction of significant information from biomedical archives. Bio TM encompasses Information Retrieval (IR and Information Extraction (IE. The Information Retrieval will retrieve the relevant Biomedical Literature documents from the various Repositories like PubMed, MedLine etc., based on a search query. The IR Process ends up with the generation of corpus with the relevant document retrieved from the Publication databases based on the query. The IE task includes the process of Preprocessing of the document, Named Entity Recognition (NER from the documents and Relationship Extraction. This process includes Natural Language Processing, Data Mining techniques and machine Language algorithm. The preprocessing task includes tokenization, stop word Removal, shallow parsing, and Parts-Of-Speech tagging. NER phase involves recognition of well-defined objects such as genes, proteins or cell-lines etc. This process leads to the next phase that is extraction of relationships (IE. The work was based on machine learning algorithm Conditional Random Field (CRF.

  4. Reviving "Walden": Mining the Text.

    Science.gov (United States)

    Hewitt Julia

    2000-01-01

    Describes how the author and her high school English students begin their study of Thoreau's "Walden" by mining the text for quotations to inspire their own writing and discussion on the topic, "How does Thoreau speak to you or how could he speak to someone you know?" (SR)

  5. Mining biomedical images towards valuable information retrieval in biomedical and life sciences

    Science.gov (United States)

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. PMID:27538578

  6. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  7. SIAM 2007 Text Mining Competition dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...

  8. Text Mining in Social Networks

    Science.gov (United States)

    Aggarwal, Charu C.; Wang, Haixun

    Social networks are rich in various kinds of contents such as text and multimedia. The ability to apply text mining algorithms effectively in the context of text data is critical for a wide variety of applications. Social networks require text mining algorithms for a wide variety of applications such as keyword search, classification, and clustering. While search and classification are well known applications for a wide variety of scenarios, social networks have a much richer structure both in terms of text and links. Much of the work in the area uses either purely the text content or purely the linkage structure. However, many recent algorithms use a combination of linkage and content information for mining purposes. In many cases, it turns out that the use of a combination of linkage and content information provides much more effective results than a system which is based purely on either of the two. This paper provides a survey of such algorithms, and the advantages observed by using such algorithms in different scenarios. We also present avenues for future research in this area.

  9. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  10. Searching Biomedical Text: Towards Maximum Relevant Results

    OpenAIRE

    Galde, Ola; Sevaldsen, John Harald

    2006-01-01

    The amount of biomedical information available to users today is large and increasing. The ability to precisely retrieve desired information is vital in order to utilize available knowledge. In this work we investigated how to improve the relevance of biomedical search results. Using the Lucene Java API we applied a series of information retrieval techniques to search in biomedical data. The techniques ranged from basic stemming and stop-word removal to more advanced methods like user relevan...

  11. GPU-Accelerated Text Mining

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.

  12. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  13. Efficient Retrieval of Text for Biomedical Domain using Expectation Maximization Algorithm

    Directory of Open Access Journals (Sweden)

    Sumit Vashishtha

    2011-11-01

    Full Text Available Data mining, a branch of computer science [1], is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. Biomedical text retrieval refers to text retrieval techniques applied to biomedical resources and literature available of the biomedical and molecular biology domain. The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Biomedical text retrieval is a way to aid researchers in coping with information overload. By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information extraction. However, textual variation due to typos, abbreviations, and other sources can prevent the productive discovery and utilization of hard-matching rules. Recent methods of soft clustering can exploit predictive relationships in textual data. This paper presents a technique for using soft clustering data mining algorithm to increase the accuracy of biomedical text extraction. Experimental results demonstrate that this approach improves text extraction more effectively that hard keyword matching rules.

  14. A Survey on Web Text Information Retrieval in Text Mining

    OpenAIRE

    Tapaswini Nayak; Srinivash Prasad; Manas Ranjan Senapat

    2015-01-01

    In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concep...

  15. Association between leukemia and genes detected using biomedical text mining tools%基于生物医学文本挖掘工具的白血病和基因关系研究

    Institute of Scientific and Technical Information of China (English)

    朱祥; 张云秋; 冯佳

    2015-01-01

    利用COREMINE Medical寻找与白血病相关的基因,确定关系最为密切的5种基因,再通过生物医学文本挖掘工具Chilibot对从PubMed中所获相关文献的摘要进行分析,通过对相互作用的深入分析,发现了白血病和基因的相互作用关系.%Five genes that are closely related with leukemia were detected and identified using COREMINE Medi-cal, and the abstracts of related papers covered in PubMed were analyzed with the biomedical text mining tool, Chilibot, which showed that leukemia interacts with the 5 genes detected using COREMINE Medical.

  16. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

    OpenAIRE

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in...

  17. Scalable Text Mining with Sparse Generative Models

    OpenAIRE

    Puurula, Antti

    2016-01-01

    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: gener...

  18. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  19. A Survey on Preprocessing in Text Mining

    OpenAIRE

    Dr. Anadakumar. K; Ms. Padmavathy. V

    2013-01-01

    Now-a-days information’s are stored electronically in databases. Extracting reliable, unknown and useful information from the abundant source is an eminent task. Data mining and Text mining are the process for extracting unknown and useful information. Text Mining is the process of extracting interesting and non-trivial patterns or knowledge from text documents. This paper presents the related activities and focuses on preprocessing steps in text mining.

  20. Text Mining Perspectives in Microarray Data Mining

    OpenAIRE

    Natarajan, Jeyakumar

    2013-01-01

    Current microarray data mining methods such as clustering, classification, and association analysis heavily rely on statistical and machine learning algorithms for analysis of large sets of gene expression data. In recent years, there has been a growing interest in methods that attempt to discover patterns based on multiple but related data sources. Gene expression data and the corresponding literature data are one such example. This paper suggests a new approach to microarray data mining as ...

  1. Knowledge discovery data and text mining

    CERN Document Server

    Olmer, Petr

    2008-01-01

    Data mining and text mining refer to techniques, models, algorithms, and processes for knowledge discovery and extraction. Basic de nitions are given together with the description of a standard data mining process. Common models and algorithms are presented. Attention is given to text clustering, how to convert unstructured text to structured data (vectors), and how to compute their importance and position within clusters.

  2. Enhancing Biomedical Text Summarization Using Semantic Relation Extraction

    OpenAIRE

    Yue Shang; Yanpeng Li; Hongfei Lin; Zhihao Yang

    2011-01-01

    Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) W...

  3. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  4. OntoRest: Text Mining Web Services in BioC Format

    OpenAIRE

    Marques, Hernani; Rinaldi, Fabio

    2014-01-01

    In this poster we present a set of biomedical text mining web services which can be used to provide remote access to the annotation results of an advanced text mining pipeline. The pipeline is part of a system which has been tested several times in community organized text mining competitions, often achieving top-ranked results.

  5. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  6. Discover Effective Pattern for Text Mining

    OpenAIRE

    Khade, A. D.; A. B. Karche

    2014-01-01

    Many data mining techniques have been discovered for finding useful patterns in documents like text document. However, how to use effective and bring to up to date discovered patterns is still an open research task, especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help users to find what they exactly want...

  7. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  8. Extracting laboratory test information from biomedical text

    Directory of Open Access Journals (Sweden)

    Yanna Shen Kang

    2013-01-01

    Full Text Available Background: No previous study reported the efficacy of current natural language processing (NLP methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens was very limited or when lexical morphology of the entity was distinctive (as in units of measures, yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.

  9. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  10. Text mining library for Orange data mining suite

    OpenAIRE

    Novak, David

    2016-01-01

    We have developed a text mining system that can be used as an add-on for Orange, a data mining platform. Orange envelops a set of supervised and unsupervised machine learning methods that benefit a typical text mining platform and therefore offers an excellent foundation for development. We have studied the field of text mining and reviewed several open-source toolkits to define its base components. We have included widgets that enable retrieval of data from remote repositories, such as PubMe...

  11. Preprocessing and Morphological Analysis in Text Mining

    Directory of Open Access Journals (Sweden)

    Krishna Kumar Mohbey Sachin Tiwari

    2011-12-01

    Full Text Available This paper is based on the preprocessing activities which is performed by the software or language translators before applying mining algorithms on the huge data. Text mining is an important area of Data mining and it plays a vital role for extracting useful information from the huge database or data ware house. But before applying the text mining or information extraction process, preprocessing is must because the given data or dataset have the noisy, incomplete, inconsistent, dirty and unformatted data. In this paper we try to collect the necessary requirements for preprocessing. When we complete the preprocess task then we can easily extract the knowledgful information using mining strategy. This paper also provides the information about the analysis of data like tokenization, stemming and semantic analysis like phrase recognition and parsing. This paper also collect the procedures for preprocessing data i.e. it describe that how the stemming, tokenization or parsing are applied.

  12. Text Mining of Supreme Administrative Court Jurisdictions

    OpenAIRE

    Feinerer , Ingo; Hornik, Kurt

    2007-01-01

    Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Au...

  13. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II

    OpenAIRE

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of writ...

  14. Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art

    OpenAIRE

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H.

    2014-01-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. Text mining is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources—such as biomedical literature, clinical narrat...

  15. Text mining for the biocuration workflow

    OpenAIRE

    Hirschman, L.; Burns, G. A. P. C.; Krallinger, M.; Arighi, C.; Cohen, K. B.; Valencia, A.; Wu, C H; Chatr-aryamontri, A; Dowell, K. G.; Huala, E; Lourenco, A.; Nash, R; Veuthey, A.-L.; Wiegers, T.; Winter, A. G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations too...

  16. Financial Statement Fraud Detection using Text Mining

    OpenAIRE

    Rajan Gupta; Nasib Singh Gill

    2013-01-01

    Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information) i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement frau...

  17. Enhancing biomedical text summarization using semantic relation extraction.

    Directory of Open Access Journals (Sweden)

    Yue Shang

    Full Text Available Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1 We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2 We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3 For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.

  18. Negation scope and spelling variation for text-mining of Danish electronic patient records

    DEFF Research Database (Denmark)

    Thomas, Cecilia Engel; Jensen, Peter Bjødstrup; Werge, Thomas;

    2014-01-01

    Electronic patient records are a potentially rich data source for knowledge extraction in biomedical research. Here we present a method based on the ICD10 system for text-mining of Danish health records. We have evaluated how adding functionalities to a baseline text-mining tool affected the...

  19. Financial Statement Fraud Detection using Text Mining

    Directory of Open Access Journals (Sweden)

    Rajan Gupta

    2013-01-01

    Full Text Available Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement fraud by analyzing the hidden clues in the qualitative information (text present in financial statements.

  20. A Survey on Text Mining in Clustering

    Directory of Open Access Journals (Sweden)

    S.Logeswari

    2011-02-01

    Full Text Available Text mining has important applications in the area of data mining and information retrieval. One of the important tasks in text mining is document clustering. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. It is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. The synonymy between related documents is ignored. It reduces the effectiveness of applications using a standard full-text document representation. This paper emphasis on the various techniques that are used to cluster the text documents based on keywords, phrases and concepts. It also includes the different performance measures that are used to evaluate the quality of clusters.

  1. Text mining and visualization using VOSviewer

    OpenAIRE

    van Eck, Nees Jan; Waltman, Ludo

    2011-01-01

    VOSviewer is a computer program for creating, visualizing, and exploring bibliometric maps of science. In this report, the new text mining functionality of VOSviewer is presented. A number of examples are given of applications in which VOSviewer is used for analyzing large amounts of text data.

  2. Arabic Text Mining Using Rule Based Classification

    OpenAIRE

    Fadi Thabtah; Omar Gharaibeh; Rashid Al-Zubaidy

    2012-01-01

    A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studi...

  3. New directions in biomedical text annotation: definitions, guidelines and corpus construction

    Directory of Open Access Journals (Sweden)

    Rzhetsky Andrey

    2006-07-01

    Full Text Available Abstract Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently

  4. Mining Quality Phrases from Massive Text Corpora

    OpenAIRE

    Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei

    2015-01-01

    Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quali...

  5. Mining Texts in Reading to Write.

    Science.gov (United States)

    Greene, Stuart

    1992-01-01

    Proposes a set of strategies for connecting reading and writing, placing the discussion in the context of other pedagogical approaches designed to exploit the relationship between reading and writing. Explores ways in which students employ the strategies involved in "mining" a text--reconstructing context, inferring or imposing structure, and…

  6. Monitoring interaction and collective text production through text mining

    Directory of Open Access Journals (Sweden)

    Macedo, Alexandra Lorandi

    2014-04-01

    Full Text Available This article presents the Concepts Network tool, developed using text mining technology. The main objective of this tool is to extract and relate terms of greatest incidence from a text and exhibit the results in the form of a graph. The Network was implemented in the Collective Text Editor (CTE which is an online tool that allows the production of texts in synchronized or non-synchronized forms. This article describes the application of the Network both in texts produced collectively and texts produced in a forum. The purpose of the tool is to offer support to the teacher in managing the high volume of data generated in the process of interaction amongst students and in the construction of the text. Specifically, the aim is to facilitate the teacher’s job by allowing him/her to process data in a shorter time than is currently demanded. The results suggest that the Concepts Network can aid the teacher, as it provides indicators of the quality of the text produced. Moreover, messages posted in forums can be analyzed without their content necessarily having to be pre-read.

  7. Demo: Using RapidMiner for Text Mining

    OpenAIRE

    Shterev, Yordan

    2013-01-01

    In this demo the basic text mining technologies by using RapidMining have been reviewed. RapidMining basic characteristics and operators of text mining have been described. Text mining example by using Navie Bayes algorithm and process modeling have been revealed.

  8. Text mining applications in psychiatry: a systematic literature review.

    Science.gov (United States)

    Abbe, Adeline; Grouin, Cyril; Zweigenbaum, Pierre; Falissard, Bruno

    2016-06-01

    The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd. PMID:26184780

  9. Text Data Mining: Theory and Methods

    Directory of Open Access Journals (Sweden)

    Jeffrey L. Solka

    2008-01-01

    Full Text Available This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

  10. Mining Causality for Explanation Knowledge from Text

    Institute of Scientific and Technical Information of China (English)

    Chaveevan Pechsiri; Asanee Kawtrakul

    2007-01-01

    Mining causality is essential to provide a diagnosis. This research aims at extracting the causality existing within multiple sentences or EDUs (Elementary Discourse Unit). The research emphasizes the use of causality verbs because they make explicit in a certain way the consequent events of a cause, e.g., "Aphids suck the sap from rice leaves. Then leaves will shrink. Later, they will become yellow and dry.". A verb can also be the causal-verb link between cause and effect within EDU(s), e.g., "Aphids suck the sap from rice leaves causing leaves to be shrunk" ("causing" is equivalent to a causal-verb link in Thai). The research confronts two main problems: identifying the interesting causality events from documents and identifying their boundaries. Then, we propose mining on verbs by using two different machine learning techniques, Naive Bayes classifier and Support Vector Machine. The resulted mining rules will be used for the identification and the causality extraction of the multiple EDUs from text. Our multiple EDUs extraction shows 0.88 precision with 0.75 recall from Na'ive Bayes classifier and 0.89 precision with 0.76 recall from Support Vector Machine.

  11. Text Data Mining: Theory and Methods

    OpenAIRE

    Solka, Jeffrey L.

    2008-01-01

    This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader wi...

  12. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  13. Rising of Text Mining Technique: As Unforeseen-part of Data Mining

    OpenAIRE

    Param Deep Singh, Jitendra Raghuvanshi

    2012-01-01

    Text Data Mining or Knowledge-Discovery in Text (KDT) technique refers generally to the process of extracting interesting and non-trivial information and knowledge from unstructured text. Text mining technique is a deviation on a countryside called data mining that tries to find interesting patterns from large databases; text mining also known as the Intelligent Text Analysis (ITA). Text mining is a young interdisciplinary field which draws on information retrieval, data mining, machine learn...

  14. Semantic text mining support for lignocellulose research

    Directory of Open Access Journals (Sweden)

    Meurs Marie-Jean

    2012-04-01

    Full Text Available Abstract Background Biofuels produced from biomass are considered to be promising sustainable alternatives to fossil fuels. The conversion of lignocellulose into fermentable sugars for biofuels production requires the use of enzyme cocktails that can efficiently and economically hydrolyze lignocellulosic biomass. As many fungi naturally break down lignocellulose, the identification and characterization of the enzymes involved is a key challenge in the research and development of biomass-derived products and fuels. One approach to meeting this challenge is to mine the rapidly-expanding repertoire of microbial genomes for enzymes with the appropriate catalytic properties. Results Semantic technologies, including natural language processing, ontologies, semantic Web services and Web-based collaboration tools, promise to support users in handling complex data, thereby facilitating knowledge-intensive tasks. An ongoing challenge is to select the appropriate technologies and combine them in a coherent system that brings measurable improvements to the users. We present our ongoing development of a semantic infrastructure in support of genomics-based lignocellulose research. Part of this effort is the automated curation of knowledge from information on fungal enzymes that is available in the literature and genome resources. Conclusions Working closely with fungal biology researchers who manually curate the existing literature, we developed ontological natural language processing pipelines integrated in a Web-based interface to assist them in two main tasks: mining the literature for relevant knowledge, and at the same time providing rich and semantically linked information.

  15. Unsupervised text mining for assessing and augmenting GWAS results.

    Science.gov (United States)

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. PMID:26911523

  16. Data, text and web mining for business intelligence: a survey

    OpenAIRE

    Abdul-Aziz Rashid Al-Azmi

    2013-01-01

    The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future eve...

  17. TEXT MINING – PREREQUISITE FOR KNOWLEDGE MANAGEMENT SYSTEMS

    OpenAIRE

    Dragoº Marcel VESPAN

    2009-01-01

    Text mining is an interdisciplinary field with the main purpose of retrieving new knowledge from large collections of text documents. This paper presents the main techniques used for knowledge extraction through text mining and their main areas of applicability and emphasizes the importance of text mining in knowledge management systems.

  18. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  19. Techniques, Applications and Challenging Issue in Text Mining

    Directory of Open Access Journals (Sweden)

    Shaidah Jusoh

    2012-11-01

    Full Text Available Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts. These texts can be found on a desktop, intranets and the internet. The aim of this paper is to give an overview of text mining in the contexts of its techniques, application domains and the most challenging issue. The focus is given on fundamentals methods of text mining which include natural language possessing and information extraction. This paper also gives a short review on domains which have employed text mining. The challenging issue in text mining which is caused by the complexity in a natural language is also addressed in this paper.

  20. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  1. A comparison study on algorithms of detecting long forms for short forms in biomedical text

    Directory of Open Access Journals (Sweden)

    Wu Cathy H

    2007-11-01

    Full Text Available Abstract Motivation With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs and their corresponding definitions as long forms (LFs. The study was designed to answer the following questions; i how well a system performs in detecting LFs from novel text, ii what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii how to combine results from various SF knowledge bases. Method We evaluated the following three publicly available detection systems in detecting LFs for SFs: i a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii a machine learning system by Chang et al., and iii a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i the UMLS (the Unified Medical Language System, and ii the BioThesaurus (a thesaurus of names for all UniProt protein records. We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. Availability The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.

  2. Techniques, Applications and Challenging Issue in Text Mining

    OpenAIRE

    Shaidah Jusoh; Hejab M. Alfawareh

    2012-01-01

    Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts. These texts can be found on a desktop, intranets and the internet. The aim of this paper is to give an overview of text mining in the contexts of its techniques, application domains and the most challenging issue. The focus is given on fundamentals methods of text mining which include natural language possessing and information extraction. This paper also gives a short review on domains whi...

  3. Research on Text Mining Based on Domain Ontology

    OpenAIRE

    Li-hua, Jiang; Neng-fu, Xie; Hong-bin, Zhang

    2013-01-01

    This paper improves the traditional text mining technology which cannot understand the text semantics. The author discusses the text mining methods based on ontology and puts forward text mining model based on domain ontology. Ontology structure is built firstly and the “concept-concept” similarity matrix is introduced, then a conception vector space model based on domain ontology is used to take the place of traditional vector space model to represent the documents in order to realize text m...

  4. Data, Text and Web Mining for Business Intelligence : A Survey

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid Al-Azmi

    2013-04-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amountsof data available. Enterprises use mining technologies to search vast amounts of data for vital insight andknowledge. Mining tools such as data mining, text mining, and web mining are used to find hiddenknowledge in large databases or the Internet. Mining tools are automated software tools used to achievebusiness intelligence by finding hidden relations,and predicting future events from vast amounts of data.This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, andeven fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented.Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then lookinto some case studies of success stories using mining tools. Finally, we shall demonstrate some of the mainchallenges to the mining technologies that limit their potential.

  5. DATA, TEXT, AND WEB MINING FOR BUSINESS INTELLIGENCE: A SURVEY

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid

    2013-03-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data. This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, and even fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented. Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential.

  6. Regulatory relations represented in logics and biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    biomedical semantics of regulates relations, i.e. positively regulates, negatively regulates and regulates, of which is assumed to be a super relation of the rst two. This thesis discusses an initial framework for knowledge representation based on logics, carries out a corpus analysis on the verbs...

  7. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  8. Investigating and Annotating the Role of Citation in Biomedical Full-Text Articles

    OpenAIRE

    Yu, Hong; Agarwal, Shashank; Frid, Nadya

    2009-01-01

    Citations are ubiquitous in scientific articles and play important roles for representing the semantic content of a full-text biomedical article. In this work, we manually examined full-text biomedical articles to analyze the semantic content of citations in full-text biomedical articles. After developing a citation relation schema and annotation guideline, our pilot annotation results show an overall agreement of 0.71, and here we report on the research challenges and the lessons we've learn...

  9. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

    OpenAIRE

    Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

    2013-01-01

    Background The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste r...

  10. A text mining framework in R and its applications

    OpenAIRE

    Feinerer, Ingo

    2008-01-01

    Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well es...

  11. Using Dependency Parses to Augment Feature Construction for Text Mining

    OpenAIRE

    Guo, Sheng

    2012-01-01

    With the prevalence of large data stored in the cloud, including unstructured information in the form of text, there is now an increased emphasis on text mining. A broad range of techniques are now used for text mining, including algorithms adapted from machine learning, NLP, computational linguistics, and data mining. Applications are also multi-fold, including classification, clustering, segmentation, relationship discovery, and practically any task that discovers latent information from wr...

  12. VIRTUAL MINING MODEL FOR CLASSIFYING TEXT USING UNSUPERVISED LEARNING

    Directory of Open Access Journals (Sweden)

    S. Koteeswaran

    2014-01-01

    Full Text Available In real world data mining is emerging in various era, one of its most outstanding performance is held in various research such as Big data, multimedia mining, text mining etc. Each of the researcher proves their contribution with tremendous improvements in their proposal by means of mathematical representation. Empowering each problem with solutions are classified into mathematical and implementation models. The mathematical model relates to the straight forward rules and formulas that are related to the problem definition of particular field of domain. Whereas the implementation model derives some sort of knowledge from the real time decision making behaviour such as artificial intelligence and swarm intelligence and has a complex set of rules compared with the mathematical model. The implementation model mines and derives knowledge model from the collection of dataset and attributes. This knowledge is applied to the concerned problem definition. The objective of our work is to efficiently mine knowledge from the unstructured text documents. In order to mine textual documents, text mining is applied. The text mining is the sub-domain in data mining. In text mining, the proposed Virtual Mining Model (VMM is defined for effective text clustering. This VMM involves the learning of conceptual terms; these terms are grouped in Significant Term List (STL. VMM model is appropriate combination of layer 1 arch with ABI (Analysis of Bilateral Intelligence. The frequent update of conceptual terms in the STL is more important for effective clustering. The result is shown, Artifial neural network based unsupervised learning algorithm is used for learning texual pattern in the Virtual Mining Model. For learning of such terminologies, this paper proposed Artificial Neural Network based learning algorithm.

  13. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  14. E3Miner: a text mining tool for ubiquitin-protein ligases

    OpenAIRE

    Lee, Hodong; Yi, Gwan-Su; Park, Jong C.

    2008-01-01

    Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor sys...

  15. BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    OpenAIRE

    Tsafnat Guy; Polasek Thomas M; Anthony Stephen; Lin Frank PY; Doogue Matthew P

    2011-01-01

    Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best pre...

  16. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    OpenAIRE

    Vishwadeepak Singh Baghela; S. P. Tripathi

    2012-01-01

    A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mi...

  17. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    NARCIS (Netherlands)

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is base

  18. VIRTUAL MINING MODEL FOR CLASSIFYING TEXT USING UNSUPERVISED LEARNING

    OpenAIRE

    S. Koteeswaran; E. Kannan; P. Visu

    2014-01-01

    In real world data mining is emerging in various era, one of its most outstanding performance is held in various research such as Big data, multimedia mining, text mining etc. Each of the researcher proves their contribution with tremendous improvements in their proposal by means of mathematical representation. Empowering each problem with solutions are classified into mathematical and implementation models. The mathematical model relates to the straight forward rules and formulas that are re...

  19. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    OpenAIRE

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is based on the scalable open source text analysis service xTAS (developed by the Intelligent Systems Lab Amsterdam). The text analysis service xTAS has been used successfully in computational humanities ...

  20. Text mining of web-based medical content

    CERN Document Server

    Neustein, Amy

    2014-01-01

    Text Mining of Web-Based Medical Content examines web mining for extracting useful information that can be used for treating and monitoring the healthcare of patients. This work provides methodological approaches to designing mapping tools that exploit data found in social media postings. Specific linguistic features of medical postings are analyzed vis-a-vis available data extraction tools for culling useful information.

  1. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    Science.gov (United States)

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR. PMID:26564970

  2. Semi-Automatic Indexing of Full Text Biomedical Articles

    OpenAIRE

    Gay, Clifford W.; Kayaalp, Mehmet; Aronson, Alan R.

    2005-01-01

    The main application of U.S. National Library of Medicine’s Medical Text Indexer (MTI) is to provide indexing recommendations to the Library’s indexing staff. The current input to MTI consists of the titles and abstracts of articles to be indexed. This study reports on an extension of MTI to the full text of articles appearing in online medical journals that are indexed for Medline®. Using a collection of 17 journal issues containing 500 articles, we report on the effectiven...

  3. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?

    Science.gov (United States)

    Winnenburg, Rainer; Wächter, Thomas; Plake, Conrad; Doms, Andreas; Schroeder, Michael

    2008-11-01

    The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation. PMID:19060303

  4. UMLS Content Views Appropriate for NLP Processing of the Biomedical Literature vs. Clinical Text

    OpenAIRE

    Demner-Fushman, Dina; Mork, James G; Shooshan, Sonya E.; Aronson, Alan R.

    2010-01-01

    Identification of medical terms in free text is a first step in such Natural Language Processing (NLP) tasks as automatic indexing of biomedical literature and extraction of patients’ problem lists from the text of clinical notes. Many tools developed to perform these tasks use biomedical knowledge encoded in the Unified Medical Language System (UMLS) Metathesaurus. We continue our exploration of automatic approaches to creation of subsets (UMLS content views) which can support NLP processing...

  5. Text mining for literature review and knowledge discovery in cancer risk assessment and research.

    Directory of Open Access Journals (Sweden)

    Anna Korhonen

    Full Text Available Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB - a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

  6. Managing biological networks by using text mining and computer-aided curation

    Science.gov (United States)

    Yu, Seok Jong; Cho, Yongseong; Lee, Min-Ho; Lim, Jongtae; Yoo, Jaesoo

    2015-11-01

    In order to understand a biological mechanism in a cell, a researcher should collect a huge number of protein interactions with experimental data from experiments and the literature. Text mining systems that extract biological interactions from papers have been used to construct biological networks for a few decades. Even though the text mining of literature is necessary to construct a biological network, few systems with a text mining tool are available for biologists who want to construct their own biological networks. We have developed a biological network construction system called BioKnowledge Viewer that can generate a biological interaction network by using a text mining tool and biological taggers. It also Boolean simulation software to provide a biological modeling system to simulate the model that is made with the text mining tool. A user can download PubMed articles and construct a biological network by using the Multi-level Knowledge Emergence Model (KMEM), MetaMap, and A Biomedical Named Entity Recognizer (ABNER) as a text mining tool. To evaluate the system, we constructed an aging-related biological network that consist 9,415 nodes (genes) by using manual curation. With network analysis, we found that several genes, including JNK, AP-1, and BCL-2, were highly related in aging biological network. We provide a semi-automatic curation environment so that users can obtain a graph database for managing text mining results that are generated in the server system and can navigate the network with BioKnowledge Viewer, which is freely available at http://bioknowledgeviewer.kisti.re.kr.

  7. Application of text mining for customer evaluations in commercial banking

    Science.gov (United States)

    Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.

  8. Text Mining: Wissensgewinnung aus natürlichsprachigen Dokumenten

    OpenAIRE

    Witte, René; Mülle, Jutta

    2006-01-01

    Das noch recht junge Forschungsgebiet "Text Mining" umfaßt eine Verbindung von Verfahren der Sprachverarbeitung mit Datenbank- und Informationssystemtechnologien. Es entstand aus der Beobachtung, dass ca. 85% aller Datenbankinhalte nur in unstrukturierter Form vorliegen, so dass sich die Techniken des klassischen Data Mining zur Wissensgewinnung nicht anwenden lassen. Beispiele für solche Daten sind Volltextdatenbanken mit Büchern, Unternehmenswebseiten, Archive mit Zeit...

  9. Applications of string mining techniques in text analysis

    OpenAIRE

    Horațiu Mocian

    2012-01-01

    The focus of this project is on the algorithms and data structures used in string mining and their applications in bioinformatics, text mining and information retrieval. More specific, it studies the use of suffix trees and suffix arrays for biological sequence analysis, and the algorithms used for approximate string matching, both general ones and specialized ones used in bioinformatics, like the BLAST algorithm and PAM substitution matrix. Also, an attempt is made to apply these structures ...

  10. Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use

    Science.gov (United States)

    White, Sheida

    2012-01-01

    This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…

  11. Text Mining Driven Drug-Drug Interaction Detection.

    Science.gov (United States)

    Yan, Su; Jiang, Xiaoqian; Chen, Ying

    2013-01-01

    Identifying drug-drug interactions is an important and challenging problem in computational biology and healthcare research. There are accurate, structured but limited domain knowledge and noisy, unstructured but abundant textual information available for building predictive models. The difficulty lies in mining the true patterns embedded in text data and developing efficient and effective ways to combine heterogenous types of information. We demonstrate a novel approach of leveraging augmented text-mining features to build a logistic regression model with improved prediction performance (in terms of discrimination and calibration). Our model based on synthesized features significantly outperforms the model trained with only structured features (AUC: 96% vs. 91%, Sensitivity: 90% vs. 82% and Specificity: 88% vs. 81%). Along with the quantitative results, we also show learned "latent topics", an intermediary result of our text mining module, and discuss their implications. PMID:25131635

  12. Grid-based Support for Different Text Mining Tasks

    Directory of Open Access Journals (Sweden)

    Martin Sarnovský

    2009-12-01

    Full Text Available This paper provides an overview of our research activities aimed at efficient useof Grid infrastructure to solve various text mining tasks. Grid-enabling of various textmining tasks was mainly driven by increasing volume of processed data. Utilizing the Gridservices approach therefore enables to perform various text mining scenarios and alsoopen ways to design distributed modifications of existing methods. Especially, some partsof mining process can significantly benefit from decomposition paradigm, in particular inthis study we present our approach to data-driven decomposition of decision tree buildingalgorithm, clustering algorithm based on self-organizing maps and its application inconceptual model building task using the FCA-based algorithm. Work presented in thispaper is rather to be considered as a 'proof of concept' for design and implementation ofdecomposition methods as we performed the experiments mostly on standard textualdatabases.

  13. Research on Online Topic Evolutionary Pattern Mining in Text Streams

    Directory of Open Access Journals (Sweden)

    Qian Chen

    2014-06-01

    Full Text Available Text Streams are a class of ubiquitous data that came in over time and are extraordinary large in scale that we often lose track of. Basically, text streams forms the fundamental source of information that can be used to detect semantic topic which individuals and organizations are interested in as well as detect burst events within communities. Thus, intelligent system that can automatically extract interesting temporal pattern from text streams is terribly needed; however, Evolutionary Pattern Mining is not well addressed in previous work. In this paper, we start a tentative research on topic evolutionary pattern mining system by discussing fully properties of a topic after formally definition, as well as proposing a common and formal framework in analyzing text streams. We also defined three basic tasks including (1 online topic Detection, (2 event evolution extraction and (3 topic property life cycle, and proposed three common mining algorithms respectively. Finally we exemplify the application of Evolutionary Pattern Mining and shows that interesting patterns can be extracted in newswire dataset

  14. Collaborative mining and interpretation of large-scale data for biomedical research insights.

    Directory of Open Access Journals (Sweden)

    Georgia Tsiliki

    Full Text Available Biomedical research becomes increasingly interdisciplinary and collaborative in nature. Researchers need to efficiently and effectively collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. In line with related research directives revealing that, in spite of the recent advances in data mining and computational analysis, humans can easily detect patterns which computer algorithms may have difficulty in finding, this paper reports on the practical use of an innovative web-based collaboration support platform in a biomedical research context. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, the proposed platform adopts a hybrid approach that builds on the synergy between machine and human intelligence to facilitate the underlying sense-making and decision making processes. User experience shows that the platform enables more informed and quicker decisions, by displaying the aggregated information according to their needs, while also exploiting the associated human intelligence.

  15. Clique-based data mining for related genes in a biomedical database

    Directory of Open Access Journals (Sweden)

    Tomita Etsuji

    2009-07-01

    Full Text Available Abstract Background Progress in the life sciences cannot be made without integrating biomedical knowledge on numerous genes in order to help formulate hypotheses on the genetic mechanisms behind various biological phenomena, including diseases. There is thus a strong need for a way to automatically and comprehensively search from biomedical databases for related genes, such as genes in the same families and genes encoding components of the same pathways. Here we address the extraction of related genes by searching for densely-connected subgraphs, which are modeled as cliques, in a biomedical relational graph. Results We constructed a graph whose nodes were gene or disease pages, and edges were the hyperlink connections between those pages in the Online Mendelian Inheritance in Man (OMIM database. We obtained over 20,000 sets of related genes (called 'gene modules' by enumerating cliques computationally. The modules included genes in the same family, genes for proteins that form a complex, and genes for components of the same signaling pathway. The results of experiments using 'metabolic syndrome'-related gene modules show that the gene modules can be used to get a coherent holistic picture helpful for interpreting relations among genes. Conclusion We presented a data mining approach extracting related genes by enumerating cliques. The extracted gene sets provide a holistic picture useful for comprehending complex disease mechanisms.

  16. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    Directory of Open Access Journals (Sweden)

    Calvin Lam

    Full Text Available Research on publication trends in journal articles on sleep disorders (SDs and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings.SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms.MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms.Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  17. Using Text Mining to Characterize Online Discussion Facilitation

    Science.gov (United States)

    Ming, Norma; Baumer, Eric

    2011-01-01

    Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…

  18. Text mining for biology--the way forward

    DEFF Research Database (Denmark)

    Altman, Russ B; Bergman, Casey M; Blake, Judith;

    2008-01-01

    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify seve...

  19. Text mining improves prediction of protein functional sites.

    Directory of Open Access Journals (Sweden)

    Karin M Verspoor

    Full Text Available We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites. The structure analysis was carried out using Dynamics Perturbation Analysis (DPA, which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

  20. Facilitating Full-text Access to Biomedical Literature Using Open Access Resources.

    Science.gov (United States)

    Kang, Hongyu; Hou, Zhen; Li, Jiao

    2015-01-01

    Open access (OA) resources and local libraries often have their own literature databases, especially in the field of biomedicine. We have developed a method of linking a local library to a biomedical OA resource facilitating researchers' full-text article access. The method uses a model based on vector space to measure similarities between two articles in local library and OA resources. The method achieved an F-score of 99.61%. This method of article linkage and mapping between local library and OA resources is available for use. Through this work, we have improved the full-text access of the biomedical OA resources. PMID:26262422

  1. Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.

    Science.gov (United States)

    Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria

    2001-01-01

    Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…

  2. Application of Text Mining in Cancer Symptom Management.

    Science.gov (United States)

    Lee, Young Ji; Donovan, Heidi

    2016-01-01

    Fatigue continues to be one of the main symptoms that afflict ovarian cancer patients and negatively affects their functional status and quality of life. To manage fatigue effectively, the symptom must be understood from the perspective of patients. We utilized text mining to understand the symptom experiences and strategies that were associated with fatigue among ovarian cancer patients. Through text analysis, we determined that descriptors such as energetic, challenging, frustrating, struggling, unmanageable, and agony were associated with fatigue. Descriptors such as decadron, encourager, grocery, massage, relaxing, shower, sleep, zoloft, and church were associated with strategies to ameliorate fatigue. This study demonstrates the potential of applying text mining in cancer research to understand patients' perspective on symptom management. Future study will consider various factors to refine the results. PMID:27332415

  3. Biomedical Mathematics, Unit II: Propagation of Error, Vectors and Linear Programming. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of mathematics within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on…

  4. Decision Support for E-Governance: A Text Mining Approach

    Directory of Open Access Journals (Sweden)

    G. Koteswara Rao

    2011-09-01

    Full Text Available Information and communication technology has the capability to improve the process by whichgovernments involve citizens in formulating public policy and public projects. Even though much ofgovernment regulations may now be in digital form (and often available online, due to their complexityand diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with theadvent of a number of electronic online forums, social networking sites and blogs, the opportunity ofgathering citizens’ petitions and stakeholders’ views on government policy and proposals has increasedgreatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the otherhand, text mining has come a long way from simple keyword search, and matured into a discipline capableof dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help inretrieval of information and relationships from textual data sources, thereby assisting policy makers indiscovering associations between policies and citizens’ opinions expressed in electronic public forums andblogs etc. We also present here, an integrated text mining based architecture for e-governance decisionsupport along with a discussion on the Indian scenario.

  5. Extraction of semantic biomedical relations from text using conditional random fields

    Directory of Open Access Journals (Sweden)

    Stetter Martin

    2008-04-01

    text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.

  6. Decision Support for e-Governance: A Text Mining Approach

    CERN Document Server

    Rao, G Koteswara

    2011-01-01

    Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens' petitions and stakeholders' views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy...

  7. Text Mining System for Non-Expert Miners

    OpenAIRE

    Ramya, P.; S. Sasirekha

    2014-01-01

    Service oriented architecture integrated with text mining allows services to extract information in a well defined manner. In this paper, it is proposed to design a knowledge extracting system for the Ocean Information Data System. Deployed ARGO floating sensors of INCOIS (Indian National Council for Ocean Information Systems) organization reflects the characteristics of ocean. This is forwarded to the OIDS (Ocean Information Data System). For the data received from OIDS, pre-processing techn...

  8. Text Mining Driven Drug-Drug Interaction Detection

    OpenAIRE

    Yan, Su; Jiang, Xiaoqian; Chen, Ying

    2013-01-01

    Identifying drug-drug interactions is an important and challenging problem in computational biology and healthcare research. There are accurate, structured but limited domain knowledge and noisy, unstructured but abundant textual information available for building predictive models. The difficulty lies in mining the true patterns embedded in text data and developing efficient and effective ways to combine heterogenous types of information. We demonstrate a novel approach of leveraging augment...

  9. Text mining a self-report back-translation.

    Science.gov (United States)

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record PMID:26302100

  10. Computational intelligence methods on biomedical signal analysis and data mining in medical records

    OpenAIRE

    Vladutu, Liviu-Mihai

    2004-01-01

    This thesis is centered around the development and application of computationally effective solutions based on artificial neural networks (ANN) for biomedical signal analysis and data mining in medical records. The ultimate goal of this work in the field of Biomedical Engineering is to provide the clinician with the best possible information needed to make an accurate diagnosis (in our case of myocardial ischemia) and to propose advanced mathematical models for recovering the complex de...

  11. Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts

    OpenAIRE

    Zhang, Shaodian; Elhadad, Nóemie

    2013-01-01

    Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervi...

  12. FigSum: Automatically Generating Structured Text Summaries for Figures in Biomedical Literature

    OpenAIRE

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Figures are frequently used in biomedical articles to support research findings; however, they are often difficult to comprehend based on their legends alone and information from the full-text articles is required to fully understand them. Previously, we found that the information associated with a single figure is distributed throughout the full-text article the figure appears in. Here, we develop and evaluate a figure summarization system – FigSum, which aggregates this scattered informatio...

  13. Using rule-based natural language processing to improve disease normalization in biomedical text

    OpenAIRE

    2013-01-01

    Background and objective In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionary-based. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization. Methods We compared the performance...

  14. Using rule-based natural language processing to improve disease normalization in biomedical text

    OpenAIRE

    Kang, Ning; Singh, Bharat; Afzal, Zubair; Mulligen, Erik; Kors, Jan

    2013-01-01

    textabstractBackground and objective: In order for computers to extract useful information from unstructured text, a concept normalization system is needed to link relevant concepts in a text to sources that contain further information about the concept. Popular concept normalization tools in the biomedical field are dictionarybased. In this study we investigate the usefulness of natural language processing (NLP) as an adjunct to dictionary-based concept normalization. Methods: We compared th...

  15. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  16. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  17. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  18. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts.

    Science.gov (United States)

    Zhang, Shaodian; Elhadad, Noémie

    2013-12-01

    Named entity recognition is a crucial component of biomedical natural language processing, enabling information extraction and ultimately reasoning over and knowledge discovery from text. Much progress has been made in the design of rule-based and supervised tools, but they are often genre and task dependent. As such, adapting them to different genres of text or identifying new types of entities requires major effort in re-annotation or rule development. In this paper, we propose an unsupervised approach to extracting named entities from biomedical text. We describe a stepwise solution to tackle the challenges of entity boundary detection and entity type classification without relying on any handcrafted rules, heuristics, or annotated data. A noun phrase chunker followed by a filter based on inverse document frequency extracts candidate entities from free text. Classification of candidate entities into categories of interest is carried out by leveraging principles from distributional semantics. Experiments show that our system, especially the entity classification step, yields competitive results on two popular biomedical datasets of clinical notes and biological literature, and outperforms a baseline dictionary match approach. Detailed error analysis provides a road map for future work. PMID:23954592

  19. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

    CERN Document Server

    Miner, Gary; Hill, Thomas; Nisbet, Robert; Delen, Dursun

    2012-01-01

    The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase d

  20. Drug name recognition in biomedical texts: a machine-learning-based method.

    Science.gov (United States)

    He, Linna; Yang, Zhihao; Lin, Hongfei; Li, Yanpeng

    2014-05-01

    Currently, there is an urgent need to develop a technology for extracting drug information automatically from biomedical texts, and drug name recognition is an essential prerequisite for extracting drug information. This article presents a machine-learning-based approach to recognize drug names in biomedical texts. In this approach, a drug name dictionary is first constructed with the external resource of DrugBank and PubMed. Then a semi-supervised learning method, feature coupling generalization, is used to filter this dictionary. Finally, the dictionary look-up and the condition random field method are combined to recognize drug names. Experimental results show that our approach achieves an F-score of 92.54% on the test set of DDIExtraction2011. PMID:24140287

  1. Data Mining Algorithms for Classification of Complex Biomedical Data

    Science.gov (United States)

    Lan, Liang

    2012-01-01

    In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray…

  2. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  3. World Wide Web platform-independent access to biomedical text/image databases

    Science.gov (United States)

    Long, L. Rodney; Goh, Gin-Hua; Neve, Leif; Thoma, George R.

    1998-07-01

    The biomedical digital library of the future is expected to provide access to stores of biomedical database information containing text and images. Developing efficient methods for accessing such databases is a research effort at the Lister Hill National Center for Biomedical Communications of the National Library of Medicine. In this paper we examine issues in providing access to databases across the Web and describe a tool we have developed: the Web-based Medical Information Retrieval System (WebMIRS). We address a number of critical issues, including preservation of data integrity, efficient database design, access to documentation, quality of query and results interfaces, capability to export results to other software, and exploitation of multimedia data. WebMIRS is implemented as a Java applet that allows database access to text and to associated image data, without requiring any user software beyond a standard Web browser. The applet implementation allows WebMIRS to run on any hardware platform (such as PCs, the Macintosh, or Unix machines) which supports a Java-enabled Web browser, such as Netscape or Internet Explorer. WebMIRS is being tested on text/x-ray image databases created from the National Health and Nutrition Examination Surveys (NHANES) data collected by the National Center for Health Statistics.

  4. Significant Term List Based Metadata Conceptual Mining Model for Effective Text Clustering

    OpenAIRE

    J. Janet; S. Koteeswaran; E. Kannan

    2012-01-01

    As the engineering world are growing fast, the usage of data for the day to day activity of the engineering industry also growing rapidly. In order to handle and to find the hidden knowledge from huge data storage, data mining is very helpful right now. Text mining, network mining, multimedia mining, trend analysis are few applications of data mining. In text mining, there are variety of methods are proposed by many researchers, even though high precision, better recall are still is a critica...

  5. Mining Sequential Update Summarization with Hierarchical Text Analysis

    Directory of Open Access Journals (Sweden)

    Chunyun Zhang

    2016-01-01

    Full Text Available The outbreak of unexpected news events such as large human accident or natural disaster brings about a new information access problem where traditional approaches fail. Mostly, news of these events shows characteristics that are early sparse and later redundant. Hence, it is very important to get updates and provide individuals with timely and important information of these incidents during their development, especially when being applied in wireless and mobile Internet of Things (IoT. In this paper, we define the problem of sequential update summarization extraction and present a new hierarchical update mining system which can broadcast with useful, new, and timely sentence-length updates about a developing event. The new system proposes a novel method, which incorporates techniques from topic-level and sentence-level summarization. To evaluate the performance of the proposed system, we apply it to the task of sequential update summarization of temporal summarization (TS track at Text Retrieval Conference (TREC 2013 to compute four measurements of the update mining system: the expected gain, expected latency gain, comprehensiveness, and latency comprehensiveness. Experimental results show that our proposed method has good performance.

  6. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

    Science.gov (United States)

    Kilicoglu, Halil; Demner-Fushman, Dina

    2016-01-01

    Coreference resolution is one of the fundamental and challenging tasks in natural language processing. Resolving coreference successfully can have a significant positive effect on downstream natural language processing tasks, such as information extraction and question answering. The importance of coreference resolution for biomedical text analysis applications has increasingly been acknowledged. One of the difficulties in coreference resolution stems from the fact that distinct types of coreference (e.g., anaphora, appositive) are expressed with a variety of lexical and syntactic means (e.g., personal pronouns, definite noun phrases), and that resolution of each combination often requires a different approach. In the biomedical domain, it is common for coreference annotation and resolution efforts to focus on specific subcategories of coreference deemed important for the downstream task. In the current work, we aim to address some of these concerns regarding coreference resolution in biomedical text. We propose a general, modular framework underpinned by a smorgasbord architecture (Bio-SCoRes), which incorporates a variety of coreference types, their mentions and allows fine-grained specification of resolution strategies to resolve coreference of distinct coreference type-mention pairs. For development and evaluation, we used a corpus of structured drug labels annotated with fine-grained coreference information. In addition, we evaluated our approach on two other corpora (i2b2/VA discharge summaries and protein coreference dataset) to investigate its generality and ease of adaptation to other biomedical text types. Our results demonstrate the usefulness of our novel smorgasbord architecture. The specific pipelines based on the architecture perform successfully in linking coreferential mention pairs, while we find that recognition of full mention clusters is more challenging. The corpus of structured drug labels (SPL) as well as the components of Bio-SCoRes and

  7. BioC: a minimalist approach to interoperability for biomedical text processing

    OpenAIRE

    Comeau, Donald C; Islamaj Doğan, Rezarta; Ciccarese, Paolo; Cohen, Kevin Bretonnel; Krallinger, Martin; Leitner, Florian; Lu, Zhiyong; Peng, Yifan; Rinaldi, Fabio; Torii, Manabu; Valencia, Alfonso; Verspoor, Karin; Wiegers, Thomas C.; Wu, Cathy H; Wilbur, W John

    2013-01-01

    A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used t...

  8. AuDis: an automatic CRF-enhanced disease normalization in biomedical text.

    Science.gov (United States)

    Lee, Hsin-Chun; Hsu, Yi-Yu; Kao, Hung-Yu

    2016-01-01

    Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature.Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html. PMID:27278815

  9. EnvMine: A text-mining system for the automatic extraction of contextual information

    Directory of Open Access Journals (Sweden)

    de Lorenzo Victor

    2010-06-01

    Full Text Available Abstract Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles. So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude, thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical

  10. Mining consumer health vocabulary from community-generated text.

    Science.gov (United States)

    Vydiswaran, V G Vinod; Mei, Qiaozhu; Hanauer, David A; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text. PMID:25954426

  11. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining

    OpenAIRE

    Sugam Sharma; Tzusheng Pei; Hari Cohly

    2010-01-01

    In Bioinformatics, text mining and text data mining sometimes interchangeably used is a process to derive high-quality information from text. Perl Status Reporter (SRr) [1] is a data fetching tool from a flat text file and in this research paper we illustrate the use of SRr in text/data mining. SRr needs a flat text input file where the mining process to be performed. SRr reads input file and derives the high-quality information from it. Typically text mining tasks are text categorization, te...

  12. Processing the Text of the Holy Quran: a Text Mining Study

    Directory of Open Access Journals (Sweden)

    Mohammad Alhawarat

    2015-02-01

    Full Text Available The Holy Quran is the reference book for more than 1.6 billion of Muslims all around the world Extracting information and knowledge from the Holy Quran is of high benefit for both specialized people in Islamic studies as well as non-specialized people. This paper initiates a series of research studies that aim to serve the Holy Quran and provide helpful and accurate information and knowledge to the all human beings. Also, the planned research studies aim to lay out a framework that will be used by researchers in the field of Arabic natural language processing by providing a ”Golden Dataset” along with useful techniques and information that will advance this field further. The aim of this paper is to find an approach for analyzing Arabic text and then providing statistical information which might be helpful for the people in this research area. In this paper the holly Quran text is preprocessed and then different text mining operations are applied to it to reveal simple facts about the terms of the holy Quran. The results show a variety of characteristics of the Holy Quran such as its most important words, its wordcloud and chapters with high term frequencies. All these results are based on term frequencies that are calculated using both Term Frequency (TF and Term Frequency-Inverse Document Frequency (TF-IDF methods.

  13. A novel procedure on next generation sequencing data analysis using text mining algorithm

    OpenAIRE

    Zhao, Weizhong; Chen, James J.; Perkins, Roger; Wang, Yuping; Liu, Zhichao; Hong, Huixiao; Tong, Weida; ZOU, WEN

    2016-01-01

    Background Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining....

  14. Annotated chemical patent corpus: a gold standard for text mining.

    Directory of Open Access Journals (Sweden)

    Saber A Akhondi

    Full Text Available Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

  15. Supporting the education evidence portal via text mining.

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John

    2010-08-28

    The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500,000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679

  16. A feature representation method for biomedical scientific data based on composite text description

    Institute of Scientific and Technical Information of China (English)

    SUN; Wei

    2009-01-01

    Feature representation is one of the key issues in data clustering.The existing feature representation of scientific data is not sufficient,which to some extent affects the result of scientific data clustering.Therefore,the paper proposes a concept of composite text description(CTD)and a CTD-based feature representation method for biomedical scientific data.The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively,combines and finally strengthens the two feature sets.Experiments show that comparing with traditional methods,the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.

  17. TEXT MINING AND CLASSIFICATION OF PRODUCT REVIEWS USING STRUCTURED SUPPORT VECTOR MACHINE

    OpenAIRE

    Jincy B. Chrystal; Stephy Joseph

    2015-01-01

    Text mining and Text classification are the two prominent and challenging tasks in the field of Machine learning. Text mining refers to the process of deriving high quality and relevant information from text, while Text classification deals with the categorization of text documents into different classes. The real challenge in these areas is to address the problems like handling large text corpora, similarity of words in text documents, and association of text documents with a ...

  18. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    Directory of Open Access Journals (Sweden)

    Anika Oellrich

    Full Text Available Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES, the National Center for Biomedical Ontology (NCBO Annotator, the Biomedical Concept Annotation System (BeCAS and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74% and their quality (best F1-measure of 33%, independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%, the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content

  19. An integrated text mining framework for metabolic interaction network reconstruction.

    Science.gov (United States)

    Patumcharoenpol, Preecha; Doungpan, Narumol; Meechai, Asawin; Shen, Bairong; Chan, Jonathan H; Vongsangnak, Wanwipa

    2016-01-01

    Text mining (TM) in the field of biology is fast becoming a routine analysis for the extraction and curation of biological entities (e.g., genes, proteins, simple chemicals) as well as their relationships. Due to the wide applicability of TM in situations involving complex relationships, it is valuable to apply TM to the extraction of metabolic interactions (i.e., enzyme and metabolite interactions) through metabolic events. Here we present an integrated TM framework containing two modules for the extraction of metabolic events (Metabolic Event Extraction module-MEE) and for the construction of a metabolic interaction network (Metabolic Interaction Network Reconstruction module-MINR). The proposed integrated TM framework performed well based on standard measures of recall, precision and F-score. Evaluation of the MEE module using the constructed Metabolic Entities (ME) corpus yielded F-scores of 59.15% and 48.59% for the detection of metabolic events for production and consumption, respectively. As for the testing of the entity tagger for Gene and Protein (GP) and metabolite with the test corpus, the obtained F-score was greater than 80% for the Superpathway of leucine, valine, and isoleucine biosynthesis. Mapping of enzyme and metabolite interactions through network reconstruction showed a fair performance for the MINR module on the test corpus with F-score >70%. Finally, an application of our integrated TM framework on a big-scale data (i.e., EcoCyc extraction data) for reconstructing a metabolic interaction network showed reasonable precisions at 69.93%, 70.63% and 46.71% for enzyme, metabolite and enzyme-metabolite interaction, respectively. This study presents the first open-source integrated TM framework for reconstructing a metabolic interaction network. This framework can be a powerful tool that helps biologists to extract metabolic events for further reconstruction of a metabolic interaction network. The ME corpus, test corpus, source code, and virtual

  20. Human-centered text mining: a new software system

    NARCIS (Netherlands)

    J. Poelmans; P. Elzinga; A.A. Neznanov; G. Dedene; S. Viaene; S. Kuznetsov

    2012-01-01

    In this paper we introduce a novel human-centered data mining software system which was designed to gain intelligence from unstructured textual data. The architecture takes its roots in several case studies which were a collaboration between the Amsterdam-Amstelland Police, GasthuisZusters Antwerpen

  1. Mining Texts in Reading to Write. Occasional Paper No. 29.

    Science.gov (United States)

    Greene, Stuart

    Reading and writing are commonly seen as parallel processes of composing meaning, employing similar cognitive and linguistic strategies. Research has begun to examine ways in which knowledge of content and strategies contribute to the construction of meaning in reading and writing. The metaphor of mining can provide a useful and descriptive means…

  2. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    Science.gov (United States)

    Oellrich, Anika; Collier, Nigel; Smedley, Damian; Groza, Tudor

    2015-01-01

    Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES), the National Center for Biomedical Ontology (NCBO) Annotator, the Biomedical Concept Annotation System (BeCAS) and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74%) and their quality (best F1-measure of 33%), independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%), the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content of the Sh

  3. An Intelligent Agent Based Text-Mining System: Presenting Concept through Design Approach

    OpenAIRE

    Kaustubh S. Raval; Ranjeetsingh S.Suryawanshi; Devendra M. Thakore

    2011-01-01

    Text mining is a variation on a field called data mining and refers to the process of deriving high-quality information from unstructured text. In text-mining the goal is to discover unknown information, something that may not be known by people. Now here the aim is to design an intelligent agent based text-mining system which reads on the text (input) and based on the keyword provide the matching documents (in the form of links) or options (statements) according to the user’s query. In this ...

  4. Significant Term List Based Metadata Conceptual Mining Model for Effective Text Clustering

    Directory of Open Access Journals (Sweden)

    J. Janet

    2012-01-01

    Full Text Available As the engineering world are growing fast, the usage of data for the day to day activity of the engineering industry also growing rapidly. In order to handle and to find the hidden knowledge from huge data storage, data mining is very helpful right now. Text mining, network mining, multimedia mining, trend analysis are few applications of data mining. In text mining, there are variety of methods are proposed by many researchers, even though high precision, better recall are still is a critical issues. In this study, text mining is focused and conceptual mining model is applied for improved clustering in the text mining. The proposed work is termed as Meta data Conceptual Mining Model (MCMM, is validated with few world leading technical digital library data sets such as IEEE, ACM and Scopus. The performance derived as precision, recall are described in terms of Entropy, F-Measure which are calculated and compared with existing term based model and concept based mining model.

  5. Text mining for metabolic reaction extraction from scientific literature

    OpenAIRE

    Risse, J.E.

    2014-01-01

    Science relies on data in all its different forms. In molecular biology and bioinformatics in particular large scale data generation has taken centre stage in the form of high-throughput experiments. In line with this exponential increase of experimental data has been the near exponential growth of scientific publications. Yet where classical data mining techniques are still capable of coping with this deluge in structured data (Chapter 2), access of information found in scientific literature...

  6. Text-mining and information-retrieval services for molecular biology

    OpenAIRE

    Krallinger, Martin; Valencia, Alfonso

    2005-01-01

    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators.

  7. Enhancing navigation in biomedical databases by community voting and database-driven text classification

    Directory of Open Access Journals (Sweden)

    Guettler Daniel

    2009-10-01

    community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at http://pepbank.mgh.harvard.edu.

  8. A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

    OpenAIRE

    V.M.Navaneethakumar; C Chandrasekar

    2012-01-01

    Text mining is a growing innovative field that endeavors to collect significant information from natural language processing term. It might be insecurely distinguished as the course of examining texts to extract information that is practical for particular purposes. In this case, the mining model can detain provisions that identify the concepts of the sentence or document, which tends to detect the subject of the document. In an existing work, the concept-based mining model is used only for n...

  9. Text and Web Mining Approaches in Order to Build Specialized Ontologies

    OpenAIRE

    Roche, Mathieu; Kodratoff, Yves

    2009-01-01

    This paper presents a text-mining approach in order to extract candidate terms from a corpus. The relevant candidates are selected using a web-mining approach. The terms (i.e. relevant candidate terms) we find are the instances of specialized ontologies built during this process. The experiments are based on real data – Human Resources corpus – and they show the quality of our text and web mining approaches.

  10. An overview of the biocreative 2012 workshop track III: Interactive text mining task

    Science.gov (United States)

    An important question is how to make use of text mining to enhance the biocuration workflow. A number of groups have developed tools for text mining from a computer science/linguistics perspective and there are many initiatives to curate some aspect of biology from the literature. In some cases the ...

  11. A Formal Framework on the Semantics of Regulatory Relations and Their Presence as Verbs in Biomedical Texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    2009-01-01

    logical properties of positive and negative regulations, both as formal relations and the frequency of their usage as verbs in texts. The paper discusses whether there exists a weak transitivity-like property for the relations. Our corpora consist of biomedical patents, Medline abstracts and the British...

  12. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    OpenAIRE

    Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio

    2011-01-01

    Abstract Background Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence fo...

  13. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more

    OpenAIRE

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-01-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized ‘Given X, find all associated Ys’ q...

  14. Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection

    Directory of Open Access Journals (Sweden)

    Shengyu Liu

    2015-01-01

    Full Text Available Drug name recognition (DNR is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.

  15. Integration of Text- and Data-Mining Technologies for Use in Banking Applications

    Science.gov (United States)

    Maslankowski, Jacek

    Unstructured data, most of it in the form of text files, typically accounts for 85% of an organization's knowledge stores, but it's not always easy to find, access, analyze or use (Robb 2004). That is why it is important to use solutions based on text and data mining. This solution is known as duo mining. This leads to improve management based on knowledge owned in organization. The results are interesting. Data mining provides to lead with structuralized data, usually powered from data warehouses. Text mining, sometimes called web mining, looks for patterns in unstructured data — memos, document and www. Integrating text-based information with structured data enriches predictive modeling capabilities and provides new stores of insightful and valuable information for driving business and research initiatives forward.

  16. BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    Directory of Open Access Journals (Sweden)

    Tsafnat Guy

    2011-04-01

    Full Text Available Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH and the PharmacoKinetic Interaction Screening (PKIS database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100% and 157 (of 197 minor drug classes (80% with areas under the receiver operating characteristic curve (AUC > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238 adverse events (73%, up to 12 (of 15 groups of clinically significant cytochrome P450 enzyme (CYP inducers or inhibitors (80%, and up to 11 (of 14 groups of narrow therapeutic index drugs (79%. Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task. Conclusions BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.

  17. Ontology-based retrieval of bio-medical information based on microarray text corpora

    DEFF Research Database (Denmark)

    Hansen, Kim Allan; Zambach, Sine; Have, Christian Theil

    Microarray technology is often used in gene expression exper- iments. Information retrieval in the context of microarrays has mainly been concerned with the analysis of the numeric data produced; how- ever, the experiments are often annotated with textual metadata. Al- though biomedical resources...

  18. Signal Detection Framework Using Semantic Text Mining Techniques

    Science.gov (United States)

    Sudarsan, Sithu D.

    2009-01-01

    Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…

  19. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining

    CERN Document Server

    Sharma, Sugam; Cohly, Hari

    2010-01-01

    In Bioinformatics, text mining and text data mining sometimes interchangeably used is a process to derive high-quality information from text. Perl Status Reporter (SRr) is a data fetching tool from a flat text file and in this research paper we illustrate the use of SRr in text or data mining. SRr needs a flat text input file where the mining process to be performed. SRr reads input file and derives the high quality information from it. Typically text mining tasks are text categorization, text clustering, concept and entity extraction, and document summarization. SRr can be utilized for any of these tasks with little or none customizing efforts. In our implementation we perform text categorization mining operation on input file. The input file has two parameters of interest (firstKey and secondKey). The composition of these two parameters describes the uniqueness of entries in that file in the similar manner as done by composite key in database. SRr reads the input file line by line and extracts the parameter...

  20. SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    OpenAIRE

    Jan Paralič; Marek Paralič

    2007-01-01

    In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl), as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the syst...

  1. OntoPDF: using a text mining pipeline to generate enriched PDF versions of scientific papers

    OpenAIRE

    Zhu, Yi; Rinaldi, Fabio

    2014-01-01

    In this poster we present a recent extension of the OntoGene text mining utilities, which enables the generation of annotated pdf versions of the original articles. While a text-based view (in XML or HTML) can allow a more flexible presentation of the results of a text mining pipeline, for some applications, notably in assisted curation, it might be desirable to present the annotations in the context of the original pdf document.

  2. Data Mining of Causal Relations from Text: Analysing Maritime Accident Investigation Reports

    OpenAIRE

    Tirunagari, Santosh

    2015-01-01

    Text mining is a process of extracting information of interest from text. Such a method includes techniques from various areas such as Information Retrieval (IR), Natural Language Processing (NLP), and Information Extraction (IE). In this study, text mining methods are applied to extract causal relations from maritime accident investigation reports collected from the Marine Accident Investigation Branch (MAIB). These causal relations provide information on various mechanisms behind accidents,...

  3. A Survey of Topic Modeling in Text Mining

    Directory of Open Access Journals (Sweden)

    Rubayyi Alghamdi

    2015-01-01

    Full Text Available Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA, Probabilistic latent semantic analysis (PLSA, Latent Dirichlet allocation (LDA, and Correlated topic model (CTM. The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT, dynamic topic models (DTM, multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

  4. Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming

    Science.gov (United States)

    Abdous, M'hammed; He, Wu

    2011-01-01

    Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…

  5. An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.

    Science.gov (United States)

    Trybula, Walter J.; Wyllys, Ronald E.

    2000-01-01

    Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)

  6. Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

    OpenAIRE

    Duan, Weisi; Song, Min; Yates, Alexander

    2009-01-01

    Background We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the ...

  7. Fuzzy Classification of Web Reports with Linguistic Text Mining

    Czech Academy of Sciences Publication Activity Database

    Dědek, Jan; Vojtáš, Peter

    Vol. 3. Los Alamitos: IEEE Computer Society, 2009 - (Boldi, P.; Vizzari, G.; Pasi, G.; Baeza-Yates, R.), s. 167-170 ISBN 978-0-7695-3801-3. [WI-IAT 2009 Workshops. IEEE/WIC/ACM 2009 International Conference on Web Intelligence and Intelligent Agent Technology. Milan (IT), 15.09.2009-18.09.2009] R&D Projects: GA AV ČR 1ET100300517; GA ČR GD201/09/H057 Institutional research plan: CEZ:AV0Z10300504 Keywords : ILP * fuzzy * text classification * information extraction Subject RIV: IN - Informatics, Computer Science

  8. A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles.

    Science.gov (United States)

    Jones, David E; Ghandehari, Hamidreza; Facelli, Julio C

    2016-08-01

    This article presents a comprehensive review of applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles of medical interest. The papers reviewed here present the results of research using these techniques to predict the biological fate and properties of a variety of nanoparticles relevant to their biomedical applications. These include the influence of particle physicochemical properties on cellular uptake, cytotoxicity, molecular loading, and molecular release in addition to manufacturing properties like nanoparticle size, and polydispersity. Overall, the results are encouraging and suggest that as more systematic data from nanoparticles becomes available, machine learning and data mining would become a powerful aid in the design of nanoparticles for biomedical applications. There is however the challenge of great heterogeneity in nanoparticles, which will make these discoveries more challenging than for traditional small molecule drug design. PMID:27282231

  9. TEXT DATA MINING OF ENGLISH BOOKS ON ENVIRONMENTOLOGY

    Directory of Open Access Journals (Sweden)

    Hiromi Ban

    2012-12-01

    Full Text Available Recently, to confront environmental problems, a system of “environmentology” is trying to be constructed. In order to study environmentology, reading materials in English is considered to be indispensable. In this paper, we investigated several English books on environmentology, comparing with journalism in terms of metrical linguistics. In short, frequency characteristics of character- and word-appearance were investigated using a program written in C++. These characteristics were approximated by an exponential function. Furthermore, we calculated the percentage of Japanese junior high school required vocabulary and American basic vocabulary to obtain the difficulty-level as well as the K-characteristic of each material. As a result, it was clearly shown that English materials for environmentology have a similar tendency to literary writings in the characteristics of character appearance. Besides, the values of the K-characteristic for the materials on environmentology are high, and some books are more difficult than TIME magazine.

  10. Visualization Model for Chinese Text Mining%可视化中文文本挖掘模型

    Institute of Scientific and Technical Information of China (English)

    林鸿飞; 贡大跃; 张跃; 姚天顺

    2000-01-01

    This paper briefly describes the background of text mining and the main difficulties in Chinese text mining,presents a visual model for Chinese text mining and puts forward the method of text categories based on concept,the method of text summary based on statistics and the method of identifying Chinese name.

  11. Text mining and visualization case studies using open-source tools

    CERN Document Server

    Chisholm, Andrew

    2016-01-01

    Text Mining and Visualization: Case Studies Using Open-Source Tools provides an introduction to text mining using some of the most popular and powerful open-source tools: KNIME, RapidMiner, Weka, R, and Python. The contributors-all highly experienced with text mining and open-source software-explain how text data are gathered and processed from a wide variety of sources, including books, server access logs, websites, social media sites, and message boards. Each chapter presents a case study that you can follow as part of a step-by-step, reproducible example. You can also easily apply and extend the techniques to other problems. All the examples are available on a supplementary website. The book shows you how to exploit your text data, offering successful application examples and blueprints for you to tackle your text mining tasks and benefit from open and freely available tools. It gets you up to date on the latest and most powerful tools, the data mining process, and specific text mining activities.

  12. Text Mining on the Internet%Internet 上的文本数据挖掘

    Institute of Scientific and Technical Information of China (English)

    王伟强; 高文; 段立娟

    2000-01-01

    The booming growth of the Internet has made text mining on it a promising research field in practice. The paper summarily introduces some aspects about it,which involve some potential applications,some techniques used and some present systems.

  13. Classifying unstructed textual data using the Product Score Model: an alternative text mining algorithm

    NARCIS (Netherlands)

    He, Q.; Veldkamp, B.P.; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful informatio

  14. Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Watson, Andrew; Quigley, Aaron; Klein, Ewan; Coates, Colin M.

    2015-01-01

    Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians, computational linguists, and visualization specialists. It combines text mining and information vi...

  15. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    OpenAIRE

    Singh, Karan P.; Mikler, Armin R.; Diane J. Cook; Corley, Courtney D.

    2010-01-01

    Text and structural data mining of web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like ill...

  16. Towards Applying Text Mining Techniques on Software Quality Standards and Models

    OpenAIRE

    Kelemen, Zádor Dániel; Kusters, Rob; Trienekens, Jos; Balla, Katalin

    2013-01-01

    Many of quality approaches are described in hundreds of textual pages. Manual processing of information consumes plenty of resources. In this report we present a text mining approach applied on CMMI, one well known and widely known quality approach. The text mining analysis can provide a quick overview on the scope of a quality approaches. The result of the analysis could accelerate the understanding and the selection of quality approaches.

  17. Trading Consequences: A Case Study of Combining Text Mining & Visualisation to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Quigley, Aaron

    2014-01-01

    Trading Consequences is an interdisciplinary research project between historians, computational linguists and visualization specialists. We use text mining and visualisations to explore the growth of the global commodity trade in the nineteenth century. Feedback from a group of environmental historians during a workshop provided essential information to adapt advanced text mining and visualisation techniques to historical research. Expert feedback is an essential tool for effective interdisci...

  18. Advanced Text Mining Methods for the Financial Markets and Forecasting of Intraday Volatility

    OpenAIRE

    Pieper, Michael J.

    2011-01-01

    The flow of information in financial markets is covered in two parts. An high-order estimator of intraday volatility is introduced in order to boost risk forecasts. Over the last decade, text mining of news and its application to finance were a vibrant topic of research as well as in the finance industry. This thesis develops a coherent approach to financial text mining that can be utilized for automated trading.

  19. News Discourse and Strategic Monitoring of Events Textometry and Information Extraction for Text Mining

    OpenAIRE

    Erin, Macmurray

    2012-01-01

    This research demonstrates two methods of text mining for strategic monitoring purposes: information extraction and Textometry. In strategic monitoring, text mining is used to automatically obtain information on the activities of corporations. For this objective, information extraction identifies and labels units of information, named entities (companies, places, people), which then constitute entry points for the analysis of economic activities or events. These include mergers, bankruptcies,...

  20. News Discourse and Strategic Monitoring of Events. Textometry and Information Extraction for Text Mining

    OpenAIRE

    MacMurray, Erin

    2012-01-01

    This research demonstrates two methods of text mining for strategic monitoring purposes: information extraction and Textometry. In strategic monitoring, text mining is used to automatically obtain information on the activities of corporations. For this objective, information extraction identifies and labels units of information, named entities (companies, places, people), which then constitute entry points for the analysis of economic activities or events. These include mergers, bankruptcies,...

  1. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    Energy Technology Data Exchange (ETDEWEB)

    Corley, Courtney D.; Cook, Diane; Mikler, Armin R.; Singh, Karan P.

    2010-02-22

    Text and structural data mining of Web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5-October-2008 to 21-March-2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like-illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.

  2. BelSmile: a biomedical semantic role labeling approach for extracting biological expression language from text.

    Science.gov (United States)

    Lai, Po-Ting; Lo, Yu-Yan; Huang, Ming-Siang; Hsiao, Yu-Cheng; Tsai, Richard Tzong-Han

    2016-01-01

    Biological expression language (BEL) is one of the most popular languages to represent the causal and correlative relationships among biological events. Automatically extracting and representing biomedical events using BEL can help biologists quickly survey and understand relevant literature. Recently, many researchers have shown interest in biomedical event extraction. However, the task is still a challenge for current systems because of the complexity of integrating different information extraction tasks such as named entity recognition (NER), named entity normalization (NEN) and relation extraction into a single system. In this study, we introduce our BelSmile system, which uses a semantic-role-labeling (SRL)-based approach to extract the NEs and events for BEL statements. BelSmile combines our previous NER, NEN and SRL systems. We evaluate BelSmile using the BioCreative V BEL task dataset. Our system achieved an F-score of 27.8%, ∼7% higher than the top BioCreative V system. The three main contributions of this study are (i) an effective pipeline approach to extract BEL statements, and (ii) a syntactic-based labeler to extract subject-verb-object tuples. We also implement a web-based version of BelSmile (iii) that is publicly available at iisrserv.csie.ncu.edu.tw/belsmile. PMID:27173520

  3. BelSmile: a biomedical semantic role labeling approach for extracting biological expression language from text

    Science.gov (United States)

    Lai, Po-Ting; Lo, Yu-Yan; Huang, Ming-Siang; Hsiao, Yu-Cheng; Tsai, Richard Tzong-Han

    2016-01-01

    Biological expression language (BEL) is one of the most popular languages to represent the causal and correlative relationships among biological events. Automatically extracting and representing biomedical events using BEL can help biologists quickly survey and understand relevant literature. Recently, many researchers have shown interest in biomedical event extraction. However, the task is still a challenge for current systems because of the complexity of integrating different information extraction tasks such as named entity recognition (NER), named entity normalization (NEN) and relation extraction into a single system. In this study, we introduce our BelSmile system, which uses a semantic-role-labeling (SRL)-based approach to extract the NEs and events for BEL statements. BelSmile combines our previous NER, NEN and SRL systems. We evaluate BelSmile using the BioCreative V BEL task dataset. Our system achieved an F-score of 27.8%, ∼7% higher than the top BioCreative V system. The three main contributions of this study are (i) an effective pipeline approach to extract BEL statements, and (ii) a syntactic-based labeler to extract subject–verb–object tuples. We also implement a web-based version of BelSmile (iii) that is publicly available at iisrserv.csie.ncu.edu.tw/belsmile. PMID:27173520

  4. A framework of Chinese semantic text mining based on ontology learning

    Science.gov (United States)

    Zhang, Yu-feng; Hu, Feng

    2012-01-01

    Text mining and ontology learning can be effectively employed to acquire the Chinese semantic information. This paper explores a framework of semantic text mining based on ontology learning to find the potential semantic knowledge from the immensity text information on the Internet. This framework consists of four parts: Data Acquisition, Feature Extraction, Ontology Construction, and Text Knowledge Pattern Discovery. Then the framework is applied into an actual case to try to find out the valuable information, and even to assist the consumers with selecting proper products. The results show that this framework is reasonable and effective.

  5. Zusammenfassung Workshop und Umfrageergebnisse "Bedarf und Anforderungen an Ressourcen für Text und Data Mining"

    OpenAIRE

    Sens, Irina; Katerbow, Matthias; Schöch, Christof; Mittermaier, Bernhard

    2015-01-01

    Zusammenfassung des Workshops und Visualisierung der Umfrageerbegnisse der Umfrage "Bedarf und Anforderungen an Ressourcen für Text und Data Mining" der Schwerpunktinitiative "Digitale Information" der Allianz der deutschen Wissenschaftsorganisationen, Arbeitsgruppe Text und Data Mining

  6. Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability

    Science.gov (United States)

    Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel

    2011-01-01

    The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…

  7. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    DEFF Research Database (Denmark)

    Warrer, Pernille; Hansen, Ebba Holme; Jensen, Lars Juhl;

    2012-01-01

    included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs......, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text...... searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design...

  8. Text mining for science and technology - a review part I - characterization/scientometrics

    Directory of Open Access Journals (Sweden)

    Ronald N Kostoff

    2012-01-01

    Full Text Available This article is the first part of a two-part review of the author′s work in developing text mining procedures. The focus of Part I is Scientometrics. Novel approaches that were used to text mine the field of nanoscience/nanotechnology and the science and technology portfolio of China are described. A unique approach to identify documents related to an application theme (e.g., military-related, intelligence-related, space-related rather than a discipline theme is also described in some detail.

  9. Discriminative Features Selection in Text Mining Using TF-IDF Scheme

    Directory of Open Access Journals (Sweden)

    Ms. Vaishali Bhujade1 , Prof. N. J. Janwe2 , Ms. Chhaya Meshram3

    2011-08-01

    Full Text Available This paper describes technique for discriminative features selection in Text mining. 'Text mining’ is the discovery of new, previously unknown information, by computer.Discriminative features are the most important keywords or terms inside document collection which describe the informative news included in the document collection. Generated keyword set are used to discover Association Rules amongst keywords labeling the document. For feature extraction Information Retrieval Scheme i.e. TF-IDF is used. This system uses previous work, which contains Text Preprocessing Phases (filtration and stemming. This work serves as basis for Association Rule Mining Phase. Association rule mining represents a Text Mining technique and its goal is to find interesting association or correlation relationships among a large set of data items. With massive amounts of data continuously being collected and stored in databases, many companies are becoming interested in mining association rules from their databases to increase their profits Knowledge discovery in databases (KDD is the process of finding useful information and pattern in data.

  10. Feature engineering for drug name recognition in biomedical texts: feature conjunction and feature selection.

    Science.gov (United States)

    Liu, Shengyu; Tang, Buzhou; Chen, Qingcai; Wang, Xiaolong; Fan, Xiaoming

    2015-01-01

    Drug name recognition (DNR) is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge. PMID:25861377

  11. Creating Knowledgebases to Text-Mine PUBMED Articles Using Clustering Techniques

    OpenAIRE

    Crasto, Chiquito J.; Thomas M . Morse; Migliore, Michele; Nadkarni, Prakash; Hines, Michael; Brash, Douglas E.; Perry L Miller; Gordon M Shepherd

    2003-01-01

    Knowledgebase-mediated text -mining approaches work best when processing the natural language of domain-specific text. To enhance the utility of our successfully tested program-NeuroText, and to extend its methodologies to other domains, we have designed clustering algorithms, which is the principal step in automatically creating a knowledgebase. Our algorithms are designed to improve the quality of clustering by parsing the test corpus to include semantic and syntactic parsing.

  12. Towards a Text Mining Methodology Using Frequent Itemsets and Association Rule Extraction

    OpenAIRE

    Cherfi, Hacène; Napoli, Amedeo; Toussaint, Yannick

    2003-01-01

    This paper proposes a methodology for text mining relying on the classical knowledge discovery loop, with a number of adaptations. First, texts are indexed and prepared to be processed by frequent itemset levelwise search. Association rules are then extracted and interpreted, with respect to a set of quality measures and domain knowledge, under the control of an analyst. The article includes an experimentation on a real-world text corpus holding on molecular biology.

  13. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis

    OpenAIRE

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna; Inzé, Dirk; Van de Peer, Yves

    2013-01-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology res...

  14. Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature

    OpenAIRE

    Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; PhD, Hongfang Liu

    2015-01-01

    Background Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remai...

  15. The Text-mining based PubChem Bioassay neighboring analysis

    Directory of Open Access Journals (Sweden)

    Wang Yanli

    2010-11-01

    Full Text Available Abstract Background In recent years, the number of High Throughput Screening (HTS assays deposited in PubChem has grown quickly. As a result, the volume of both the structured information (i.e. molecular structure, bioactivities and the unstructured information (such as descriptions of bioassay experiments, has been increasing exponentially. As a result, it has become even more demanding and challenging to efficiently assemble the bioactivity data by mining the huge amount of information to identify and interpret the relationships among the diversified bioassay experiments. In this work, we propose a text-mining based approach for bioassay neighboring analysis from the unstructured text descriptions contained in the PubChem BioAssay database. Results The neighboring analysis is achieved by evaluating the cosine scores of each bioassay pair and fraction of overlaps among the human-curated neighbors. Our results from the cosine score distribution analysis and assay neighbor clustering analysis on all PubChem bioassays suggest that strong correlations among the bioassays can be identified from their conceptual relevance. A comparison with other existing assay neighboring methods suggests that the text-mining based bioassay neighboring approach provides meaningful linkages among the PubChem bioassays, and complements the existing methods by identifying additional relationships among the bioassay entries. Conclusions The text-mining based bioassay neighboring analysis is efficient for correlating bioassays and studying different aspects of a biological process, which are otherwise difficult to achieve by existing neighboring procedures due to the lack of specific annotations and structured information. It is suggested that the text-mining based bioassay neighboring analysis can be used as a standalone or as a complementary tool for the PubChem bioassay neighboring process to enable efficient integration of assay results and generate hypotheses for

  16. Biomedical Mathematics, Unit I: Measurement, Linear Functions and Dimensional Algebra. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This text presents lessons relating specific mathematical concepts to the ideas, skills, and tasks pertinent to the health care field. Among other concepts covered are linear functions, vectors, trigonometry, and statistics. Many of the lessons use data acquired during science experiments as the basis for exercises in mathematics. Lessons present…

  17. Trends of E-Learning Research from 2000 to 2008: Use of Text Mining and Bibliometrics

    Science.gov (United States)

    Hung, Jui-long

    2012-01-01

    This study investigated the longitudinal trends of e-learning research using text mining techniques. Six hundred and eighty-nine (689) refereed journal articles and proceedings were retrieved from the Science Citation Index/Social Science Citation Index database in the period from 2000 to 2008. All e-learning publications were grouped into two…

  18. Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

    Science.gov (United States)

    Jiang, Feng; McComas, William F.

    2014-01-01

    This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…

  19. The Determination of Children's Knowledge of Global Lunar Patterns from Online Essays Using Text Mining Analysis

    Science.gov (United States)

    Cheon, Jongpil; Lee, Sangno; Smith, Walter; Song, Jaeki; Kim, Yongjin

    2013-01-01

    The purpose of this study was to use text mining analysis of early adolescents' online essays to determine their knowledge of global lunar patterns. Australian and American students in grades five to seven wrote about global lunar patterns they had discovered by sharing observations with each other via the Internet. These essays were analyzed for…

  20. Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

    Science.gov (United States)

    Michalski, Greg V.

    2011-01-01

    Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

  1. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Arup; Hansen, Lars Kai; Balslev, Daniela

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...

  2. Integrated Text Mining and Chemoinformatics Analysis Associates Diet to Health Benefit at Molecular Level

    DEFF Research Database (Denmark)

    Jensen, Kasper; Panagiotou, Gianni; Kouskoumvekaki, Irene

    2014-01-01

    , lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish between disease prevention/amelioration and disease progression. We subsequently searched for frequently...

  3. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    Science.gov (United States)

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  4. Exploring the potential of Social Media Data using Text Mining to augment Business Intelligence

    Directory of Open Access Journals (Sweden)

    Dr. Ananthi Sheshasaayee

    2014-03-01

    Full Text Available In recent years, social media has become world-wide famous and important for content sharing, social networking, etc., The contents generated from these websites remains largely unused. Social media contains text, images, audio, video, and so on. Social media data largely contains unstructured text. Foremost thing is to extract the information in the unstructured text. This paper presents the influence of social media data for research and how the content can be used to predict real-world decisions that enhance business intelligence, by applying the text mining methods.

  5. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    OpenAIRE

    Verspoor Karin; Cohen Kevin; Lanfranchi Arrick; Warner Colin; Johnson Helen L; Roeder Christophe; Choi Jinho D; Funk Christopher; Malenkiy Yuriy; Eckert Miriam; Xue Nianwen; Baumgartner William A; Bada Michael; Palmer Martha; Hunter Lawrence E

    2012-01-01

    Abstract Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance...

  6. Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

    OpenAIRE

    Venugopalan, Subhashini; Hendricks, Lisa Anne; Mooney, Raymond; Saenko, Kate

    2016-01-01

    This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while maintaining...

  7. Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden Diensten

    OpenAIRE

    Pfeifer, Katja

    2014-01-01

    Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen. Die Kombination verschiedener solcher Text-Min...

  8. A tm Plug-In for Distributed Text Mining in R

    OpenAIRE

    Stefan Theussl; Ingo Feinerer; Kurt Hornik

    2012-01-01

    R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valua...

  9. Text-mining the NeuroSynth corpus using Deep Boltzmann Machines

    OpenAIRE

    Monti, Ricardo Pio; Lorenz, Romy; Leech, Robert; Anagnostopoulos, Christoforos; Montana, Giovanni

    2016-01-01

    Large-scale automated meta-analysis of neuroimaging data has recently established itself as an important tool in advancing our understanding of human brain function. This research has been pioneered by NeuroSynth, a database collecting both brain activation coordinates and associated text across a large cohort of neuroimaging research papers. One of the fundamental aspects of such meta-analysis is text-mining. To date, word counts and more sophisticated methods such as Latent Dirichlet Alloca...

  10. Discriminative Features Selection in Text Mining Using TF-IDF Scheme

    OpenAIRE

    Ms. Vaishali Bhujade1 , Prof. N. J. Janwe2 , Ms. Chhaya Meshram3

    2011-01-01

    This paper describes technique for discriminative features selection in Text mining. 'Text mining’ is the discovery of new, previously unknown information, by computer.Discriminative features are the most important keywords or terms inside document collection which describe the informative news included in the document collection. Generated keyword set are used to discover Association Rules amongst keywords labeling the document. For feature extraction Information Retrieval Scheme i.e. TF-IDF...

  11. SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    Directory of Open Access Journals (Sweden)

    Jan Paralič

    2007-06-01

    Full Text Available In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl, as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the system is described, followed by some typical use case scenarios, which have been used in some previous projects. Then, basic principles and technologies used for service-oriented computing, web services and semantic web services are presented. We further discuss how the JBowl system can be adopted into a distributed environment via technologies available already and what benefits can bring such an adaptation. This is in particular important in the context of a new integrated EU-funded project KP-Lab2 (Knowledge Practices Laboratory that is briefly presented as well as the role of the proposed text mining services, which are currently being designed and developed there.

  12. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    Science.gov (United States)

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and

  13. Review of Text Mining Tools%文本挖掘工具述评

    Institute of Scientific and Technical Information of China (English)

    张雯雯; 许鑫

    2012-01-01

    The authors briefly describe some commercial text mining tools and open source text mining tools, coupled with detailed comparisons of four typical open source tools concerning data format, functional module and user experience firstly. Then, the authors realize the testing of text classification function for three kinds of distinctive tool design. Finally, the authors offer some suggestions for the status of open source text mining tools.%简要介绍一些商业文本挖掘工具和开源文本挖掘工具,针对其中四款典型的开源工具进行详细的比较,包括数据格式、功能模块和用户体验三个方面;选取三种各具特色的工具就其文本分类功能进行测评。最后,针对开源文本挖掘工具的现状,提出几点建议。

  14. A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets

    OpenAIRE

    Godfrey, Daniel; Johns, Caley; Meyer, Carl; Race, Shaina; Sadek, Carol

    2014-01-01

    Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat...

  15. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    Science.gov (United States)

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. PMID:26209007

  16. The WONP-NURT corpus as nuclear knowledge base for text mining in the INIS database

    International Nuclear Information System (INIS)

    In the present work the WONP-NURT corpus is taken as knowledge base for text mining in the INIS database. Main components of the information processing system, as well as computational methods for content analysis of INIS database record files are described. Results of the content analysis of the WONP-NURT corpus are reported. Furthermore, results of two comparative text mining studies in the INIS database are also shown. The first one explores 10 research areas in the more familiar nearest range of WONP-NURT corpus, while the second one surveys 15 regions in the more exotic far range. The results provide new elements to asses the significance of the WONP-NURT corpus in the context of the current state of nuclear science and technology research areas. (Author)

  17. Text Mining and copy right laws : A case for change in the medical research field

    OpenAIRE

    Blanc, Xavier; Tinh-Hai, Collet; Iriarte, Pablo; De Kaenel, Isabelle; Krause, Jan Brice

    2012-01-01

    Mid 2011, the research team at the Department of Ambulatory Care and Community Medicine, University of Lausanne, decided to start a project on a new research topic: Shared Decision Making (SDM). The objective was to identify publication trends about SDM in 15 major internal medicine journals over the last 15 years. It was decided to use a "text mining" approach to systematically review all the articles published in these main journals and automatically search for the different occurrences of ...

  18. Supporting text mining for e-Science: the challenges for Grid-enabled natural language processing

    OpenAIRE

    Carroll, John; Evans, Roger; Klein, Ewan

    2005-01-01

    Over the last few years, language technology has moved rapidly from 'applied research' to 'engineering', and from small-scale to large-scale engineering. Applications such as advanced text mining systems are feasible, but very resource-intensive, while research seeking to address the underlying language processing questions faces very real practical and methodological limitations. The e-Science vision, and the creation of the e-Science Grid, promises the level of integrated large-scale techno...

  19. Online Discourse on Fibromyalgia: Text-Mining to Identify Clinical Distinction and Patient Concerns

    OpenAIRE

    Park, Jungsik; Ryu, Young Uk

    2014-01-01

    Background The purpose of this study was to evaluate the possibility of using text-mining to identify clinical distinctions and patient concerns in online memoires posted by patients with fibromyalgia (FM). Material/Methods A total of 399 memoirs were collected from an FM group website. The unstructured data of memoirs associated with FM were collected through a crawling process and converted into structured data with a concordance, parts of speech tagging, and word frequency. We also conduct...

  20. Nutzen und Benutzen von Text Mining für die Medienanalyse

    OpenAIRE

    Richter, Matthias

    2011-01-01

    Einerseits werden bestehende Ergebnisse aus so unterschiedlichen Richtungen wie etwa der empirischen Medienforschung und dem Text Mining zusammengetragen. Es geht dabei um Inhaltsanalyse, von Hand, mit Unterstützung durch Computer, oder völlig automatisch, speziell auch im Hinblick auf die Faktoren wie Zeit, Entwicklung und Veränderung. Die Verdichtung und Zusammenstellung liefert nicht nur einen Überblick aus ungewohnter Perspektive, in diesem Prozess geschieht auch die Synthese von etwas Ne...

  1. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    OpenAIRE

    Przemyslaw Maciolek; Grzegorz Dobrowolski

    2013-01-01

    The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT) system, which extracts and analyzes significant quantitiesof openl...

  2. A Conformity Measure using Background Knowledge for Association Rules: Application to Text Mining

    OpenAIRE

    Cherfi, Hacène; Napoli, Amedeo; Toussaint, Yannick

    2009-01-01

    A text mining process using association rules generates a very large number of rules. According to experts of the domain, most of these rules basically convey a common knowledge, i.e. rules which associate terms that experts may likely relate to each other. In order to focus on the result interpretation and discover new knowledge units, it is necessary to define criteria for classifying the extracted rules. Most of the rule classification methods are based on numerical quality measures. In th...

  3. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    Science.gov (United States)

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD. PMID:26319542

  4. Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL

    Science.gov (United States)

    Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo

    2011-01-01

    Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

  5. Mining Clinicians' Electronic Documentation to Identify Heart Failure Patients with Ineffective Self-Management: A Pilot Text-Mining Study.

    Science.gov (United States)

    Topaz, Maxim; Radhakrishnan, Kavita; Lei, Victor; Zhou, Li

    2016-01-01

    Effective self-management can decrease up to 50% of heart failure hospitalizations. Unfortunately, self-management by patients with heart failure remains poor. This pilot study aimed to explore the use of text-mining to identify heart failure patients with ineffective self-management. We first built a comprehensive self-management vocabulary based on the literature and clinical notes review. We then randomly selected 545 heart failure patients treated within Partners Healthcare hospitals (Boston, MA, USA) and conducted a regular expression search with the compiled vocabulary within 43,107 interdisciplinary clinical notes of these patients. We found that 38.2% (n = 208) patients had documentation of ineffective heart failure self-management in the domains of poor diet adherence (28.4%), missed medical encounters (26.4%) poor medication adherence (20.2%) and non-specified self-management issues (e.g., "compliance issues", 34.6%). We showed the feasibility of using text-mining to identify patients with ineffective self-management. More natural language processing algorithms are needed to help busy clinicians identify these patients. PMID:27332377

  6. USING TEXT MINING TECHNIQUES TO ANALYZE HOW MOVIE FORUMS AFFECT THE BOX OFFICE

    Directory of Open Access Journals (Sweden)

    I-ping Chiang

    2014-06-01

    Full Text Available As a forecasting tool, audience movie reviews provide a guide for film companies. This study uses a text mining technique to analyze the American film market. It explores movie reviews including word of mouth (WOM factors (i.e., movie content, positive, negative, and promotion and related factors (i.e., time, rating, and the number of ratings for the box office. According to the relationship between the keyword clusters, the major factors that affect the box office are determined. The findings provide reference for movie producers to manipulate WOMs.

  7. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    Directory of Open Access Journals (Sweden)

    Przemyslaw Maciolek

    2013-01-01

    Full Text Available The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT system, which extracts and analyzes significant quantitiesof openly available information.

  8. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more.

    Science.gov (United States)

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-07-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized 'Given X, find all associated Ys' query, where X and Y can be selected from the aforementioned biomedical entities. An example query might be: 'Find all diseases associated with Bisphenol A'. To find its answers, PolySearch2 searches for associations against comprehensive collections of free-text collections, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles and US Patent application abstracts. PolySearch2 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and Human Metabolome Database to improve its accuracy and coverage. PolySearch2 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch2 also generates, ranks and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation. PMID:25925572

  9. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics

    Science.gov (United States)

    Torii, Manabu; Tilak, Sameer S.; Doan, Son; Zisook, Daniel S.; Fan, Jung-wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research. PMID:27375358

  10. Screening for posttraumatic stress disorder using verbal features in self narratives: a text mining approach.

    Science.gov (United States)

    He, Qiwei; Veldkamp, Bernard P; de Vries, Theo

    2012-08-15

    Much evidence has shown that people's physical and mental health can be predicted by the words they use. However, such verbal information is seldom used in the screening and diagnosis process probably because the procedure to handle these words is rather difficult with traditional quantitative methods. The first challenge would be to extract robust information from diversified expression patterns, the second to transform unstructured text into a structuralized dataset. The present study developed a new textual assessment method to screen the posttraumatic stress disorder (PTSD) patients using lexical features in the self narratives with text mining techniques. Using 300 self narratives collected online, we extracted highly discriminative keywords with the Chi-square algorithm and constructed a textual assessment model to classify individuals with the presence or absence of PTSD. This resulted in a high agreement between computer and psychiatrists' diagnoses for PTSD and revealed some expressive characteristics in the writings of PTSD patients. Although the results of text analysis are not completely analogous to the results of structured interviews in PTSD diagnosis, the application of text mining is a promising addition to assessing PTSD in clinical and research settings. PMID:22464046

  11. Integrated Text Mining and Chemoinformatics Analysis Associates Diet to Health Benefit at Molecular Level.

    OpenAIRE

    Jensen, Kasper; Panagiotou, Gianni; Kouskoumvekaki, Irene

    2014-01-01

    Awareness that disease susceptibility is not only dependent on genetic make up, but can be affected by lifestyle decisions, has brought more attention to the role of diet. However, food is often treated as a black box, or the focus is limited to few, well-studied compounds, such as polyphenols, lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish betwe...

  12. Integrated text mining and chemoinformatics analysis associates diet to health benefit at molecular level.

    OpenAIRE

    Kasper Jensen; Gianni Panagiotou; Irene Kouskoumvekaki

    2014-01-01

    Awareness that disease susceptibility is not only dependent on genetic make up, but can be affected by lifestyle decisions, has brought more attention to the role of diet. However, food is often treated as a black box, or the focus is limited to few, well-studied compounds, such as polyphenols, lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish betwe...

  13. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Årup; Hansen, Lars Kai; Balslev, D.

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...... coordinates. We invoke a simple probabilistic framework in which kernel density estimates are used to model distributions of brain activation foci conditioned on words in a given abstract. The principal associations are found in the joint probability density between words and voxels. We show that the...

  14. Text Mining of the Classical Medical Literature for Medicines That Show Potential in Diabetic Nephropathy

    Directory of Open Access Journals (Sweden)

    Lei Zhang

    2014-01-01

    Full Text Available Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1 identification of candidate ancient terms; (2 systemic search and assessment of medical records written in classical Chinese; (3 preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae.

  15. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    Directory of Open Access Journals (Sweden)

    Verspoor Karin

    2012-08-01

    Full Text Available Abstract Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

  16. Text mining analysis of public comments regarding high-level radioactive waste disposal

    International Nuclear Information System (INIS)

    In order to narrow the risk perception gap as seen in social investigations between the general public and people who are involved in nuclear industry, public comments on high-level radioactive waste (HLW) disposal have been conducted to find the significant talking points with the general public for constructing an effective risk communication model of social risk information regarding HLW disposal. Text mining was introduced to examine public comments to identify the core public interest underlying the comments. The utilized test mining method is to cluster specific groups of words with negative meanings and then to analyze public understanding by employing text structural analysis to extract words from subjective expressions. Using these procedures, it was found that the public does not trust the nuclear fuel cycle promotion policy and shows signs of anxiety about the long-lasting technological reliability of waste storage. To develop effective social risk communication of HLW issues, these findings are expected to help experts in the nuclear industry to communicate with the general public more effectively to obtain their trust. (author)

  17. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    Science.gov (United States)

    Hirdt, J. A.; Brown, D. A.

    2016-01-01

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  18. DDMGD: the database of text-mined associations between genes methylated in diseases from different species

    KAUST Repository

    Raies, A. B.

    2014-11-14

    Gathering information about associations between methylated genes and diseases is important for diseases diagnosis and treatment decisions. Recent advancements in epigenetics research allow for large-scale discoveries of associations of genes methylated in diseases in different species. Searching manually for such information is not easy, as it is scattered across a large number of electronic publications and repositories. Therefore, we developed DDMGD database (http://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD\\'s scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we developed earlier and additional post-processing, we extracted associations of genes methylated in different diseases from PubMed Central articles and PubMed abstracts. The accuracy of extracted associations is 82% as estimated on 2500 hand-curated entries. DDMGD provides a user-friendly interface facilitating retrieval of these associations ranked according to confidence scores. Submission of new associations to DDMGD is provided. A comparison analysis of DDMGD with several other databases focused on genes methylated in diseases shows that DDMGD is comprehensive and includes most of the recent information on genes methylated in diseases.

  19. Text mining of the classical medical literature for medicines that show potential in diabetic nephropathy.

    Science.gov (United States)

    Zhang, Lei; Li, Yin; Guo, Xinfeng; May, Brian H; Xue, Charlie C L; Yang, Lihong; Liu, Xusheng

    2014-01-01

    Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1) identification of candidate ancient terms; (2) systemic search and assessment of medical records written in classical Chinese; (3) preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction) was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae. PMID:24744808

  20. Texts and data mining and their possibilities applied to the process of news production

    Directory of Open Access Journals (Sweden)

    Walter Teixeira Lima Jr

    2008-06-01

    Full Text Available The proposal of this essay is to discuss the challenges of representing in a formalist computational process the knowledge which the journalist uses to articulate news values for the purpose of selecting and imposing hierarchy on news. It discusses how to make bridges to emulate this knowledge obtained in an empirical form with the bases of computational science, in the area of storage, recovery and linked to data in a database, which must show the way human brains treat information obtained through their sensorial system. Systemizing and automating part of the journalistic process in a database contributes to eliminating distortions, faults and to applying, in an efficient manner, techniques for Data Mining and/or Texts which, by definition, permit the discovery of nontrivial relations.

  1. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    Science.gov (United States)

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases. PMID:27073839

  2. Automated extraction of precise protein expression patterns in lymphoma by text mining abstracts of immunohistochemical studies

    Directory of Open Access Journals (Sweden)

    Jia-Fu Chang

    2013-01-01

    Full Text Available Background: In general, surgical pathology reviews report protein expression by tumors in a semi-quantitative manner, that is, -, -/+, +/-, +. At the same time, the experimental pathology literature provides multiple examples of precise expression levels determined by immunohistochemical (IHC tissue examination of populations of tumors. Natural language processing (NLP techniques enable the automated extraction of such information through text mining. We propose establishing a database linking quantitative protein expression levels with specific tumor classifications through NLP. Materials and Methods: Our method takes advantage of typical forms of representing experimental findings in terms of percentages of protein expression manifest by the tumor population under study. Characteristically, percentages are represented straightforwardly with the % symbol or as the number of positive findings of the total population. Such text is readily recognized using regular expressions and templates permitting extraction of sentences containing these forms for further analysis using grammatical structures and rule-based algorithms. Results: Our pilot study is limited to the extraction of such information related to lymphomas. We achieved a satisfactory level of retrieval as reflected in scores of 69.91% precision and 57.25% recall with an F-score of 62.95%. In addition, we demonstrate the utility of a web-based curation tool for confirming and correcting our findings. Conclusions: The experimental pathology literature represents a rich source of pathobiological information, which has been relatively underutilized. There has been a combinatorial explosion of knowledge within the pathology domain as represented by increasing numbers of immunophenotypes and disease subclassifications. NLP techniques support practical text mining techniques for extracting this knowledge and organizing it in forms appropriate for pathology decision support systems.

  3. Biomedical Science, Unit III: The Circulatory System in Health and Science. The Heart and Blood Vessels; Blood and Its Properties; The Urinary Tract. Student Text. Revised Version, 1976.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of science within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on the…

  4. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine

    OpenAIRE

    Baskin Berivan; Zhang Shudong; Tuekam Brigitte; Lay Vicki; Wolting Cheryl; de Bruijn Berry; Martin Joel; Donaldson Ian; Bader Gary D; Michalickova Katerina; Pawson Tony; Hogue Christopher WV

    2003-01-01

    Abstract Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate inte...

  5. A multilingual text mining based content gathering system for open source intelligence

    International Nuclear Information System (INIS)

    The number of documents available in electronic format has grown dramatically in the recent years, whilst the information that States provide to the IAEA is not always complete or clear. Independent information sources can balance the limited State-reported information, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for source and language used, and transforming them into information, is therefore inextricably linked to the concepts of automatic textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. This paper describes is a multilingual indexing, searching and clustering system, whose main goal is managing huge collections of data coming from different and geographically distributed information sources, providing language independent searches and dynamic classification facilities. The automatic linguistic analysis of documents is based on Morpho-Syntactic, Functional and Statistical criteria. This phase is intended to identify only the significant expressions from the whole raw text: the system analyzes each sentence, cycling through all the possible sentence constructions. Using a series of word relationship tests to establish context, the system tries to determine the meaning of the sentence. Once reduced to its Part Of Speech and Functional tagged base form, referred to its language independent entry inside a sectorial multilingual dictionary, each tagged lemma is used as descriptor and possible seed of clustering. The automatic classification of results is made by Unsupervised Classification schema. By Multilingual Text Mining, analysts can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find any nuclear proliferation and safeguard related information. Multilingual Text Mining permits to overcome linguistic barriers, allowing the automatic

  6. MET network in PubMed: a text-mined network visualization and curation system.

    Science.gov (United States)

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET.Database URL: http://btm.tmu.edu.tw/metastasisway. PMID:27242035

  7. MET network in PubMed: a text-mined network visualization and curation system

    Science.gov (United States)

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET. Database URL: http://btm.tmu.edu.tw/metastasisway

  8. Sequential Data Mining for Information Extraction from Texts Fouille de données séquentielles pour l’extraction d’information dans les textes

    Directory of Open Access Journals (Sweden)

    Thierry Charnois

    2010-09-01

    Full Text Available This paper shows the benefit of using data mining methods for Biological Natural Language Processing. A method for discovering linguistic patterns based on a recursive sequential pattern mining is proposed. It does not require a sentence parsing nor other resource except a training data set. It produces understandable results and we show its interest in the extraction of relations between named entities. For the named entities recognition problem, we propose a method based on a new kind of patterns taking account the sequence and its context.

  9. A Distributed Look-up Architecture for Text Mining Applications using MapReduce.

    Science.gov (United States)

    Balkir, Atilla Soner; Foster, Ian; Rzhetsky, Andrey

    2011-11-01

    Text mining applications typically involve statistical models that require accessing and updating model parameters in an iterative fashion. With the growing size of the data, such models become extremely parameter rich, and naive parallel implementations fail to address the scalability problem of maintaining a distributed look-up table that maps model parameters to their values. We evaluate several existing alternatives to provide coordination among worker nodes in Hadoop [11] clusters, and suggest a new multi-layered look-up architecture that is specifically optimized for certain problem domains. Our solution exploits the power-law distribution characteristics of the phrase or n-gram counts in large corpora while utilizing a Bloom Filter [2], in-memory cache, and an HBase [12] cluster at varying levels of abstraction. PMID:25356441

  10. Natural products for chronic cough: Text mining the East Asian historical literature for future therapeutics.

    Science.gov (United States)

    Shergis, Johannah Linda; Wu, Lei; May, Brian H; Zhang, Anthony Lin; Guo, Xinfeng; Lu, Chuanjian; Xue, Charlie Changli

    2015-08-01

    Chronic cough is a significant health burden. Patients experience variable benefits from over the counter and prescribed products, but there is an unmet need to provide more effective treatments. Natural products have been used to treat cough and some plant compounds such as pseudoephedrine from ephedra and codeine from opium poppy have been developed into drugs. Text mining historical literature may offer new insight for future therapeutic development. We identified natural products used in the East Asian historical literature to treat chronic cough. Evaluation of the historical literature revealed 331 natural products used to treat chronic cough. Products included plants, minerals and animal substances. These natural products were found in 75 different books published between AD 363 and 1911. Of the 331 products, the 10 most frequently and continually used products were examined, taking into consideration findings from contemporary experimental studies. The natural products identified are promising and offer new directions in therapeutic development for treating chronic cough. PMID:25901012

  11. Developing timely insights into comparative effectiveness research with a text-mining pipeline.

    Science.gov (United States)

    Chang, Meiping; Chang, Man; Reed, Jane Z; Milward, David; Xu, Jinghai James; Cornell, Wendy D

    2016-03-01

    Comparative effectiveness research (CER) provides evidence for the relative effectiveness and risks of different treatment options and informs decisions made by healthcare providers, payers, and pharmaceutical companies. CER data come from retrospective analyses as well as prospective clinical trials. Here, we describe the development of a text-mining pipeline based on natural language processing (NLP) that extracts key information from three different trial data sources: NIH ClinicalTrials.gov, WHO International Clinical Trials Registry Platform (ICTRP), and Citeline Trialtrove. The pipeline leverages tailored terminologies to produce an integrated and structured output, capturing any trials in which pharmaceutical products of interest are compared with another therapy. The timely information alerts generated by this system provide the earliest and most complete picture of emerging clinical research. PMID:26854423

  12. A text mining approach to the prediction of disease status from clinical discharge summaries.

    Science.gov (United States)

    Yang, Hui; Spasic, Irena; Keane, John A; Nenadic, Goran

    2009-01-01

    OBJECTIVE The authors present a system developed for the Challenge in Natural Language Processing for Clinical Data-the i2b2 obesity challenge, whose aim was to automatically identify the status of obesity and 15 related co-morbidities in patients using their clinical discharge summaries. The challenge consisted of two tasks, textual and intuitive. The textual task was to identify explicit references to the diseases, whereas the intuitive task focused on the prediction of the disease status when the evidence was not explicitly asserted. DESIGN The authors assembled a set of resources to lexically and semantically profile the diseases and their associated symptoms, treatments, etc. These features were explored in a hybrid text mining approach, which combined dictionary look-up, rule-based, and machine-learning methods. MEASUREMENTS The methods were applied on a set of 507 previously unseen discharge summaries, and the predictions were evaluated against a manually prepared gold standard. The overall ranking of the participating teams was primarily based on the macro-averaged F-measure. RESULTS The implemented method achieved the macro-averaged F-measure of 81% for the textual task (which was the highest achieved in the challenge) and 63% for the intuitive task (ranked 7(th) out of 28 teams-the highest was 66%). The micro-averaged F-measure showed an average accuracy of 97% for textual and 96% for intuitive annotations. CONCLUSIONS The performance achieved was in line with the agreement between human annotators, indicating the potential of text mining for accurate and efficient prediction of disease statuses from clinical discharge summaries. PMID:19390098

  13. Effective Mining of Protein Interactions

    OpenAIRE

    Rinaldi, F; Schneider, G; Kaljurand, K.; Clematide, S

    2009-01-01

    The detection of mentions of protein-protein interactions in the scientific literature has recently emerged as a core task in biomedical text mining. We present effective techniques for this task, which have been developed using the IntAct database as a gold standard, and have been evaluated in two text mining competitions.

  14. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes.

    Directory of Open Access Journals (Sweden)

    Nicholas J Leeper

    Full Text Available BACKGROUND: Peripheral arterial disease (PAD is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF. METHODS AND RESULTS: We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55], myocardial infarction (OR = 1.00, CI [0.71, 1.39], or death (OR = 0.86, CI [0.63, 1.18]. Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients. CONCLUSIONS: This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

  15. Development of Workshops on Biodiversity and Evaluation of the Educational Effect by Text Mining Analysis

    Science.gov (United States)

    Baba, R.; Iijima, A.

    2014-12-01

    Conservation of biodiversity is one of the key issues in the environmental studies. As means to solve this issue, education is becoming increasingly important. In the previous work, we have developed a course of workshops on the conservation of biodiversity. To disseminate the course as a tool for environmental education, determination of the educational effect is essential. A text mining enables analyses of frequency and co-occurrence of words in the freely described texts. This study is intended to evaluate the effect of workshop by using text mining technique. We hosted the originally developed workshop on the conservation of biodiversity for 22 college students. The aim of the workshop was to inform the definition of biodiversity. Generally, biodiversity refers to the diversity of ecosystem, diversity between species, and diversity within species. To facilitate discussion, supplementary materials were used. For instance, field guides of wildlife species were used to discuss about the diversity of ecosystem. Moreover, a hierarchical framework in an ecological pyramid was shown for understanding the role of diversity between species. Besides, we offered a document material on the historical affair of Potato Famine in Ireland to discuss about the diversity within species from the genetic viewpoint. Before and after the workshop, we asked students for free description on the definition of biodiversity, and analyzed by using Tiny Text Miner. This technique enables Japanese language morphological analysis. Frequently-used words were sorted into some categories. Moreover, a principle component analysis was carried out. After the workshop, frequency of the words tagged to diversity between species and diversity within species has significantly increased. From a principle component analysis, the 1st component consists of the words such as producer, consumer, decomposer, and food chain. This indicates that the students have comprehended the close relationship between

  16. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques

    International Nuclear Information System (INIS)

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  17. Building Classification System to Predict Risk factors of Diabetic Retinopathy Using Text mining

    Directory of Open Access Journals (Sweden)

    T.Sudha,

    2010-10-01

    Full Text Available This Making medical decisions such as diagnosing the diseases that cause a patient’s illness is often a complex task. The Diabetic retinopathy is one of the complications of diabetes and Diabetic retinopathy is one of the most common causes of blindness. Unfortunately, in many cases the patient is not aware of any symptoms until it is too late for effective treatment. Analysis of the evoked potential response of the retina, optical nerve and optical brain centre will pave a way for early diagnosis of diabetic retinopathy and prognosis during the treatment process. The objective of this study is to identify the prevalence and severity of diabetic retinopathy and to determine the relationship between risk factors, prevalence and severity of diabetic retinopathy. We collected 3450 patients history, who are suffering with type 2 diabetes .As the available data is not in structured format, we apply text mining classification technique to predict the risk factors of the diabetic retinopathy. This study shows that a relatively short duration of case management instituted before onset of clinically identifiable retinopathy, significantly reduce the risk of developing retinopathy in patients with type 2 diabetes. A total of 1402 patients (39.8% hadevidence of retinopathy. This comprised of 32% of initial stage ofDR , 20% Retinal haemorrhages, 14% patients with Mild non proliferate diabetic retinopathy, 18% with Moderate non proliferate DR, 1% with Proliferate DR ,14%with High risk.

  18. PENGGUNAAN WEB CRAWLER UNTUK MENGHIMPUN TWEETS DENGAN METODE PRE-PROCESSING TEXT MINING

    Directory of Open Access Journals (Sweden)

    Bayu Rima Aditya

    2015-11-01

    Full Text Available Saat ini jumlah data di media sosial sudah terbilang sangat besar, namun jumlah data tersebut masih belum banyak dimanfaatkan atau diolah untuk menjadi sesuatu yang bernilai guna, salah satunya adalah tweets pada media sosial twitter. Paper ini menguraikan hasil penggunaan engine web crawel menggunakan metode pre-processing text mining. Penggunaan engine web crawel itu sendiri bertujuan untuk menghimpun tweets melalui API twitter sebagai data teks tidak terstruktur yang kemudian direpresentasikan kembali kedalam bentuk web. Sedangkan penggunaan metode pre-processing bertujuan untuk menyaring tweets melalui tiga tahap, yaitu cleansing, case folding, dan parsing. Aplikasi yang dirancang pada penelitian ini menggunakan metode pengembangan perangkat lunak yaitu model waterfall dan diimplementasikan dengan bahasa pemrograman PHP. Sedangkan untuk pengujiannya menggunakan black box testing untuk memeriksa apakah hasil perancangan sudah dapat berjalan sesuai dengan harapan atau belum. Hasil dari penelitian ini adalah berupa aplikasi yang dapat mengubah tweets yang telah dihimpun menjadi data yang siap diolah lebih lanjut sesuai dengan kebutuhan user berdasarkan kata kunci dan tanggal pencarian. Hal ini dilakukan karena dari beberapa penelitian terkait terlihat bahwa data pada media sosial khususnya twitter saat ini menjadi tujuan perusahaan atau instansi untuk memahami opini masyarakat.

  19. Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining

    Directory of Open Access Journals (Sweden)

    Jose M.G. Izarzugaza

    2012-08-01

    Full Text Available Protein kinases play a crucial role in a plethora of significant physiological functions and a number of mutations in this superfamily have been reported in the literature to disrupt protein structure and/or function.Computational and experimental research aims to discover the mechanistic connection between mutations in protein kinases and disease with the final aim of predicting the consequences of mutations on protein function and the subsequent phenotypic alterations.In this chapter, we will review the possibilities and limitations of current computational methods for the prediction of the pathogenicity of mutations in the protein kinase superfamily. In particular we will focus in the problem of benchmarking the predictions with independent gold-standard datasets. We will propose a pipeline for the curation of mutations automatically extracted from the literature. Since many of these mutations are not included in the databases that are commonly used to train the computational methods to predict the pathogenicity of protein kinase mutations we propose them to build a valuable gold-standard dataset in the benchmarking of a number of these predictors.Finally, we will discuss how text mining approaches constitute a powerful tool for the interpretation of the consequences of mutations in the context of personalized/stratified medicine.

  20. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

    OpenAIRE

    Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho

    2015-01-01

    Background Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named...

  1. A practical application of text mining to literature on cognitive rehabilitation and enhancement through neurostimulation

    Directory of Open Access Journals (Sweden)

    Puiu F Balan

    2014-09-01

    Full Text Available The exponential growth in publications represents a major challenge for researchers. Many scientific domains, including neuroscience, are not yet fully engaged in exploiting large bodies of publications. In this paper, we promote the idea to partially automate the processing of scientific documents, specifically using text mining (TM, to efficiently review big corpora of publications. The cognitive advantage given by TM is mainly related to the automatic extraction of relevant trends from corpora of literature, otherwise impossible to analyze in short periods of time. Specifically, the benefits of TM are increased speed, quality and reproducibility of text processing, boosted by rapid updates of the results. First, we selected a set of TM-tools that allow user-friendly approaches of the scientific literature, and which could serve as a guide for researchers willing to incorporate TM in their work. Second, we used these TM-tools to obtain basic insights into the relevant literature on cognitive rehabilitation (CR and cognitive enhancement (CE using transcranial magnetic stimulation (TMS. TM readily extracted the diversity of TMS applications in CR and CE from vast corpora of publications, automatically retrieving trends already described in published reviews. TMS emerged as one of the important non-invasive tools that can both improve cognitive and motor functions in numerous neurological diseases and induce modulations/enhancements of many fundamental brain functions. TM also revealed trends in big corpora of publications by extracting occurrence frequency and relationships of particular subtopics. Moreover, we showed that CR and CE share research topics, both aiming to increase the brain’s capacity to process information, thus supporting their integration in a larger perspective. Methodologically, despite limitations of a simple user-friendly approach, TM served well the reviewing process.

  2. Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

    Directory of Open Access Journals (Sweden)

    Jordan MI

    2006-05-01

    Full Text Available Abstract Background The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. Results An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Conclusion Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

  3. Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application

    Directory of Open Access Journals (Sweden)

    Leon eFrench

    2015-05-01

    Full Text Available We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 51% of connectivity statements at 67% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/.

  4. Mining

    Directory of Open Access Journals (Sweden)

    Khairullah Khan

    2014-09-01

    Full Text Available Opinion mining is an interesting area of research because of its applications in various fields. Collecting opinions of people about products and about social and political events and problems through the Web is becoming increasingly popular every day. The opinions of users are helpful for the public and for stakeholders when making certain decisions. Opinion mining is a way to retrieve information through search engines, Web blogs and social networks. Because of the huge number of reviews in the form of unstructured text, it is impossible to summarize the information manually. Accordingly, efficient computational methods are needed for mining and summarizing the reviews from corpuses and Web documents. This study presents a systematic literature survey regarding the computational techniques, models and algorithms for mining opinion components from unstructured reviews.

  5. METSP: A Maximum-Entropy Classifier Based Text Mining Tool for Transporter-Substrate Identification with Semistructured Text

    Directory of Open Access Journals (Sweden)

    Min Zhao

    2015-01-01

    Full Text Available The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.

  6. Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application.

    Science.gov (United States)

    French, Leon; Liu, Po; Marais, Olivia; Koreman, Tianna; Tseng, Lucia; Lai, Artemis; Pavlidis, Paul

    2015-01-01

    We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 67% of connectivity statements at 51% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/. PMID:26052282

  7. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.

    Science.gov (United States)

    Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H

    2015-12-01

    Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research. PMID:26579624

  8. Development of a Text Mining System for the Discussion of Proactive Aging Management in Nuclear Power Plant

    International Nuclear Information System (INIS)

    The purpose of this study is to develop an effective system to support the exploration process of knowledge extraction from the database of incident records in the long-operated nuclear power plants with text mining technology, especially for the Generic Issues for proactive materials degradation management (PMDM) project in Japan. A modified system with text mining technology has been developed to support to explore relationships of keywords as cues for the discussion of Generic Issues effectively. As a result of evaluation, the knowledge extraction method with the modified system has been confirmed to support to explore relationships of keywords more effectively than the proposed method in the previous study

  9. Development of a Text Mining System for the Discussion of Proactive Aging Management in Nuclear Power Plant

    Energy Technology Data Exchange (ETDEWEB)

    Shiraishi, Natsuki; Takahashi, Makoto; Wakabayashi, Toshio [Tohoku University, Tohoku (Japan)

    2011-08-15

    The purpose of this study is to develop an effective system to support the exploration process of knowledge extraction from the database of incident records in the long-operated nuclear power plants with text mining technology, especially for the Generic Issues for proactive materials degradation management (PMDM) project in Japan. A modified system with text mining technology has been developed to support to explore relationships of keywords as cues for the discussion of Generic Issues effectively. As a result of evaluation, the knowledge extraction method with the modified system has been confirmed to support to explore relationships of keywords more effectively than the proposed method in the previous study.

  10. 生物信息学中的文本挖掘方法%Text mining in bioinformatics

    Institute of Scientific and Technical Information of China (English)

    邹权; 林琛; 刘晓燕; 郭茂祖

    2011-01-01

    从两个角度讨论应用于生物信息学中的文本挖掘方法.以搜索生物知识为目标,利用文本挖掘方法进行文献检索,进而构建相关数据库,如在PubMed中挖掘蛋白质相互作用和基因疾病关系等知识.总结了可以应用文本挖掘技术的生物信息学问题,如蛋白质结构与功能的分析.探讨了文本挖掘研究者可以探索的生物信息学领域,以便更多的文本挖掘研究者可以将相关成果应用于生物信息学的研究中.%Text mining methods in bioinformatics are discussed from two views. First, three problems are reviewed including searching biology knowledge, retrieving the reference by text mining method and reconstructing databases. For example, protein-protein interaction and gene-disease relationship can be mined from PubMed. Then the bioinformatics applications of text mining are concluded, such as protein structure and function prediction. At last, more methods and applications are discussed for helping text mining researchers to do more contribution in bioinformatics.

  11. Text Mining to inform construction of Earth and Environmental Science Ontologies

    Science.gov (United States)

    Schildhauer, M.; Adams, B.; Rebich Hespanha, S.

    2013-12-01

    There is a clear need for better semantic representation of Earth and environmental concepts, to facilitate more effective discovery and re-use of information resources relevant to scientists doing integrative research. In order to develop general-purpose Earth and environmental science ontologies, however, it is necessary to represent concepts and relationships that span usage across multiple disciplines and scientific specialties. Traditional knowledge modeling through ontologies utilizes expert knowledge but inevitably favors the particular perspectives of the ontology engineers, as well as the domain experts who interacted with them. This often leads to ontologies that lack robust coverage of synonymy, while also missing important relationships among concepts that can be extremely useful for working scientists to be aware of. In this presentation we will discuss methods we have developed that utilize statistical topic modeling on a large corpus of Earth and environmental science articles, to expand coverage and disclose relationships among concepts in the Earth sciences. For our work we collected a corpus of over 121,000 abstracts from many of the top Earth and environmental science journals. We performed latent Dirichlet allocation topic modeling on this corpus to discover a set of latent topics, which consist of terms that commonly co-occur in abstracts. We match terms in the topics to concept labels in existing ontologies to reveal gaps, and we examine which terms are commonly associated in natural language discourse, to identify relationships that are important to formally model in ontologies. Our text mining methodology uncovers significant gaps in the content of some popular existing ontologies, and we show how, through a workflow involving human interpretation of topic models, we can bootstrap ontologies to have much better coverage and richer semantics. Because we base our methods directly on what working scientists are communicating about their

  12. The Text Mining and its Key Technigues and Methods%文本挖掘及其关键技术与方法

    Institute of Scientific and Technical Information of China (English)

    王丽坤; 王宏; 陆玉昌

    2002-01-01

    With the dramatically development of Internet, the information processing and management technology onWWW have become a great important branch of data mining and data warehouse. Especially, nowadays, Text Miningis marvelously emerging and plays an important role in interrelated fields. So it is worth summarizing the contentabout text mining from its definition to relational methods and techniques. In this paper, combined to comparativelymature data mining technology, we present the definition of text mining and the multi-stage text mining process mod-el. Moreover, this paper roundly introduces the key areas of text mining and some of the powerful text analysis tech-niques, including: Word Automatic Segmenting, Feature Representation, Feature Extraction, Text Categorization,Text Clustering, Text Summarization, Information Extraction, Pattern Quality Evaluation, etc. These techniquescover the whole process from information preprocessing to knowledge obtaining.

  13. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

    Directory of Open Access Journals (Sweden)

    Sun Lin

    2009-09-01

    Full Text Available Abstract Background One of the most neglected areas of biomedical Text Mining (TM is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA. Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.

  14. Towards Evidence-based Precision Medicine: Extracting Population Information from Biomedical Text using Binary Classifiers and Syntactic Patterns.

    Science.gov (United States)

    Raja, Kalpana; Dasot, Naman; Goyal, Pawan; Jonnalagadda, Siddhartha R

    2016-01-01

    Precision Medicine is an emerging approach for prevention and treatment of disease that considers individual variability in genes, environment, and lifestyle for each person. The dissemination of individualized evidence by automatically identifying population information in literature is a key for evidence-based precision medicine at the point-of-care. We propose a hybrid approach using natural language processing techniques to automatically extract the population information from biomedical literature. Our approach first implements a binary classifier to classify sentences with or without population information. A rule-based system based on syntactic-tree regular expressions is then applied to sentences containing population information to extract the population named entities. The proposed two-stage approach achieved an F-score of 0.81 using a MaxEnt classifier and the rule- based system, and an F-score of 0.87 using a Nai've-Bayes classifier and the rule-based system, and performed relatively well compared to many existing systems. The system and evaluation dataset is being released as open source. PMID:27570671

  15. Examining Mobile Learning Trends 2003-2008: A Categorical Meta-Trend Analysis Using Text Mining Techniques

    Science.gov (United States)

    Hung, Jui-Long; Zhang, Ke

    2012-01-01

    This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…

  16. BIOMedical search engine framework: lightweight and customized implementation of domain-specific biomedical search engines

    OpenAIRE

    Jácome, Alberto G.; Fdez-Riverola, Florentino; Lourenço, Anália

    2016-01-01

    The Smart Drug Search is publicly accessible at http://sing.ei.uvigo.es/sds/. The BIOMedical Search Engine Framework is freely available for non-commercial use at https://github.com/agjacome/biomsef Background and Objectives: Text mining and semantic analysis approaches can be applied to the construction of biomedical domain-specific search engines and provide an attractive alternative to create personalized and enhanced search experiences. Therefore, this work introduces the new open-sour...

  17. Towards Openness in Biomedical Informatics

    OpenAIRE

    Maojo Garcia, Victor Manuel; Jiménez Castellanos, Ana; Iglesia Jimenez, Diana de la

    2011-01-01

    Over the last years, and particularly in the context of the COMBIOMED network, our biomedical informatics (BMI) group at the Universidad Politecnica de Madrid has carried out several approaches to address a fundamental issue: to facilitate open access and retrieval to BMI resources —including software, databases and services. In this regard, we have followed various directions: a) a text mining-based approach to automatically build a “resourceome”, an inventory of open resources, b) met...

  18. PPInterFinder—a mining tool for extracting causal relations on human proteins from literature

    OpenAIRE

    Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar

    2013-01-01

    One of the most common and challenging problem in biomedical text mining is to mine protein–protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder—a web-based text mining tool to extract human PPIs from biomedical literature. PPInterFinder uses relation keyword co-occurrences with protein names to extract information...

  19. Construction of an index of information from clinical practice in Radiology and Imaging Diagnosis based on text mining and thesaurus

    Directory of Open Access Journals (Sweden)

    Paulo Roberto Barbosa Serapiao

    2013-09-01

    Full Text Available Objective To construct a Portuguese language index of information on the practice of diagnostic radiology in order to improve the standardization of the medical language and terminology. Materials and Methods A total of 61,461 definitive reports were collected from the database of the Radiology Information System at Hospital das Clínicas – Faculdade de Medicina de Ribeirão Preto (RIS/HCFMRP as follows: 30,000 chest x-ray reports; 27,000 mammography reports; and 4,461 thyroid ultrasonography reports. The text mining technique was applied for the selection of terms, and the ANSI/NISO Z39.19-2005 standard was utilized to construct the index based on a thesaurus structure. The system was created in *html. Results The text mining resulted in a set of 358,236 (n = 100% words. Out of this total, 76,347 (n = 21% terms were selected to form the index. Such terms refer to anatomical pathology description, imaging techniques, equipment, type of study and some other composite terms. The index system was developed with 78,538 *html web pages. Conclusion The utilization of text mining on a radiological reports database has allowed the construction of a lexical system in Portuguese language consistent with the clinical practice in Radiology.

  20. Text Classification using the Concept of Association Rule of Data Mining

    OpenAIRE

    Rahman, Chowdhury Mofizur; Sohel, Ferdous Ahmed; Naushad, Parvez; Kamruzzaman, S. M.

    2010-01-01

    As the amount of online text increases, the demand for text classification to aid the analysis and management of text is increasing. Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive. Automatic classification of text can provide this information at low cost, but the classifiers themselves must be built with expensive human effort, or trained from texts which have themselves been manually classified. In this paper we will discuss a procedure of...

  1. Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation.

    Science.gov (United States)

    Smalheiser, Neil R; Bonifield, Gary

    2016-01-01

    In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html. Handling editor: Elizabeth Workman, MLIS, PhD. PMID:27213780

  2. A Digital Humanities Approach to the History of Science Eugenics Revisited in Hidden Debates by Means of Semantic Text Mining

    OpenAIRE

    Huijnen, Pim; Laan, Fons; De Rijke, Maarten; Pieters, Toine

    2014-01-01

    Comparative historical research on the the intensity, diversity and fluidity of public discourses has been severely hampered by the extraordinary task of manually gathering and processing large sets of opinionated data in news media in different countries. At most 50,000 documents have been systematically studied in a single comparative historical project in the subject area of heredity and eugenics. Digital techniques, like the text mining tools WAHSP and BILAND we have developed in two succ...

  3. Text Mining and Natural Language Processing Approaches for Automatic Categorization of Lay Requests to Web-Based Expert Forums

    OpenAIRE

    Himmel, Wolfgang; Reincke, Ulrich; Michelmann, Hans Wilhelm

    2009-01-01

    Background Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called “ask the doctor” services. Objective To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. Methods We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German web...

  4. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool

    OpenAIRE

    Ilona eSilins; Anna eKorhonen; Ulla eStenius

    2014-01-01

    Toxicity caused by chemical mixtures has emerged as a significant challenge for toxicologists and risk assessors. Information on individual chemicals’ modes of action is an important part of the hazard identification step. In this study, an automatic text mining-based tool was employed as a method to identify the carcinogenic modes of action of pesticides frequently found in fruit on the Swedish market. The current available scientific literature on the 26 most common pesticides found in appl...

  5. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool

    OpenAIRE

    Silins, Ilona; Korhonen, Anna; Stenius, Ulla

    2014-01-01

    Toxicity caused by chemical mixtures has emerged as a significant challenge for toxicologists and risk assessors. Information on individual chemicals' modes of action is an important part of the hazard identification step. In this study, an automatic text mining-based tool was employed as a method to identify the carcinogenic modes of action of pesticides frequently found in fruit on the Swedish market. The current available scientific literature on the 26 most common pesticides found in appl...

  6. Business intelligence in banking: A literature analysis from 2002 to 2013 using Text Mining and latent Dirichlet allocation

    OpenAIRE

    S. Moro; Cortez, P.; Rita, P.

    2015-01-01

    WOS:000345734700028 (Nº de Acesso Web of Science) This paper analyzes recent literature in the search for trends in business intelligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling w...

  7. Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

    OpenAIRE

    Jung, Kenneth; LePendu, Paea; Iyer, Srinivasan; Bauer-Mehren, Anna; Percha, Bethany; Shah, Nigam H.

    2014-01-01

    Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug...

  8. DESTAF: A database of text-mined associations for reproductive toxins potentially affecting human fertility

    KAUST Repository

    Dawe, Adam Sean

    2012-01-01

    The Dragon Exploration System for Toxicants and Fertility (DESTAF) is a publicly available resource which enables researchers to efficiently explore both known and potentially novel information and associations in the field of reproductive toxicology. To create DESTAF we used data from the literature (including over 10. 500 PubMed abstracts), several publicly available biomedical repositories, and specialized, curated dictionaries. DESTAF has an interface designed to facilitate rapid assessment of the key associations between relevant concepts, allowing for a more in-depth exploration of information based on different gene/protein-, enzyme/metabolite-, toxin/chemical-, disease- or anatomically centric perspectives. As a special feature, DESTAF allows for the creation and initial testing of potentially new association hypotheses that suggest links between biological entities identified through the database.DESTAF, along with a PDF manual, can be found at http://cbrc.kaust.edu.sa/destaf. It is free to academic and non-commercial users and will be updated quarterly. © 2011 Elsevier Inc.

  9. Development and testing of a text-mining approach to analyse patients’ comments on their experiences of colorectal cancer care

    OpenAIRE

    Wagland, Richard; Recio Saucedo, Alejandra; Simon, Michael; Bracher, Michael; Hunt, Katherine; Foster, Claire; Downing, Amy; Glaser, Adam W; Corner, Jessica

    2016-01-01

    Background: Quality of cancer care may greatly impact upon patients’ health-related quality of life (HRQoL). Free-text responses to patient-reported outcome measures (PROMs) provide rich data but analysis is time and resource-intensive. This study developed and tested a learning-based text-mining approach to facilitate analysis of patients’ experiences of care and develop an explanatory model illustrating impact upon HRQoL. Methods: Respondents to a population-based survey of colorectal c...

  10. Exploring the potential of Social Media Data using Text Mining to augment Business Intelligence

    OpenAIRE

    Dr.Ananthi Sheshasaayee*; R . Jayanthi

    2014-01-01

    In recent years, social media has become world-wide famous and important for content sharing, social networking, etc., The contents generated from these websites remains largely unused. Social media contains text, images, audio, video, and so on. Social media data largely contains unstructured text. Foremost thing is to extract the information in the unstructured text. This paper presents the influence of social media data for research and how the content can be used to predic...

  11. The biomedical discourse relation bank

    Directory of Open Access Journals (Sweden)

    Joshi Aravind

    2011-05-01

    Full Text Available Abstract Background Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource. Results We have developed the Biomedical Discourse Relation Bank (BioDRB, in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB, which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89. These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28, mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57. Conclusion Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more

  12. Data Mining of Acupoint Characteristics from the Classical Medical Text: DongUiBoGam of Korean Medicine

    Directory of Open Access Journals (Sweden)

    Taehyung Lee

    2014-01-01

    Full Text Available Throughout the history of East Asian medicine, different kinds of acupuncture treatment experiences have been accumulated in classical medical texts. Reexamining knowledge from classical medical texts is expected to provide meaningful information that could be utilized in current medical practices. In this study, we used data mining methods to analyze the association between acupoints and patterns of disorder with the classical medical book DongUiBoGam of Korean medicine. Using the term frequency-inverse document frequency (tf-idf method, we quantified the significance of acupoints to its targeting patterns and, conversely, the significance of patterns to acupoints. Through these processes, we extracted characteristics of each acupoint based on its treating patterns. We also drew practical information for selecting acupoints on certain patterns according to their association. Data analysis on DongUiBoGam’s acupuncture treatment gave us an insight into the main idea of DongUiBoGam. We strongly believe that our approach can provide a novel understanding of unknown characteristics of acupoint and pattern identification from the classical medical text using data mining methods.

  13. An Ontology based Text Mining framework for R&D Project Selection

    Directory of Open Access Journals (Sweden)

    E.Sathya

    2013-03-01

    Full Text Available Researchanddevelopment(R&Dprojectselectionisandecision-making taskcommonlyfoundingovernment fundingagencies,universities,researchinstitutes,and technologyintensivecompanies.TextMininghasemergedasadefinitivetechniqueforextracting theunknowninformationfromlargetextdocument.Ontologyisaknowledgerepositoryinwhichconcepts andterms aredefined aswellasrelationshipsbetween theseconcepts.Ontology'smakethetaskofsearchingsimilarpatternof textthat tobemoreeffective,efficientandinteractive.Thecurrentmethod for groupingproposalsfor research projectselection is proposed using anontologybasedtextminingapproachtoclusterresearchproposalsbasedon theirsimilaritiesinresearcharea.Thismethodisefficient andeffectiveforclusteringresearchproposals.Howeverproposalassignmentregardingresearch areas toexpertscannotbeoftenaccurate.Thispaperpresents aframeworkonontologybasedtextminingtoclusterresearchproposals,externalreviewersbasedon theirresearchareaandtoassignconcernedresearchproposalstoreviewerssystematically.Aknowledgebasedagentisappendedtotheproposedsystem foraretrieval ofdatafromthesystem in an efficient way.

  14. Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools

    Directory of Open Access Journals (Sweden)

    J. I. Toledo-Alvarado

    2012-06-01

    Full Text Available In this paper we show a procedure to build automatically an ontology from a corpus of text documents without externalhelp such as dictionaries or thesauri. The method proposed finds relevant concepts in the form of multi-words in thecorpus and non-hierarchical relations between them in an unsupervised manner.

  15. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine

    Directory of Open Access Journals (Sweden)

    Baskin Berivan

    2003-03-01

    Full Text Available Abstract Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days. Conclusions Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

  16. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    Science.gov (United States)

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data. PMID:27057545

  17. Proceedings of the International Workshop on Text Mining Research, Practice and Opportunities

    OpenAIRE

    2005-01-01

    It is a well known fact that huge quantities of valuable knowledge are embedded in unstructured texts that can be found in the World Wide Web, in intranets and on personal desktop machines. In recent years, there has been an increasing research interest in technologies for extracting and analysing useful structured knowledge from unstructured texts. At the same time, there has been an increasing commercial interest with a number of tools appearing on the market that address the needs of the u...

  18. Cloud Based Metalearning System for Predictive Modeling of Biomedical Data

    Directory of Open Access Journals (Sweden)

    Milan Vukićević

    2014-01-01

    Full Text Available Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.

  19. Textpresso Site-Specific Recombinases: a text-mining server for the recombinase literature including Cre mice and conditional alleles

    Science.gov (United States)

    Urbanski, William M.; Condie, Brian G.

    2016-01-01

    Textpresso Site Specific Recombinases (http://ssrc.genetics.uga.edu/) is a text-mining web server for searching a database of over 9000 full-text publications. The papers and abstracts in this database represent a wide range of topics related to site-specific recombinase (SSR) research tools. Included in the database are most of the papers that report the characterization or use of mouse strains that express Cre recombinase as well as papers that describe or analyze mouse lines that carry conditional (floxed) alleles or SSR-activated transgenes/knockins. The database also includes reports describing SSR-based cloning methods such as the Gateway or the Creator systems, papers reporting the development or use of SSR-based tools in systems such as Drosophila, bacteria, parasites, stem cells, yeast, plants, zebrafish and Xenopus as well as publications that describe the biochemistry, genetics or molecular structure of the SSRs themselves. Textpresso Site Specific Recombinases is the only comprehensive text-mining resource available for the literature describing the biology and technical applications of site-specific recombinases. PMID:19882667

  20. Data Mining of Acupoint Characteristics from the Classical Medical Text: DongUiBoGam of Korean Medicine.

    Science.gov (United States)

    Lee, Taehyung; Jung, Won-Mo; Lee, In-Seon; Lee, Ye-Seul; Lee, Hyejung; Park, Hi-Joon; Kim, Namil; Chae, Younbyoung

    2014-01-01

    Throughout the history of East Asian medicine, different kinds of acupuncture treatment experiences have been accumulated in classical medical texts. Reexamining knowledge from classical medical texts is expected to provide meaningful information that could be utilized in current medical practices. In this study, we used data mining methods to analyze the association between acupoints and patterns of disorder with the classical medical book DongUiBoGam of Korean medicine. Using the term frequency-inverse document frequency (tf-idf) method, we quantified the significance of acupoints to its targeting patterns and, conversely, the significance of patterns to acupoints. Through these processes, we extracted characteristics of each acupoint based on its treating patterns. We also drew practical information for selecting acupoints on certain patterns according to their association. Data analysis on DongUiBoGam's acupuncture treatment gave us an insight into the main idea of DongUiBoGam. We strongly believe that our approach can provide a novel understanding of unknown characteristics of acupoint and pattern identification from the classical medical text using data mining methods. PMID:25574179

  1. Models of text mining to measure improvements to doctoral courses suggested by “STELLA” phd survey respondents

    Directory of Open Access Journals (Sweden)

    Pasquale Pavone

    2014-10-01

    Full Text Available We present Text Mining models to thematically categorise and measure the suggestions of  PhD holders on improving PhD programmes in the STELLA survey (Statistiche in TEma di Laureati e LAvoro. The coded responses questionnaire, designed to evaluate the employment opportunities of students and assess their learning experience, included open-ended questions on how to improve PhD programmes. The Corpus analysed was taken from the data of Italian PhD holders between 2005 and 2009 in eight universities (Bergamo, Brescia, Milano Statale, Milano Bicocca, Pisa, Scuola Superiore Sant’Anna, Palermo and Pavia. The usual methodological approach to text analysis allowed us to categorize open-ended proposals of PhD courses improvements in 8 Italian Universities.

  2. Mining for constructions in texts using N-gram and network analysis

    DEFF Research Database (Denmark)

    Shibuya, Yoshikata; Jensen, Kim Ebensgaard

    2015-01-01

    N-gram analysis to Lewis Carroll's novel Alice's Adventures in Wonderland and Mark Twain's novelThe Adventures of Huckleberry Finn and extrapolate a number of likely constructional phenomena from recurring N-gram patterns in the two texts. In addition to simple N-gram analysis, the following will be......-styles, text-world construction and specification of narrative temporality. In this paper, our special interest lies in the relationship between constructions and the discourse of fiction. As the study reported in this article is exploratory, it serves just as much to test the methods mentioned above as to...

  3. Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized versus Common Languages

    Science.gov (United States)

    Jarman, Jay

    2011-01-01

    This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms,…

  4. Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    DEFF Research Database (Denmark)

    Debortoli, Stefan; Müller, Oliver; Junglas, Iris;

    2016-01-01

    probabilistic topic modeling via Latent Dirichlet Allocation, an unsupervised text miningtechnique, in combination with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifactby automatically analyzing more than 12,000 online customer reviews. For fellow information systems...

  5. Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

    CERN Document Server

    Safeer, Yasir; Ali, Anis Noor

    2010-01-01

    With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of k-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result.

  6. Ask and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify Which Topics Are Associated with Bond Passage and Voter Turnout

    Science.gov (United States)

    Bowers, Alex J.; Chen, Jingjing

    2015-01-01

    The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…

  7. Newspaper archives + text mining = rich sources of historical geo-spatial data

    Science.gov (United States)

    Yzaguirre, A.; Smit, M.; Warren, R.

    2016-04-01

    Newspaper archives are rich sources of cultural, social, and historical information. These archives, even when digitized, are typically unstructured and organized by date rather than by subject or location, and require substantial manual effort to analyze. The effort of journalists to be accurate and precise means that there is often rich geo-spatial data embedded in the text, alongside text describing events that editors considered to be of sufficient importance to the region or the world to merit column inches. A regional newspaper can add over 100,000 articles to its database each year, and extracting information from this data for even a single country would pose a substantial Big Data challenge. In this paper, we describe a pilot study on the construction of a database of historical flood events (location(s), date, cause, magnitude) to be used in flood assessment projects, for example to calibrate models, estimate frequency, establish high water marks, or plan for future events in contexts ranging from urban planning to climate change adaptation. We then present a vision for extracting and using the rich geospatial data available in unstructured text archives, and suggest future avenues of research.

  8. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.

  9. Louhi 2010: Special issue on Text and Data Mining of Health Documents

    Directory of Open Access Journals (Sweden)

    Dalianis Hercules

    2011-07-01

    Full Text Available Abstract The papers presented in this supplement focus and reflect on computer use in every-day clinical work in hospitals and clinics such as electronic health record systems, pre-processing for computer aided summaries, clinical coding, computer decision systems, as well as related ethical concerns and security. Much of this work concerns itself by necessity with incorporation and development of language processing tools and methods, and as such this supplement aims at providing an arena for reporting on development in a diversity of languages. In the supplement we can read about some of the challenges identified above.

  10. DEVELOP ENGLISH TUTORIAL SYSTEM USING NATURAL LANGAUGE PROCESSING WITH TEXT MINING

    Directory of Open Access Journals (Sweden)

    K. J. SATAO,

    2010-12-01

    Full Text Available This paper describes the model of an Interactive Tutoring system. The model enables a Indian user to generate computer-based tutorials without any programming knowledge, serves as a multiple domain tutor to English monolingual users who are learning English, and the prototype also incorporates an English-Hindi translation ability. Based on reviews of previous work, such features are usually incorporated separately in different systems and applications. Therefore, the primary goal of this paper is to describe an experiment which investigates the feasibility of integrating these features into one common platform. The implementation of the prototype is based on paradigms of a flexible and interactive user interface, and Natural Language Processing.

  11. A multilingual text mining based content gathering system for open source intelligence

    International Nuclear Information System (INIS)

    Full text: The number of documents available in electronic format has grown dramatically in the recent years, whilst the information that States provide to the IAEA is not always complete or clear. Generally speaking, up to 80% of electronic data is textual and most valuable information is often hidden and encoded in pages which are neither structured, nor classified. The availability of huge amount of data available in the open sources leads to the well-identified nowadays paradox: an overload of information means no usable knowledge. Besides, open source texts are - and will be - written in various native languages, but these documents are relevant even to non-native IAEA speakers. Independent information sources can balance the limited State-reported information, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for type (scientific article, patent, free textual document), source (Internet/Intranet, database, etc), protocol (HTTP/HTTPS, FTP, GOPHER, IRC, NNTP, etc) and language used, and transforming them into information, is therefore inextricably linked to the concepts of focused crawling, textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. This task undoubtedly requires remarkable efforts. This poster describes is a multimedia content gathering, multilingual indexing, searching and clustering system, whose main goal is managing huge collections of data coming from different and geographically distributed information sources, providing language independent searches and dynamic classification facilities. Its focused crawling aims to crawl only the subset of the web pages related to a specific category, in order to find only information of interest and improve quality in documents gathering. The focused crawling algorithm builds a model for the context within which topically relevant pages occur on the web, typically capturing link hierarchies

  12. 文本挖掘探讨青风藤用药规律研究%Treatment Rules of Sinomenium Acutum by Text Mining

    Institute of Scientific and Technical Information of China (English)

    李雨彦; 郑光; 刘良

    2015-01-01

    Objective:The study summarized the treatment rules of Sinomenium acutum (Menispermaceae,SA)using text mining techniques.Methods:Firstly,we conducted text-mining by collecting related literatures about SA from Chinese Biomedical Litera-ture (CBM)Database.Then structured query language was used to do data processing as well as data stratification.Algorithm was used to analyze the basic laws of symptom,TCM pattern,TCM herb compatibility and drug combination.Results:Sinomenium Acutum was mainly used to treat diseases with symptoms such as ache,swelling,stiffness,malformation,etc.Wind,cold,wet-ness,heat,sputum,stasis and deficiency were the main etiology and pathology.Sinomenium Acutum was always used in combina-tion with herbs with the functions of dispelling wind and eliminating dampness,nourishing the blood and promoting blood circula-tion,dredging collaterals,warming meridians and nourishing kidney.Conclusion:By text mining we summarized the treatment rules of Sinomenium Acutum in a systematic,comprehensive and precise way,providing literature basis for future clinical applica-tion and drug research.%目的:基于文本挖掘技术探讨青风藤用药规律。方法:在 CBM数据库中检索、下载所有涉及青风藤的文献,通过清洗、降噪及关键词频统计的数据分层算法,挖掘青风藤治疗疾病的规律,症状、证型的分布规律,中药配伍、中成药、西药、汤剂、针灸联用规律,并进行规律的可视化展示。结果:青风藤主要治疗以疼痛、肿胀、强直、畸形为主的病证,中医病证要素涉及风、寒、湿、热、痰、瘀、虚。疾病以现代医学的类风湿关节炎为主,涉及多种风湿类疾病以及慢性肾炎、肝炎、心律失常等。中药应用方面,青风藤多与祛风除湿类、养血活血类、通络类、温经类及补肾类中药合用。此外,青风藤多与雷公藤多苷、活络丸等调节免疫、通络药物联用。结论:数据

  13. 用文本挖掘方法发现药物的副作用%Detection of drug adverse effects by text-mining

    Institute of Scientific and Technical Information of China (English)

    隋明爽; 崔雷

    2015-01-01

    After the necessity and feasibility to detect drug adverse effects by text-mining were analyzed, the cur-rent researches on detection drug adverse effects by text-mining, unsolved problems and future development were summarized in aspects of text-mining process, text mining/detecting methods, results assessment, and current tool software.%分析了用文本挖掘方法探测药物副作用的必要性及可行性,从挖掘流程、挖掘/提取方法、结果评价和现有工具软件4个方面总结了用文本挖掘技术提取药物副作用的研究现状及尚未解决的问题和未来发展趋势。

  14. Using large-scale text mining for a systematic reconstruction of molecular mechanisms of diseases: a case study in thyroid cancer

    OpenAIRE

    Wu, Chengkun

    2014-01-01

    Information about genes and pathways involved in a disease is usually 'buried' in scientific literature, making it difficult to perform systematic studies for a comprehensive understanding. Text mining has provided opportunities to retrieve and extract most relevant information from literature, and thus might enable collecting and exploring relevant data to a certain disease systematically. This thesis aims to develop a text-mining pipeline that can identify genes and pathways involved in the...

  15. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    André SANTOS

    2012-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  16. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    Anália LOURENÇO

    2013-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  17. PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries

    Directory of Open Access Journals (Sweden)

    Barbosa-Silva Adriano

    2011-11-01

    Full Text Available Abstract Background Biological function is greatly dependent on the interactions of proteins with other proteins and genes. Abstracts from the biomedical literature stored in the NCBI's PubMed database can be used for the derivation of interactions between genes and proteins by identifying the co-occurrences of their terms. Often, the amount of interactions obtained through such an approach is large and may mix processes occurring in different contexts. Current tools do not allow studying these data with a focus on concepts of relevance to a user, for example, interactions related to a disease or to a biological mechanism such as protein aggregation. Results To help the concept-oriented exploration of such data we developed PESCADOR, a web tool that extracts a network of interactions from a set of PubMed abstracts given by a user, and allows filtering the interaction network according to user-defined concepts. We illustrate its use in exploring protein aggregation in neurodegenerative disease and in the expansion of pathways associated to colon cancer. Conclusions PESCADOR is a platform independent web resource available at: http://cbdm.mdc-berlin.de/tools/pescador/

  18. Introduction to biomedical engineering

    CERN Document Server

    Enderle, John

    2011-01-01

    Introduction to Biomedical Engineering is a comprehensive survey text for biomedical engineering courses. It is the most widely adopted text across the BME course spectrum, valued by instructors and students alike for its authority, clarity and encyclopedic coverage in a single volume. Biomedical engineers need to understand the wide range of topics that are covered in this text, including basic mathematical modeling; anatomy and physiology; electrical engineering, signal processing and instrumentation; biomechanics; biomaterials science and tissue engineering; and medical and engineering e

  19. Biomedical Science, Unit II: Nutrition in Health and Medicine. Digestion of Foods; Organic Chemistry of Nutrients; Energy and Cell Respiration; The Optimal Diet; Foodborne Diseases; Food Technology; Dental Science and Nutrition. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of science within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on…

  20. MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.

    Science.gov (United States)

    Karthikeyan, Muthukumarasamy; Pandit, Yogesh; Pandit, Deepak; Vyas, Renu

    2015-01-01

    Virtual screening is an indispensable tool to cope with the massive amount of data being tossed by the high throughput omics technologies. With the objective of enhancing the automation capability of virtual screening process a robust portal termed MegaMiner has been built using the cloud computing platform wherein the user submits a text query and directly accesses the proposed lead molecules along with their drug-like, lead-like and docking scores. Textual chemical structural data representation is fraught with ambiguity in the absence of a global identifier. We have used a combination of statistical models, chemical dictionary and regular expression for building a disease specific dictionary. To demonstrate the effectiveness of this approach, a case study on malaria has been carried out in the present work. MegaMiner offered superior results compared to other text mining search engines, as established by F score analysis. A single query term 'malaria' in the portlet led to retrieval of related PubMed records, protein classes, drug classes and 8000 scaffolds which were internally processed and filtered to suggest new molecules as potential anti-malarials. The results obtained were validated by docking the virtual molecules into relevant protein targets. It is hoped that MegaMiner will serve as an indispensable tool for not only identifying hidden relationships between various biological and chemical entities but also for building better corpus and ontologies. PMID:26138567

  1. 基于Web的文本挖掘系统的研究与实现%The Research and Development of Text Mining System Based on Web

    Institute of Scientific and Technical Information of China (English)

    唐菁; 沈记全; 杨炳儒

    2003-01-01

    With the development of network technology, the spread of information on Internet becomes more andmore quick. There are many types of complicated data in the information ocean. How to acquire useful knowledgequickly from the information ocean is the very difficult. The Text Mining based on Web is the new research fieldwhich can solve the problem effectively. In this paper, we present a structure model of Text Mining and research thecore arithmetic - Classification arithmetic. We have developed the Text Mining system based on Web and appliedit in the modern long-distance education. This system can automatically classify the text information of education fieldwhich is collected from education site on Internet and help people to browser the important information quickly andacquire knowledge.

  2. Term identification in the biomedical literature.

    Science.gov (United States)

    Krauthammer, Michael; Nenadic, Goran

    2004-12-01

    Sophisticated information technologies are needed for effective data acquisition and integration from a growing body of the biomedical literature. Successful term identification is key to getting access to the stored literature information, as it is the terms (and their relationships) that convey knowledge across scientific articles. Due to the complexities of a dynamically changing biomedical terminology, term identification has been recognized as the current bottleneck in text mining, and--as a consequence--has become an important research topic both in natural language processing and biomedical communities. This article overviews state-of-the-art approaches in term identification. The process of identifying terms is analysed through three steps: term recognition, term classification, and term mapping. For each step, main approaches and general trends, along with the major problems, are discussed. By assessing previous work in context of the overall term identification process, the review also tries to delineate needs for future work in the field. PMID:15542023

  3. Identification of candidate genes in Populus cell wall biosynthesis using text-mining, co-expression network and comparative genomics

    Energy Technology Data Exchange (ETDEWEB)

    Yang, Xiaohan [ORNL; Ye, Chuyu [ORNL; Bisaria, Anjali [ORNL; Tuskan, Gerald A [ORNL; Kalluri, Udaya C [ORNL

    2011-01-01

    Populus is an important bioenergy crop for bioethanol production. A greater understanding of cell wall biosynthesis processes is critical in reducing biomass recalcitrance, a major hindrance in efficient generation of ethanol from lignocellulosic biomass. Here, we report the identification of candidate cell wall biosynthesis genes through the development and application of a novel bioinformatics pipeline. As a first step, via text-mining of PubMed publications, we obtained 121 Arabidopsis genes that had the experimental evidences supporting their involvement in cell wall biosynthesis or remodeling. The 121 genes were then used as bait genes to query an Arabidopsis co-expression database and additional genes were identified as neighbors of the bait genes in the network, increasing the number of genes to 548. The 548 Arabidopsis genes were then used to re-query the Arabidopsis co-expression database and re-construct a network that captured additional network neighbors, expanding to a total of 694 genes. The 694 Arabidopsis genes were computationally divided into 22 clusters. Queries of the Populus genome using the Arabidopsis genes revealed 817 Populus orthologs. Functional analysis of gene ontology and tissue-specific gene expression indicated that these Arabidopsis and Populus genes are high likelihood candidates for functional genomics in relation to cell wall biosynthesis.

  4. Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification.

    Science.gov (United States)

    Ben Abacha, Asma; Chowdhury, Md Faisal Mahbub; Karanasiou, Aikaterini; Mrabet, Yassine; Lavelli, Alberto; Zweigenbaum, Pierre

    2015-12-01

    Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDIs). The shared tasks on DDI-Extraction organized in 2011 and 2013 have pointed out the importance of this issue and provided benchmarks for: Drug Name Recognition, DDI extraction and DDI classification. In this paper, we present our text mining systems for these tasks and evaluate their results on the DDI-Extraction benchmarks. Our systems rely on machine learning techniques using both feature-based and kernel-based methods. The obtained results for drug name recognition are encouraging. For DDI-Extraction, our hybrid system combining a feature-based method and a kernel-based method was ranked second in the DDI-Extraction-2011 challenge, and our two-step system for DDI detection and classification was ranked first in the DDI-Extraction-2013 task at SemEval. We discuss our methods and results and give pointers to future work. PMID:26432353

  5. Grouping chemicals for health risk assessment: A text mining-based case study of polychlorinated biphenyls (PCBs).

    Science.gov (United States)

    Ali, Imran; Guo, Yufan; Silins, Ilona; Högberg, Johan; Stenius, Ulla; Korhonen, Anna

    2016-01-22

    As many chemicals act as carcinogens, chemical health risk assessment is critically important. A notoriously time consuming process, risk assessment could be greatly supported by classifying chemicals with similar toxicological profiles so that they can be assessed in groups rather than individually. We have previously developed a text mining (TM)-based tool that can automatically identify the mode of action (MOA) of a carcinogen based on the scientific evidence in literature, and it can measure the MOA similarity between chemicals on the basis of their literature profiles (Korhonen et al., 2009, 2012). A new version of the tool (2.0) was recently released and here we apply this tool for the first time to investigate and identify meaningful groups of chemicals for risk assessment. We used published literature on polychlorinated biphenyls (PCBs)-persistent, widely spread toxic organic compounds comprising of 209 different congeners. Although chemically similar, these compounds are heterogeneous in terms of MOA. We show that our TM tool, when applied to 1648 PubMed abstracts, produces a MOA profile for a subgroup of dioxin-like PCBs (DL-PCBs) which differs clearly from that for the rest of PCBs. This suggests that the tool could be used to effectively identify homogenous groups of chemicals and, when integrated in real-life risk assessment, could help and significantly improve the efficiency of the process. PMID:26562772

  6. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-01-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  7. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-08-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  8. PALM-IST: Pathway Assembly from Literature Mining - an Information Search Tool

    OpenAIRE

    Mandloi, Sapan; Chakrabarti, Saikat

    2015-01-01

    Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular c...

  9. Auto-selection of DRG codes from discharge summaries by text mining in several hospitals: analysis of difference of discharge summaries.

    Science.gov (United States)

    Suzuki, Takahiro; Doi, Shunsuke; Shimada, Gen; Takasaki, Mitsuhiro; Tamura, Toshiyo; Fujita, Shinsuke; Takabayashi, Katsuhiko

    2010-01-01

    Recently, electronic medical record (EMR) systems have become popular in Japan, and number of discharge summaries is stored electronically, though they have not been reutilized yet. We performed text mining with Tf-idf method and morphological analysis in the discharge summaries from three Hospitals (Chiba University Hospital, St. Luke's International Hospital and Saga University Hospital). We showed differences in the styles of summaries, between hospitals, while the rate of properly classified DPC (Diagnosis Procedure Combination) codes were almost the same. Beyond different styles of the discharge summaries, text mining method could obtain proper extracts of proper DPC codes. Improvement was observed by using integrated model data between the hospitals. It seemed that huge database which contains the data of many hospitals can improve the precision of text mining. PMID:20841838

  10. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

    Directory of Open Access Journals (Sweden)

    Tareef K. Mustafa

    2010-01-01

    Full Text Available Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational

  11. 基于高维聚类的探索性文本挖掘算法%Exploratory text mining algorithm based on high-dimensional clustering

    Institute of Scientific and Technical Information of China (English)

    张爱科; 符保龙

    2013-01-01

    建立了一种基于高维聚类的探索性文本挖掘算法,利用文本挖掘的引导作用实现数据类文本中的数据挖掘.算法只需要少量迭代,就能够从非常大的文本集中产生良好的集群;映射到其他数据与将文本记录到用户组,能进一步提高算法的结果.通过对相关数据的测试以及实验结果的分析,证实了该方法的可行性与有效性.%Because of the unstructured characteristics of free text, text mining becomes an important branch of data mining. In recent years, types of text mining algorithms emerged in large numbers. In this paper, an exploratory text mining algorithm was proposed based on high-dimensional clustering. The algorithm required only a small number of iterations to produce favorable clusters from very large text. Mapping to other recorded data and recording the text to the user group enabled the result of the algorithm be improved further. The feasibility and validity of the proposed method is verified by related data test and the analysis of experimental results.

  12. Research on Web Text Mining Based on Ontology%基于领域本体实现Web文本挖掘研究

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2011-01-01

    为弥补改进传统Web文本挖掘方法缺乏对文本语义理解的不足,采用本体与Web文本挖掘相结合的方法,探讨基于领域本体的Web文本挖掘方法。首先创建Web文本的本体结构,然后引入领域本体“概念-概念”相似度矩阵,并就概念间关系识别进行描述,最后给出Web文本挖掘的实现方法,发现Web文本信息的内涵。实验中以网络媒体报道为例,通过文本挖掘得出相关结论。%The paper improved the traditional web text mining technology which can not understand the text semantics. The author discusses the web text mining methods based on the ontology, and sets up the web ontology structure at first, then introduces the "concept-concept" similarity matrix, and describs the relations among the concepts; puts forward the web text mining method at last. Based on text mining, the paper can find the potential information from the web pages. Finally, the author did a case study and drew some conclusion.

  13. Research on Fuzzy Clustering Validity in Web Text Mining%Web文本挖掘中模糊聚类的有效性评价研究

    Institute of Scientific and Technical Information of China (English)

    罗琪

    2012-01-01

    本文研究了基于模糊聚类的Web文本挖掘和模糊聚类有效性评价函数,并将其应用于Web文本挖掘中模糊聚类有效性评价.仿真实验表明该方法有一定的准确性和可行性.%This paper studies web documents mining based on fuzzy clustering and validity evaluation function, and puts forward to applying validity evaluation function into evaluation of web text mining. The experiments show that FKCM can effectively improve the precision of web text clustering; the method is feasible in web documents mining. The result of emulation examinations indicates that the method has certain feasibility and accuracy.

  14. A sentence sliding window approach to extract protein annotations from biomedical articles

    OpenAIRE

    2005-01-01

    Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of...

  15. Fundamental of biomedical engineering

    CERN Document Server

    Sawhney, GS

    2007-01-01

    About the Book: A well set out textbook explains the fundamentals of biomedical engineering in the areas of biomechanics, biofluid flow, biomaterials, bioinstrumentation and use of computing in biomedical engineering. All these subjects form a basic part of an engineer''s education. The text is admirably suited to meet the needs of the students of mechanical engineering, opting for the elective of Biomedical Engineering. Coverage of bioinstrumentation, biomaterials and computing for biomedical engineers can meet the needs of the students of Electronic & Communication, Electronic & Instrumenta

  16. 基于领域本体的语义文本挖掘研究%Research on Semantic Text Mining Based on Domain Ontology

    Institute of Scientific and Technical Information of China (English)

    张玉峰; 何超

    2011-01-01

    为了提高文本挖掘的深度和精度,研究并提出了一种基于领域本体的语义文本挖掘模型.该模型利用语义角色标注进行语义分析,获取概念和概念间的语义关系,提高文本表示的准确度;针对传统的知识挖掘算法不能有效挖掘语义元数据库,设计了一种基于语义的模式挖掘算法挖掘文本深层的语义模式.实验结果表明,该模型能够挖掘文本数据库中的深层语义知识,获取的模式具有很强的潜在应用价值,设计的算法具有很强的适应性和可扩展性.%In order to improve the depth and accuracy of text mining, a semantic text mining model based on domain ontology is proposed. In this model, semantic role labeling is applied to semantic analysis so that the semantic relations can be extracted accurately. For the defect of traditional knowledge mining algorithms that can not effectively mine semantic meta database, an association patterns mining algorithm hased on semantic is designed and used to acquire the deep semantic association patterns from semantic meta database. Experimental results show that the model can mine deep semantic knowledge from text database. The pattern got has great potential applications, and the algorithm designed has strong adaptability and scalability.

  17. Research on Methods of Text Mining and Its Application%文本挖掘的方法及应用研究

    Institute of Scientific and Technical Information of China (English)

    张晓艳; 华英

    2011-01-01

    互联网的兴起带来了大量的文本信息。在半结构化和非结构化的文本中提取对用户有用的信息,主要采用文本挖掘技术.本文对文本挖掘常用的方法进行比较分析,总结文本挖掘目前主要的应用领域%Vast amount of text information comes with the rise of the Internet. Text mining technology is used to extract useful information for users from semi-structured or non-structured texts. This article analyzes several common methods of text mining and summarizes its application.

  18. Parsing Citations in Biomedical Articles Using Conditional Random Fields

    OpenAIRE

    Zhang, Qing; Cao, Yong-Gang; Yu, Hong

    2011-01-01

    Citations are used ubiquitously in biomedical full-text articles and play an important role for representing both the rhetorical structure and the semantic content of the articles. As a result, text mining systems will significantly benefit from a tool that automatically extracts the content of a citation. In this study, we applied the supervised machine-learning algorithms Conditional Random Fields (CRFs) to automatically parse a citation into its fields (e.g., Author, Title, Journal, and Ye...

  19. Threats and violence as a precursor to occupational injury : Text-mining of insurance-based information on police officers and security guards in Sweden 2004-2007

    OpenAIRE

    Larsson, Tore J.; Tezic, Kerem; Oldertz, Cecilia

    2010-01-01

    The full text of all occupational injury claims associated with threats or violence from Police Officers and Security Guards reported to the Swedish National workers’ compensation insurance 2004 –2007 was analysed with the help of text-mining software. The analysis generated clusters of details on hazardous exposures and accident processes, and the level of information in the clusters describing the different scenarios identified possible practical modifications in training, technology and pr...

  20. The Mining Methods of Multimedia Text Data Pattern%多媒体文本数据的模式挖掘方法

    Institute of Scientific and Technical Information of China (English)

    刘茂福; 曹加恒; 彭敏; 叶可; 林芝

    2001-01-01

    给出了多媒体文本数据挖掘(MTM)的定义和分类,提出了多媒体文本数据挖掘过程模型(MTMM)及其特征表示,讨论了多媒体文本分类挖掘方法,MTM与Web挖掘的区别与联系,以期发现有用的知识或模式,促进MTM的发展和应用.%Multimedia Text data Mining is a new research field in data mining. The definition and classifications of MTM are given. This article also focuses on Multimedia Text data Mining Model(MTMM) and feature expression, discusses multimedia text data categorization and its alteration. In this paper, the author points out the differences and relationships between MTM and Web mining. The goal of MTM is to discover the useful knowledge or model and push the development and application of MTM.

  1. Application of Text Mining to Extract Hotel Attributes and Construct Perceptual Map of Five Star Hotels from Online Review: Study of Jakarta and Singapore Five-Star Hotels

    Directory of Open Access Journals (Sweden)

    Arga Hananto

    2015-12-01

    Full Text Available The use of post-purchase online consumer review in hotel attributes study was still scarce in the literature. Arguably, post purchase online review data would gain more accurate attributes thatconsumers actually consider in their purchase decision. This study aims to extract attributes from two samples of five-star hotel reviews (Jakarta and Singapore with text mining methodology. In addition,this study also aims to describe positioning of five-star hotels in Jakarta and Singapore based on the extracted attributes using Correspondence Analysis. This study finds that reviewers of five star hotels in both cities mentioned similar attributes such as service, staff, club, location, pool and food. Attributes derived from text mining seem to be viable input to build fairly accurate positioning map of hotels. This study has demonstrated the viability of online review as a source of data for hotel attribute and positioning studies.

  2. What online communities can tell us about electronic cigarettes and hookah use: A study using text mining and visualization techniques

    OpenAIRE

    Chen, AT; Zhu, SH; Conway, M.

    2015-01-01

    © 2015 Journal of Medical Internet Research. Background: The rise in popularity of electronic cigarettes (e-cigarettes) and hookah over recent years has been accompanied by some confusion and uncertainty regarding the development of an appropriate regulatory response towards these emerging products. Mining online discussion content can lead to insights into people's experiences, which can in turn further our knowledge of how to address potential health implications. In this work, we take a no...

  3. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    Directory of Open Access Journals (Sweden)

    Cesareni Gianni

    2011-10-01

    Full Text Available Abstract Background The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP systems is essential for the selection of relevant publications, identification of data attributes and partially automated annotation. One of the tasks of the Biocreative 2010 Challenge III was devoted to the evaluation of NLP systems developed to identify articles for curation and extraction of protein-protein interaction (PPI data. Results The Biocreative 2010 competition addressed three tasks: gene normalization, article classification and interaction method identification. The BioGRID and MINT protein interaction databases both participated in the generation of the test publication set for gene normalization, annotated the development and test sets for article classification, and curated the test set for interaction method classification. These test datasets served as a gold standard for the evaluation of data extraction algorithms. Conclusion The development of efficient tools for extraction of PPI data is a necessary step to achieve full curation of the biomedical literature. NLP systems can in the first instance facilitate expert curation by refining the list of candidate publications that contain PPI data; more ambitiously, NLP approaches may be able to directly extract relevant information from full-text articles for rapid inspection by expert curators. Close collaboration between biological databases and NLP systems developers will continue to facilitate the long-term objectives of both disciplines.

  4. TML:A General High-Performance Text Mining Language%TML:一种通用高效的文本挖掘语言

    Institute of Scientific and Technical Information of China (English)

    李佳静; 李晓明; 孟涛

    2015-01-01

    实现了一种通用高效的文本挖掘编程语言,包括其编译器、运行虚拟机和图形开发环境。其工作方式是用户通过编写该语言的代码以定制抽取目标和抽取手段,然后将用户代码编译成字节码并进行优化,再将其与输入文本语义结构做匹配。该语言具有如下特点:1)提供了一种描述文本挖掘的范围、目标和手段的形式化方法,从而能通过编写该语言的代码来在不同应用领域做声明式文本挖掘;2)运行虚拟机以信息抽取技术为核心,高效地实现了多种常用文本挖掘技术,并将其组成一个文本分析流水线;3)通过一系列编译优化技术使得大量匹配指令能够充分并发执行,从而解决了该语言在处理海量规则和海量数据上的执行效率问题。实用案例说明了TML语言的描述能力以及它的实际应用情况。%This paper proposes a general‐purpose programming language named TML for text mining . TML is the abbreviation of “text mining language” ,and it aims at turning complicated text mining tasks into easy jobs . The implementation of TML includes a compiler ,a runtime virtual machine (interpreter ) , and an IDE . TML has supplied most usual text mining techniques , which are implemented as grammars and reserved words .Users can use TML to program ,and the code will be compiled into bytecodes ,which will be next interpreted in the virual runtime machine .TML has the following characteristics :1) It supplies a formal way to model the searching area ,object definition and mining methods of text mining jobs ,so users can program with it to make a declarative text mining easily ;2) The TML runtime machine implements usual text mining techniques ,and organizes them into an efficient text analysis pipeline ;3) The TML compiler fully explores the possibility of concurrently executing its byte codes , and the execution has good performance on very large collections of

  5. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    Science.gov (United States)

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  6. Evaluation of the strengths and weaknesses of Text Mining and Netnography as methods of understanding consumer conversations around luxury brands on social media platforms.

    OpenAIRE

    SAINI, CHITRA; ,

    2015-01-01

    The advent of social media has led to Luxury brands increasingly turning to social media sites to build brand value. Understanding the discussions that happen on social media is therefore a key for the marketing managers of luxury brands. There are two prominent methodologies which have been used widely in the literature to study consumer conversations on social media, these two methodologies are Text Mining and Netnography. In this study I will compare and contrast both these methodologies t...

  7. Application of Text Mining to Extract Hotel Attributes and Construct Perceptual Map of Five Star Hotels from Online Review: Study of Jakarta and Singapore Five-Star Hotels

    OpenAIRE

    Arga Hananto

    2015-01-01

    The use of post-purchase online consumer review in hotel attributes study was still scarce in the literature. Arguably, post purchase online review data would gain more accurate attributes thatconsumers actually consider in their purchase decision. This study aims to extract attributes from two samples of five-star hotel reviews (Jakarta and Singapore) with text mining methodology. In addition,this study also aims to describe positioning of five-star hotels in Jakarta and Singapore based on t...

  8. The Application of the Web Text Mining in the Druggist Interest Extraction%Web文本挖掘在药商兴趣提取中的应用

    Institute of Scientific and Technical Information of China (English)

    孙士新

    2014-01-01

    The information attainment has become the important component of the druggist's business operation and the market judgment basis. The appearance of the largely unstructured and semi-structured network has provided the technology space and the demonstration basis for the druggist's individual service. Through the critical technology of the text mining in individual service,the paper,combining the Traditional Chinese Medicinal Materials information website,has actually applied the text mining process, and applies the text mining technology to the example of the user's interest attainment about the Traditional Chinese Medicinal Materials information website.%信息获取已成为药商经营活动的重要组成部分和市场判断依据,网络大量非结构化、半结构化信息的出现为药商个性化服务提供了技术空间和实证依据。文章通过对个性化服务中文本挖掘的关键技术进行设计,并应用了中药材信息网站文本挖掘流程,把文本挖掘技术应用于中药材信息网站的用户兴趣获取实例中,实现用户兴趣的自动获取功能。

  9. Key Issues in Morphology Analysis Based on Text Mining%基于文本挖掘的形态分析方法的关键问题

    Institute of Scientific and Technical Information of China (English)

    冷伏海; 王林; 王立学

    2012-01-01

    Morphological analysis based on text mining integration of text mining method, that reduce the reliance on technical experts, and adds objective data to support analysis. Morphological analysis based on text mining has four key issues, which are the definition of the morphological structure, feature word selection, morphology representation, morphology analysis. The improvement to the four issues has key role in enhancing the method.%基于文本挖掘的形态分析方法是在传统方法基础上融入文本挖掘的手段,是国内外学者对形态分析方法的一次有益的探索与改进。改进后的方法减轻对领域专家的依赖,并且增加分析过程中客观数据的支持,提高方法的效率和科学性。基于文本挖掘的形态分析方法包括形态结构定义、特征词选择、形态表示、形态分析等4个关键问题,这4个问题解决方案的优化对整个方法的分析效率和质量的提高有关键作用。

  10. 基于文本挖掘的网络新闻报道差异分析%Analysis on Web Media Report Differences Based on Text Mining

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2012-01-01

    It is a new research on how to find potential but valued information in the web media reports based on text mining technology. This paper discusses the text mining methods of web media reports. In the case of web media reports on Shanghai Expo, the author has done some empirical study to analyze the differences among different web media. The paper selected the web media reports on Expo" from Hong Kong, Tai Wan, overseas newspapers (Chinese version) and Shanghai, analyzed the differences among these different regions base on text mining and attribution extraction and drew some conclusions.%运用文本挖掘技术发现网络新闻报道中潜在的、有价值的信息是情报研究的一个新尝试。笔者探讨了网络新闻的文本挖掘方法,以上海世博新闻媒体网络版报道为例,进行实证研究,并对报道差异进行对比分析。本文选取香港、台湾、境外媒体华语版、上海本地媒体对世博会相关报道,基于文本挖掘、特征提取对报道内容的差异进行阐述,并得出结论。

  11. 关联挖掘下的海量文本信息深入挖掘实现%Text Mining Method of Massive Network Based on Correlation Mining

    Institute of Scientific and Technical Information of China (English)

    彭其华

    2013-01-01

    研究基于关联度挖掘的海量网络文本挖掘方法;随着计算机和网络技术的快速发展,网络上的文本呈现海量增长的趋势,传统的网络文本挖掘方法采用基于特征提取的方法实现,能够实现小数据量下的文本挖掘,但是在信息量的快速增长下,传统方法已经不能适应;提出一种基于关联度挖掘的海量网络文本挖掘方法,首先采用特征提取的方法对海量文本进行初步的分类和特征识别,然后采用关联度挖掘的方法对各个文本特征之间的关联度进行计算处理,根据关联度的大小最终实现文本挖掘,由于关联度可以很好的体现特征文本之间的相互关系;最后采用一组随机的网络热门词汇进行测试实验,结果显示,算法能够很好适应海量文本下的挖掘实现,具有很好的应用价值。%The text mining method of massive network based on correlation mining was research on .With the rapid development of computer and network technology ,the text rendering of network grew fast ,the traditional network-based text mining method extracted feature from text to achieve text mining ,but with the rapid growth in the amount of information ,the traditional methods cannot meet the need of development .So a text mining method of massive network based on correlation mining was proposed ,the feature was extracted with mass text to finish initial classification and characteristics identification ,and then the method of mining correlation between the characteristics of the various texts correlation was used to do calculate the coefficient of correlation ,according to the coefficient of correlation ,the text was divided into several types ,so the correlation can reflect the relationship between the characteristics of the text well .Finally ,a team of random words were used to test the ability of the algorithm ,and the result shows that the algorithm can adapt to massive text excavation

  12. Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens

    Directory of Open Access Journals (Sweden)

    Perna Nicole T

    2009-06-01

    Full Text Available Abstract Background The Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP, and in particular Information Extraction (IE technology, can be a significant aid to this process. Description We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc. and over 70% for relations (gene/gene product to role, etc. This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application. Conclusion Our Text Mining application is available online on the ERIC website http://www.ericbrc.org/portal/eric/articles. The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations

  13. Biomedical Science, Unit I: Respiration in Health and Medicine. Respiratory Anatomy, Physiology and Pathology; The Behavior of Gases; Introductory Chemistry; and Air Pollution. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text deals with the human respiratory system and its relation to the environment. Topics include the process of respiration, the relationship of air to diseases of the respiratory system, the chemical and physical properties of gases, the impact on air quality of human activities and the effect of this air pollution on health.…

  14. Evaluation of information retrieval and text mining tools on automatic named entity extraction. Intelligence and security informatics. Proceedings

    OpenAIRE

    Kumar, Nishant; De Beer, Jan; Vanthienen, Jan; Moens, Marie-Francine

    2006-01-01

    We will report evaluation of Automatic Named Entity Extraction feature of IR tools on Dutch, French, and English text. The aim is to analyze the competency of off-the-shelf information extraction tools in recognizing entity types including person, organization, location, vehicle, time, & currency from unstructured text. Within such an evaluation one can compare the effectiveness of different approaches for identifying named entities.

  15. Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE.

    Science.gov (United States)

    Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong

    2012-01-01

    High-throughput experiments and bioinformatics techniques are creating an exploding volume of data that are becoming overwhelming to keep track of for biologists and researchers who need to access, analyze and process existing data. Much of the available data are being deposited in specialized databases, such as the Gene Expression Omnibus (GEO) for microarrays or the Protein Data Bank (PDB) for protein structures and coordinates. Data sets are also being described by their authors in publications archived in literature databases such as MEDLINE and PubMed Central. Currently, the curation of links between biological databases and the literature mainly relies on manual labour, which makes it a time-consuming and daunting task. Herein, we analysed the current state of link curation between GEO, PDB and MEDLINE. We found that the link curation is heterogeneous depending on the sources and databases involved, and that overlap between sources is low, mining tools can automatically provide valuable evidence to help curators broaden the scope of articles and database entries that they review. As a result, we made recommendations to improve the coverage of curated links, as well as the consistency of information available from different databases while maintaining high-quality curation. Database URLs: http://www.ncbi.nlm.nih.gov/PubMed, http://www.ncbi.nlm.nih.gov/geo/, http://www.rcsb.org/pdb/ PMID:22685160

  16. Biomedical signal processing

    CERN Document Server

    Akay, Metin

    1994-01-01

    Sophisticated techniques for signal processing are now available to the biomedical specialist! Written in an easy-to-read, straightforward style, Biomedical Signal Processing presents techniques to eliminate background noise, enhance signal detection, and analyze computer data, making results easy to comprehend and apply. In addition to examining techniques for electrical signal analysis, filtering, and transforms, the author supplies an extensive appendix with several computer programs that demonstrate techniques presented in the text.

  17. Effective Classification of Text

    OpenAIRE

    A Saritha; N NaveenKumar

    2014-01-01

    Text mining is the process of obtaining useful and interesting information from text. Huge amount of text data is available in the form of various formats. Most of it is unstructured.Text mining usually involves the process of structuring the input text which involves parsing it, structuring it by inserting results into a database, deriving patterns from the structured data, and finally evaluation and interpretation of the output. There are several data mining techniques proposed for mi...

  18. 文本挖掘、数据挖掘和知识管理%Text Mining,Data Mining vs.Knowledge Management:the Intelligent Information Processing in the 21st Century

    Institute of Scientific and Technical Information of China (English)

    韩客松; 王永成

    2001-01-01

    本文首先介绍了数据挖掘、文本挖掘和知识管理等概念,然后从技术角度出发,将知识管理划分为知识库、知识共享和知识发现三个阶段,分析了作为最高阶段的知识发现的关键技术和意义,最后指出在文本中进行知识发现是新世纪智能信息处理的重要方向。%Based on the introduction to Data Mining,Text Mining andKnowledge Management,we divide the knowledge management into three phases,Knowledge Repository,Knowledge Sharing and Knowledge Discovery respectively,from the view-point of technical development.We analyse the key component of text mining,and point out that it is the main trend of intelligent information processing in the coming new century.

  19. 基于文本挖掘技术的偏头痛临床诊疗规律分析%Analysis of Regularity of Clinical Medication for Migraine with Text Mining Approach

    Institute of Scientific and Technical Information of China (English)

    杨静; 蔡峰; 谭勇; 郑光; 李立; 姜淼; 吕爱平

    2013-01-01

    Objective To analyze the regularity of clinical medication in the treatment of migraine with text mining approach. Methods The data set of migraine was downloaded from Chinese BioMedical Literature Database (CBM). Rules of TCM pattern, symptoms, Chinese herbal medicines (CHM), Chinese patent medicines (CPM) and western medicines on migraine were mined out by data slicing algorithm, the results were demonstrated in both frequency tables and two-dimension based network. Results A total of 7 921 literatures were searched. The main syndrome classification in TCM of migraine included liver-yang hyperactivity syndrome, stagnation of liver qi syndrome and qi deficiency to blood stasis syndrome, et al. The core symptoms of migraine included headache, vomit, nausea, swirl, photophobia, et al. Traditional Chinese medicine for migraine contained Chuanxiong, Tianma, Danshen, Chaihu, Danggui, Baishao and Baizhi, et al. For Chinese patent medicine, Yangxueqingnao granule, Toutongning capsule and Zhengtian pill were used in treating migraine. In western medicine, flunarizine, nimodipine and aspirin were used frequently. For the integrated treatment of TCM and western medicine, the combination of Yangxueqingnao granule and nimodipine was most commonly used. Conclusion Text mining approach provides a novel method in the summary of treatment rules on migraine in both TCM and western medicine. To some' extent, the migraine results from texting mining has significance for clinical practice.%目的 探索偏头痛中西医临床诊疗的规律.方法 应用中国生物医学文献服务系统,收集治疗偏头痛的文献数据,采用基于敏感关键词频数统计的数据分层算法,并结合原文献回溯、人工阅读分析等方法,挖掘有关偏头痛证候、症状、中药、中成药以及西药联用的规律,并通过一维频次表及二维的网络图对结果进行展示.结果 共检索到偏头痛文献7 921篇.文本挖掘结果显示,肝阳上亢、肝气郁

  20. Application of Lanczos bidiagonalization algorithm in text mining%Lanczos双对角算法在文本挖掘当中的应用

    Institute of Scientific and Technical Information of China (English)

    范伟鹏

    2012-01-01

    Text mining plays an important role in data mining, and classical text mining is based on latent semantic analysis. In the past, to get a low rank approximation, singular value decomposition is applied for latent semantic analysis. As we all know, singular value decomposition needs a cubic operation for it; so, it is cost, in particular, when the matrix is large and sparse. To solve this problem, this paper uses Lancos bidiagonalization algorithm and extended Lanczos bidiagonalization algorithm in here, both of them are efficient and effective for a large and sparse matrix.%文本挖掘是数据挖掘中的一个重要组成部分,传统的文本挖掘方法大部分是基于潜在语义分析的基础上进行的.由于由文本构成的矩阵基本上是大型稀疏的,而传统的潜在语义分析都是基于矩阵的奇异值分解的基础上进行的,矩阵的奇异值分解是一种立方次运算的求矩阵低秩近似方法,因而是一种低效的方法.针对文本矩阵是大型稀疏的特点,将Lanczos双对角算法和Lanczos双对角算法运用于此,并且从文中的算法分析得出,Lanczos双对角算法和扩展的Lanczos双对角算法是两种高效的求大型稀疏矩阵低秩近似的方法.

  1. 基于文本挖掘的流行病学致病因素的提取%Extraction of epidemiologic risk factors based on text mining

    Institute of Scientific and Technical Information of China (English)

    卢延鑫; 姚旭峰

    2013-01-01

    Objective Based on text mining techniques,we design a system which automatically extracts epidemiologic risk factors. Methods The system consists of a text mining engine subsystem and a rule-based information extraction subsystem. First,all the noun phrases are identified by the text mining engine subsystem and the information are collected. Then,the epidemiologic risk factors are identified by the text classifier system based on rules. Results The evaluation of the system using text annotated by an epidemiologist shows the highest F-measure of 64.6% (Precision 61.0% and Recall 68.8%), with certain avoidable mistakes. Conclusions This method is helpful for the automatic extraction of risk factors in the epidemiologic literatures.%目的 基于文本挖掘技术,设计出能够自动提取流行病学致病因素的系统.方法 该自动信息提取系统由一个文本挖掘引擎子系统和一个基于规则的信息提取子系统构成.首先使用文本挖掘引擎标记出所有的名词短语,并收集该名词短语的语义等信息.然后利用基于规则的文本分类器,标记出流行病学致病因素.结果 为评估本系统,将由流行病学专家人工注解的文本输入该系统,评估发现最好的结果F-measure为64.6%,其精确率和召回率分别为61.0%和68.8%,该结果优于其它相关研究,且其中有些错误仍可避免.结论 基于文本挖掘的方法对从流行病学研究文献中自动提取致病因素信息有很大帮助.

  2. Simulation Research of Text Categorization Based on Data Mining%基于数据挖掘的文本自动分类仿真研究

    Institute of Scientific and Technical Information of China (English)

    赖娟

    2011-01-01

    Research text classification problem. Text classification feature dimension is usually up to tens of thousands , characteristics and interrelated information are redundant, and the accuracy of traditional text classification is low. In order to improve the accuracy of automatic text categorization, we put forward a method of automatic text categorization based on data mining technology. Using the insensitivity of support vector machine to the characteristic correlation and sparseness and the advantage of handling high dimension problems, the contribution value of single word to the classification was calculated. Then, the words with similar contribution value were merged into a feature item of the text vector. Finally, by using support vector machine, text classification results was obtained. The results show that performance testing based on data mining technology is very good, and this method can quicken the text classification speed and improve the text classification accuracy and recall rate.%研究文本分类优化问题,文本是一种半结构化形式,特征数常高达几万,特征互相关联、冗余比较严重,影响分类的准确性.传统分类方法难以获得高正确率.为了提高文本自动分类正确率,提出了一种数据挖掘技术的文本自动分类方法.利用支持向量机对于特征相关性和稀疏性不敏感,能很好处理高维数问题的优点对单词对分类的贡献值进行计算,将对分类贡献相近单词合并成文本向量的一个特征项,采用支持向量机对特征项进行学习和分类.用文本分类库数据进行测试,结果表明,数据挖掘技术的分类方法,不仅加快了文本分类速度,同时提高文本分类准确率.

  3. Benefits of off-campus education for students in the health sciences: a text-mining analysis

    Directory of Open Access Journals (Sweden)

    Nakagawa Kazumasa

    2012-08-01

    Full Text Available Abstract Background In Japan, few community-based approaches have been adopted in health-care professional education, and the appropriate content for such approaches has not been clarified. In establishing community-based education for health-care professionals, clarification of its learning effects is required. A community-based educational program was started in 2009 in the health sciences course at Gunma University, and one of the main elements in this program is conducting classes outside school. The purpose of this study was to investigate using text-analysis methods how the off-campus program affects students. Methods In all, 116 self-assessment worksheets submitted by students after participating in the off-campus classes were decomposed into words. The extracted words were carefully selected from the perspective of contained meaning or content. With the selected terms, the relations to each word were analyzed by means of cluster analysis. Results Cluster analysis was used to select and divide 32 extracted words into four clusters: cluster 1—“actually/direct,” “learn/watch/hear,” “how,” “experience/participation,” “local residents,” “atmosphere in community-based clinical care settings,” “favorable,” “communication/conversation,” and “study”; cluster 2—“work of staff member” and “role”; cluster 3—“interaction/communication,” “understanding,” “feel,” “significant/important/necessity,” and “think”; and cluster 4—“community,” “confusing,” “enjoyable,” “proactive,” “knowledge,” “academic knowledge,” and “class.” Conclusions The students who participated in the program achieved different types of learning through the off-campus classes. They also had a positive impression of the community-based experience and interaction with the local residents, which is considered a favorable outcome. Off-campus programs could be a useful educational approach

  4. Study of Cloud Based ERP Services for Small and Medium Enterprises (Data is Processed by Text Mining Technique

    Directory of Open Access Journals (Sweden)

    SHARMA, R.

    2014-06-01

    Full Text Available The purpose of this research paper is to explore the knowledge of the existing studies related to cloud computing current trend. The outcome of research is demonstrated in the form of diagram which simplifies the ERP integration process for in-house and cloud eco-system. It will provide a conceptual view to the new client or entrepreneurs using ERP services and explain them how to deal with two stages of ERP systems (cloud and in-house. Also suggest how to improve knowledge about ERP services and implementation process for both stages. The work recommends which ERP services can be outsourced over the cloud. Cloud ERP is a mix of standard ERP services along with cloud flexibility and low cost to afford these services. This is a recent phenomenon in enterprise service offering. For most of non IT background entrepreneurs it is unclear and broad concept, since all the research work related to it are done in couple of years. Most of cloud ERP vendors describe their products as straight forward tasks. The process and selection of Cloud ERP Services and vendors is not clear. This research work draws a framework for selecting non-core business process from preferred ERP service partners. It also recommends which ERP services outsourced first over the cloud, and the security issues related to data or information moved out from company premises to the cloud eco-system.

  5. Association text classification of mining ItemSet significance%挖掘重要项集的关联文本分类

    Institute of Scientific and Technical Information of China (English)

    蔡金凤; 白清源

    2011-01-01

    针对在关联规则分类算法的构造分类器阶段中只考虑特征词是否存在,忽略了文本特征权重的问题,基于关联规则的文本分类方法(ARC-BC)的基础上提出一种可以提高关联文本分类准确率的ISARC(ItemSet Significance-based ARC)算法.该算法利用特征项权重定义了k-项集重要度,通过挖掘重要项集来产生关联规则,并考虑提升度对待分类文本的影响.实验结果表明,挖掘重要项集的ISARC算法可以提高关联文本分类的准确率.%Text classification technology is an important basis of information retrieval and text mining,and its main task is to mark category according to a given category set.Text classification has a wide range of applications in natural language processing and understanding、information organization and management、information filtering and other areas.At present,text classification can be mainly divided into three groups: based on statistical methods、based on connection method and the method based on rules. The basic idea of the traditional association text classification algorithm associative rule-based classifier by category(ARC-BC) is to use the association rule mining algorithm Apriori which generates frequent items that appear frequently feature items or itemsets,and then use these frequent items as rule antecedent and category is used as rule consequent to form the rule set and then make these rules constitute a classifier.During classifying the test samples,if the test sample matches the rule antecedent,put the rule that belongs to the class counterm to the cumulative confidence.If the confidence of the category counter is the maximum,then determine the test sample belongs to that category. However,ARC-BC algorithm has two main drawbacks:(1) During the structure classifier,it only considers the existence of feature words and ignores the weight of text features for mining frequent itemsets and generated association rules

  6. Construction of an annotated corpus to support biomedical information extraction

    Directory of Open Access Journals (Sweden)

    McNaught John

    2009-10-01

    Full Text Available Abstract Background Information Extraction (IE is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC, consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining. Initial experiments have also shown that the corpus may

  7. 文本数据挖掘技术在Web知识库中的应用研究%The Applied Research of Text Data Mining Technology in the Web Knowledge Base

    Institute of Scientific and Technical Information of China (English)

    蔡立斌

    2012-01-01

    介绍了文本数据挖掘和知识提取的基本理论,然后分析了网络信息的检索与挖掘的特征,特别是文本挖掘、Web数据挖掘和基于内容数据挖掘与之相关联的系列问题.在此基础上,分析了Web知识库的设计、建立、文本数据挖掘和知识发现所需的理论和技术,对Web知识库系统的架构和功能模块进行分析和设计,建立了基于文本数据挖掘的Web网络知识库的模型.%This article first briefly describes the basic theory of text data mining and knowledge extraction, and then analyzes the network information retrieval and mining of feature, especially Web text mining, data mining and data mining based on content associated with the series of problems. On this basis, we analyzed theory and technology that the Web knowledge base design, build, text data mining and knowledge discovery are required, the Web knowledge base system structure and function module is analyzed and designed, based on text data mining Web network knowledge base model.

  8. Biomedical photonics handbook biomedical diagnostics

    CERN Document Server

    Vo-Dinh, Tuan

    2014-01-01

    Shaped by Quantum Theory, Technology, and the Genomics RevolutionThe integration of photonics, electronics, biomaterials, and nanotechnology holds great promise for the future of medicine. This topic has recently experienced an explosive growth due to the noninvasive or minimally invasive nature and the cost-effectiveness of photonic modalities in medical diagnostics and therapy. The second edition of the Biomedical Photonics Handbook presents fundamental developments as well as important applications of biomedical photonics of interest to scientists, engineers, manufacturers, teachers, studen

  9. Thrombin and Related Coding Gene Text Mining Analysis%凝血酶及其相关编码基因的文本挖掘分析

    Institute of Scientific and Technical Information of China (English)

    许航; 吴坚

    2012-01-01

    [目的]对凝血酶及其相关基因进行文本挖掘分析.[方法]通过PubMed以“Thrombin”AND“gene”为检索策略检索相关文献,汇集了1980年以来的该主题文献,运用MMTx对这些文献进行处理,并设计相关Java算法程序,从处理结果中提取语义类型为“Pathologic Function”的病理功能概念和“gene or genome”基因类概念形成共句距阵,最后通过统计学软件SPSS16,对文本挖掘数据进行统计聚类分析.[结果]通过文本挖掘处理,系统的了解和归类了凝血酶编码基因与相应病理功能的关系.[结论]通过文献反查,说明该研究中的文本挖掘有效.%Text mining was carried out on thrombin and related gene. [Method] "Thrombin" AND "gene" were used for the retrieval strategy in the PubMed, brought references since 1980. The co-word analysis and cluster analysis were adopted to reflect the affinities between these words in all literature, and then the actual connection and meaning of these words were analyzed. Finally, the intrinsic link between these biological concepts were explored. [Result] The relationship between thrombin coding gene and corresponding pathological functions was classified. [Conclusion] Trough literature reverse-tracing, it was indicated that the text mining is effective.

  10. 基于文本情报的数据挖掘%Data Mining Realization Technology Based on Text Intelligence Data

    Institute of Scientific and Technical Information of China (English)

    吕曹芳; 侯智斌

    2012-01-01

    文章介绍了适合于军事领域中进行情报数据的挖掘方法,建立了军事情报中非结构化文本情报数据处理方法,结合军孥睛报的特点,提出了军事情报中数据挖掘的框架模型,探讨了军事情报挖掘中文文本的方法。实现了对情报文本数据的分词、关键字提取、词频分析、关联分析等。%This paper introduces intelligence text classification model in military, the data processing map of unstructured intelligence text is established. Data mining model framework is established firstly by the feature of military intelligence. And implements Chinese word segmentation on text data, keyword extraction, word frequency analysis, relational analysis.

  11. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    OpenAIRE

    Cesareni Gianni; Castagnoli Luisa; Iannuccelli Marta; Licata Luana; Briganti Leonardo; Perfetto Livia; Winter Andrew; Chatr-aryamontri Andrew; Tyers Mike

    2011-01-01

    Abstract Background The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP) systems is essential for the selection of rele...

  12. Constructing a semantic predication gold standard from the biomedical literature

    Directory of Open Access Journals (Sweden)

    Kilicoglu Halil

    2011-12-01

    Full Text Available Abstract Background Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology. Results We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475. With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536 in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations. Conclusions While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular

  13. Implantable CMOS Biomedical Devices

    Directory of Open Access Journals (Sweden)

    Toshihiko Noda

    2009-11-01

    Full Text Available The results of recent research on our implantable CMOS biomedical devices are reviewed. Topics include retinal prosthesis devices and deep-brain implantation devices for small animals. Fundamental device structures and characteristics as well as in vivo experiments are presented.

  14. Electronic Medical Record Writing Aided System Based on Text Mining%基于文本挖掘的电子病历书写辅助系统

    Institute of Scientific and Technical Information of China (English)

    周民

    2014-01-01

    Electronic medical record system is widely used in the hospital .Because of the particularity of occupation ;the doctor is impossible to fast input as professionals .Therefore the method which helps doctors quickly entry medical record information is studied ,and based on text mining technology ,an auxiliary system for electronic medical record system is pro-posed .The system carries on the excavation to the general information of medical records using data mining technology ,con-struct different lexicon for different types of medical records ,and uses Pinyin initials instead of Chinese characters input to speed up the medical record entry speed .%电子病历系统广泛应用于医院的管理系统中,由于职业的特殊性,医生在录入病历时不可能像专业人员那样快速录入,论文研究了帮助医生快速录入病历信息的方法,并基于文本挖掘技术提出了一种电子病历书写辅助系统,该系统利用数据挖掘技术对病历中的常用信息进行挖掘,为不同类型的病历构建不同的词库,并利用拼音首字母缩写代替汉字输入来加快病历的录入速度。

  15. A CONDITIONAL RANDOM FIELDS APPROACH TO BIOMEDICAL NAMED ENTITY RECOGNITION

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail,and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better results.

  16. Biomedical Engineering

    CERN Document Server

    Suh, Sang C; Tanik, Murat M

    2011-01-01

    Biomedical Engineering: Health Care Systems, Technology and Techniques is an edited volume with contributions from world experts. It provides readers with unique contributions related to current research and future healthcare systems. Practitioners and researchers focused on computer science, bioinformatics, engineering and medicine will find this book a valuable reference.

  17. CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM to identify genetically-associated cancers and candidate genes

    Directory of Open Access Journals (Sweden)

    Jones Steven

    2005-03-01

    Full Text Available Abstract Background Online Mendelian Inheritance in Man (OMIM is a computerized database of information about genes and heritable traits in human populations, based on information reported in the scientific literature. Our objective was to establish an automated text-mining system for OMIM that will identify genetically-related cancers and cancer-related genes. We developed the computer program CGMIM to search for entries in OMIM that are related to one or more cancer types. We performed manual searches of OMIM to verify the program results. Results In the OMIM database on September 30, 2004, CGMIM identified 1943 genes related to cancer. BRCA2 (OMIM *164757, BRAF (OMIM *164757 and CDKN2A (OMIM *600160 were each related to 14 types of cancer. There were 45 genes related to cancer of the esophagus, 121 genes related to cancer of the stomach, and 21 genes related to both. Analysis of CGMIM results indicate that fewer than three gene entries in OMIM should mention both, and the more than seven-fold discrepancy suggests cancers of the esophagus and stomach are more genetically related than current literature suggests. Conclusion CGMIM identifies genetically-related cancers and cancer-related genes. In several ways, cancers with shared genetic etiology are anticipated to lead to further etiologic hypotheses and advances regarding environmental agents. CGMIM results are posted monthly and the source code can be obtained free of charge from the BC Cancer Research Centre website http://www.bccrc.ca/ccr/CGMIM.

  18. FROM DATA MINING TO BEHAVIOR MINING

    OpenAIRE

    ZHENGXIN CHEN

    2006-01-01

    Knowledge economy requires data mining be more goal-oriented so that more tangible results can be produced. This requirement implies that the semantics of the data should be incorporated into the mining process. Data mining is ready to deal with this challenge because recent developments in data mining have shown an increasing interest on mining of complex data (as exemplified by graph mining, text mining, etc.). By incorporating the relationships of the data along with the data itself (rathe...

  19. Data Mining.

    Science.gov (United States)

    Benoit, Gerald

    2002-01-01

    Discusses data mining (DM) and knowledge discovery in databases (KDD), taking the view that KDD is the larger view of the entire process, with DM emphasizing the cleaning, warehousing, mining, and visualization of knowledge discovery in databases. Highlights include algorithms; users; the Internet; text mining; and information extraction.…

  20. Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value.

    Science.gov (United States)

    Bishop, Dorothy V M; Thompson, Paul A

    2016-01-01

    Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed. PMID:26925335

  1. RESEARCH ON TEXT MINING BASED ON BACKGROUND KNOWLEDGE AND ACTIVE LEARNING%基于背景知识和主动学习的文本挖掘技术研究

    Institute of Scientific and Technical Information of China (English)

    符保龙

    2013-01-01

    为了达成好的文本分类和文本挖掘效果,往往需要使用大量的标识数据.然而数据标识不但操作复杂,而且成本昂贵.为此,在基于支持向量机的分类技术框架下,在文本分类和文本挖掘中引入未标识数据,具体的执行通过基于背景知识和基于主动学习两种方法展开.实验结果表明,基于背景知识的文本挖掘方法在基线分类器性能较强的情况下可以发挥优秀的文本挖掘性能,而基于主动学习的文本挖掘方法在一般的情况下就可以改善文本挖掘的性能指标.%In order to achieve good effect in text classification and text mining,there often needs to use a large number of labelled data.However,to label data is usually complex in operation and also expensive.Therefore,in this paper we introduce the unlabelled data to text classification and text mining in the framework of support vector machine-based classification technology.The specific implementation is carried out through two methods,the background knowledge-based and the active learning-based.Experimental results show that the text mining based on background knowledge can bring the text mining performance into excellent play under the condition of stronger baseline classifier,while the text mining based on active learning can improve the performance index of text mining just in general situation.

  2. Using a search engine-based mutually reinforcing approach to assess the semantic relatedness of biomedical terms.

    Directory of Open Access Journals (Sweden)

    Yi-Yu Hsu

    Full Text Available BACKGROUND: Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field. Previous studies, such as those using ontology-based and corpus-based approaches, measured semantic relatedness by using information from the structure of biomedical literature, but these methods are limited by the small size of training resources. To increase the size of training datasets, the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text. ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms. By combining sentence structures and the linking activities between containers and lexical patterns, our algorithm can explore the correlation between two biomedical terms. CONCLUSIONS/SIGNIFICANCE: The average correlation coefficient of the ReLPR algorithm was 0.82 for various datasets. The results of the ReLPR algorithm were significantly superior to those of previous methods.

  3. Biomedical Materials

    Institute of Scientific and Technical Information of China (English)

    CHANG Jiang; ZHOU Yanling

    2011-01-01

    @@ Biomedical materials, biomaterials for short, is regarded as "any substance or combination of substances, synthetic or natural in origin, which can be used for any period of time, as a whole or as part of a system which treats, augments, or replaces any tissue, organ or function of the body" (Vonrecum & Laberge, 1995).Biomaterials can save lives, relieve suffering and enhance the quality of life for human being.

  4. Design and Implementation of a Comprehensive Web-based Survey for Ovarian Cancer Survivorship with an Analysis of Prediagnosis Symptoms via Text Mining.

    Science.gov (United States)

    Sun, Jiayang; Bogie, Kath M; Teagno, Joe; Sun, Yu-Hsiang Sam; Carter, Rebecca R; Cui, Licong; Zhang, Guo-Qiang

    2014-01-01

    Ovarian cancer (OvCa) is the most lethal gynecologic disease in the United States, with an overall 5-year survival rate of 44.5%, about half of the 89.2% for all breast cancer patients. To identify factors that possibly contribute to the long-term survivorship of women with OvCa, we conducted a comprehensive online Ovarian Cancer Survivorship Survey from 2009 to 2013. This paper presents the design and implementation of our survey, introduces its resulting data source, the OVA-CRADLE™ (Clinical Research Analytics and Data Lifecycle Environment), and illustrates a sample application of the survey and data by an analysis of prediagnosis symptoms, using text mining and statistics. The OVA-CRADLE™ is an application of our patented Physio-MIMI technology, facilitating Web-based access, online query and exploration of data. The prediagnostic symptoms and association of early-stage OvCa diagnosis with endometriosis provide potentially important indicators for future studies in this field. PMID:25861211

  5. Spatial Patterns of the Indications of Acupoints Using Data Mining in Classic Medical Text: A Possible Visualization of the Meridian System.

    Science.gov (United States)

    Jung, Won-Mo; Lee, Taehyung; Lee, In-Seon; Kim, Sanghyun; Jang, Hyunchul; Kim, Song-Yi; Park, Hi-Joon; Chae, Younbyoung

    2015-01-01

    The indications of acupoints are thought to be highly associated with the lines of the meridian systems. The present study used data mining methods to analyze the characteristics of the indications of each acupoint and to visualize the relationships between the acupoints and disease sites in the classic Korean medical text Chimgoogyeongheombang. Using a term frequency-inverse document frequency (tf-idf) scheme, the present study extracted valuable data regarding the indications of each acupoint according to the frequency of the cooccurrences of eight Source points and eighteen disease sites. Furthermore, the spatial patterns of the indications of each acupoint on a body map were visualized according to the tf-idf values. Each acupoint along the different meridians exhibited different constellation patterns at various disease sites. Additionally, the spatial patterns of the indications of each acupoint were highly associated with the route of the corresponding meridian. The present findings demonstrate that the indications of each acupoint were primarily associated with the corresponding meridian system. Furthermore, these findings suggest that the routes of the meridians may have clinical implications in terms of identifying the constellations of the indications of acupoints. PMID:26539224

  6. Effective use of Latent Semantic Indexing and Computational Linguistics in Biological and Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Hongyu eChen

    2013-01-01

    Full Text Available Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI, a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  7. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2013-01-01

    Full Text Available Abstract Background Availability of chemical response-specific lists of genes (gene sets for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM, and that these can be used with gene set analysis (GSA methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human and 588 (mouse gene sets from the Comparative Toxicogenomics Database (CTD. We tested for significant differential expression (SDE (false discovery rate -corrected p-values Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.

  8. 人文社会科学研究中文本挖掘技术应用进展%Progress of Text Mining Applications in Humanities and Social Science

    Institute of Scientific and Technical Information of China (English)

    郭金龙; 许鑫; 陆宇杰

    2012-01-01

    指出作为处理海量数据的有效工具,文本挖掘技术近年来在人文社科领域得到广泛重视。概述文本挖掘的相关技术和研究现状,介绍信息抽取、文本分类、文本聚类、关联规则与模式发现等常用的文本挖掘方法在人文社科研究中的具体应用,以拓展文本挖掘的应用领域,并为人文社科研究的方法创新提供新的思路。%As an effective method to handle data deluge, text mining has earned widespread respect in humanities and social science in recent years. This paper firstly summarizes the relevant techniques of text mining and current situation of study, then introduces spe- cific applications of frequently - used text mining techniques like information extraction, text classification, text clustering, association rules and pattern discovery in the domain of humanities and social science, so as to expand the domain of text mining application as well as providing new ideas for humanities and social science research.

  9. An Italian Biomedical Publications Database

    OpenAIRE

    De Robbio, Antonella; Mozzati, Paola; Lazzari, Luigina; Maguolo, Dario; Dolfino, Manuela; Gradito, Paola

    2002-01-01

    Periodical scientific literature is one of the most important information sources for the scientific community and particularly for biomedicine. As regards Italian publications today, a part from very few laudable exceptions, there is a lack of the instruments necessary for accessing the information that they contain. With over 700 Italian biomedical texts, only 25% are mentioned in the more important biomedical data banks, such as Medline, Embase, Pascal, CAB, with unfortunately a great deal...

  10. MONK Project and the Reference for Text Mining Applied to the Humanities in China%MONK项目及其对我国人文领域文本挖掘的借鉴

    Institute of Scientific and Technical Information of China (English)

    许鑫; 郭金龙; 蔚海燕

    2012-01-01

    MONK is a crossdisciplinary text mining project in the humanities undertaken by several universities and research institutes from America and Canada. This paper mainly discusses the text mining process of MONK as well as relevant tools, techniques and algorithm. Two case studies based on MONK are introduced to give details about the application of text mining to the humanities. Finally the authors summarize some unique applications of text mining applied to the humanities and discuss what we can learn from the MONK projeet.%针对美国和加拿大等高校共同承担的大型跨学科人文文本挖掘项目MONK,详细介绍其文本挖掘流程及相应的工具、技术和算法,并具体探讨利用MONK提供的工具进行文学文本挖掘研究的应用实例。最后总结人文领域文本挖掘方法的几类应用,提出该项目对我国人文领域应用文本挖掘的启示。

  11. Optical Polarizationin Biomedical Applications

    CERN Document Server

    Tuchin, Valery V; Zimnyakov, Dmitry A

    2006-01-01

    Optical Polarization in Biomedical Applications introduces key developments in optical polarization methods for quantitative studies of tissues, while presenting the theory of polarization transfer in a random medium as a basis for the quantitative description of polarized light interaction with tissues. This theory uses the modified transfer equation for Stokes parameters and predicts the polarization structure of multiple scattered optical fields. The backscattering polarization matrices (Jones matrix and Mueller matrix) important for noninvasive medical diagnostic are introduced. The text also describes a number of diagnostic techniques such as CW polarization imaging and spectroscopy, polarization microscopy and cytometry. As a new tool for medical diagnosis, optical coherent polarization tomography is analyzed. The monograph also covers a range of biomedical applications, among them cataract and glaucoma diagnostics, glucose sensing, and the detection of bacteria.

  12. Toxicology of Biomedical Polymers

    Directory of Open Access Journals (Sweden)

    P. V. Vedanarayanan

    1987-04-01

    Full Text Available This paper deals with the various types of polymers, used in the fabrication of medical devices, their diversity of applications and toxic hazards which may arise out of their application. The potential toxicity of monomers and the various additives used in the manufacture of biomedical polymers have been discussed along with hazards which may arise out of processing of devices such as sterilization. The importance of quality control and stringent toxicity evaluation methods have been emphasised since in our country, at present, there are no regulations covering the manufacturing and marketing of medical devices. Finally the question of the general and subtle long term systemic toxicity of biomedical polymers have been brought to attention with the suggestion that this question needs to be resolved permanently by appropriate studies.

  13. 融合语义关联挖掘的文本情感分析算法研究%Text Sentiment Analysis Algorithm Combining with Semantic Association Mining

    Institute of Scientific and Technical Information of China (English)

    明均仁

    2012-01-01

    Facing the enriching textual sentiment information resources in the network, utilizing association mining technology to mine and analyse them automatically and intelligently to obtain user sentiment knowledge at semantic level has an important potential value for enterprise to formulate competitive strategies and keep competitive advantage. This paper integrates association mining technology into text sentiment analysis, researches and designs the text sentiment analysis algorithm combining with semantic association mining to realize sentiment analysis and user sentiment knowledge mining at semantic level. Experiment results demonstrate that this algorithm a- chieves a good expected effect. It dramatically improves the accuracy and efficiency of sentiment analysis and the depth and width of as- sociation mining.%面对网络中日益丰富的文本性情感信息资源,利用关联挖掘技术对其进行智能化的自动挖掘与分析,获取语义层面的用户情感知识,对于企业竞争策略的制定和竞争优势的保持具有重要的潜在价值。将关联挖掘技术融入文本情感分析之中,研究并设计一种融合语义关联挖掘的文本情感分析算法,实现语义层面的情感分析与用户情感知识挖掘。实验结果表明,该算法取得了很好的预期效果,显著提高了情感分析的准确率与效率以及关联挖掘的深度与广度。

  14. Application of Text Mining Technology in the Traditional Chinese Medicine Literature Research%文本挖掘技术在中医药文献研究中的应用

    Institute of Scientific and Technical Information of China (English)

    郭洪涛

    2013-01-01

    目的:探讨文本挖掘技术在中医药文献研究中的应用成果.方法:对近年来文本挖掘技术应用于中医药研究的文献进行综述,总结文本挖掘技术在中医药中的应用成果.结果:文本挖掘技术能以线性和非线性方式解析数据,且能进行高层次的知识整合,又善于处理模糊和非量化数据.未来文本挖掘有可能整合中医药数据、蛋白质及代谢组学数据,分析组合中药活性成分,为新药发现和组合药物形成构建研发平台.结论:利用文本挖掘技术对中医药进行研究分析是一种很有前景的方法.%Objective:To investigate the application progress of text mining technology on traditional Chinese medicine literature research.Methods:Achievements of the application of text mining technology in TCM were summarized through literature review of the application of text mining technology in traditional chinese medicine in recent years.Results:Text mining technology can parse the data in the linear and nonlinear manner,with a high level of integration of knowledge.It is also good at dealing with fuzzy and not quantitative data.Text mining technology is likely to integrate traditional Chinese medicine data,protein and metabolomics data,analysis of combination traditional Chinese medicine active ingredients,which builds development platform for the new drug discovery and the form of combination drugs.Conclusion:Using text mining technology for research and analysis of traditional Chinese medicine is a promising method

  15. Centroid Based Text Clustering

    Directory of Open Access Journals (Sweden)

    Priti Maheshwari

    2010-09-01

    Full Text Available Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tacklesproblems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mySql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing .We apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering ofdocuments.

  16. A Review of Technical Topic Analysis Based on Text Mining%基于文本挖掘的专利技术主题分析研究综述

    Institute of Scientific and Technical Information of China (English)

    胡阿沛; 张静; 雷孝平; 张晓宇

    2013-01-01

    为应对专利数量巨大和技术的日益复杂给专利技术主题分析带来的挑战,以及利用文本挖掘技术的专利技术主题分析近来成为研究热点。首先介绍文本挖掘的概念和其发展历史。其次,对目前基于文本挖掘的专利技术主题分析方法进行了归纳,包括主题词词频分析、共词分析、文本聚类分析和与引文聚类结合的分析方法,总结其常用的分析工具并介绍新的科学图谱分析软件---SciMAT。最后总结基于文本挖掘的专利技术主题分析方法的优点与不足,为其将来的研究提供建议。%To cope with the challenges of patent technical topic analysis presented by huge amounts of patent documents and increasingly sophisticated technology, this paper uses text mining technology to assist technical topic analysis and get researchers' focus in recent years. Firstly, the concept of text mining and its development history are introduced. Secondly, the methods of analyzing patent technical topic based on text mining is summarized, including word frequency analysis, co-word analysis, text clustering analysis and analysis combined with citation clustering. Some important analytical tools and a new science mapping analysis software tool SciMAT are introduced. Final-ly, the paper points out the advantages and deficiencies of technical topic analysis based on text mining and future research direction.

  17. Biomedical engineering and nanotechnology

    International Nuclear Information System (INIS)

    This book is predominantly a compilation of papers presented in the conference which is focused on the development in biomedical materials, biomedical devises and instrumentation, biomedical effects of electromagnetic radiation, electrotherapy, radiotherapy, biosensors, biotechnology, bioengineering, tissue engineering, clinical engineering and surgical planning, medical imaging, hospital system management, biomedical education, biomedical industry and society, bioinformatics, structured nanomaterial for biomedical application, nano-composites, nano-medicine, synthesis of nanomaterial, nano science and technology development. The papers presented herein contain the scientific substance to suffice the academic directivity of the researchers from the field of biomedicine, biomedical engineering, material science and nanotechnology. Papers relevant to INIS are indexed separately

  18. Review of Text Mining Application in Humanity and Social Science%文本挖掘在人文社会科学研究中的典型应用述评

    Institute of Scientific and Technical Information of China (English)

    陆宇杰; 许鑫; 郭金龙

    2012-01-01

    调研文本挖掘在人文社会科学领域的应用现况,介绍国际上文本挖掘在这些领域应用的成功案例与经验,展现目前文本挖掘在人文社科领域的最新研究进展,给国内相关研究的开展提供一定的启示。%This paper investigates the text mining application status in typical practice in the two domains, shows the newest progress of text sonic inspiration to Chinese researchers. humanity and social science, introduces the best exprience and mining all over the world, and correspondingly, tries to bring

  19. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations

    Science.gov (United States)

    Lee, Kyubum; Lee, Sunwon; Park, Sungjoon; Kim, Sunkyu; Kim, Suhkyung; Choi, Kwanghun; Tan, Aik Choon; Kang, Jaewoo

    2016-01-01

    Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance. Database URL: http://infos.korea.ac.kr/bronco PMID:27074804

  20. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

    Science.gov (United States)

    Lee, Kyubum; Lee, Sunwon; Park, Sungjoon; Kim, Sunkyu; Kim, Suhkyung; Choi, Kwanghun; Tan, Aik Choon; Kang, Jaewoo

    2016-01-01

    Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance.Database URL:http://infos.korea.ac.kr/bronco. PMID:27074804

  1. The Design and Implemention of a Text Mining Algorithm Based on MapReduce Framework%基于MapReduce框架一种文本挖掘算法的设计与实现

    Institute of Scientific and Technical Information of China (English)

    朱蔷蔷; 张桂芸; 刘文龙

    2012-01-01

    随着文本挖掘在主动信息服务中应用的日益扩展,在文本数据的基础上分析数据的内在特征已经成为目前的研究趋势,本文在Hadoop平台上设计并实现了一种文本挖掘算法,该算法利用MapReduce框架按照自然语料中相邻词组出现的频数进行降序输出,从而有助于用户挖掘大量数据中各项集之间的联系,实验结果体现了该算法的有效性和良好的加速比.%With the expanding application of text mining in active information service, analyzing the inherent characteristics of data based on the text data is becoming a current research trend, this paper designs and implements a text mining algorithm based on the Hadoop platform which outputs the data according to the natural corpora adjacent phrase descending frequency, thus helping the users mine the link between the set in the large quantities of data, In view of the distributed feature of the Hadoop platform, the experimental result shows the efficiency and better speedup.

  2. Principles of Biomedical Engineering

    CERN Document Server

    Madihally, Sundararajan V

    2010-01-01

    Describing the role of engineering in medicine today, this comprehensive volume covers a wide range of the most important topics in this burgeoning field. Supported with over 145 illustrations, the book discusses bioelectrical systems, mechanical analysis of biological tissues and organs, biomaterial selection, compartmental modeling, and biomedical instrumentation. Moreover, you find a thorough treatment of the concept of using living cells in various therapeutics and diagnostics.Structured as a complete text for students with some engineering background, the book also makes a valuable refere

  3. A Semantics-Based Approach to Retrieving Biomedical Information

    DEFF Research Database (Denmark)

    Andreasen, Troels; Bulskov, Henrik; Zambach, Sine;

    2011-01-01

    This paper describes an approach to representing, organising, and accessing conceptual content of biomedical texts using a formal ontology. The ontology is based on UMLS resources supplemented with domain ontologies developed in the project. The approach introduces the notion of ‘generative ontol...... data mining of texts identifying paraphrases and concept relations and measuring distances between key concepts in texts. Thus, the project is distinct in its attempt to provide a formal underpinning of conceptual similarity or relatedness of meaning.......This paper describes an approach to representing, organising, and accessing conceptual content of biomedical texts using a formal ontology. The ontology is based on UMLS resources supplemented with domain ontologies developed in the project. The approach introduces the notion of ‘generative...... ontologies’, i.e., ontologies providing increasingly specialised concepts reflecting the phrase structure of natural language. Furthermore, we propose a novel so called ontological semantics which maps noun phrases from texts and queries into nodes in the generative ontology. This enables an advanced form of...

  4. Application of an efficient Bayesian discretization method to biomedical data

    Directory of Open Access Journals (Sweden)

    Gopalakrishnan Vanathi

    2011-07-01

    Full Text Available Abstract Background Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI discretization method, which is commonly used for discretization. Results On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI. Conclusions On a range of biomedical datasets, a Bayesian discretization method (EBD yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.

  5. Algorithm of Text Vectors Feature Mining Based on Multi Factor Analysis of Variance%基于多因素方差分析的文本向量特征挖掘算法

    Institute of Scientific and Technical Information of China (English)

    谭海中; 何波

    2015-01-01

    The text feature vector mining applied to information resources organization and management field, in the field of data mining and has great application value, characteristic vector of traditional text mining algorithm using K-means algo⁃rithm , the accuracy is not good. A new method based on multi factor variance analysis of the characteristics of mining algo⁃rithm of text vector. The features used multi factor variance analysis method to obtain a variety of corpora mining rules, based on ant colony algorithm, based on ant colony fitness probability regular training transfer rule, get the evolution of pop⁃ulation of recent data sets obtained effective moment features the maximum probability, the algorithm selects K-means ini⁃tial clustering center based on optimized division, first division of the sample data, then according to the sample distribu⁃tion characteristics to determine the initial cluster center, improve the performance of text feature mining, the simulation re⁃sults show that, this algorithm improves the clustering effect of the text feature vectors, and then improve the performance of feature mining, data feature has higher recall rate and detection rate, time consuming less, greater in the application of data mining in areas such as value.%文本向量特征挖掘应用于信息资源组织和管理领域,在大数据挖掘领域具有较大应用价值,传统算法精度不好。提出一种基于多因素方差分析的文本向量特征挖掘算法。使用多因素方差分析方法得到多种语料库的特征挖掘规律,结合蚁群算法,根据蚁群适应度概率正则训练迁移法则,得到种群进化最近时刻获得的数据集有效特征概率最大值,基于最优划分的K-means初始聚类中心选取算法,先对数据样本进行划分,然后根据样本分布特点来确定初始聚类中心,提高文本特征挖掘性能。仿真结果表明,该算法提高了文本向量特征的聚类效

  6. Application of Text Mining in Employment Information Analysis in Higher Vocational College%文本挖掘在高职院校就业信息分析中的应用

    Institute of Scientific and Technical Information of China (English)

    宁建飞

    2016-01-01

    、文本分类、文本聚类、文本关联分析、分布分析和趋势预测等。文本关联分析是其中一种很关键的挖掘任务,也是在文本信息处理领域用得较多的一种技术。本文主要用到文本关联分析,下面做重点介绍。%Taking the employment information data of the graduates in higher vocational colleges as the analysis object,text mining is applied to the employment information analysis. Through the employment information data mining,valuable data is obtained to act as important references for talent training,employment guidance and other scientific decisions. The experimental results show that text mining is a very effective analysis method for data anal-ysis of employment information.

  7. BioN∅T: A searchable database of biomedical negated sentences

    Directory of Open Access Journals (Sweden)

    Agarwal Shashank

    2011-10-01

    Full Text Available Abstract Background Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioN∅T, a database of negated sentences that can be used to extract such negated events. Description Currently BioN∅T incorporates ≈32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ≈2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ≈20 million abstracts in PubMed. We evaluated BioN∅T on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioN∅T is able to capture negated events that may be ignored by experts. Conclusions The BioN∅T database can be a useful resource for biomedical researchers. BioN∅T is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioN∅T.

  8. BIG: a Grid Portal for Biomedical Data and Images

    Directory of Open Access Journals (Sweden)

    Giovanni Aloisio

    2004-06-01

    Full Text Available Modern management of biomedical systems involves the use of many distributed resources, such as high performance computational resources to analyze biomedical data, mass storage systems to store them, medical instruments (microscopes, tomographs, etc., advanced visualization and rendering tools. Grids offer the computational power, security and availability needed by such novel applications. This paper presents BIG (Biomedical Imaging Grid, a Web-based Grid portal for management of biomedical information (data and images in a distributed environment. BIG is an interactive environment that deals with complex user's requests, regarding the acquisition of biomedical data, the "processing" and "delivering" of biomedical images, using the power and security of Computational Grids.

  9. 基于增量队列的在全置信度下的关联挖掘%Association Mining on Massive Text under Full Confidence Based on Incremental Queue

    Institute of Scientific and Technical Information of China (English)

    刘炜

    2015-01-01

    关联挖掘是一种重要的数据分析方法, 提出了一种在全置信度下的增量队列关联挖掘算法模型, 在传统的 FP-Growth 及 PF-Tree 算法的关联挖掘中使用了全置信度规则, 算法的适应性得到提升, 由此提出FP4W-Growth 算法并运用到对文本数据的关联计算以及对增量式的数据进行关联性挖掘的研究中, 通过实验验证了此算法及模型的可行性与优化性, 为在庞大的文本数据中发现隐藏着的先前未知的并潜在有用的新信息和新模式, 提供了科学的决策方法.%Association mining is an important data analysis method, this article proposes an incremental queue association mining algorithm model under full confidence,using the full confidence rules in the traditional FP-Growth and PF-Tree association mining algorithm can improve the algorithm adaptability. Thus, the article proposes FP4W-Growth algorithm, and applies this algotithm to the association calculation of text data and association mining of incremental data. Then this paper conducted verification experiment. The experimental results show the feasibility of this algorithm and model. The article provides a scientific approach to finding hidden but useful information and patterns from large amount of text data.

  10. The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text

    DEFF Research Database (Denmark)

    Pafilis, Evangelos; Pletscher-Frankild, Sune; Fanini, Lucia;

    2013-01-01

    The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary......-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of...

  11. Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature

    Directory of Open Access Journals (Sweden)

    Kevin E. Heinrich

    2008-01-01

    Full Text Available Identifying functional groups of genes is a challenging problem for biological applications. Text mining approaches can be used to build hierarchical clusters or trees from the information in the biological literature. In particular, the nonnegative matrix factorization (NMF is examined as one approach to label hierarchical trees. A generic labeling algorithm as well as an evaluation technique is proposed, and the effects of different NMF parameters with regard to convergence and labeling accuracy are discussed. The primary goals of this study are to provide a qualitative assessment of the NMF and its various parameters and initialization, to provide an automated way to classify biomedical data, and to provide a method for evaluating labeled data assuming a static input tree. As a byproduct, a method for generating gold standard trees is proposed.

  12. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D

    2014-01-01

    Known as the bible of biomedical engineering, The Biomedical Engineering Handbook, Fourth Edition, sets the standard against which all other references of this nature are measured. As such, it has served as a major resource for both skilled professionals and novices to biomedical engineering.Biomedical Engineering Fundamentals, the first volume of the handbook, presents material from respected scientists with diverse backgrounds in physiological systems, biomechanics, biomaterials, bioelectric phenomena, and neuroengineering. More than three dozen specific topics are examined, including cardia

  13. In silico discoveries for biomedical sciences

    NARCIS (Netherlands)

    Haagen, Herman van

    2011-01-01

    Text-mining is a challenging field of research initially meant for reading large text collections with a computer. Text-mining is useful in summarizing text, searching for the informative documents, and most important to do knowledge discovery. Knowledge discovery is the main subject of this thesis.

  14. Biomedical microsystems

    CERN Document Server

    Meng, Ellis

    2010-01-01

    IntroductionEvolution of MEMSApplications of MEMSBioMEMS ApplicationsMEMS ResourcesText Goals and OrganizationMiniaturization and ScalingBioMEMS MaterialsTraditional MEMS and Microelectronic MaterialsPolymeric Materials for MEMSBiomaterialsMicrofabrication Methods and Processes for BioMEMSIntroductionMicrolithographyDopingMicromachiningWafer Bonding, Assembly, and PackagingSurface TreatmentConversion Factors for Energy and Intensity UnitsLaboratory ExercisesMicrofluidicsIntroduction and Fluid PropertiesConcepts in MicrofluidicsFluid-Transport Phenomena and PumpingFlow ControlLaboratory Exercis

  15. Automatic Arabic Text Classification

    OpenAIRE

    Al-harbi, S; Almuhareb, A.; Al-Thubaity , A; Khorsheed, M. S.; Al-Rajeh, A.

    2008-01-01

    Automated document classification is an important text mining task especially with the rapid growth of the number of online documents present in Arabic language. Text classification aims to automatically assign the text to a predefined category based on linguistic features. Such a process has different useful applications including, but not restricted to, e-mail spam detection, web page content filtering, and automatic message routing. This paper presents the results of experiments on documen...

  16. Oportunidades de letramento através de mineração textual e produção de Fanfictions Opportunities of literacy through text mining and fanfiction writing

    Directory of Open Access Journals (Sweden)

    Patrícia da Silva Campelo Costa

    2012-01-01

    Full Text Available Este trabalho tem por objetivo investigar como o letramento pode ser apoiado pelo uso de um recurso digital passível de auxiliar os processos de leitura e produção textual. Assim, a presente pesquisa baseia-se nos estudos de Feldman e Sanger (2006 acerca da mineração de textos e nas pesquisas de Black (2007; 2009 sobre a incorporação de um gênero textual característico da internet (fanfiction na aprendizagem de línguas. Através da utilização de um recurso de mineração de texto (Sobek, a partir do qual ocorre a extração dos termos mais recorrentes em um texto, os participantes deste estudo criaram narrativas em inglês como língua estrangeira (LE, em meio digital. Observou-se que a utilização da ferramenta deu suporte à produção textual em LE, e sua subsequente prática de letramento, visto que os alunos se apoiaram no recurso de mineração para criar narrativas fanfiction.This work aims at investigating how literacy may be supported by the use of a digital resource which can help the process of reading and writing. Thus, the present work is based on studies by Feldman and Sanger (2006 about text mining, and on research by Black (2007; 2009 about the incorporation of a textual genre characteristic of the Internet (fanfiction in language learning. Through the use of a text mining resource (Sobek, which promotes the extraction of frequent terms present in a text, the participants of our pilot study created narratives in English as a foreign language (FL, in digital media, and used the mining tool to develop graphs with the recurrent terms found in the story. It was observed that the use of a digital tool supported the text production in the FL, and its following practice of literacy, as the students relied on the mining resource to create their fanfictions.

  17. Oportunidades de letramento através de mineração textual e produção de fanfictions Opportunities of literacy through text mining and fanfiction writing

    Directory of Open Access Journals (Sweden)

    Patrícia da Silva Campelo Costa

    2012-01-01

    Full Text Available Este trabalho tem por objetivo investigar como o letramento pode ser apoiado pelo uso de um recurso digital passível de auxiliar os processos de leitura e produção textual. Assim, a presente pesquisa baseia-se nos estudos de Feldman e Sanger (2006 acerca da mineração de textos e nas pesquisas de Black (2007; 2009 sobre a incorporação de um gênero textual característico da internet (fanfiction na aprendizagem de línguas. Através da utilização de um recurso de mineração de texto (Sobek, a partir do qual ocorre a extração dos termos mais recorrentes em um texto, os participantes deste estudo criaram narrativas em inglês como língua estrangeira (LE, em meio digital. Observou-se que a utilização da ferramenta deu suporte à produção textual em LE, e sua subsequente prática de letramento, visto que os alunos se apoiaram no recurso de mineração para criar narrativas fanfiction.This work aims at investigating how literacy may be supported by the use of a digital resource which can help the process of reading and writing. Thus, the present work is based on studies by Feldman and Sanger (2006 about text mining, and on research by Black (2007; 2009 about the incorporation of a textual genre characteristic of the Internet (fanfiction in language learning. Through the use of a text mining resource (Sobek, which promotes the extraction of frequent terms present in a text, the participants of our pilot study created narratives in English as a foreign language (FL, in digital media, and used the mining tool to develop graphs with the recurrent terms found in the story. It was observed that the use of a digital tool supported the text production in the FL, and its following practice of literacy, as the students relied on the mining resource to create their fanfictions.

  18. Statistics in biomedical research

    Directory of Open Access Journals (Sweden)

    González-Manteiga, Wenceslao

    2007-06-01

    Full Text Available The discipline of biostatistics is nowadays a fundamental scientific component of biomedical, public health and health services research. Traditional and emerging areas of application include clinical trials research, observational studies, physiology, imaging, and genomics. The present article reviews the current situation of biostatistics, considering the statistical methods traditionally used in biomedical research, as well as the ongoing development of new methods in response to the new problems arising in medicine. Clearly, the successful application of statistics in biomedical research requires appropriate training of biostatisticians. This training should aim to give due consideration to emerging new areas of statistics, while at the same time retaining full coverage of the fundamentals of statistical theory and methodology. In addition, it is important that students of biostatistics receive formal training in relevant biomedical disciplines, such as epidemiology, clinical trials, molecular biology, genetics, and neuroscience.La Bioestadística es hoy en día una componente científica fundamental de la investigación en Biomedicina, salud pública y servicios de salud. Las áreas tradicionales y emergentes de aplicación incluyen ensayos clínicos, estudios observacionales, fisología, imágenes, y genómica. Este artículo repasa la situación actual de la Bioestadística, considerando los métodos estadísticos usados tradicionalmente en investigación biomédica, así como los recientes desarrollos de nuevos métodos, para dar respuesta a los nuevos problemas que surgen en Medicina. Obviamente, la aplicación fructífera de la estadística en investigación biomédica exige una formación adecuada de los bioestadísticos, formación que debería tener en cuenta las áreas emergentes en estadística, cubriendo al mismo tiempo los fundamentos de la teoría estadística y su metodología. Es importante, además, que los estudiantes de

  19. 国际图书馆界对文本和数据挖掘权利的争取及启示%The International Library Community to the Text and Data Mining Rights for the Fight and Enlightenment

    Institute of Scientific and Technical Information of China (English)

    于静

    2016-01-01

    面对版权问题对文本和数据挖掘技术在图书馆领域应用的制约,国际图书馆界采取了发布版权原则声明、游说开展版权立法、对出版商的版权政策提出质疑以及构建维权合作同盟等争取文本与数据挖掘权利的对策,其成果主要体现在:版权立场受到社会认同和支持、版权例外制度初现端倪、部分出版商调整了版权政策、图书馆的版权实践模式趋于多元化等。国际图书馆界争取文本和数据挖掘权利的做法与经验,对我国图书馆具有启示意义。%〔Abstract〕In the face of the copyright problem of text and data mining technology in library ifeld constraints, the international library community take the publishing copyright statement of principles ,lobbying to carry out copyright legislation on the publisher’s copyright policy raised the question as well as the construction of rights cooperation alliance for text and number according to the Countermeasures of mining rights, the results mainly relfected in: copyright position by social recognition and support, incipient copyright exception system, some publishers copyright policy is adjusted, the practice mode of copyright in the library diversiifed etc.. The practice and experience of the international library community to strive for the right of the text and data mining, to our country library has the enlightenment signiifcance.

  20. Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles.

    Directory of Open Access Journals (Sweden)

    Rey-Long Liu

    Full Text Available Biomedical literature is an essential source of biomedical evidence. To translate the evidence for biomedicine study, researchers often need to carefully read multiple articles about specific biomedical issues. These articles thus need to be highly related to each other. They should share similar core contents, including research goals, methods, and findings. However, given an article r, it is challenging for search engines to retrieve highly related articles for r. In this paper, we present a technique PBC (Passage-based Bibliographic Coupling that estimates inter-article similarity by seamlessly integrating bibliographic coupling with the information collected from context passages around important out-link citations (references in each article. Empirical evaluation shows that PBC can significantly improve the retrieval of those articles that biomedical experts believe to be highly related to specific articles about gene-disease associations. PBC can thus be used to improve search engines in retrieving the highly related articles for any given article r, even when r is cited by very few (or even no articles. The contribution is essential for those researchers and text mining systems that aim at cross-validating the evidence about specific gene-disease associations.

  1. Patent Analysis Method for Field of Video Codec Based on Text Mining%基于文本挖掘的视频编解码领域专利分析方法

    Institute of Scientific and Technical Information of China (English)

    于雷; 夏鹏

    2012-01-01

    介绍了通过高级语义技术以及自然语言处理技术对专利进行文本挖掘分析的方法,同时利用该方法对涉及视频编解码领域的专利进行分析,得到一些有用的建议.%This paper introduces the advanced semantic technology and natural language processing technology of the patent text mining analysis methods, while taking advantage of the method to analyze the patent involved in the field of video codec and get some useful suggestions.

  2. Study on science and research information's auto-suggestion method based on text mining%文本挖掘技术在科研信息自动建议中的应用

    Institute of Scientific and Technical Information of China (English)

    李芳; 朱群雄

    2011-01-01

    This paper studies the characteristics of text data from research journal literature,applies the popular text mining technique into analyzing and processing research literature text data, and proposes research information's auto-suggestion system. Case study on journal documents is discussed.%研究了科研期刊文献文本数据的特点,将文本挖掘技术用于对科研期刊文本数据的分析和处理中,提出了基于文本挖掘技术的科研信息自动建议系统.结合国内信息领域较有影响的3种期刊2007全年的期刊文章,进行了实例仿真.

  3. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  4. TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery.

    Directory of Open Access Journals (Sweden)

    Yi-An Chen

    Full Text Available Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

  5. Uranium mining

    International Nuclear Information System (INIS)

    Full text: The economic and environmental sustainability of uranium mining has been analysed by Monash University researcher Dr Gavin Mudd in a paper that challenges the perception that uranium mining is an 'infinite quality source' that provides solutions to the world's demand for energy. Dr Mudd says information on the uranium industry touted by politicians and mining companies is not necessarily inaccurate, but it does not tell the whole story, being often just an average snapshot of the costs of uranium mining today without reflecting the escalating costs associated with the process in years to come. 'From a sustainability perspective, it is critical to evaluate accurately the true lifecycle costs of all forms of electricity production, especially with respect to greenhouse emissions, ' he says. 'For nuclear power, a significant proportion of greenhouse emissions are derived from the fuel supply, including uranium mining, milling, enrichment and fuel manufacture.' Dr Mudd found that financial and environmental costs escalate dramatically as the uranium ore is used. The deeper the mining process required to extract the ore, the higher the cost for mining companies, the greater the impact on the environment and the more resources needed to obtain the product. It is clear that there is a strong sensitivity of energy and water consumption and greenhouse emissions to ore grade, and that ore grades are likely to continue to decline gradually in the medium to long term. These issues are critical to the current debate over nuclear power and greenhouse emissions, especially with respect to ascribing sustainability to such activities as uranium mining and milling. For example, mining at Roxby Downs is responsible for the emission of over one million tonnes of greenhouse gases per year and this could increase to four million tonnes if the mine is expanded.'

  6. 基于语义文本挖掘的企业竞争对手分析模型研究%A Competitor Analysis Model Based on Semantic Text Mining

    Institute of Scientific and Technical Information of China (English)

    唐晓波; 郭萍

    2013-01-01

    为弥补传统竞争对手分析方法无法有效挖掘网络化企业竞争对手信息的缺陷,本文将语义文本挖掘技术引入企业竞争对手分析中,提出了一个基于语义文本挖掘的企业竞争对手分析模型.该模型采用规则化主题爬取技术获取结构化信息,利用竞争情报领域本体知识库和语义VSM矩阵实现竞争对手信息语义分析和描述,通过基于语义的文本挖掘技术提取竞争对手深层次语义知识.并以相机市场的两大竞争力企业--佳能、尼康为例进行了实证分析研究,实验结果表明,该模型具有潜在的实际应用价值,可有效提高企业决策水平.%In order to make up for the failure of traditional competitor analysis methods to mine information about corporate competitors in Web, this paper puts forward a competitor analysis model based on semantic text mining,involving text mining technology into the enterprise competitor analysis. This model adopts the regularized topical crawling technologies to obtain structured information, uses competitive ontology knowledge base and semantic VSM matrix to achieve semantic analysis and description of competitor information, extracts rival deep-level semantic knowledge through semantic-based text mining technology. Two competitive enterprises in camera market, namely Canon and Nikon are chosen to demonstrate the applicability of the model, the results show that this model has potential practical value, which can effectively improve business decision-making level.

  7. PALM-IST: Pathway Assembly from Literature Mining--an Information Search Tool.

    Science.gov (United States)

    Mandloi, Sapan; Chakrabarti, Saikat

    2015-01-01

    Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of "co-authorship network," for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm. PMID:25989388

  8. Current Market Demand for Core Competencies of Librarianship—A Text Mining Study of American Library Association’s Advertisements from 2009 through 2014

    Directory of Open Access Journals (Sweden)

    Qinghong Yang

    2016-02-01

    Full Text Available As librarianship evolves, it is important to examine the changes that have taken place in professional requirements. To provide an understanding of the current market demand for core competencies of librarianship, this article conducts a semi-automatic methodology to analyze job advertisements (ads posted on the American Library Association (ALA Joblist from 2009 through 2014. There is evidence that the ability to solve unexpected complex problems and to provide superior customer service gained increasing importance for librarians during those years. The authors contend that the findings in this report question the status quo of core competencies of librarianship in the US job market.

  9. Biomedical optical imaging

    CERN Document Server

    Fujimoto, James G

    2009-01-01

    Biomedical optical imaging is a rapidly emerging research area with widespread fundamental research and clinical applications. This book gives an overview of biomedical optical imaging with contributions from leading international research groups who have pioneered many of these techniques and applications. A unique research field spanning the microscopic to the macroscopic, biomedical optical imaging allows both structural and functional imaging. Techniques such as confocal and multiphoton microscopy provide cellular level resolution imaging in biological systems. The integration of this tech

  10. Biomedical engineering principles

    CERN Document Server

    Ritter, Arthur B; Valdevit, Antonio; Ascione, Alfred N

    2011-01-01

    Introduction: Modeling of Physiological ProcessesCell Physiology and TransportPrinciples and Biomedical Applications of HemodynamicsA Systems Approach to PhysiologyThe Cardiovascular SystemBiomedical Signal ProcessingSignal Acquisition and ProcessingTechniques for Physiological Signal ProcessingExamples of Physiological Signal ProcessingPrinciples of BiomechanicsPractical Applications of BiomechanicsBiomaterialsPrinciples of Biomedical Capstone DesignUnmet Clinical NeedsEntrepreneurship: Reasons why Most Good Designs Never Get to MarketAn Engineering Solution in Search of a Biomedical Problem

  11. Exploring subdomain variation in biomedical language

    Directory of Open Access Journals (Sweden)

    Séaghdha Diarmuid Ó

    2011-05-01

    Full Text Available Abstract Background Applications of Natural Language Processing (NLP technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. Results Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. Conclusions We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.

  12. Biomedical devices and their applications

    CERN Document Server

    2004-01-01

    This volume introduces readers to the basic concepts and recent advances in the field of biomedical devices. The text gives a detailed account of novel developments in drug delivery, protein electrophoresis, estrogen mimicking methods and medical devices. It also provides the necessary theoretical background as well as describing a wide range of practical applications. The level and style make this book accessible not only to scientific and medical researchers but also to graduate students.

  13. UMLS knowledge for biomedical language processing.

    OpenAIRE

    McCray, A T; Aronson, A. R.; Browne, A. C.; Rindflesch, T. C.; A razi; Srinivasan, S

    1993-01-01

    This paper describes efforts to provide access to the free text in biomedical databases. The focus of the effort is the development of SPECIALIST, an experimental natural language processing system for the biomedical domain. The system includes a broad coverage parser supported by a large lexicon, modules that provide access to the extensive Unified Medical Language System (UMLS) Knowledge Sources, and a retrieval module that permits experiments in information retrieval. The UMLS Metathesauru...

  14. Rewriting and suppressing UMLS terms for improved biomedical term identification

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2010-03-01

    Full Text Available Abstract Background Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS. To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. Results Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. Conclusions We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software

  15. Text Data Mining in Biomedical Informatics Based on Biclustering Method%基于双聚类方法的生物医学信息学文本数据挖掘研究

    Institute of Scientific and Technical Information of China (English)

    于跃; 徐志健; 王坤; 王伟

    2012-01-01

    运用TDA和gCluto软件对SCI数据库中近5年的相关文献进行处理,获得双聚类矩阵图,经分析后得到该学科领域近年来的期刊研究主题方向与热点。结论认为,双聚类方法能够很好地反映学科发展状况及研究热点,从而为医学人员提供有价值的信息与知识,值得进一步研究与推广。%In this paper, the authors retrieve SCI database to get the relating literatures during the recent 5 years and extract the recording information by TDA software and gCluto software. The outcomes indicate the research subject areas and hot spots in the biclustering figure. The conclusion is that biclustering techniques can reflect the development in a given field, and provide valuable medical information and knowledge for medical researchers. Further research and extension is needed to confirm the worth of these technologies.

  16. UKPMC: a full text article resource for the life sciences.

    Science.gov (United States)

    McEntyre, Johanna R; Ananiadou, Sophia; Andrews, Stephen; Black, William J; Boulderstone, Richard; Buttery, Paula; Chaplin, David; Chevuru, Sandeepreddy; Cobley, Norman; Coleman, Lee-Ann; Davey, Paul; Gupta, Bharti; Haji-Gholam, Lesley; Hawkins, Craig; Horne, Alan; Hubbard, Simon J; Kim, Jee-Hyub; Lewin, Ian; Lyte, Vic; MacIntyre, Ross; Mansoor, Sami; Mason, Linda; McNaught, John; Newbold, Elizabeth; Nobata, Chikashi; Ong, Ernest; Pillai, Sharmila; Rebholz-Schuhmann, Dietrich; Rosie, Heather; Rowbotham, Rob; Rupp, C J; Stoehr, Peter; Vaughan, Philip

    2011-01-01

    UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains 'Cited By' information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched. PMID:21062818

  17. Detecting the research trend of islet amyloid polypeptide with text mining technique%基于文本挖掘技术探测胰岛淀粉样多肽的研究趋势

    Institute of Scientific and Technical Information of China (English)

    殷蜀梅; 李春英

    2014-01-01

    胰岛淀粉样多肽是2型糖尿病的重要致病原因之一.为了研究胰岛淀粉样多肽的生物学作用及其应用范围,本文拟通过文本挖掘技术来对胰岛淀粉样多肽生化用专业试剂和试剂盒检测的研究发展趋势进行探测.%Islet amyloid polypeptide (IAPP) is an important etiologic factor for the type 2 diabetes mellitus.To investigate the biological functions and the applications of IAPP,we used text mining to explore the development of the research about IAPP biochemical reagents and test kits in this study.

  18. MedlineRanker: flexible ranking of biomedical literature.

    Science.gov (United States)

    Fontaine, Jean-Fred; Barbosa-Silva, Adriano; Schaefer, Martin; Huska, Matthew R; Muro, Enrique M; Andrade-Navarro, Miguel A

    2009-07-01

    The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker. PMID:19429696

  19. Context-Aware Adaptive Hybrid Semantic Relatedness in Biomedical Science

    Science.gov (United States)

    Emadzadeh, Ehsan

    Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1--6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

  20. 文本挖掘探索泌尿系感染中西医用药规律%Exploring the associated rules of traditional Chinese medicine and western medicine on urinary tract infection with text mining technique

    Institute of Scientific and Technical Information of China (English)

    陈文; 姜洋; 黄蕙莉; 孙玉香

    2013-01-01

    Objective To explore the associated rules between western medicine and traditional Chinese medicine (TCM) on urinary tract infection (UTI) with text mining technique.Methods The data set on UTI was downloaded from CBM database.The regularities of Chinese patent medicines (CPM),western medicines and the combination of CPM and western medicines on UTI were mined out by data slicing algorithm.The results were showed visually with Cytoscape2.8 software.Results The main function of CPM was focused on clearing heat and removing toxicity,promoting diuresis and relieving stranguria.For western medicine,antibacterial agents was often used and it was also frequently used together with CPM such as Sanjinpian.Conclusions Text mining approach provides an important method in the summary of the application regularity for disease in both TCM and western medicine.%目的 利用文本挖掘技术探索泌尿系感染中西医用药规律.方法 在中国生物医学文献服务系统中收集治疗泌尿系感染文献数据,采用基于敏感关键词频数统计的数据分层算法,挖掘泌尿系感染中成药、西药、中成药与西药联合应用规律,并利用Cytoscape2.8软件进行可视化展示.结果 中成药的应用以清热解毒、利尿通淋为主;西药以抗菌治疗为主;具有清热解毒、利尿通淋之功的中成药常与抗菌药联合应用.结论 文本挖掘能够比较客观地总结疾病用药规律,为临床应用提供有益的探索和参考.

  1. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

    OpenAIRE

    Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio

    2011-01-01

    Background Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretat...

  2. Implementation of Paste Backfill Mining Technology in Chinese Coal Mines

    Directory of Open Access Journals (Sweden)

    Qingliang Chang

    2014-01-01

    Full Text Available Implementation of clean mining technology at coal mines is crucial to protect the environment and maintain balance among energy resources, consumption, and ecology. After reviewing present coal clean mining technology, we introduce the technology principles and technological process of paste backfill mining in coal mines and discuss the components and features of backfill materials, the constitution of the backfill system, and the backfill process. Specific implementation of this technology and its application are analyzed for paste backfill mining in Daizhuang Coal Mine; a practical implementation shows that paste backfill mining can improve the safety and excavation rate of coal mining, which can effectively resolve surface subsidence problems caused by underground mining activities, by utilizing solid waste such as coal gangues as a resource. Therefore, paste backfill mining is an effective clean coal mining technology, which has widespread application.

  3. 改进的朴素贝叶斯聚类Web文本分类挖掘技术%The Improved Naive Bayes Text Classification Data Mining Clustering Web

    Institute of Scientific and Technical Information of China (English)

    高胜利

    2012-01-01

    通过对Web数据的特点进行详细的分析,在基于传统的贝叶斯聚类算法基础上,采用网页标记形式来有效地弥补朴素贝叶斯算法的不足,并将改进的方法应用在文本分类中,是一种很好的改进思路。最后实验结果也表明,此方法能够有效地对文本进行分类。%This paper first introduced the Web mining and text classification of basic theory, specific to the Web data characteristics are analyzed in detail, mainly based on the traditional Bayesian clustering algorithm based on the proposed algorithm, the improvement of the webpage, marked form to effectively compensates for the naive Bayes algorithm is in- sufficient, will be improved method and its application in text classification, finally the experimental results show that the method can effectively classify the text.

  4. Coreference resolution improves extraction of Biological Expression Language statements from texts.

    Science.gov (United States)

    Choi, Miji; Liu, Haibin; Baumgartner, William; Zobel, Justin; Verspoor, Karin

    2016-01-01

    We describe a system that automatically extracts biological events from biomedical journal articles, and translates those events into Biological Expression Language (BEL) statements. The system incorporates existing text mining components for coreference resolution, biological event extraction and a previously formally untested strategy for BEL statement generation. Although addressing the BEL track (Track 4) at BioCreative V (2015), we also investigate how incorporating coreference resolution might impact event extraction in the biomedical domain. In this paper, we report that our system achieved the best performance of 20.2 and 35.2 in F-score for the full BEL statement level on both stage 1, and stage 2 using provided gold standard entities, respectively. We also report that our results evaluated on the training dataset show benefit from integrating coreference resolution with event extraction. PMID:27374122

  5. Handbook of biomedical optics

    CERN Document Server

    Boas, David A

    2011-01-01

    Biomedical optics holds tremendous promise to deliver effective, safe, non- or minimally invasive diagnostics and targeted, customizable therapeutics. Handbook of Biomedical Optics provides an in-depth treatment of the field, including coverage of applications for biomedical research, diagnosis, and therapy. It introduces the theory and fundamentals of each subject, ensuring accessibility to a wide multidisciplinary readership. It also offers a view of the state of the art and discusses advantages and disadvantages of various techniques.Organized into six sections, this handbook: Contains intr

  6. Biomedical applications of polymers

    CERN Document Server

    Gebelein, C G

    1991-01-01

    The biomedical applications of polymers span an extremely wide spectrum of uses, including artificial organs, skin and soft tissue replacements, orthopaedic applications, dental applications, and controlled release of medications. No single, short review can possibly cover all these items in detail, and dozens of books andhundreds of reviews exist on biomedical polymers. Only a few relatively recent examples will be cited here;additional reviews are listed under most of the major topics in this book. We will consider each of the majorclassifications of biomedical polymers to some extent, inclu

  7. Powering biomedical devices

    CERN Document Server

    Romero, Edwar

    2013-01-01

    From exoskeletons to neural implants, biomedical devices are no less than life-changing. Compact and constant power sources are necessary to keep these devices running efficiently. Edwar Romero's Powering Biomedical Devices reviews the background, current technologies, and possible future developments of these power sources, examining not only the types of biomedical power sources available (macro, mini, MEMS, and nano), but also what they power (such as prostheses, insulin pumps, and muscular and neural stimulators), and how they work (covering batteries, biofluids, kinetic and ther

  8. Biomedical Engineering Desk Reference

    CERN Document Server

    Ratner, Buddy D; Schoen, Frederick J; Lemons, Jack E; Dyro, Joseph; Martinsen, Orjan G; Kyle, Richard; Preim, Bernhard; Bartz, Dirk; Grimnes, Sverre; Vallero, Daniel; Semmlow, John; Murray, W Bosseau; Perez, Reinaldo; Bankman, Isaac; Dunn, Stanley; Ikada, Yoshito; Moghe, Prabhas V; Constantinides, Alkis

    2009-01-01

    A one-stop Desk Reference, for Biomedical Engineers involved in the ever expanding and very fast moving area; this is a book that will not gather dust on the shelf. It brings together the essential professional reference content from leading international contributors in the biomedical engineering field. Material covers a broad range of topics including: Biomechanics and Biomaterials; Tissue Engineering; and Biosignal Processing* A hard-working desk reference providing all the essential material needed by biomedical and clinical engineers on a day-to-day basis * Fundamentals, key techniques,

  9. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D; Bronzino, Joseph D

    2006-01-01

    Over the last century,medicine has come out of the "black bag" and emerged as one of the most dynamic and advanced fields of development in science and technology. Today, biomedical engineering plays a critical role in patient diagnosis, care, and rehabilitation. As such, the field encompasses a wide range of disciplines, from biology and physiology to informatics and signal processing. Reflecting the enormous growth and change in biomedical engineering during the infancy of the 21st century, The Biomedical Engineering Handbook enters its third edition as a set of three carefully focused and

  10. An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database%中医方剂数据库文本挖掘数据预处理的尝试

    Institute of Scientific and Technical Information of China (English)

    吴磊; 李舒

    2015-01-01

    Objective To propose a set of data preprocessing method based on data cleaning for TCM prescription database;To make data more standard, accurate and orderly, and convenient for follow-up processing. Methods The text data source was retrieved from prescription databases by bibliographic searching techniques. Non-normalized data were processed through steps followed by auxiliary word group line processing, regular expression substitution, and synonyms processing, with a purpose to improve data quality. Results Totally 1758 effective records were retrieved from TCM prescription database, and 91 records were retrieved from prescription modern application database. 6913 effective Chinese herbal medicines were retrieved after preprocessing, which can be successfully imported into relevant information mining system, and information about prescription and herb names can be extracted. Conclusion This method is applicable for text mining and knowledge discovery in TCM prescription database. It can successfully implement data cleaning for source text data, get data with unified standard and without noise, and finally realize the effective extraction of prescription information, which can provide references for researches on analysis and mining of TCM prescription text data.%目的:针对中医方剂数据挖掘需要提出一套以数据清洗为主的数据预处理方法,使数据规范、准确和有序,利于后续处理。方法通过检索技术,在方剂数据库中获取文本数据源,将非规范化的数据通过辅助词群行处理、正则表达式替换、异名处理等步骤进行清洗,改进数据质量。结果在中国方剂数据库共检索到1758条记录,在方剂现代应用数据库共检索到91条记录。源文本数据经预处理后共得到有效记录6913味药,可成功导入相关信息挖掘系统进行方剂名称和中药名词的信息抽取。结论本方法适用于基于中医方剂数据库的文本挖掘和知识

  11. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  12. Sensors for biomedical applications

    NARCIS (Netherlands)

    Bergveld, Piet

    1986-01-01

    This paper considers the impact during the last decade of modern IC technology, microelectronics, thin- and thick-film technology, fibre optic technology, etc. on the development of sensors for biomedical applications.

  13. Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text

    OpenAIRE

    Garg, Sahil; Galstyan, Aram; Hermjakob, Ulf; Marcu, Daniel

    2015-01-01

    We advance the state of the art in biomolecular interaction extraction with three contributions: (i) We show that deep, Abstract Meaning Representations (AMR) significantly improve the accuracy of a biomolecular interaction extraction system when compared to a baseline that relies solely on surface- and syntax-based features; (ii) In contrast with previous approaches that infer relations on a sentence-by-sentence basis, we expand our framework to enable consistent predictions over sets of sen...

  14. Statistical Machine Translation for Biomedical Text: Are We There Yet?

    OpenAIRE

    Wu, Cuijun; Xia, Fei; Deleger, Louise; Solti, Imre

    2011-01-01

    In our paper we addressed the research question: “Has machine translation achieved sufficiently high quality to translate PubMed titles for patients?”. We analyzed statistical machine translation output for six foreign language - English translation pairs (bi-directionally). We built a high performing in-house system and evaluated its output for each translation pair on large scale both with automated BLEU scores and human judgment. In addition to the in-house system, we also evaluated Google...

  15. Sharing big biomedical data

    OpenAIRE

    Toga, Arthur W.; Dinov, Ivo D.

    2015-01-01

    Background The promise of Big Biomedical Data may be offset by the enormous challenges in handling, analyzing, and sharing it. In this paper, we provide a framework for developing practical and reasonable data sharing policies that incorporate the sociological, financial, technical and scientific requirements of a sustainable Big Data dependent scientific community. Findings Many biomedical and healthcare studies may be significantly impacted by using large, heterogeneous and incongruent data...

  16. Biomedical signal analysis

    CERN Document Server

    Rangayyan, Rangaraj M

    2015-01-01

    The book will help assist a reader in the development of techniques for analysis of biomedical signals and computer aided diagnoses with a pedagogical examination of basic and advanced topics accompanied by over 350 figures and illustrations. Wide range of filtering techniques presented to address various applications. 800 mathematical expressions and equations. Practical questions, problems and laboratory exercises. Includes fractals and chaos theory with biomedical applications.

  17. What is biomedical informatics?

    OpenAIRE

    Bernstam, Elmer V.; Smith, Jack W.; Johnson, Todd R

    2009-01-01

    Biomedical informatics lacks a clear and theoretically grounded definition. Many proposed definitions focus on data, information, and knowledge, but do not provide an adequate definition of these terms. Leveraging insights from the philosophy of information, we define informatics as the science of information, where information is data plus meaning. Biomedical informatics is the science of information as applied to or studied in the context of biomedicine. Defining the object of study of info...

  18. Text Laws

    Czech Academy of Sciences Publication Activity Database

    Hřebíček, Luděk

    Vol. 26. Ein internationales Handbuch/An International Handbook. Berlin-New York : Walter de Gruyter, 2005 - (Köhler, R.; Altmann, G.; Piotrowski, R.), s. 348-361 ISBN 978-3-11-015578-5 Institutional research plan: CEZ:AV0Z90210515 Keywords : Text structure * Quantitative linguistics Subject RIV: AI - Linguistics

  19. The Ontology for Biomedical Investigations.

    Directory of Open Access Journals (Sweden)

    Anita Bandrowski

    Full Text Available The Ontology for Biomedical Investigations (OBI is an ontology that provides terms with precisely defined meanings to describe all aspects of how investigations in the biological and medical domains are conducted. OBI re-uses ontologies that provide a representation of biomedical knowledge from the Open Biological and Biomedical Ontologies (OBO project and adds the ability to describe how this knowledge was derived. We here describe the state of OBI and several applications that are using it, such as adding semantic expressivity to existing databases, building data entry forms, and enabling interoperability between knowledge resources. OBI covers all phases of the investigation process, such as planning, execution and reporting. It represents information and material entities that participate in these processes, as well as roles and functions. Prior to OBI, it was not possible to use a single internally consistent resource that could be applied to multiple types of experiments for these applications. OBI has made this possible by creating terms for entities involved in biological and medical investigations and by importing parts of other biomedical ontologies such as GO, Chemical Entities of Biological Interest (ChEBI and Phenotype Attribute and Trait Ontology (PATO without altering their meaning. OBI is being used in a wide range of projects covering genomics, multi-omics, immunology, and catalogs of services. OBI has also spawned other ontologies (Information Artifact Ontology and methods for importing parts of ontologies (Minimum information to reference an external ontology term (MIREOT. The OBI project is an open cross-disciplinary collaborative effort, encompassing multiple research communities from around the globe. To date, OBI has created 2366 classes and 40 relations along with textual and formal definitions. The OBI Consortium maintains a web resource (http://obi-ontology.org providing details on the people, policies, and issues being

  20. Biomedical Image Analysis by Program "Vision Assistant" and "Labview"

    Directory of Open Access Journals (Sweden)

    Peter Izak

    2005-01-01

    Full Text Available This paper introduces application in image analysis of biomedical images. General task is focused on analysis and diagnosis biomedical images obtained from program ImageJ. There are described methods which can be used for images in biomedical application. The main idea is based on particle analysis, pattern matching techniques. For this task was chosensophistication method by program Vision Assistant, which is a part of program LabVIEW.