WorldWideScience

Sample records for biomedical text mining

  1. Text mining patents for biomedical knowledge.

    Science.gov (United States)

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery.

  2. Text mining patents for biomedical knowledge.

    Science.gov (United States)

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. PMID:27179985

  3. CONAN : Text Mining in the Biomedical Domain

    NARCIS (Netherlands)

    Malik, R.

    2006-01-01

    This thesis is about Text Mining. Extracting important information from literature. In the last years, the number of biomedical articles and journals is growing exponentially. Scientists might not find the information they want because of the large number of publications. Therefore a system was cons

  4. Frontiers of biomedical text mining: current progress

    OpenAIRE

    Zweigenbaum, Pierre; Demner-Fushman, Dina; Hong YU; Cohen, Kevin B.

    2007-01-01

    It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a ...

  5. Biomedical text mining and its applications in cancer research.

    Science.gov (United States)

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. PMID:23159498

  6. OntoGene web services for biomedical text mining

    OpenAIRE

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an appro...

  7. OntoGene web services for biomedical text mining.

    Science.gov (United States)

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them. PMID:25472638

  8. Application of text mining in the biomedical domain

    NARCIS (Netherlands)

    Fleuren, W.W.M.; Alkema, W.B.L.

    2015-01-01

    In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of

  9. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

    Science.gov (United States)

    Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. PMID:26420781

  10. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    Institute of Scientific and Technical Information of China (English)

    J; Harold; Pardue; William; T; Gerthoffer

    2012-01-01

    Computational techniques have been adopted in medi-cal and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from origi- nal data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research.

  11. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics.

    Science.gov (United States)

    Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming

    2012-02-26

    Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical link in the chain of knowledge acquisition, sharing, and reuse. Challenges that have been encountered include: how to efficiently and effectively represent human knowledge in formal computing models, how to take advantage of semantic text mining techniques rather than traditional syntactic text mining, and how to handle security issues during the knowledge sharing and reuse. This paper summarizes the state-of-the-art in these research directions. We aim to provide readers with an introduction of major computing themes to be applied to the medical and biological research.

  12. An Unsupervised Graph Based Continuous Word Representation Method for Biomedical Text Mining.

    Science.gov (United States)

    Jiang, Zhenchao; Li, Lishuang; Huang, Degen

    2016-01-01

    In biomedical text mining tasks, distributed word representation has succeeded in capturing semantic regularities, but most of them are shallow-window based models, which are not sufficient for expressing the meaning of words. To represent words using deeper information, we make explicit the semantic regularity to emerge in word relations, including dependency relations and context relations, and propose a novel architecture for computing continuous vector representation by leveraging those relations. The performance of our model is measured on word analogy task and Protein-Protein Interaction Extraction (PPIE) task. Experimental results show that our method performs overall better than other word representation models on word analogy task and have many advantages on biomedical text mining.

  13. Community challenges in biomedical text mining over 10 years: success, failure and the future.

    Science.gov (United States)

    Huang, Chung-Chi; Lu, Zhiyong

    2016-01-01

    One effective way to improve the state of the art is through competitions. Following the success of the Critical Assessment of protein Structure Prediction (CASP) in bioinformatics research, a number of challenge evaluations have been organized by the text-mining research community to assess and advance natural language processing (NLP) research for biomedicine. In this article, we review the different community challenge evaluations held from 2002 to 2014 and their respective tasks. Furthermore, we examine these challenge tasks through their targeted problems in NLP research and biomedical applications, respectively. Next, we describe the general workflow of organizing a Biomedical NLP (BioNLP) challenge and involved stakeholders (task organizers, task data producers, task participants and end users). Finally, we summarize the impact and contributions by taking into account different BioNLP challenges as a whole, followed by a discussion of their limitations and difficulties. We conclude with future trends in BioNLP challenge evaluations. PMID:25935162

  14. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  15. The BioLexicon: a large-scale terminological resource for biomedical text mining

    Directory of Open Access Journals (Sweden)

    Thompson Paul

    2011-10-01

    Full Text Available Abstract Background Due to the rapidly expanding body of biomedical literature, biologists require increasingly sophisticated and efficient systems to help them to search for relevant information. Such systems should account for the multiple written variants used to represent biomedical concepts, and allow the user to search for specific pieces of knowledge (or events involving these concepts, e.g., protein-protein interactions. Such functionality requires access to detailed information about words used in the biomedical literature. Existing databases and ontologies often have a specific focus and are oriented towards human use. Consequently, biological knowledge is dispersed amongst many resources, which often do not attempt to account for the large and frequently changing set of variants that appear in the literature. Additionally, such resources typically do not provide information about how terms relate to each other in texts to describe events. Results This article provides an overview of the design, construction and evaluation of a large-scale lexical and conceptual resource for the biomedical domain, the BioLexicon. The resource can be exploited by text mining tools at several levels, e.g., part-of-speech tagging, recognition of biomedical entities, and the extraction of events in which they are involved. As such, the BioLexicon must account for real usage of words in biomedical texts. In particular, the BioLexicon gathers together different types of terms from several existing data resources into a single, unified repository, and augments them with new term variants automatically extracted from biomedical literature. Extraction of events is facilitated through the inclusion of biologically pertinent verbs (around which events are typically organized together with information about typical patterns of grammatical and semantic behaviour, which are acquired from domain-specific texts. In order to foster interoperability, the BioLexicon is

  16. An unsupervised text mining method for relation extraction from biomedical literature.

    Directory of Open Access Journals (Sweden)

    Changqin Quan

    Full Text Available The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1 Protein-protein interactions extraction, and (2 Gene-suicide association extraction. The evaluation of task (1 on the benchmark dataset (AImed corpus showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene-suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.

  17. Knowledge acquisition, semantic text mining, and security risks in health and biomedical informatics

    OpenAIRE

    Huang, Jingshan; Dou, Dejing; Dang, Jiangbo; Pardue, J Harold; Qin, Xiao; Huan, Jun; Gerthoffer, William T; Tan, Ming

    2012-01-01

    Computational techniques have been adopted in medical and biological systems for a long time. There is no doubt that the development and application of computational methods will render great help in better understanding biomedical and biological functions. Large amounts of datasets have been produced by biomedical and biological experiments and simulations. In order for researchers to gain knowledge from original data, nontrivial transformation is necessary, which is regarded as a critical l...

  18. Text Mining for Neuroscience

    Science.gov (United States)

    Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis

    2011-06-01

    Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in

  19. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  20. Services for annotation of biomedical text

    OpenAIRE

    Hakenberg, Jörg

    2008-01-01

    Motivation: Text mining in the biomedical domain in recent years has focused on the development of tools for recognizing named entities and extracting relations. Such research resulted from the need for such tools as basic components for more advanced solutions. Named entity recognition, entity mention normalization, and relationship extraction now have reached a stage where they perform comparably to human annotators (considering inter--annotator agreement, measured in many studies to be aro...

  1. Hotspots in text mining of biomedical field%生物医学文本挖掘研究热点分析

    Institute of Scientific and Technical Information of China (English)

    史航; 高雯珺; 崔雷

    2016-01-01

    The high frequency subject terms were extracted from the PubMed-covered papers published from January 2000 to March 2015 on text mining of biomedical field to generate the matrix of high frequency subject terms and their source papers.The co-occurrence of high frequency subject terms in a same paper was analyzed by clustering analysis.The hotspots in text mining of biomedical field were analyzed according to the clustering analysis of high frequency subject terms and their corresponding class labels, which showed that the hotspots in text mining of bio-medical field were the basic technologies of text mining, application of text mining in biomedical informatics and in extraction of drugs-related facts.%为了解生物医学文本挖掘的研究现状和评估未来的发展方向,以美国国立图书馆 PubMed中收录的2000年1月-2015年3月发表的生物医学文本挖掘研究文献记录为样本来源,提取文献记录的主要主题词进行频次统计后截取高频主题词,形成高频主题词-论文矩阵,根据高频主题词在同一篇论文中的共现情况对其进行聚类分析,根据高频主题词聚类分析结果和对应的类标签文献,分析当前生物医学文本挖掘研究的热点。结果显示,当前文本挖掘在生物医学领域应用的主要研究热点为文本挖掘的基本技术研究、文本挖掘在生物信息学领域里的应用、文本挖掘在药物相关事实抽取中的应用3个方面。

  2. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  3. Chapter 16: Text Mining for Translational Bioinformatics

    OpenAIRE

    Bretonnel Cohen, K; Hunter, Lawrence E.

    2013-01-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. P...

  4. Mining Molecular Pharmacological Effects from Biomedical Text: a Case Study for Eliciting Anti-Obesity/Diabetes Effects of Chemical Compounds.

    Science.gov (United States)

    Dura, Elzbieta; Muresan, Sorel; Engkvist, Ola; Blomberg, Niklas; Chen, Hongming

    2014-05-01

    In the pharmaceutical industry, efficiently mining pharmacological data from the rapidly increasing scientific literature is very crucial for many aspects of the drug discovery process such as target validation, tool compound selection etc. A quick and reliable way is needed to collect literature assertions of selected compounds' biological and pharmacological effects in order to assist the hypothesis generation and decision-making of drug developers. INFUSIS, the text mining system presented here, extracts data on chemical compounds from PubMed abstracts. It involves an extensive use of customized natural language processing besides a co-occurrence analysis. As a proof-of-concept study, INFUSIS was used to search in abstract texts for several obesity/diabetes related pharmacological effects of the compounds included in a compound dictionary. The system extracts assertions regarding the pharmacological effects of each given compound and scores them by the relevance. For each selected pharmacological effect, the highest scoring assertions in 100 abstracts were manually evaluated, i.e. 800 abstracts in total. The overall accuracy for the inferred assertions was over 90 percent. PMID:27485890

  5. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  6. Chapter 16: text mining for translational bioinformatics.

    Directory of Open Access Journals (Sweden)

    K Bretonnel Cohen

    2013-04-01

    Full Text Available Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  7. Chapter 16: text mining for translational bioinformatics.

    Science.gov (United States)

    Cohen, K Bretonnel; Hunter, Lawrence E

    2013-04-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  8. Text Mining Infrastructure in R

    OpenAIRE

    Kurt Hornik; Ingo Feinerer; David Meyer

    2008-01-01

    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classiffication and string kernels. (authors' abstract)

  9. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  10. 基于知识组织系统的生物医学文本挖掘研究%Research on Biomedical Text Mining Based on Knowledge Organization System

    Institute of Scientific and Technical Information of China (English)

    钱庆

    2016-01-01

    With the rapid development of biomedical information technology, biological medical literatures grow exponential y. It's hard to read and understand the required knowledge by manual, how to integrate knowledge from huge amounts of biomedical literatures, mining new knowledge has been becoming the current hot spot. Knowledge organization system construction in the field of biological medicine is more normative and complete than other fields, which is the foundation for biomedical text mining. A large number of text mining methodsand systems based on knowledge organization system have fast development. This paper investigates the existing medical knowledge organization systems and summarizes the process of biomedical text mining. It also summaries the researches andrecentprogressand analyzes the characteristics of biomedical text mining based on knowledge organization system. The knowledge organization systems play an important role in biomedical text mining and the chal enge for the current study are summarized, so as to provide references for biomedical workers.%随着生物医学信息技术的飞速发展,生物医学文献呈“指数型”增长,单纯依靠人工阅读获取和理解所需知识变得异常困难,如何从海量生物医学文献中整合已有知识、挖掘新知识成为当前研究热点。生物医学领域的知识组织系统建设相比其他领域更加规范和完整,为生物医学文本挖掘奠定了基础,大量基于知识组织系统的文本挖掘方法、系统得到快速发展。本文主要梳理现有医学知识组织系统,归纳生物医学文本挖掘的主要流程,按照挖掘任务探讨当前的主要研究和进展情况,并进一步分析基于知识组织系统的生物医学文本挖掘的特点,对知识组织系统在生物医学文本挖掘中发挥的主要作用和当前研究面临的挑战进行总结,以期为生物医学工作者提供借鉴。

  11. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  12. Text mining meets workflow: linking U-Compare with Taverna

    OpenAIRE

    Kano, Yoshinobu; Dobson, Paul; Nakanishi, Mio; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare t...

  13. Text mining: A Brief survey

    OpenAIRE

    Falguni N. Patel , Neha R. Soni

    2012-01-01

    The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have...

  14. Typesafe Modeling in Text Mining

    OpenAIRE

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  15. Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text

    CERN Document Server

    Jonnalagadda, Siddhartha; Hakenberg, Jorg; Baral, Chitta; Gonzalez, Graciela

    2010-01-01

    The complexity of sentences characteristic to biomedical articles poses a challenge to natural language parsers, which are typically trained on large-scale corpora of non-technical text. We propose a text simplification process, bioSimplify, that seeks to reduce the complexity of sentences in biomedical abstracts in order to improve the performance of syntactic parsers on the processed sentences. Syntactic parsing is typically one of the first steps in a text mining pipeline. Thus, any improvement in performance would have a ripple effect over all processing steps. We evaluated our method using a corpus of biomedical sentences annotated with syntactic links. Our empirical results show an improvement of 2.90% for the Charniak-McClosky parser and of 4.23% for the Link Grammar parser when processing simplified sentences rather than the original sentences in the corpus.

  16. System for Distributed Text Mining

    OpenAIRE

    Torgersen, Martin Nordseth

    2011-01-01

    Text mining presents us with new possibilities for the use of collections of documents.There exists a large amount of hidden implicit information inside these collection, which text mining techniques may help us to uncover. Unfortunately, these techniques generally requires large amounts of computational power. This is addressed by the introduction of distributed systems and methods for distributed processing, such as Hadoop and MapReduce.This thesis aims to describe, design, implement and ev...

  17. DeTEXT: A Database for Evaluating Text Extraction from Biomedical Literature Figures.

    Directory of Open Access Journals (Sweden)

    Xu-Cheng Yin

    Full Text Available Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. Since text is a rich source of information in figures, automatically extracting such text may assist in the task of mining figure information. A high-quality ground truth standard can greatly facilitate the development of an automated system. This article describes DeTEXT: A database for evaluating text extraction from biomedical literature figures. It is the first publicly available, human-annotated, high quality, and large-scale figure-text dataset with 288 full-text articles, 500 biomedical figures, and 9308 text regions. This article describes how figures were selected from open-access full-text biomedical articles and how annotation guidelines and annotation tools were developed. We also discuss the inter-annotator agreement and the reliability of the annotations. We summarize the statistics of the DeTEXT data and make available evaluation protocols for DeTEXT. Finally we lay out challenges we observed in the automated detection and recognition of figure text and discuss research directions in this area. DeTEXT is publicly available for downloading at http://prir.ustb.edu.cn/DeTEXT/.

  18. Learning the Structure of Biomedical Relationships from Unstructured Text.

    Directory of Open Access Journals (Sweden)

    Bethany Percha

    2015-07-01

    Full Text Available The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes. Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC, that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1 pharmacogenomic relationships from PharmGKB and (2 drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining.

  19. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  20. Learning the Structure of Biomedical Relationships from Unstructured Text.

    Science.gov (United States)

    Percha, Bethany; Altman, Russ B

    2015-07-01

    The published biomedical research literature encompasses most of our understanding of how drugs interact with gene products to produce physiological responses (phenotypes). Unfortunately, this information is distributed throughout the unstructured text of over 23 million articles. The creation of structured resources that catalog the relationships between drugs and genes would accelerate the translation of basic molecular knowledge into discoveries of genomic biomarkers for drug response and prediction of unexpected drug-drug interactions. Extracting these relationships from natural language sentences on such a large scale, however, requires text mining algorithms that can recognize when different-looking statements are expressing similar ideas. Here we describe a novel algorithm, Ensemble Biclustering for Classification (EBC), that learns the structure of biomedical relationships automatically from text, overcoming differences in word choice and sentence structure. We validate EBC's performance against manually-curated sets of (1) pharmacogenomic relationships from PharmGKB and (2) drug-target relationships from DrugBank, and use it to discover new drug-gene relationships for both knowledge bases. We then apply EBC to map the complete universe of drug-gene relationships based on their descriptions in Medline, revealing unexpected structure that challenges current notions about how these relationships are expressed in text. For instance, we learn that newer experimental findings are described in consistently different ways than established knowledge, and that seemingly pure classes of relationships can exhibit interesting chimeric structure. The EBC algorithm is flexible and adaptable to a wide range of problems in biomedical text mining. PMID:26219079

  1. Using natural language processing to improve biomedical concept normalization and relation mining

    NARCIS (Netherlands)

    N. Kang (Ning)

    2013-01-01

    textabstractThis thesis concerns the use of natural language processing for improving biomedical concept normalization and relation mining. We begin with introducing the background of biomedical text mining, and subsequently we will continue by describing a typical text mining pipeline, some key iss

  2. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  3. SIAM 2007 Text Mining Competition dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...

  4. Text Mining in Social Networks

    Science.gov (United States)

    Aggarwal, Charu C.; Wang, Haixun

    Social networks are rich in various kinds of contents such as text and multimedia. The ability to apply text mining algorithms effectively in the context of text data is critical for a wide variety of applications. Social networks require text mining algorithms for a wide variety of applications such as keyword search, classification, and clustering. While search and classification are well known applications for a wide variety of scenarios, social networks have a much richer structure both in terms of text and links. Much of the work in the area uses either purely the text content or purely the linkage structure. However, many recent algorithms use a combination of linkage and content information for mining purposes. In many cases, it turns out that the use of a combination of linkage and content information provides much more effective results than a system which is based purely on either of the two. This paper provides a survey of such algorithms, and the advantages observed by using such algorithms in different scenarios. We also present avenues for future research in this area.

  5. NAMED ENTITY RECOGNITION FROM BIOMEDICAL TEXT -AN INFORMATION EXTRACTION TASK

    Directory of Open Access Journals (Sweden)

    N. Kanya

    2016-07-01

    Full Text Available Biomedical Text Mining targets the Extraction of significant information from biomedical archives. Bio TM encompasses Information Retrieval (IR and Information Extraction (IE. The Information Retrieval will retrieve the relevant Biomedical Literature documents from the various Repositories like PubMed, MedLine etc., based on a search query. The IR Process ends up with the generation of corpus with the relevant document retrieved from the Publication databases based on the query. The IE task includes the process of Preprocessing of the document, Named Entity Recognition (NER from the documents and Relationship Extraction. This process includes Natural Language Processing, Data Mining techniques and machine Language algorithm. The preprocessing task includes tokenization, stop word Removal, shallow parsing, and Parts-Of-Speech tagging. NER phase involves recognition of well-defined objects such as genes, proteins or cell-lines etc. This process leads to the next phase that is extraction of relationships (IE. The work was based on machine learning algorithm Conditional Random Field (CRF.

  6. Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

    Science.gov (United States)

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. PMID:27538578

  7. Mining biomedical images towards valuable information retrieval in biomedical and life sciences.

    Science.gov (United States)

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries.

  8. Mining biomedical images towards valuable information retrieval in biomedical and life sciences

    Science.gov (United States)

    Ahmed, Zeeshan; Zeeshan, Saman; Dandekar, Thomas

    2016-01-01

    Biomedical images are helpful sources for the scientists and practitioners in drawing significant hypotheses, exemplifying approaches and describing experimental results in published biomedical literature. In last decades, there has been an enormous increase in the amount of heterogeneous biomedical image production and publication, which results in a need for bioimaging platforms for feature extraction and analysis of text and content in biomedical images to take advantage in implementing effective information retrieval systems. In this review, we summarize technologies related to data mining of figures. We describe and compare the potential of different approaches in terms of their developmental aspects, used methodologies, produced results, achieved accuracies and limitations. Our comparative conclusions include current challenges for bioimaging software with selective image mining, embedded text extraction and processing of complex natural language queries. PMID:27538578

  9. GPU-Accelerated Text Mining

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.

  10. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  11. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  12. Searching Biomedical Text: Towards Maximum Relevant Results

    OpenAIRE

    Galde, Ola; Sevaldsen, John Harald

    2006-01-01

    The amount of biomedical information available to users today is large and increasing. The ability to precisely retrieve desired information is vital in order to utilize available knowledge. In this work we investigated how to improve the relevance of biomedical search results. Using the Lucene Java API we applied a series of information retrieval techniques to search in biomedical data. The techniques ranged from basic stemming and stop-word removal to more advanced methods like user relevan...

  13. A Survey on Web Text Information Retrieval in Text Mining

    OpenAIRE

    Tapaswini Nayak; Srinivash Prasad; Manas Ranjan Senapat

    2015-01-01

    In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concep...

  14. Efficient Retrieval of Text for Biomedical Domain using Expectation Maximization Algorithm

    Directory of Open Access Journals (Sweden)

    Sumit Vashishtha

    2011-11-01

    Full Text Available Data mining, a branch of computer science [1], is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage. Biomedical text retrieval refers to text retrieval techniques applied to biomedical resources and literature available of the biomedical and molecular biology domain. The volume of published biomedical research, and therefore the underlying biomedical knowledge base, is expanding at an increasing rate. Biomedical text retrieval is a way to aid researchers in coping with information overload. By discovering predictive relationships between different pieces of extracted data, data-mining algorithms can be used to improve the accuracy of information extraction. However, textual variation due to typos, abbreviations, and other sources can prevent the productive discovery and utilization of hard-matching rules. Recent methods of soft clustering can exploit predictive relationships in textual data. This paper presents a technique for using soft clustering data mining algorithm to increase the accuracy of biomedical text extraction. Experimental results demonstrate that this approach improves text extraction more effectively that hard keyword matching rules.

  15. Association between leukemia and genes detected using biomedical text mining tools%基于生物医学文本挖掘工具的白血病和基因关系研究

    Institute of Scientific and Technical Information of China (English)

    朱祥; 张云秋; 冯佳

    2015-01-01

    利用COREMINE Medical寻找与白血病相关的基因,确定关系最为密切的5种基因,再通过生物医学文本挖掘工具Chilibot对从PubMed中所获相关文献的摘要进行分析,通过对相互作用的深入分析,发现了白血病和基因的相互作用关系.%Five genes that are closely related with leukemia were detected and identified using COREMINE Medi-cal, and the abstracts of related papers covered in PubMed were analyzed with the biomedical text mining tool, Chilibot, which showed that leukemia interacts with the 5 genes detected using COREMINE Medical.

  16. Scalable Text Mining with Sparse Generative Models

    OpenAIRE

    Puurula, Antti

    2016-01-01

    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: gener...

  17. Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

    OpenAIRE

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Biomedical texts can be typically represented by four rhetorical categories: Introduction, Methods, Results and Discussion (IMRAD). Classifying sentences into these categories can benefit many other text-mining tasks. Although many studies have applied different approaches for automatically classifying sentences in MEDLINE abstracts into the IMRAD categories, few have explored the classification of sentences that appear in full-text biomedical articles. We first evaluated whether sentences in...

  18. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  19. Text Mining Perspectives in Microarray Data Mining

    OpenAIRE

    Natarajan, Jeyakumar

    2013-01-01

    Current microarray data mining methods such as clustering, classification, and association analysis heavily rely on statistical and machine learning algorithms for analysis of large sets of gene expression data. In recent years, there has been a growing interest in methods that attempt to discover patterns based on multiple but related data sources. Gene expression data and the corresponding literature data are one such example. This paper suggests a new approach to microarray data mining as ...

  20. Knowledge discovery data and text mining

    CERN Document Server

    Olmer, Petr

    2008-01-01

    Data mining and text mining refer to techniques, models, algorithms, and processes for knowledge discovery and extraction. Basic de nitions are given together with the description of a standard data mining process. Common models and algorithms are presented. Attention is given to text clustering, how to convert unstructured text to structured data (vectors), and how to compute their importance and position within clusters.

  1. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  2. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  3. OntoRest: Text Mining Web Services in BioC Format

    OpenAIRE

    Marques, Hernani; Rinaldi, Fabio

    2014-01-01

    In this poster we present a set of biomedical text mining web services which can be used to provide remote access to the annotation results of an advanced text mining pipeline. The pipeline is part of a system which has been tested several times in community organized text mining competitions, often achieving top-ranked results.

  4. Discover Effective Pattern for Text Mining

    OpenAIRE

    Khade, A. D.; A. B. Karche

    2014-01-01

    Many data mining techniques have been discovered for finding useful patterns in documents like text document. However, how to use effective and bring to up to date discovered patterns is still an open research task, especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help users to find what they exactly want...

  5. Text mining for traditional Chinese medical knowledge discovery: a survey.

    Science.gov (United States)

    Zhou, Xuezhong; Peng, Yonghong; Liu, Baoyan

    2010-08-01

    Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the TCM knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. TCM literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to TCM and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around TCM text mining and its future directions.

  6. Enhancing Biomedical Text Summarization Using Semantic Relation Extraction

    OpenAIRE

    Yue Shang; Yanpeng Li; Hongfei Lin; Zhihao Yang

    2011-01-01

    Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) W...

  7. Text mining and its potential applications in systems biology.

    Science.gov (United States)

    Ananiadou, Sophia; Kell, Douglas B; Tsujii, Jun-ichi

    2006-12-01

    With biomedical literature increasing at a rate of several thousand papers per week, it is impossible to keep abreast of all developments; therefore, automated means to manage the information overload are required. Text mining techniques, which involve the processes of information retrieval, information extraction and data mining, provide a means of solving this. By adding meaning to text, these techniques produce a more structured analysis of textual knowledge than simple word searches, and can provide powerful tools for the production and analysis of systems biology models. PMID:17045684

  8. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  9. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  10. Extracting laboratory test information from biomedical text

    Directory of Open Access Journals (Sweden)

    Yanna Shen Kang

    2013-01-01

    Full Text Available Background: No previous study reported the efficacy of current natural language processing (NLP methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens was very limited or when lexical morphology of the entity was distinctive (as in units of measures, yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.

  11. Text mining library for Orange data mining suite

    OpenAIRE

    Novak, David

    2016-01-01

    We have developed a text mining system that can be used as an add-on for Orange, a data mining platform. Orange envelops a set of supervised and unsupervised machine learning methods that benefit a typical text mining platform and therefore offers an excellent foundation for development. We have studied the field of text mining and reviewed several open-source toolkits to define its base components. We have included widgets that enable retrieval of data from remote repositories, such as PubMe...

  12. Preprocessing and Morphological Analysis in Text Mining

    Directory of Open Access Journals (Sweden)

    Krishna Kumar Mohbey Sachin Tiwari

    2011-12-01

    Full Text Available This paper is based on the preprocessing activities which is performed by the software or language translators before applying mining algorithms on the huge data. Text mining is an important area of Data mining and it plays a vital role for extracting useful information from the huge database or data ware house. But before applying the text mining or information extraction process, preprocessing is must because the given data or dataset have the noisy, incomplete, inconsistent, dirty and unformatted data. In this paper we try to collect the necessary requirements for preprocessing. When we complete the preprocess task then we can easily extract the knowledgful information using mining strategy. This paper also provides the information about the analysis of data like tokenization, stemming and semantic analysis like phrase recognition and parsing. This paper also collect the procedures for preprocessing data i.e. it describe that how the stemming, tokenization or parsing are applied.

  13. Text mining for the biocuration workflow

    OpenAIRE

    Hirschman, L.; Burns, G. A. P. C.; Krallinger, M.; Arighi, C.; Cohen, K. B.; Valencia, A.; Wu, C H; Chatr-aryamontri, A; Dowell, K. G.; Huala, E; Lourenco, A.; Nash, R; Veuthey, A.-L.; Wiegers, T.; Winter, A. G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations too...

  14. Financial Statement Fraud Detection using Text Mining

    OpenAIRE

    Rajan Gupta; Nasib Singh Gill

    2013-01-01

    Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information) i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement frau...

  15. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II

    OpenAIRE

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of writ...

  16. Text Mining for Adverse Drug Events: the Promise, Challenges, and State of the Art

    OpenAIRE

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H.

    2014-01-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. Text mining is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources—such as biomedical literature, clinical narrat...

  17. Financial Statement Fraud Detection using Text Mining

    Directory of Open Access Journals (Sweden)

    Rajan Gupta

    2013-01-01

    Full Text Available Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement fraud by analyzing the hidden clues in the qualitative information (text present in financial statements.

  18. Negation scope and spelling variation for text-mining of Danish electronic patient records

    DEFF Research Database (Denmark)

    Thomas, Cecilia Engel; Jensen, Peter Bjødstrup; Werge, Thomas;

    2014-01-01

    Electronic patient records are a potentially rich data source for knowledge extraction in biomedical research. Here we present a method based on the ICD10 system for text-mining of Danish health records. We have evaluated how adding functionalities to a baseline text-mining tool affected the over...... scopes, some spanning only a few words, while others span up to whole sentences....

  19. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set

    Directory of Open Access Journals (Sweden)

    Abdul Wahab Muzaffar

    2015-01-01

    Full Text Available The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.

  20. Text mining and visualization using VOSviewer

    OpenAIRE

    van Eck, Nees Jan; Waltman, Ludo

    2011-01-01

    VOSviewer is a computer program for creating, visualizing, and exploring bibliometric maps of science. In this report, the new text mining functionality of VOSviewer is presented. A number of examples are given of applications in which VOSviewer is used for analyzing large amounts of text data.

  1. Enhancing biomedical text summarization using semantic relation extraction.

    Directory of Open Access Journals (Sweden)

    Yue Shang

    Full Text Available Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1 We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2 We develop a relation-level retrieval method to select the relations most relevant to each query concept and visualize them in a graphic representation. 3 For relations in the relevant set, we extract informative sentences that can interpret them from the document collection to generate text summary using an information retrieval based method. Our major focus in this work is to investigate the contribution of semantic relation extraction to the task of biomedical text summarization. The experimental results on summarization for a set of diseases show that the introduction of semantic knowledge improves the performance and our results are better than the MEAD system, a well-known tool for text summarization.

  2. Arabic Text Mining Using Rule Based Classification

    OpenAIRE

    Fadi Thabtah; Omar Gharaibeh; Rashid Al-Zubaidy

    2012-01-01

    A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studi...

  3. Semantator: semantic annotator for converting biomedical text to linked data.

    Science.gov (United States)

    Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G

    2013-10-01

    More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference.

  4. A Relation Extraction Framework for Biomedical Text Using Hybrid Feature Set.

    Science.gov (United States)

    Muzaffar, Abdul Wahab; Azam, Farooque; Qamar, Usman

    2015-01-01

    The information extraction from unstructured text segments is a complex task. Although manual information extraction often produces the best results, it is harder to manage biomedical data extraction manually because of the exponential increase in data size. Thus, there is a need for automatic tools and techniques for information extraction in biomedical text mining. Relation extraction is a significant area under biomedical information extraction that has gained much importance in the last two decades. A lot of work has been done on biomedical relation extraction focusing on rule-based and machine learning techniques. In the last decade, the focus has changed to hybrid approaches showing better results. This research presents a hybrid feature set for classification of relations between biomedical entities. The main contribution of this research is done in the semantic feature set where verb phrases are ranked using Unified Medical Language System (UMLS) and a ranking algorithm. Support Vector Machine and Naïve Bayes, the two effective machine learning techniques, are used to classify these relations. Our approach has been validated on the standard biomedical text corpus obtained from MEDLINE 2001. Conclusively, it can be articulated that our framework outperforms all state-of-the-art approaches used for relation extraction on the same corpus.

  5. Mining Quality Phrases from Massive Text Corpora

    OpenAIRE

    Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei

    2015-01-01

    Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quali...

  6. Full text clustering and relationship network analysis of biomedical publications.

    Directory of Open Access Journals (Sweden)

    Renchu Guan

    Full Text Available Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.

  7. Text Mining applied to Molecular Biology

    NARCIS (Netherlands)

    R. Jelier (Rob)

    2008-01-01

    textabstractThis thesis describes the development of text-mining algorithms for molecular biology, in particular for DNA microarray data analysis. Concept profiles were introduced, which characterize the context in which a gene is mentioned in literature, to retrieve functional associations

  8. New directions in biomedical text annotation: definitions, guidelines and corpus construction

    Directory of Open Access Journals (Sweden)

    Rzhetsky Andrey

    2006-07-01

    Full Text Available Abstract Background While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined. Results We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them. To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task. Conclusion We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently

  9. Monitoring interaction and collective text production through text mining

    Directory of Open Access Journals (Sweden)

    Macedo, Alexandra Lorandi

    2014-04-01

    Full Text Available This article presents the Concepts Network tool, developed using text mining technology. The main objective of this tool is to extract and relate terms of greatest incidence from a text and exhibit the results in the form of a graph. The Network was implemented in the Collective Text Editor (CTE which is an online tool that allows the production of texts in synchronized or non-synchronized forms. This article describes the application of the Network both in texts produced collectively and texts produced in a forum. The purpose of the tool is to offer support to the teacher in managing the high volume of data generated in the process of interaction amongst students and in the construction of the text. Specifically, the aim is to facilitate the teacher’s job by allowing him/her to process data in a shorter time than is currently demanded. The results suggest that the Concepts Network can aid the teacher, as it provides indicators of the quality of the text produced. Moreover, messages posted in forums can be analyzed without their content necessarily having to be pre-read.

  10. Imitating manual curation of text-mined facts in biomedicine.

    Directory of Open Access Journals (Sweden)

    Raul Rodriguez-Esteban

    2006-09-01

    Full Text Available Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations, we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95. Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

  11. Demo: Using RapidMiner for Text Mining

    OpenAIRE

    Shterev, Yordan

    2013-01-01

    In this demo the basic text mining technologies by using RapidMining have been reviewed. RapidMining basic characteristics and operators of text mining have been described. Text mining example by using Navie Bayes algorithm and process modeling have been revealed.

  12. Text mining applications in psychiatry: a systematic literature review.

    Science.gov (United States)

    Abbe, Adeline; Grouin, Cyril; Zweigenbaum, Pierre; Falissard, Bruno

    2016-06-01

    The expansion of biomedical literature is creating the need for efficient tools to keep pace with increasing volumes of information. Text mining (TM) approaches are becoming essential to facilitate the automated extraction of useful biomedical information from unstructured text. We reviewed the applications of TM in psychiatry, and explored its advantages and limitations. A systematic review of the literature was carried out using the CINAHL, Medline, EMBASE, PsycINFO and Cochrane databases. In this review, 1103 papers were screened, and 38 were included as applications of TM in psychiatric research. Using TM and content analysis, we identified four major areas of application: (1) Psychopathology (i.e. observational studies focusing on mental illnesses) (2) the Patient perspective (i.e. patients' thoughts and opinions), (3) Medical records (i.e. safety issues, quality of care and description of treatments), and (4) Medical literature (i.e. identification of new scientific information in the literature). The information sources were qualitative studies, Internet postings, medical records and biomedical literature. Our work demonstrates that TM can contribute to complex research tasks in psychiatry. We discuss the benefits, limits, and further applications of this tool in the future. Copyright © 2015 John Wiley & Sons, Ltd. PMID:26184780

  13. Text Data Mining: Theory and Methods

    Directory of Open Access Journals (Sweden)

    Jeffrey L. Solka

    2008-01-01

    Full Text Available This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

  14. Corpus annotation for mining biomedical events from literature

    Directory of Open Access Journals (Sweden)

    Tsujii Jun'ichi

    2008-01-01

    Full Text Available Abstract Background Advanced Text Mining (TM such as semantic enrichment of papers, event or relation extraction, and intelligent Question Answering have increasingly attracted attention in the bio-medical domain. For such attempts to succeed, text annotation from the biological point of view is indispensable. However, due to the complexity of the task, semantic annotation has never been tried on a large scale, apart from relatively simple term annotation. Results We have completed a new type of semantic annotation, event annotation, which is an addition to the existing annotations in the GENIA corpus. The corpus has already been annotated with POS (Parts of Speech, syntactic trees, terms, etc. The new annotation was made on half of the GENIA corpus, consisting of 1,000 Medline abstracts. It contains 9,372 sentences in which 36,114 events are identified. The major challenges during event annotation were (1 to design a scheme of annotation which meets specific requirements of text annotation, (2 to achieve biology-oriented annotation which reflect biologists' interpretation of text, and (3 to ensure the homogeneity of annotation quality across annotators. To meet these challenges, we introduced new concepts such as Single-facet Annotation and Semantic Typing, which have collectively contributed to successful completion of a large scale annotation. Conclusion The resulting event-annotated corpus is the largest and one of the best in quality among similar annotation efforts. We expect it to become a valuable resource for NLP (Natural Language Processing-based TM in the bio-medical domain.

  15. Mining Causality for Explanation Knowledge from Text

    Institute of Scientific and Technical Information of China (English)

    Chaveevan Pechsiri; Asanee Kawtrakul

    2007-01-01

    Mining causality is essential to provide a diagnosis. This research aims at extracting the causality existing within multiple sentences or EDUs (Elementary Discourse Unit). The research emphasizes the use of causality verbs because they make explicit in a certain way the consequent events of a cause, e.g., "Aphids suck the sap from rice leaves. Then leaves will shrink. Later, they will become yellow and dry.". A verb can also be the causal-verb link between cause and effect within EDU(s), e.g., "Aphids suck the sap from rice leaves causing leaves to be shrunk" ("causing" is equivalent to a causal-verb link in Thai). The research confronts two main problems: identifying the interesting causality events from documents and identifying their boundaries. Then, we propose mining on verbs by using two different machine learning techniques, Naive Bayes classifier and Support Vector Machine. The resulted mining rules will be used for the identification and the causality extraction of the multiple EDUs from text. Our multiple EDUs extraction shows 0.88 precision with 0.75 recall from Na'ive Bayes classifier and 0.89 precision with 0.76 recall from Support Vector Machine.

  16. Text Data Mining: Theory and Methods

    OpenAIRE

    Solka, Jeffrey L.

    2008-01-01

    This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader wi...

  17. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  18. Semantic text mining support for lignocellulose research

    Directory of Open Access Journals (Sweden)

    Meurs Marie-Jean

    2012-04-01

    Full Text Available Abstract Background Biofuels produced from biomass are considered to be promising sustainable alternatives to fossil fuels. The conversion of lignocellulose into fermentable sugars for biofuels production requires the use of enzyme cocktails that can efficiently and economically hydrolyze lignocellulosic biomass. As many fungi naturally break down lignocellulose, the identification and characterization of the enzymes involved is a key challenge in the research and development of biomass-derived products and fuels. One approach to meeting this challenge is to mine the rapidly-expanding repertoire of microbial genomes for enzymes with the appropriate catalytic properties. Results Semantic technologies, including natural language processing, ontologies, semantic Web services and Web-based collaboration tools, promise to support users in handling complex data, thereby facilitating knowledge-intensive tasks. An ongoing challenge is to select the appropriate technologies and combine them in a coherent system that brings measurable improvements to the users. We present our ongoing development of a semantic infrastructure in support of genomics-based lignocellulose research. Part of this effort is the automated curation of knowledge from information on fungal enzymes that is available in the literature and genome resources. Conclusions Working closely with fungal biology researchers who manually curate the existing literature, we developed ontological natural language processing pipelines integrated in a Web-based interface to assist them in two main tasks: mining the literature for relevant knowledge, and at the same time providing rich and semantically linked information.

  19. Unsupervised text mining for assessing and augmenting GWAS results.

    Science.gov (United States)

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. PMID:26911523

  20. Data, text and web mining for business intelligence: a survey

    OpenAIRE

    Abdul-Aziz Rashid Al-Azmi

    2013-01-01

    The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future eve...

  1. Spectral signature verification using statistical analysis and text mining

    Science.gov (United States)

    DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.

    2016-05-01

    In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is

  2. TEXT MINING – PREREQUISITE FOR KNOWLEDGE MANAGEMENT SYSTEMS

    OpenAIRE

    Dragoº Marcel VESPAN

    2009-01-01

    Text mining is an interdisciplinary field with the main purpose of retrieving new knowledge from large collections of text documents. This paper presents the main techniques used for knowledge extraction through text mining and their main areas of applicability and emphasizes the importance of text mining in knowledge management systems.

  3. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  4. Techniques, Applications and Challenging Issue in Text Mining

    Directory of Open Access Journals (Sweden)

    Shaidah Jusoh

    2012-11-01

    Full Text Available Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts. These texts can be found on a desktop, intranets and the internet. The aim of this paper is to give an overview of text mining in the contexts of its techniques, application domains and the most challenging issue. The focus is given on fundamentals methods of text mining which include natural language possessing and information extraction. This paper also gives a short review on domains which have employed text mining. The challenging issue in text mining which is caused by the complexity in a natural language is also addressed in this paper.

  5. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  6. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  7. BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

    Directory of Open Access Journals (Sweden)

    Liu Haibin

    2012-04-01

    Full Text Available Abstract Background The wide variety of morphological variants of domain-specific technical terms contributes to the complexity of performing natural language processing of the scientific literature related to molecular biology. For morphological analysis of these texts, lemmatization has been actively applied in the recent biomedical research. Results In this work, we developed a domain-specific lemmatization tool, BioLemmatizer, for the morphological analysis of biomedical literature. The tool focuses on the inflectional morphology of English and is based on the general English lemmatization tool MorphAdorner. The BioLemmatizer is further tailored to the biological domain through incorporation of several published lexical resources. It retrieves lemmas based on the use of a word lexicon, and defines a set of rules that transform a word to a lemma if it is not encountered in the lexicon. An innovative aspect of the BioLemmatizer is the use of a hierarchical strategy for searching the lexicon, which enables the discovery of the correct lemma even if the input Part-of-Speech information is inaccurate. The BioLemmatizer achieves an accuracy of 97.5% in lemmatizing an evaluation set prepared from the CRAFT corpus, a collection of full-text biomedical articles, and an accuracy of 97.6% on the LLL05 corpus. The contribution of the BioLemmatizer to accuracy improvement of a practical information extraction task is further demonstrated when it is used as a component in a biomedical text mining system. Conclusions The BioLemmatizer outperforms other tools when compared with eight existing lemmatizers. The BioLemmatizer is released as an open source software and can be downloaded from http://biolemmatizer.sourceforge.net.

  8. A comparison study on algorithms of detecting long forms for short forms in biomedical text

    Directory of Open Access Journals (Sweden)

    Wu Cathy H

    2007-11-01

    Full Text Available Abstract Motivation With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs and their corresponding definitions as long forms (LFs. The study was designed to answer the following questions; i how well a system performs in detecting LFs from novel text, ii what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii how to combine results from various SF knowledge bases. Method We evaluated the following three publicly available detection systems in detecting LFs for SFs: i a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii a machine learning system by Chang et al., and iii a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i the UMLS (the Unified Medical Language System, and ii the BioThesaurus (a thesaurus of names for all UniProt protein records. We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. Availability The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.

  9. Techniques, Applications and Challenging Issue in Text Mining

    OpenAIRE

    Shaidah Jusoh; Hejab M. Alfawareh

    2012-01-01

    Text mining is a very exciting research area as it tries to discover knowledge from unstructured texts. These texts can be found on a desktop, intranets and the internet. The aim of this paper is to give an overview of text mining in the contexts of its techniques, application domains and the most challenging issue. The focus is given on fundamentals methods of text mining which include natural language possessing and information extraction. This paper also gives a short review on domains whi...

  10. Research on Text Mining Based on Domain Ontology

    OpenAIRE

    Li-hua, Jiang; Neng-fu, Xie; Hong-bin, Zhang

    2013-01-01

    This paper improves the traditional text mining technology which cannot understand the text semantics. The author discusses the text mining methods based on ontology and puts forward text mining model based on domain ontology. Ontology structure is built firstly and the “concept-concept” similarity matrix is introduced, then a conception vector space model based on domain ontology is used to take the place of traditional vector space model to represent the documents in order to realize text m...

  11. Data, Text and Web Mining for Business Intelligence : A Survey

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid Al-Azmi

    2013-04-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amountsof data available. Enterprises use mining technologies to search vast amounts of data for vital insight andknowledge. Mining tools such as data mining, text mining, and web mining are used to find hiddenknowledge in large databases or the Internet. Mining tools are automated software tools used to achievebusiness intelligence by finding hidden relations,and predicting future events from vast amounts of data.This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, andeven fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented.Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then lookinto some case studies of success stories using mining tools. Finally, we shall demonstrate some of the mainchallenges to the mining technologies that limit their potential.

  12. DATA, TEXT, AND WEB MINING FOR BUSINESS INTELLIGENCE: A SURVEY

    Directory of Open Access Journals (Sweden)

    Abdul-Aziz Rashid

    2013-03-01

    Full Text Available The Information and Communication Technologies revolution brought a digital world with huge amounts of data available. Enterprises use mining technologies to search vast amounts of data for vital insight and knowledge. Mining tools such as data mining, text mining, and web mining are used to find hidden knowledge in large databases or the Internet. Mining tools are automated software tools used to achieve business intelligence by finding hidden relations, and predicting future events from vast amounts of data. This uncovered knowledge helps in gaining completive advantages, better customers’ relationships, and even fraud detection. In this survey, we’ll describe how these techniques work, how they are implemented. Furthermore, we shall discuss how business intelligence is achieved using these mining tools. Then look into some case studies of success stories using mining tools. Finally, we shall demonstrate some of the main challenges to the mining technologies that limit their potential.

  13. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  14. A text mining framework in R and its applications

    OpenAIRE

    Feinerer, Ingo

    2008-01-01

    Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well es...

  15. Using Dependency Parses to Augment Feature Construction for Text Mining

    OpenAIRE

    Guo, Sheng

    2012-01-01

    With the prevalence of large data stored in the cloud, including unstructured information in the form of text, there is now an increased emphasis on text mining. A broad range of techniques are now used for text mining, including algorithms adapted from machine learning, NLP, computational linguistics, and data mining. Applications are also multi-fold, including classification, clustering, segmentation, relationship discovery, and practically any task that discovers latent information from wr...

  16. Regulatory relations represented in logics and biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    is the biomedical semantics of regulates relations, i.e. positively regulates, negatively regulates and regulates, of which is assumed to be a super relation of the rst two. This thesis discusses an initial framework for knowledge representation based on logics, carries out a corpus analysis on the verbs...

  17. Regulatory relations represented in logics and biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    biomedical semantics of regulates relations, i.e. positively regulates, negatively regulates and regulates, of which is assumed to be a super relation of the rst two. This thesis discusses an initial framework for knowledge representation based on logics, carries out a corpus analysis on the verbs...

  18. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies

    OpenAIRE

    Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

    2013-01-01

    Background The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste r...

  19. Analysing Customer Opinions with Text Mining Algorithms

    Science.gov (United States)

    Consoli, Domenico

    2009-08-01

    Knowing what the customer thinks of a particular product/service helps top management to introduce improvements in processes and products, thus differentiating the company from their competitors and gain competitive advantages. The customers, with their preferences, determine the success or failure of a company. In order to know opinions of the customers we can use technologies available from the web 2.0 (blog, wiki, forums, chat, social networking, social commerce). From these web sites, useful information must be extracted, for strategic purposes, using techniques of sentiment analysis or opinion mining.

  20. Event-based text mining for biology and functional genomics.

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Nawaz, Raheel; McNaught, John; Kell, Douglas B

    2015-05-01

    The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of 'events', i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research.

  1. VIRTUAL MINING MODEL FOR CLASSIFYING TEXT USING UNSUPERVISED LEARNING

    Directory of Open Access Journals (Sweden)

    S. Koteeswaran

    2014-01-01

    Full Text Available In real world data mining is emerging in various era, one of its most outstanding performance is held in various research such as Big data, multimedia mining, text mining etc. Each of the researcher proves their contribution with tremendous improvements in their proposal by means of mathematical representation. Empowering each problem with solutions are classified into mathematical and implementation models. The mathematical model relates to the straight forward rules and formulas that are related to the problem definition of particular field of domain. Whereas the implementation model derives some sort of knowledge from the real time decision making behaviour such as artificial intelligence and swarm intelligence and has a complex set of rules compared with the mathematical model. The implementation model mines and derives knowledge model from the collection of dataset and attributes. This knowledge is applied to the concerned problem definition. The objective of our work is to efficiently mine knowledge from the unstructured text documents. In order to mine textual documents, text mining is applied. The text mining is the sub-domain in data mining. In text mining, the proposed Virtual Mining Model (VMM is defined for effective text clustering. This VMM involves the learning of conceptual terms; these terms are grouped in Significant Term List (STL. VMM model is appropriate combination of layer 1 arch with ABI (Analysis of Bilateral Intelligence. The frequent update of conceptual terms in the STL is more important for effective clustering. The result is shown, Artifial neural network based unsupervised learning algorithm is used for learning texual pattern in the Virtual Mining Model. For learning of such terminologies, this paper proposed Artificial Neural Network based learning algorithm.

  2. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  3. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    OpenAIRE

    Vishwadeepak Singh Baghela; S. P. Tripathi

    2012-01-01

    A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mi...

  4. MINING TEXTS TO UNDERSTAND CUSTOMERS' IMAGE OF BRANDS

    Directory of Open Access Journals (Sweden)

    Hyung Jun Ahn

    2013-06-01

    Full Text Available Text mining is becoming increasingly important in understanding customers and markets these days. This paper presents a method of mining texts about customer sentiments using a network analysis technique. A data set collected about two global mobile device manufactures were used for testing the method. The analysis results show that the method can be effectively used to extract key sentiments in the customers' texts.

  5. E3Miner: a text mining tool for ubiquitin-protein ligases

    OpenAIRE

    Lee, Hodong; Yi, Gwan-Su; Park, Jong C.

    2008-01-01

    Ubiquitination is a regulatory process critically involved in the degradation of >80% of cellular proteins, where such proteins are specifically recognized by a key enzyme, or a ubiquitin-protein ligase (E3). Because of this important role of E3s, a rapidly growing body of the published literature in biology and biomedical fields reports novel findings about various E3s and their molecular mechanisms. However, such findings are neither adequately retrieved by general text-mining tools nor sys...

  6. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    NARCIS (Netherlands)

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is base

  7. VIRTUAL MINING MODEL FOR CLASSIFYING TEXT USING UNSUPERVISED LEARNING

    OpenAIRE

    S. Koteeswaran; E. Kannan; P. Visu

    2014-01-01

    In real world data mining is emerging in various era, one of its most outstanding performance is held in various research such as Big data, multimedia mining, text mining etc. Each of the researcher proves their contribution with tremendous improvements in their proposal by means of mathematical representation. Empowering each problem with solutions are classified into mathematical and implementation models. The mathematical model relates to the straight forward rules and formulas that are re...

  8. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    OpenAIRE

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is based on the scalable open source text analysis service xTAS (developed by the Intelligent Systems Lab Amsterdam). The text analysis service xTAS has been used successfully in computational humanities ...

  9. Text mining of web-based medical content

    CERN Document Server

    Neustein, Amy

    2014-01-01

    Text Mining of Web-Based Medical Content examines web mining for extracting useful information that can be used for treating and monitoring the healthcare of patients. This work provides methodological approaches to designing mapping tools that exploit data found in social media postings. Specific linguistic features of medical postings are analyzed vis-a-vis available data extraction tools for culling useful information.

  10. Mining biomedical data using MetaMap Transfer (MMtx) and the Unified Medical Language System (UMLS).

    Science.gov (United States)

    Osborne, John D; Lin, Simon; Zhu, Lihua; Kibbe, Warren A

    2007-01-01

    Detailed instruction is described for mapping unstructured, free text data into common biomedical concepts (drugs, diseases, anatomy, and so on) found in the Unified Medical Language System using MetaMap Transfer (MMTx). MMTx can be used in applications including mining and inferring relationship between concepts in MEDLINE publications by transforming free text into computable concepts. MMTx is in general not designed to be an end-user program; therefore, a simple analysis is described using MMTx for users without any programming knowledge. In addition, two Java template files are provided for automated processing of the output from MMTx and users can adopt this with minimum Java program experience. PMID:18314582

  11. Introduction to Text Mining with R for Information Professionals

    Directory of Open Access Journals (Sweden)

    Monica Maceli

    2016-07-01

    Full Text Available The 'tm: Text Mining Package' in the open source statistical software R has made text analysis techniques easily accessible to both novice and expert practitioners, providing useful ways of analyzing and understanding large, unstructured datasets. Such an approach can yield many benefits to information professionals, particularly those involved in text-heavy research projects. This article will discuss the functionality and possibilities of text mining, as well as the basic setup necessary for novice R users to employ the RStudio integrated development environment (IDE. Common use cases, such as analyzing a corpus of text documents or spreadsheet text data, will be covered, as well as the text mining tools for calculating term frequency, term correlations, clustering, creating wordclouds, and plotting.

  12. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    Science.gov (United States)

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR. PMID:26564970

  13. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies?

    Science.gov (United States)

    Winnenburg, Rainer; Wächter, Thomas; Plake, Conrad; Doms, Andreas; Schroeder, Michael

    2008-11-01

    The biomedical literature can be seen as a large integrated, but unstructured data repository. Extracting facts from literature and making them accessible is approached from two directions: manual curation efforts develop ontologies and vocabularies to annotate gene products based on statements in papers. Text mining aims to automatically identify entities and their relationships in text using information retrieval and natural language processing techniques. Manual curation is highly accurate but time consuming, and does not scale with the ever increasing growth of literature. Text mining as a high-throughput computational technique scales well, but is error-prone due to the complexity of natural language. How can both be married to combine scalability and accuracy? Here, we review the state-of-the-art text mining approaches that are relevant to annotation and discuss available online services analysing biomedical literature by means of text mining techniques, which could also be utilised by annotation projects. We then examine how far text mining has already been utilised in existing annotation projects and conclude how these techniques could be tightly integrated into the manual annotation process through novel authoring systems to scale-up high-quality manual curation. PMID:19060303

  14. Semi-Automatic Indexing of Full Text Biomedical Articles

    OpenAIRE

    Gay, Clifford W.; Kayaalp, Mehmet; Aronson, Alan R.

    2005-01-01

    The main application of U.S. National Library of Medicine’s Medical Text Indexer (MTI) is to provide indexing recommendations to the Library’s indexing staff. The current input to MTI consists of the titles and abstracts of articles to be indexed. This study reports on an extension of MTI to the full text of articles appearing in online medical journals that are indexed for Medline®. Using a collection of 17 journal issues containing 500 articles, we report on the effectiven...

  15. Managing biological networks by using text mining and computer-aided curation

    Science.gov (United States)

    Yu, Seok Jong; Cho, Yongseong; Lee, Min-Ho; Lim, Jongtae; Yoo, Jaesoo

    2015-11-01

    In order to understand a biological mechanism in a cell, a researcher should collect a huge number of protein interactions with experimental data from experiments and the literature. Text mining systems that extract biological interactions from papers have been used to construct biological networks for a few decades. Even though the text mining of literature is necessary to construct a biological network, few systems with a text mining tool are available for biologists who want to construct their own biological networks. We have developed a biological network construction system called BioKnowledge Viewer that can generate a biological interaction network by using a text mining tool and biological taggers. It also Boolean simulation software to provide a biological modeling system to simulate the model that is made with the text mining tool. A user can download PubMed articles and construct a biological network by using the Multi-level Knowledge Emergence Model (KMEM), MetaMap, and A Biomedical Named Entity Recognizer (ABNER) as a text mining tool. To evaluate the system, we constructed an aging-related biological network that consist 9,415 nodes (genes) by using manual curation. With network analysis, we found that several genes, including JNK, AP-1, and BCL-2, were highly related in aging biological network. We provide a semi-automatic curation environment so that users can obtain a graph database for managing text mining results that are generated in the server system and can navigate the network with BioKnowledge Viewer, which is freely available at http://bioknowledgeviewer.kisti.re.kr.

  16. Text mining for literature review and knowledge discovery in cancer risk assessment and research.

    Directory of Open Access Journals (Sweden)

    Anna Korhonen

    Full Text Available Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB - a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

  17. Application of text mining for customer evaluations in commercial banking

    Science.gov (United States)

    Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.

  18. Text Mining: Wissensgewinnung aus natürlichsprachigen Dokumenten

    OpenAIRE

    Witte, René; Mülle, Jutta

    2006-01-01

    Das noch recht junge Forschungsgebiet "Text Mining" umfaßt eine Verbindung von Verfahren der Sprachverarbeitung mit Datenbank- und Informationssystemtechnologien. Es entstand aus der Beobachtung, dass ca. 85% aller Datenbankinhalte nur in unstrukturierter Form vorliegen, so dass sich die Techniken des klassischen Data Mining zur Wissensgewinnung nicht anwenden lassen. Beispiele für solche Daten sind Volltextdatenbanken mit Büchern, Unternehmenswebseiten, Archive mit Zeit...

  19. Applications of string mining techniques in text analysis

    OpenAIRE

    Horațiu Mocian

    2012-01-01

    The focus of this project is on the algorithms and data structures used in string mining and their applications in bioinformatics, text mining and information retrieval. More specific, it studies the use of suffix trees and suffix arrays for biological sequence analysis, and the algorithms used for approximate string matching, both general ones and specialized ones used in bioinformatics, like the BLAST algorithm and PAM substitution matrix. Also, an attempt is made to apply these structures ...

  20. Text Mining Driven Drug-Drug Interaction Detection.

    Science.gov (United States)

    Yan, Su; Jiang, Xiaoqian; Chen, Ying

    2013-01-01

    Identifying drug-drug interactions is an important and challenging problem in computational biology and healthcare research. There are accurate, structured but limited domain knowledge and noisy, unstructured but abundant textual information available for building predictive models. The difficulty lies in mining the true patterns embedded in text data and developing efficient and effective ways to combine heterogenous types of information. We demonstrate a novel approach of leveraging augmented text-mining features to build a logistic regression model with improved prediction performance (in terms of discrimination and calibration). Our model based on synthesized features significantly outperforms the model trained with only structured features (AUC: 96% vs. 91%, Sensitivity: 90% vs. 82% and Specificity: 88% vs. 81%). Along with the quantitative results, we also show learned "latent topics", an intermediary result of our text mining module, and discuss their implications. PMID:25131635

  1. Grid-based Support for Different Text Mining Tasks

    Directory of Open Access Journals (Sweden)

    Martin Sarnovský

    2009-12-01

    Full Text Available This paper provides an overview of our research activities aimed at efficient useof Grid infrastructure to solve various text mining tasks. Grid-enabling of various textmining tasks was mainly driven by increasing volume of processed data. Utilizing the Gridservices approach therefore enables to perform various text mining scenarios and alsoopen ways to design distributed modifications of existing methods. Especially, some partsof mining process can significantly benefit from decomposition paradigm, in particular inthis study we present our approach to data-driven decomposition of decision tree buildingalgorithm, clustering algorithm based on self-organizing maps and its application inconceptual model building task using the FCA-based algorithm. Work presented in thispaper is rather to be considered as a 'proof of concept' for design and implementation ofdecomposition methods as we performed the experiments mostly on standard textualdatabases.

  2. Approximate subgraph matching-based literature mining for biomedical events and relations.

    Directory of Open Access Journals (Sweden)

    Haibin Liu

    Full Text Available The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.

  3. Research on Online Topic Evolutionary Pattern Mining in Text Streams

    Directory of Open Access Journals (Sweden)

    Qian Chen

    2014-06-01

    Full Text Available Text Streams are a class of ubiquitous data that came in over time and are extraordinary large in scale that we often lose track of. Basically, text streams forms the fundamental source of information that can be used to detect semantic topic which individuals and organizations are interested in as well as detect burst events within communities. Thus, intelligent system that can automatically extract interesting temporal pattern from text streams is terribly needed; however, Evolutionary Pattern Mining is not well addressed in previous work. In this paper, we start a tentative research on topic evolutionary pattern mining system by discussing fully properties of a topic after formally definition, as well as proposing a common and formal framework in analyzing text streams. We also defined three basic tasks including (1 online topic Detection, (2 event evolution extraction and (3 topic property life cycle, and proposed three common mining algorithms respectively. Finally we exemplify the application of Evolutionary Pattern Mining and shows that interesting patterns can be extracted in newswire dataset

  4. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    Directory of Open Access Journals (Sweden)

    Calvin Lam

    Full Text Available Research on publication trends in journal articles on sleep disorders (SDs and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings.SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms.MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms.Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  5. Text mining for biology--the way forward

    DEFF Research Database (Denmark)

    Altman, Russ B; Bergman, Casey M; Blake, Judith;

    2008-01-01

    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify...... workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress....

  6. Using Text Mining to Characterize Online Discussion Facilitation

    Science.gov (United States)

    Ming, Norma; Baumer, Eric

    2011-01-01

    Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…

  7. Text mining improves prediction of protein functional sites.

    Directory of Open Access Journals (Sweden)

    Karin M Verspoor

    Full Text Available We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites. The structure analysis was carried out using Dynamics Perturbation Analysis (DPA, which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

  8. Collaborative mining and interpretation of large-scale data for biomedical research insights.

    Directory of Open Access Journals (Sweden)

    Georgia Tsiliki

    Full Text Available Biomedical research becomes increasingly interdisciplinary and collaborative in nature. Researchers need to efficiently and effectively collaborate and make decisions by meaningfully assembling, mining and analyzing available large-scale volumes of complex multi-faceted data residing in different sources. In line with related research directives revealing that, in spite of the recent advances in data mining and computational analysis, humans can easily detect patterns which computer algorithms may have difficulty in finding, this paper reports on the practical use of an innovative web-based collaboration support platform in a biomedical research context. Arguing that dealing with data-intensive and cognitively complex settings is not a technical problem alone, the proposed platform adopts a hybrid approach that builds on the synergy between machine and human intelligence to facilitate the underlying sense-making and decision making processes. User experience shows that the platform enables more informed and quicker decisions, by displaying the aggregated information according to their needs, while also exploiting the associated human intelligence.

  9. Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.

    Science.gov (United States)

    Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria

    2001-01-01

    Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…

  10. Application of Text Mining in Cancer Symptom Management.

    Science.gov (United States)

    Lee, Young Ji; Donovan, Heidi

    2016-01-01

    Fatigue continues to be one of the main symptoms that afflict ovarian cancer patients and negatively affects their functional status and quality of life. To manage fatigue effectively, the symptom must be understood from the perspective of patients. We utilized text mining to understand the symptom experiences and strategies that were associated with fatigue among ovarian cancer patients. Through text analysis, we determined that descriptors such as energetic, challenging, frustrating, struggling, unmanageable, and agony were associated with fatigue. Descriptors such as decadron, encourager, grocery, massage, relaxing, shower, sleep, zoloft, and church were associated with strategies to ameliorate fatigue. This study demonstrates the potential of applying text mining in cancer research to understand patients' perspective on symptom management. Future study will consider various factors to refine the results. PMID:27332415

  11. Unsupervised mining of frequent tags for clinical eligibility text indexing.

    Science.gov (United States)

    Miotto, Riccardo; Weng, Chunhua

    2013-12-01

    Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.

  12. Decision Support for E-Governance: A Text Mining Approach

    Directory of Open Access Journals (Sweden)

    G. Koteswara Rao

    2011-09-01

    Full Text Available Information and communication technology has the capability to improve the process by whichgovernments involve citizens in formulating public policy and public projects. Even though much ofgovernment regulations may now be in digital form (and often available online, due to their complexityand diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with theadvent of a number of electronic online forums, social networking sites and blogs, the opportunity ofgathering citizens’ petitions and stakeholders’ views on government policy and proposals has increasedgreatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the otherhand, text mining has come a long way from simple keyword search, and matured into a discipline capableof dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help inretrieval of information and relationships from textual data sources, thereby assisting policy makers indiscovering associations between policies and citizens’ opinions expressed in electronic public forums andblogs etc. We also present here, an integrated text mining based architecture for e-governance decisionsupport along with a discussion on the Indian scenario.

  13. Decision Support for e-Governance: A Text Mining Approach

    CERN Document Server

    Rao, G Koteswara

    2011-01-01

    Information and communication technology has the capability to improve the process by which governments involve citizens in formulating public policy and public projects. Even though much of government regulations may now be in digital form (and often available online), due to their complexity and diversity, identifying the ones relevant to a particular context is a non-trivial task. Similarly, with the advent of a number of electronic online forums, social networking sites and blogs, the opportunity of gathering citizens' petitions and stakeholders' views on government policy and proposals has increased greatly, but the volume and the complexity of analyzing unstructured data makes this difficult. On the other hand, text mining has come a long way from simple keyword search, and matured into a discipline capable of dealing with much more complex tasks. In this paper we discuss how text-mining techniques can help in retrieval of information and relationships from textual data sources, thereby assisting policy...

  14. Text Mining System for Non-Expert Miners

    OpenAIRE

    Ramya, P.; S. Sasirekha

    2014-01-01

    Service oriented architecture integrated with text mining allows services to extract information in a well defined manner. In this paper, it is proposed to design a knowledge extracting system for the Ocean Information Data System. Deployed ARGO floating sensors of INCOIS (Indian National Council for Ocean Information Systems) organization reflects the characteristics of ocean. This is forwarded to the OIDS (Ocean Information Data System). For the data received from OIDS, pre-processing techn...

  15. Text Mining Driven Drug-Drug Interaction Detection

    OpenAIRE

    Yan, Su; Jiang, Xiaoqian; Chen, Ying

    2013-01-01

    Identifying drug-drug interactions is an important and challenging problem in computational biology and healthcare research. There are accurate, structured but limited domain knowledge and noisy, unstructured but abundant textual information available for building predictive models. The difficulty lies in mining the true patterns embedded in text data and developing efficient and effective ways to combine heterogenous types of information. We demonstrate a novel approach of leveraging augment...

  16. Intrinsic evaluation of text mining tools may not predict performance on realistic tasks.

    Science.gov (United States)

    Caporaso, J Gregory; Deshpande, Nita; Fink, J Lynn; Bourne, Philip E; Cohen, K Bretonnel; Hunter, Lawrence

    2008-01-01

    Biomedical text mining and other automated techniques are beginning to achieve performance which suggests that they could be applied to aid database curators. However, few studies have evaluated how these systems might work in practice. In this article we focus on the problem of annotating mutations in Protein Data Bank (PDB) entries, and evaluate the relationship between performance of two automated techniques, a text-mining-based approach (MutationFinder) and an alignment-based approach, in intrinsic versus extrinsic evaluations. We find that high performance on gold standard data (an intrinsic evaluation) does not necessarily translate to high performance for database annotation (an extrinsic evaluation). We show that this is in part a result of lack of access to the full text of journal articles, which appears to be critical for comprehensive database annotation by text mining. Additionally, we evaluate the accuracy and completeness of manually annotated mutation data in the PDB, and find that it is far from perfect. We conclude that currently the most cost-effective and reliable approach for database annotation might incorporate manual and automatic annotation methods.

  17. Linking genes to literature: text mining, information extraction, and retrieval applications for biology.

    Science.gov (United States)

    Krallinger, Martin; Valencia, Alfonso; Hirschman, Lynette

    2008-01-01

    Efficient access to information contained in online scientific literature collections is essential for life science research, playing a crucial role from the initial stage of experiment planning to the final interpretation and communication of the results. The biological literature also constitutes the main information source for manual literature curation used by expert-curated databases. Following the increasing popularity of web-based applications for analyzing biological data, new text-mining and information extraction strategies are being implemented. These systems exploit existing regularities in natural language to extract biologically relevant information from electronic texts automatically. The aim of the BioCreative challenge is to promote the development of such tools and to provide insight into their performance. This review presents a general introduction to the main characteristics and applications of currently available text-mining systems for life sciences in terms of the following: the type of biological information demands being addressed; the level of information granularity of both user queries and results; and the features and methods commonly exploited by these applications. The current trend in biomedical text mining points toward an increasing diversification in terms of application types and techniques, together with integration of domain-specific resources such as ontologies. Additional descriptions of some of the systems discussed here are available on the internet http://zope.bioinfo.cnio.es/bionlp_tools/. PMID:18834499

  18. Text mining a self-report back-translation.

    Science.gov (United States)

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record PMID:26302100

  19. Text mining a self-report back-translation.

    Science.gov (United States)

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record

  20. The Role of Text Mining in Export Control

    Energy Technology Data Exchange (ETDEWEB)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon [Korea Institute of Nuclear Nonproliferation and Control, Daejeon (Korea, Republic of)

    2015-10-15

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control.

  1. Biomedical Mathematics, Unit II: Propagation of Error, Vectors and Linear Programming. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of mathematics within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on…

  2. Extraction of semantic biomedical relations from text using conditional random fields

    Directory of Open Access Journals (Sweden)

    Stetter Martin

    2008-04-01

    text and apply it to the biomedical domain. Our approach is based on a rich set of textual features and achieves a performance that is competitive to leading approaches. The model is quite general and can be extended to handle arbitrary biological entities and relation types. The resulting gene-disease network shows that the GeneRIF database provides a rich knowledge source for text mining. Current work is focused on improving the accuracy of detection of entities as well as entity boundaries, which will also greatly improve the relation extraction performance.

  3. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  4. Computational intelligence methods on biomedical signal analysis and data mining in medical records

    OpenAIRE

    Vladutu, Liviu-Mihai

    2004-01-01

    This thesis is centered around the development and application of computationally effective solutions based on artificial neural networks (ANN) for biomedical signal analysis and data mining in medical records. The ultimate goal of this work in the field of Biomedical Engineering is to provide the clinician with the best possible information needed to make an accurate diagnosis (in our case of myocardial ischemia) and to propose advanced mathematical models for recovering the complex de...

  5. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  6. Mining consumer health vocabulary from community-generated text.

    Science.gov (United States)

    Vydiswaran, V G Vinod; Mei, Qiaozhu; Hanauer, David A; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.

  7. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  8. FigSum: Automatically Generating Structured Text Summaries for Figures in Biomedical Literature

    OpenAIRE

    Agarwal, Shashank; Yu, Hong

    2009-01-01

    Figures are frequently used in biomedical articles to support research findings; however, they are often difficult to comprehend based on their legends alone and information from the full-text articles is required to fully understand them. Previously, we found that the information associated with a single figure is distributed throughout the full-text article the figure appears in. Here, we develop and evaluate a figure summarization system – FigSum, which aggregates this scattered informatio...

  9. Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications

    CERN Document Server

    Miner, Gary; Hill, Thomas; Nisbet, Robert; Delen, Dursun

    2012-01-01

    The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase d

  10. Semantic-based image retrieval by text mining on environmental texts

    Science.gov (United States)

    Yang, Hsin-Chang; Lee, Chung-Hong

    2003-01-01

    In this paper we propose a novel method to bridge the 'semantic gap' between a user's information need and the image content. The semantic gap describes the major deficiency of content-based image retrieval (CBIR) systems which use visual features extracted from images to describe the images. We conquer the deficiency by extracting semantic of an image from the environmental texts around it. Since an image generally co-exists with accompany texts in various formats, we may rely on such environmental texts to discover the semantic of the image. A text mining approach based on self-organizing maps is used to extract the semantic of an image from its environmental texts. We performed experiments on a small set of images and obtained promising results.

  11. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  12. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text

    Directory of Open Access Journals (Sweden)

    Altman Russ B

    2009-02-01

    Full Text Available Abstract Background Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities – particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations. Results Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively. Conclusion Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at http://pharmspresso.stanford.edu.

  13. Significant Term List Based Metadata Conceptual Mining Model for Effective Text Clustering

    OpenAIRE

    J. Janet; S. Koteeswaran; E. Kannan

    2012-01-01

    As the engineering world are growing fast, the usage of data for the day to day activity of the engineering industry also growing rapidly. In order to handle and to find the hidden knowledge from huge data storage, data mining is very helpful right now. Text mining, network mining, multimedia mining, trend analysis are few applications of data mining. In text mining, there are variety of methods are proposed by many researchers, even though high precision, better recall are still is a critica...

  14. Mining Sequential Update Summarization with Hierarchical Text Analysis

    Directory of Open Access Journals (Sweden)

    Chunyun Zhang

    2016-01-01

    Full Text Available The outbreak of unexpected news events such as large human accident or natural disaster brings about a new information access problem where traditional approaches fail. Mostly, news of these events shows characteristics that are early sparse and later redundant. Hence, it is very important to get updates and provide individuals with timely and important information of these incidents during their development, especially when being applied in wireless and mobile Internet of Things (IoT. In this paper, we define the problem of sequential update summarization extraction and present a new hierarchical update mining system which can broadcast with useful, new, and timely sentence-length updates about a developing event. The new system proposes a novel method, which incorporates techniques from topic-level and sentence-level summarization. To evaluate the performance of the proposed system, we apply it to the task of sequential update summarization of temporal summarization (TS track at Text Retrieval Conference (TREC 2013 to compute four measurements of the update mining system: the expected gain, expected latency gain, comprehensiveness, and latency comprehensiveness. Experimental results show that our proposed method has good performance.

  15. Data mining of text as a tool in authorship attribution

    Science.gov (United States)

    Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu

    2001-03-01

    It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

  16. Text-mining-assisted biocuration workflows in Argo.

    Science.gov (United States)

    Rak, Rafal; Batista-Navarro, Riza Theresa; Rowley, Andrew; Carter, Jacob; Ananiadou, Sophia

    2014-01-01

    Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk.

  17. EnvMine: A text-mining system for the automatic extraction of contextual information

    Directory of Open Access Journals (Sweden)

    de Lorenzo Victor

    2010-06-01

    Full Text Available Abstract Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles. So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude, thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical

  18. Bio-SCoRes: A Smorgasbord Architecture for Coreference Resolution in Biomedical Text.

    Science.gov (United States)

    Kilicoglu, Halil; Demner-Fushman, Dina

    2016-01-01

    Coreference resolution is one of the fundamental and challenging tasks in natural language processing. Resolving coreference successfully can have a significant positive effect on downstream natural language processing tasks, such as information extraction and question answering. The importance of coreference resolution for biomedical text analysis applications has increasingly been acknowledged. One of the difficulties in coreference resolution stems from the fact that distinct types of coreference (e.g., anaphora, appositive) are expressed with a variety of lexical and syntactic means (e.g., personal pronouns, definite noun phrases), and that resolution of each combination often requires a different approach. In the biomedical domain, it is common for coreference annotation and resolution efforts to focus on specific subcategories of coreference deemed important for the downstream task. In the current work, we aim to address some of these concerns regarding coreference resolution in biomedical text. We propose a general, modular framework underpinned by a smorgasbord architecture (Bio-SCoRes), which incorporates a variety of coreference types, their mentions and allows fine-grained specification of resolution strategies to resolve coreference of distinct coreference type-mention pairs. For development and evaluation, we used a corpus of structured drug labels annotated with fine-grained coreference information. In addition, we evaluated our approach on two other corpora (i2b2/VA discharge summaries and protein coreference dataset) to investigate its generality and ease of adaptation to other biomedical text types. Our results demonstrate the usefulness of our novel smorgasbord architecture. The specific pipelines based on the architecture perform successfully in linking coreferential mention pairs, while we find that recognition of full mention clusters is more challenging. The corpus of structured drug labels (SPL) as well as the components of Bio-SCoRes and

  19. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining

    OpenAIRE

    Sugam Sharma; Tzusheng Pei; Hari Cohly

    2010-01-01

    In Bioinformatics, text mining and text data mining sometimes interchangeably used is a process to derive high-quality information from text. Perl Status Reporter (SRr) [1] is a data fetching tool from a flat text file and in this research paper we illustrate the use of SRr in text/data mining. SRr needs a flat text input file where the mining process to be performed. SRr reads input file and derives the high-quality information from it. Typically text mining tasks are text categorization, te...

  20. AuDis: an automatic CRF-enhanced disease normalization in biomedical text.

    Science.gov (United States)

    Lee, Hsin-Chun; Hsu, Yi-Yu; Kao, Hung-Yu

    2016-01-01

    Diseases play central roles in many areas of biomedical research and healthcare. Consequently, aggregating the disease knowledge and treatment research reports becomes an extremely critical issue, especially in rapid-growth knowledge bases (e.g. PubMed). We therefore developed a system, AuDis, for disease mention recognition and normalization in biomedical texts. Our system utilizes an order two conditional random fields model. To optimize the results, we customize several post-processing steps, including abbreviation resolution, consistency improvement and stopwords filtering. As the official evaluation on the CDR task in BioCreative V, AuDis obtained the best performance (86.46% of F-score) among 40 runs (16 unique teams) on disease normalization of the DNER sub task. These results suggest that AuDis is a high-performance recognition system for disease recognition and normalization from biomedical literature.Database URL: http://ikmlab.csie.ncku.edu.tw/CDR2015/AuDis.html. PMID:27278815

  1. Processing the Text of the Holy Quran: a Text Mining Study

    Directory of Open Access Journals (Sweden)

    Mohammad Alhawarat

    2015-02-01

    Full Text Available The Holy Quran is the reference book for more than 1.6 billion of Muslims all around the world Extracting information and knowledge from the Holy Quran is of high benefit for both specialized people in Islamic studies as well as non-specialized people. This paper initiates a series of research studies that aim to serve the Holy Quran and provide helpful and accurate information and knowledge to the all human beings. Also, the planned research studies aim to lay out a framework that will be used by researchers in the field of Arabic natural language processing by providing a ”Golden Dataset” along with useful techniques and information that will advance this field further. The aim of this paper is to find an approach for analyzing Arabic text and then providing statistical information which might be helpful for the people in this research area. In this paper the holly Quran text is preprocessed and then different text mining operations are applied to it to reveal simple facts about the terms of the holy Quran. The results show a variety of characteristics of the Holy Quran such as its most important words, its wordcloud and chapters with high term frequencies. All these results are based on term frequencies that are calculated using both Term Frequency (TF and Term Frequency-Inverse Document Frequency (TF-IDF methods.

  2. A novel procedure on next generation sequencing data analysis using text mining algorithm

    OpenAIRE

    Zhao, Weizhong; Chen, James J.; Perkins, Roger; Wang, Yuping; Liu, Zhichao; Hong, Huixiao; Tong, Weida; ZOU, WEN

    2016-01-01

    Background Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining....

  3. Annotated chemical patent corpus: a gold standard for text mining.

    Directory of Open Access Journals (Sweden)

    Saber A Akhondi

    Full Text Available Exploring the chemical and biological space covered by patent applications is crucial in early-stage medicinal chemistry activities. Patent analysis can provide understanding of compound prior art, novelty checking, validation of biological assays, and identification of new starting points for chemical exploration. Extracting chemical and biological entities from patents through manual extraction by expert curators can take substantial amount of time and resources. Text mining methods can help to ease this process. To validate the performance of such methods, a manually annotated patent corpus is essential. In this study we have produced a large gold standard chemical patent corpus. We developed annotation guidelines and selected 200 full patents from the World Intellectual Property Organization, United States Patent and Trademark Office, and European Patent Office. The patents were pre-annotated automatically and made available to four independent annotator groups each consisting of two to ten annotators. The annotators marked chemicals in different subclasses, diseases, targets, and modes of action. Spelling mistakes and spurious line break due to optical character recognition errors were also annotated. A subset of 47 patents was annotated by at least three annotator groups, from which harmonized annotations and inter-annotator agreement scores were derived. One group annotated the full set. The patent corpus includes 400,125 annotations for the full set and 36,537 annotations for the harmonized set. All patents and annotated entities are publicly available at www.biosemantics.org.

  4. Supporting the education evidence portal via text mining.

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John

    2010-08-28

    The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500,000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679

  5. PubMedPortable: A Framework for Supporting the Development of Text Mining Applications

    Science.gov (United States)

    Döring, Kersten; Grüning, Björn A.; Telukunta, Kiran K.; Thomas, Philippe; Günther, Stefan

    2016-01-01

    Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user’s system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects. PMID:27706202

  6. TEXT MINING AND CLASSIFICATION OF PRODUCT REVIEWS USING STRUCTURED SUPPORT VECTOR MACHINE

    OpenAIRE

    Jincy B. Chrystal; Stephy Joseph

    2015-01-01

    Text mining and Text classification are the two prominent and challenging tasks in the field of Machine learning. Text mining refers to the process of deriving high quality and relevant information from text, while Text classification deals with the categorization of text documents into different classes. The real challenge in these areas is to address the problems like handling large text corpora, similarity of words in text documents, and association of text documents with a ...

  7. A feature representation method for biomedical scientific data based on composite text description

    Institute of Scientific and Technical Information of China (English)

    SUN; Wei

    2009-01-01

    Feature representation is one of the key issues in data clustering.The existing feature representation of scientific data is not sufficient,which to some extent affects the result of scientific data clustering.Therefore,the paper proposes a concept of composite text description(CTD)and a CTD-based feature representation method for biomedical scientific data.The method mainly uses different feature weight algorisms to represent candidate features based on two types of data sources respectively,combines and finally strengthens the two feature sets.Experiments show that comparing with traditional methods,the feature representation method is more effective than traditional methods and can significantly improve the performance of biomedcial data clustering.

  8. An integrated text mining framework for metabolic interaction network reconstruction.

    Science.gov (United States)

    Patumcharoenpol, Preecha; Doungpan, Narumol; Meechai, Asawin; Shen, Bairong; Chan, Jonathan H; Vongsangnak, Wanwipa

    2016-01-01

    Text mining (TM) in the field of biology is fast becoming a routine analysis for the extraction and curation of biological entities (e.g., genes, proteins, simple chemicals) as well as their relationships. Due to the wide applicability of TM in situations involving complex relationships, it is valuable to apply TM to the extraction of metabolic interactions (i.e., enzyme and metabolite interactions) through metabolic events. Here we present an integrated TM framework containing two modules for the extraction of metabolic events (Metabolic Event Extraction module-MEE) and for the construction of a metabolic interaction network (Metabolic Interaction Network Reconstruction module-MINR). The proposed integrated TM framework performed well based on standard measures of recall, precision and F-score. Evaluation of the MEE module using the constructed Metabolic Entities (ME) corpus yielded F-scores of 59.15% and 48.59% for the detection of metabolic events for production and consumption, respectively. As for the testing of the entity tagger for Gene and Protein (GP) and metabolite with the test corpus, the obtained F-score was greater than 80% for the Superpathway of leucine, valine, and isoleucine biosynthesis. Mapping of enzyme and metabolite interactions through network reconstruction showed a fair performance for the MINR module on the test corpus with F-score >70%. Finally, an application of our integrated TM framework on a big-scale data (i.e., EcoCyc extraction data) for reconstructing a metabolic interaction network showed reasonable precisions at 69.93%, 70.63% and 46.71% for enzyme, metabolite and enzyme-metabolite interaction, respectively. This study presents the first open-source integrated TM framework for reconstructing a metabolic interaction network. This framework can be a powerful tool that helps biologists to extract metabolic events for further reconstruction of a metabolic interaction network. The ME corpus, test corpus, source code, and virtual

  9. Human-centered text mining: a new software system

    NARCIS (Netherlands)

    J. Poelmans; P. Elzinga; A.A. Neznanov; G. Dedene; S. Viaene; S. Kuznetsov

    2012-01-01

    In this paper we introduce a novel human-centered data mining software system which was designed to gain intelligence from unstructured textual data. The architecture takes its roots in several case studies which were a collaboration between the Amsterdam-Amstelland Police, GasthuisZusters Antwerpen

  10. An Intelligent Agent Based Text-Mining System: Presenting Concept through Design Approach

    OpenAIRE

    Kaustubh S. Raval; Ranjeetsingh S.Suryawanshi; Devendra M. Thakore

    2011-01-01

    Text mining is a variation on a field called data mining and refers to the process of deriving high-quality information from unstructured text. In text-mining the goal is to discover unknown information, something that may not be known by people. Now here the aim is to design an intelligent agent based text-mining system which reads on the text (input) and based on the keyword provide the matching documents (in the form of links) or options (statements) according to the user’s query. In this ...

  11. Generation of silver standard concept annotations from biomedical texts with special relevance to phenotypes.

    Directory of Open Access Journals (Sweden)

    Anika Oellrich

    Full Text Available Electronic health records and scientific articles possess differing linguistic characteristics that may impact the performance of natural language processing tools developed for one or the other. In this paper, we investigate the performance of four extant concept recognition tools: the clinical Text Analysis and Knowledge Extraction System (cTAKES, the National Center for Biomedical Ontology (NCBO Annotator, the Biomedical Concept Annotation System (BeCAS and MetaMap. Each of the four concept recognition systems is applied to four different corpora: the i2b2 corpus of clinical documents, a PubMed corpus of Medline abstracts, a clinical trails corpus and the ShARe/CLEF corpus. In addition, we assess the individual system performances with respect to one gold standard annotation set, available for the ShARe/CLEF corpus. Furthermore, we built a silver standard annotation set from the individual systems' output and assess the quality as well as the contribution of individual systems to the quality of the silver standard. Our results demonstrate that mainly the NCBO annotator and cTAKES contribute to the silver standard corpora (F1-measures in the range of 21% to 74% and their quality (best F1-measure of 33%, independent from the type of text investigated. While BeCAS and MetaMap can contribute to the precision of silver standard annotations (precision of up to 42%, the F1-measure drops when combined with NCBO Annotator and cTAKES due to a low recall. In conclusion, the performances of individual systems need to be improved independently from the text types, and the leveraging strategies to best take advantage of individual systems' annotations need to be revised. The textual content of the PubMed corpus, accession numbers for the clinical trials corpus, and assigned annotations of the four concept recognition systems as well as the generated silver standard annotation sets are available from http://purl.org/phenotype/resources. The textual content

  12. Significant Term List Based Metadata Conceptual Mining Model for Effective Text Clustering

    Directory of Open Access Journals (Sweden)

    J. Janet

    2012-01-01

    Full Text Available As the engineering world are growing fast, the usage of data for the day to day activity of the engineering industry also growing rapidly. In order to handle and to find the hidden knowledge from huge data storage, data mining is very helpful right now. Text mining, network mining, multimedia mining, trend analysis are few applications of data mining. In text mining, there are variety of methods are proposed by many researchers, even though high precision, better recall are still is a critical issues. In this study, text mining is focused and conceptual mining model is applied for improved clustering in the text mining. The proposed work is termed as Meta data Conceptual Mining Model (MCMM, is validated with few world leading technical digital library data sets such as IEEE, ACM and Scopus. The performance derived as precision, recall are described in terms of Entropy, F-Measure which are calculated and compared with existing term based model and concept based mining model.

  13. Text-mining and information-retrieval services for molecular biology

    OpenAIRE

    Krallinger, Martin; Valencia, Alfonso

    2005-01-01

    Text-mining in molecular biology - defined as the automatic extraction of information about genes, proteins and their functional relationships from text documents - has emerged as a hybrid discipline on the edges of the fields of information science, bioinformatics and computational linguistics. A range of text-mining applications have been developed recently that will improve access to knowledge for biologists and database annotators.

  14. A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

    OpenAIRE

    V.M.Navaneethakumar; C Chandrasekar

    2012-01-01

    Text mining is a growing innovative field that endeavors to collect significant information from natural language processing term. It might be insecurely distinguished as the course of examining texts to extract information that is practical for particular purposes. In this case, the mining model can detain provisions that identify the concepts of the sentence or document, which tends to detect the subject of the document. In an existing work, the concept-based mining model is used only for n...

  15. Enhancing navigation in biomedical databases by community voting and database-driven text classification

    Directory of Open Access Journals (Sweden)

    Guettler Daniel

    2009-10-01

    community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at http://pepbank.mgh.harvard.edu.

  16. Opinion Mining in Latvian Text Using Semantic Polarity Analysis and Machine Learning Approach

    Directory of Open Access Journals (Sweden)

    Gatis Špats

    2016-07-01

    Full Text Available In this paper we demonstrate approaches for opinion mining in Latvian text. Authors have applied, combined and extended results of several previous studies and public resources to perform opinion mining in Latvian text using two approaches, namely, semantic polarity analysis and machine learning. One of the most significant constraints that make application of opinion mining for written content classification in Latvian text challenging is the limited publicly available text corpora for classifier training. We have joined several sources and created a publically available extended lexicon. Our results are comparable to or outperform current achievements in opinion mining in Latvian. Experiments show that lexicon-based methods provide more accurate opinion mining than the application of Naive Bayes machine learning classifier on Latvian tweets. Methods used during this study could be further extended using human annotators, unsupervised machine learning and bootstrapping to create larger corpora of classified text.

  17. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Directory of Open Access Journals (Sweden)

    Hamish Cunningham

    Full Text Available This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  18. An overview of the biocreative 2012 workshop track III: Interactive text mining task

    Science.gov (United States)

    An important question is how to make use of text mining to enhance the biocuration workflow. A number of groups have developed tools for text mining from a computer science/linguistics perspective and there are many initiatives to curate some aspect of biology from the literature. In some cases the ...

  19. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Science.gov (United States)

    Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina

    2013-01-01

    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  20. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    OpenAIRE

    Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio

    2011-01-01

    Abstract Background Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence fo...

  1. BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    Directory of Open Access Journals (Sweden)

    Tsafnat Guy

    2011-04-01

    Full Text Available Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH and the PharmacoKinetic Interaction Screening (PKIS database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100% and 157 (of 197 minor drug classes (80% with areas under the receiver operating characteristic curve (AUC > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238 adverse events (73%, up to 12 (of 15 groups of clinically significant cytochrome P450 enzyme (CYP inducers or inhibitors (80%, and up to 11 (of 14 groups of narrow therapeutic index drugs (79%. Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task. Conclusions BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.

  2. A Two Step Data Mining Approach for Amharic Text Classification

    Directory of Open Access Journals (Sweden)

    Seffi Gebeyehu

    2016-08-01

    Full Text Available Traditionally, text classifiers are built from labeled training examples (supervised. Labeling is usually done manually by human experts (or the users, which is a labor intensive and time consuming process. In the past few years, researchers have investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper is aimed to show as the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. In this paper, intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM and two classifiers: Naive Bayes (NB and locally weighted learning (LWL. NB first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents while LWL uses a class of function approximation to build a model around the current point of interest. An experiment conducted on a mixture of labeled and unlabeled Amharic text documents showed that the new method achieved a significant performance in comparison with that of a supervised LWL and NB. The result also pointed out that the use of unlabeled data with EM reduces the classification absolute error by 27.6%. In general, since unlabeled documents are much less expensive and easier to collect than labeled documents, this method will be useful for text categorization tasks including online data sources such as web pages, e-mails and news group postings. If one uses this method, building text categorization systems will be significantly faster and less expensive than the supervised learning approach.

  3. A Formal Framework on the Semantics of Regulatory Relations and Their Presence as Verbs in Biomedical Texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    2009-01-01

    on the logical properties of positive and negative regulations, both as formal relations and the frequency of their usage as verbs in texts. The paper discusses whether there exists a weak transitivity-like property for the relations. Our corpora consist of biomedical patents, Medline abstracts and the British...

  4. Ontology-based retrieval of bio-medical information based on microarray text corpora

    DEFF Research Database (Denmark)

    Hansen, Kim Allan; Zambach, Sine; Have, Christian Theil

    degree. We explore the possibilities of retrieving biomedical information from microarrays in Gene Expression Omnibus (GEO), of which we have indexed a sample semantically, as a rst step towards ontology based searches. Through an example we argue that it is possible to improve the retrieval......Microarray technology is often used in gene expression exper- iments. Information retrieval in the context of microarrays has mainly been concerned with the analysis of the numeric data produced; how- ever, the experiments are often annotated with textual metadata. Al- though biomedical resources...

  5. Logical implications for regulatory relations represented by verbs in biomedical texts

    DEFF Research Database (Denmark)

    Zambach, Sine

    Relations used in biomedical ontologies can be very general or very specific in respect to the domain. However, some relations are used widely in for example regulatory networks. This work focuses on positive and negative regulatory relations, in particular their usage expressed as verbs in diffe......Relations used in biomedical ontologies can be very general or very specific in respect to the domain. However, some relations are used widely in for example regulatory networks. This work focuses on positive and negative regulatory relations, in particular their usage expressed as verbs...

  6. Feature Engineering for Drug Name Recognition in Biomedical Texts: Feature Conjunction and Feature Selection

    Directory of Open Access Journals (Sweden)

    Shengyu Liu

    2015-01-01

    Full Text Available Drug name recognition (DNR is a critical step for drug information extraction. Machine learning-based methods have been widely used for DNR with various types of features such as part-of-speech, word shape, and dictionary feature. Features used in current machine learning-based methods are usually singleton features which may be due to explosive features and a large number of noisy features when singleton features are combined into conjunction features. However, singleton features that can only capture one linguistic characteristic of a word are not sufficient to describe the information for DNR when multiple characteristics should be considered. In this study, we explore feature conjunction and feature selection for DNR, which have never been reported. We intuitively select 8 types of singleton features and combine them into conjunction features in two ways. Then, Chi-square, mutual information, and information gain are used to mine effective features. Experimental results show that feature conjunction and feature selection can improve the performance of the DNR system with a moderate number of features and our DNR system significantly outperforms the best system in the DDIExtraction 2013 challenge.

  7. Signal Detection Framework Using Semantic Text Mining Techniques

    Science.gov (United States)

    Sudarsan, Sithu D.

    2009-01-01

    Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…

  8. On Utilization and Importance of Perl Status Reporter (SRr) in Text Mining

    CERN Document Server

    Sharma, Sugam; Cohly, Hari

    2010-01-01

    In Bioinformatics, text mining and text data mining sometimes interchangeably used is a process to derive high-quality information from text. Perl Status Reporter (SRr) is a data fetching tool from a flat text file and in this research paper we illustrate the use of SRr in text or data mining. SRr needs a flat text input file where the mining process to be performed. SRr reads input file and derives the high quality information from it. Typically text mining tasks are text categorization, text clustering, concept and entity extraction, and document summarization. SRr can be utilized for any of these tasks with little or none customizing efforts. In our implementation we perform text categorization mining operation on input file. The input file has two parameters of interest (firstKey and secondKey). The composition of these two parameters describes the uniqueness of entries in that file in the similar manner as done by composite key in database. SRr reads the input file line by line and extracts the parameter...

  9. A Survey of Topic Modeling in Text Mining

    Directory of Open Access Journals (Sweden)

    Rubayyi Alghamdi

    2015-01-01

    Full Text Available Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA, Probabilistic latent semantic analysis (PLSA, Latent Dirichlet allocation (LDA, and Correlated topic model (CTM. The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT, dynamic topic models (DTM, multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.

  10. SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    OpenAIRE

    Jan Paralič; Marek Paralič

    2007-01-01

    In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl), as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the syst...

  11. OntoPDF: using a text mining pipeline to generate enriched PDF versions of scientific papers

    OpenAIRE

    Zhu, Yi; Rinaldi, Fabio

    2014-01-01

    In this poster we present a recent extension of the OntoGene text mining utilities, which enables the generation of annotated pdf versions of the original articles. While a text-based view (in XML or HTML) can allow a more flexible presentation of the results of a text mining pipeline, for some applications, notably in assisted curation, it might be desirable to present the annotations in the context of the original pdf document.

  12. Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming

    Science.gov (United States)

    Abdous, M'hammed; He, Wu

    2011-01-01

    Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…

  13. An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.

    Science.gov (United States)

    Trybula, Walter J.; Wyllys, Ronald E.

    2000-01-01

    Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)

  14. TEXT DATA MINING OF ENGLISH BOOKS ON ENVIRONMENTOLOGY

    Directory of Open Access Journals (Sweden)

    Hiromi Ban

    2012-12-01

    Full Text Available Recently, to confront environmental problems, a system of “environmentology” is trying to be constructed. In order to study environmentology, reading materials in English is considered to be indispensable. In this paper, we investigated several English books on environmentology, comparing with journalism in terms of metrical linguistics. In short, frequency characteristics of character- and word-appearance were investigated using a program written in C++. These characteristics were approximated by an exponential function. Furthermore, we calculated the percentage of Japanese junior high school required vocabulary and American basic vocabulary to obtain the difficulty-level as well as the K-characteristic of each material. As a result, it was clearly shown that English materials for environmentology have a similar tendency to literary writings in the characteristics of character appearance. Besides, the values of the K-characteristic for the materials on environmentology are high, and some books are more difficult than TIME magazine.

  15. Visualization Model for Chinese Text Mining%可视化中文文本挖掘模型

    Institute of Scientific and Technical Information of China (English)

    林鸿飞; 贡大跃; 张跃; 姚天顺

    2000-01-01

    This paper briefly describes the background of text mining and the main difficulties in Chinese text mining,presents a visual model for Chinese text mining and puts forward the method of text categories based on concept,the method of text summary based on statistics and the method of identifying Chinese name.

  16. Aspects of Text Mining From Computational Semiotics to Systemic Functional Hypertexts

    Directory of Open Access Journals (Sweden)

    Alexander Mehler

    2001-05-01

    Full Text Available The significance of natural language texts as the prime information structure for the management and dissemination of knowledge in organisations is still increasing. Making relevant documents available depending on varying tasks in different contexts is of primary importance for any efficient task completion. Implementing this demand requires the content based processing of texts, which enables to reconstruct or, if necessary, to explore the relationship of task, context and document. Text mining is a technology that is suitable for solving problems of this kind. In the following, semiotic aspects of text mining are investigated. Based on the primary object of text mining - natural language lexis - the specific complexity of this class of signs is outlined and requirements for the implementation of text mining procedures are derived. This is done with reference to text linkage introduced as a special task in text mining. Text linkage refers to the exploration of implicit, content based relations of texts (and their annotation as typed links in corpora possibly organised as hypertexts. In this context, the term systemic functional hypertext is introduced, which distinguishes genre and register layers for the management of links in a poly-level hypertext system.

  17. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    DEFF Research Database (Denmark)

    Warrer, Pernille; Hansen, Ebba Holme; Jensen, Lars Juhl;

    2012-01-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We...... and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs...... included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs...

  18. Text Mining on the Internet%Internet 上的文本数据挖掘

    Institute of Scientific and Technical Information of China (English)

    王伟强; 高文; 段立娟

    2000-01-01

    The booming growth of the Internet has made text mining on it a promising research field in practice. The paper summarily introduces some aspects about it,which involve some potential applications,some techniques used and some present systems.

  19. Classifying unstructed textual data using the Product Score Model: an alternative text mining algorithm

    NARCIS (Netherlands)

    He, Q.; Veldkamp, B.P.; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful informatio

  20. Text mining and visualization case studies using open-source tools

    CERN Document Server

    Chisholm, Andrew

    2016-01-01

    Text Mining and Visualization: Case Studies Using Open-Source Tools provides an introduction to text mining using some of the most popular and powerful open-source tools: KNIME, RapidMiner, Weka, R, and Python. The contributors-all highly experienced with text mining and open-source software-explain how text data are gathered and processed from a wide variety of sources, including books, server access logs, websites, social media sites, and message boards. Each chapter presents a case study that you can follow as part of a step-by-step, reproducible example. You can also easily apply and extend the techniques to other problems. All the examples are available on a supplementary website. The book shows you how to exploit your text data, offering successful application examples and blueprints for you to tackle your text mining tasks and benefit from open and freely available tools. It gets you up to date on the latest and most powerful tools, the data mining process, and specific text mining activities.

  1. Fast max-margin clustering for unsupervised word sense disambiguation in biomedical texts

    OpenAIRE

    Duan, Weisi; Song, Min; Yates, Alexander

    2009-01-01

    Background We aim to solve the problem of determining word senses for ambiguous biomedical terms with minimal human effort. Methods We build a fully automated system for Word Sense Disambiguation by designing a system that does not require manually-constructed external resources or manually-labeled training examples except for a single ambiguous word. The system uses a novel and efficient graph-based algorithm to cluster words into groups that have the same meaning. Our algorithm follows the ...

  2. A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles.

    Science.gov (United States)

    Jones, David E; Ghandehari, Hamidreza; Facelli, Julio C

    2016-08-01

    This article presents a comprehensive review of applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles of medical interest. The papers reviewed here present the results of research using these techniques to predict the biological fate and properties of a variety of nanoparticles relevant to their biomedical applications. These include the influence of particle physicochemical properties on cellular uptake, cytotoxicity, molecular loading, and molecular release in addition to manufacturing properties like nanoparticle size, and polydispersity. Overall, the results are encouraging and suggest that as more systematic data from nanoparticles becomes available, machine learning and data mining would become a powerful aid in the design of nanoparticles for biomedical applications. There is however the challenge of great heterogeneity in nanoparticles, which will make these discoveries more challenging than for traditional small molecule drug design.

  3. A review of the applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles.

    Science.gov (United States)

    Jones, David E; Ghandehari, Hamidreza; Facelli, Julio C

    2016-08-01

    This article presents a comprehensive review of applications of data mining and machine learning for the prediction of biomedical properties of nanoparticles of medical interest. The papers reviewed here present the results of research using these techniques to predict the biological fate and properties of a variety of nanoparticles relevant to their biomedical applications. These include the influence of particle physicochemical properties on cellular uptake, cytotoxicity, molecular loading, and molecular release in addition to manufacturing properties like nanoparticle size, and polydispersity. Overall, the results are encouraging and suggest that as more systematic data from nanoparticles becomes available, machine learning and data mining would become a powerful aid in the design of nanoparticles for biomedical applications. There is however the challenge of great heterogeneity in nanoparticles, which will make these discoveries more challenging than for traditional small molecule drug design. PMID:27282231

  4. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    Energy Technology Data Exchange (ETDEWEB)

    Corley, Courtney D.; Cook, Diane; Mikler, Armin R.; Singh, Karan P.

    2010-02-22

    Text and structural data mining of Web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5-October-2008 to 21-March-2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like-illness patient report data. We also bring to bear a graph-based data mining technique to detect anomalies among flu blogs connected by publisher type, links, and user-tags.

  5. Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Watson, Andrew; Quigley, Aaron; Klein, Ewan; Coates, Colin M.

    2015-01-01

    Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians, computational linguists, and visualization specialists. It combines text mining and information vi...

  6. Text and Structural Data Mining of Influenza Mentions in Web and Social Media

    OpenAIRE

    Singh, Karan P.; Mikler, Armin R.; Diane J. Cook; Corley, Courtney D.

    2010-01-01

    Text and structural data mining of web and social media (WSM) provides a novel disease surveillance resource and can identify online communities for targeted public health communications (PHC) to assure wide dissemination of pertinent information. WSM that mention influenza are harvested over a 24-week period, 5 October 2008 to 21 March 2009. Link analysis reveals communities for targeted PHC. Text mining is shown to identify trends in flu posts that correlate to real-world influenza-like ill...

  7. Towards Applying Text Mining Techniques on Software Quality Standards and Models

    OpenAIRE

    Kelemen, Zádor Dániel; Kusters, Rob; Trienekens, Jos; Balla, Katalin

    2013-01-01

    Many of quality approaches are described in hundreds of textual pages. Manual processing of information consumes plenty of resources. In this report we present a text mining approach applied on CMMI, one well known and widely known quality approach. The text mining analysis can provide a quick overview on the scope of a quality approaches. The result of the analysis could accelerate the understanding and the selection of quality approaches.

  8. Trading Consequences: A Case Study of Combining Text Mining & Visualisation to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Quigley, Aaron

    2014-01-01

    Trading Consequences is an interdisciplinary research project between historians, computational linguists and visualization specialists. We use text mining and visualisations to explore the growth of the global commodity trade in the nineteenth century. Feedback from a group of environmental historians during a workshop provided essential information to adapt advanced text mining and visualisation techniques to historical research. Expert feedback is an essential tool for effective interdisci...

  9. Advanced Text Mining Methods for the Financial Markets and Forecasting of Intraday Volatility

    OpenAIRE

    Pieper, Michael J.

    2011-01-01

    The flow of information in financial markets is covered in two parts. An high-order estimator of intraday volatility is introduced in order to boost risk forecasts. Over the last decade, text mining of news and its application to finance were a vibrant topic of research as well as in the finance industry. This thesis develops a coherent approach to financial text mining that can be utilized for automated trading.

  10. Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    DEFF Research Database (Denmark)

    Debortoli, Stefan; Müller, Oliver; Junglas, Iris;

    2016-01-01

    , such as manual coding. Yet, the size of text data setsobtained from the Internet makes manual analysis virtually impossible. In this tutorial, we discuss the challengesencountered when applying automated text-mining techniques in information systems research. In particular, weshowcase the use of probabilistic...... topic modeling via Latent Dirichlet Allocation, an unsupervised text miningtechnique, in combination with a LASSO multinomial logistic regression to explain user satisfaction with an IT artifactby automatically analyzing more than 12,000 online customer reviews. For fellow information systems...... researchers,this tutorial provides some guidance for conducting text mining studies on their own and for evaluating the quality ofothers....

  11. A framework of Chinese semantic text mining based on ontology learning

    Science.gov (United States)

    Zhang, Yu-feng; Hu, Feng

    2012-01-01

    Text mining and ontology learning can be effectively employed to acquire the Chinese semantic information. This paper explores a framework of semantic text mining based on ontology learning to find the potential semantic knowledge from the immensity text information on the Internet. This framework consists of four parts: Data Acquisition, Feature Extraction, Ontology Construction, and Text Knowledge Pattern Discovery. Then the framework is applied into an actual case to try to find out the valuable information, and even to assist the consumers with selecting proper products. The results show that this framework is reasonable and effective.

  12. Zusammenfassung Workshop und Umfrageergebnisse "Bedarf und Anforderungen an Ressourcen für Text und Data Mining"

    OpenAIRE

    Sens, Irina; Katerbow, Matthias; Schöch, Christof; Mittermaier, Bernhard

    2015-01-01

    Zusammenfassung des Workshops und Visualisierung der Umfrageerbegnisse der Umfrage "Bedarf und Anforderungen an Ressourcen für Text und Data Mining" der Schwerpunktinitiative "Digitale Information" der Allianz der deutschen Wissenschaftsorganisationen, Arbeitsgruppe Text und Data Mining

  13. Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability

    Science.gov (United States)

    Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel

    2011-01-01

    The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…

  14. Text mining for science and technology - a review part I - characterization/scientometrics

    Directory of Open Access Journals (Sweden)

    Ronald N Kostoff

    2012-01-01

    Full Text Available This article is the first part of a two-part review of the author′s work in developing text mining procedures. The focus of Part I is Scientometrics. Novel approaches that were used to text mine the field of nanoscience/nanotechnology and the science and technology portfolio of China are described. A unique approach to identify documents related to an application theme (e.g., military-related, intelligence-related, space-related rather than a discipline theme is also described in some detail.

  15. Text Matching and Categorization: Mining Implicit Semantic Knowledge from Tree-Shape Structures

    Directory of Open Access Journals (Sweden)

    Lin Guo

    2015-01-01

    Full Text Available The diversities of large-scale semistructured data make the extraction of implicit semantic information have enormous difficulties. This paper proposes an automatic and unsupervised method of text categorization, in which tree-shape structures are used to represent semantic knowledge and to explore implicit information by mining hidden structures without cumbersome lexical analysis. Mining implicit frequent structures in trees can discover both direct and indirect semantic relations, which largely enhances the accuracy of matching and classifying texts. The experimental results show that the proposed algorithm remarkably reduces the time and effort spent in training and classifying, which outperforms established competitors in correctness and effectiveness.

  16. PageRank without hyperlinks: Reranking with PubMed related article networks for biomedical text retrieval

    Directory of Open Access Journals (Sweden)

    Lin Jimmy

    2008-06-01

    Full Text Available Abstract Background Graph analysis algorithms such as PageRank and HITS have been successful in Web environments because they are able to extract important inter-document relationships from manually-created hyperlinks. We consider the application of these techniques to biomedical text retrieval. In the current PubMed® search interface, a MEDLINE® citation is connected to a number of related citations, which are in turn connected to other citations. Thus, a MEDLINE record represents a node in a vast content-similarity network. This article explores the hypothesis that these networks can be exploited for text retrieval, in the same manner as hyperlink graphs on the Web. Results We conducted a number of reranking experiments using the TREC 2005 genomics track test collection in which scores extracted from PageRank and HITS analysis were combined with scores returned by an off-the-shelf retrieval engine. Experiments demonstrate that incorporating PageRank scores yields significant improvements in terms of standard ranked-retrieval metrics. Conclusion The link structure of content-similarity networks can be exploited to improve the effectiveness of information retrieval systems. These results generalize the applicability of graph analysis algorithms to text retrieval in the biomedical domain.

  17. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis

    OpenAIRE

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna; Inzé, Dirk; Van de Peer, Yves

    2013-01-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology res...

  18. The Text-mining based PubChem Bioassay neighboring analysis

    Directory of Open Access Journals (Sweden)

    Wang Yanli

    2010-11-01

    Full Text Available Abstract Background In recent years, the number of High Throughput Screening (HTS assays deposited in PubChem has grown quickly. As a result, the volume of both the structured information (i.e. molecular structure, bioactivities and the unstructured information (such as descriptions of bioassay experiments, has been increasing exponentially. As a result, it has become even more demanding and challenging to efficiently assemble the bioactivity data by mining the huge amount of information to identify and interpret the relationships among the diversified bioassay experiments. In this work, we propose a text-mining based approach for bioassay neighboring analysis from the unstructured text descriptions contained in the PubChem BioAssay database. Results The neighboring analysis is achieved by evaluating the cosine scores of each bioassay pair and fraction of overlaps among the human-curated neighbors. Our results from the cosine score distribution analysis and assay neighbor clustering analysis on all PubChem bioassays suggest that strong correlations among the bioassays can be identified from their conceptual relevance. A comparison with other existing assay neighboring methods suggests that the text-mining based bioassay neighboring approach provides meaningful linkages among the PubChem bioassays, and complements the existing methods by identifying additional relationships among the bioassay entries. Conclusions The text-mining based bioassay neighboring analysis is efficient for correlating bioassays and studying different aspects of a biological process, which are otherwise difficult to achieve by existing neighboring procedures due to the lack of specific annotations and structured information. It is suggested that the text-mining based bioassay neighboring analysis can be used as a standalone or as a complementary tool for the PubChem bioassay neighboring process to enable efficient integration of assay results and generate hypotheses for

  19. Trends of E-Learning Research from 2000 to 2008: Use of Text Mining and Bibliometrics

    Science.gov (United States)

    Hung, Jui-long

    2012-01-01

    This study investigated the longitudinal trends of e-learning research using text mining techniques. Six hundred and eighty-nine (689) refereed journal articles and proceedings were retrieved from the Science Citation Index/Social Science Citation Index database in the period from 2000 to 2008. All e-learning publications were grouped into two…

  20. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Arup; Hansen, Lars Kai; Balslev, Daniela

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...... that the statistically motivated associations are well aligned with general neuroscientific knowledge...

  1. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Årup; Hansen, Lars Kai; Balslev, D.

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...... that the statistically motivated associations are well aligned with general neuroscientific knowledge....

  2. Analysis of Nature of Science Included in Recent Popular Writing Using Text Mining Techniques

    Science.gov (United States)

    Jiang, Feng; McComas, William F.

    2014-01-01

    This study examined the inclusion of nature of science (NOS) in popular science writing to determine whether it could serve supplementary resource for teaching NOS and to evaluate the accuracy of text mining and classification as a viable research tool in science education research. Four groups of documents published from 2001 to 2010 were…

  3. The Determination of Children's Knowledge of Global Lunar Patterns from Online Essays Using Text Mining Analysis

    Science.gov (United States)

    Cheon, Jongpil; Lee, Sangno; Smith, Walter; Song, Jaeki; Kim, Yongjin

    2013-01-01

    The purpose of this study was to use text mining analysis of early adolescents' online essays to determine their knowledge of global lunar patterns. Australian and American students in grades five to seven wrote about global lunar patterns they had discovered by sharing observations with each other via the Internet. These essays were analyzed for…

  4. Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

    Science.gov (United States)

    Michalski, Greg V.

    2011-01-01

    Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

  5. Exploring the potential of Social Media Data using Text Mining to augment Business Intelligence

    Directory of Open Access Journals (Sweden)

    Dr. Ananthi Sheshasaayee

    2014-03-01

    Full Text Available In recent years, social media has become world-wide famous and important for content sharing, social networking, etc., The contents generated from these websites remains largely unused. Social media contains text, images, audio, video, and so on. Social media data largely contains unstructured text. Foremost thing is to extract the information in the unstructured text. This paper presents the influence of social media data for research and how the content can be used to predict real-world decisions that enhance business intelligence, by applying the text mining methods.

  6. Serviceorientiertes Text Mining am Beispiel von Entitätsextrahierenden Diensten

    OpenAIRE

    Pfeifer, Katja

    2014-01-01

    Der Großteil des geschäftsrelevanten Wissens liegt heute als unstrukturierte Information in Form von Textdaten auf Internetseiten, in Office-Dokumenten oder Foreneinträgen vor. Zur Extraktion und Verwertung dieser unstrukturierten Informationen wurde eine Vielzahl von Text-Mining-Lösungen entwickelt. Viele dieser Systeme wurden in der jüngeren Vergangenheit als Webdienste zugänglich gemacht, um die Verwertung und Integration zu vereinfachen. Die Kombination verschiedener solcher Text-Min...

  7. A tm Plug-In for Distributed Text Mining in R

    OpenAIRE

    Stefan Theussl; Ingo Feinerer; Kurt Hornik

    2012-01-01

    R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text) corpora. However, we typically face two challenges when analyzing large corpora: (1) the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM), and (2) the more data to be analyzed the higher the need for efficient procedures for calculating valua...

  8. Biomedical Mathematics, Unit I: Measurement, Linear Functions and Dimensional Algebra. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This text presents lessons relating specific mathematical concepts to the ideas, skills, and tasks pertinent to the health care field. Among other concepts covered are linear functions, vectors, trigonometry, and statistics. Many of the lessons use data acquired during science experiments as the basis for exercises in mathematics. Lessons present…

  9. A tm Plug-In for Distributed Text Mining in R

    Directory of Open Access Journals (Sweden)

    Stefan Theussl

    2012-11-01

    Full Text Available R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text corpora. However, we typically face two challenges when analyzing large corpora: (1 the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM, and (2 the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

  10. SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    Directory of Open Access Journals (Sweden)

    Jan Paralič

    2007-06-01

    Full Text Available In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl, as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the system is described, followed by some typical use case scenarios, which have been used in some previous projects. Then, basic principles and technologies used for service-oriented computing, web services and semantic web services are presented. We further discuss how the JBowl system can be adopted into a distributed environment via technologies available already and what benefits can bring such an adaptation. This is in particular important in the context of a new integrated EU-funded project KP-Lab2 (Knowledge Practices Laboratory that is briefly presented as well as the role of the proposed text mining services, which are currently being designed and developed there.

  11. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    Science.gov (United States)

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and

  12. Clustering more than two million biomedical publications: comparing the accuracies of nine text-based similarity approaches.

    Directory of Open Access Journals (Sweden)

    Kevin W Boyack

    Full Text Available BACKGROUND: We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents. METHODOLOGY: We used a corpus of 2.15 million recent (2004-2008 records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine, latent semantic analysis (LSA, topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles. The two data sources were a MeSH subject headings, and b words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1 within-cluster textual coherence based on the Jensen-Shannon divergence, and (2 two concentration measures based on grant-to-article linkages indexed in MEDLINE. CONCLUSIONS: PubMed's own related article approach (PMRA generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach

  13. Review of Text Mining Tools%文本挖掘工具述评

    Institute of Scientific and Technical Information of China (English)

    张雯雯; 许鑫

    2012-01-01

    The authors briefly describe some commercial text mining tools and open source text mining tools, coupled with detailed comparisons of four typical open source tools concerning data format, functional module and user experience firstly. Then, the authors realize the testing of text classification function for three kinds of distinctive tool design. Finally, the authors offer some suggestions for the status of open source text mining tools.%简要介绍一些商业文本挖掘工具和开源文本挖掘工具,针对其中四款典型的开源工具进行详细的比较,包括数据格式、功能模块和用户体验三个方面;选取三种各具特色的工具就其文本分类功能进行测评。最后,针对开源文本挖掘工具的现状,提出几点建议。

  14. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    Science.gov (United States)

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system.

  15. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    Science.gov (United States)

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. PMID:26209007

  16. The WONP-NURT corpus as nuclear knowledge base for text mining in the INIS database

    International Nuclear Information System (INIS)

    In the present work the WONP-NURT corpus is taken as knowledge base for text mining in the INIS database. Main components of the information processing system, as well as computational methods for content analysis of INIS database record files are described. Results of the content analysis of the WONP-NURT corpus are reported. Furthermore, results of two comparative text mining studies in the INIS database are also shown. The first one explores 10 research areas in the more familiar nearest range of WONP-NURT corpus, while the second one surveys 15 regions in the more exotic far range. The results provide new elements to asses the significance of the WONP-NURT corpus in the context of the current state of nuclear science and technology research areas. (Author)

  17. A Conformity Measure using Background Knowledge for Association Rules: Application to Text Mining

    OpenAIRE

    Cherfi, Hacène; Napoli, Amedeo; Toussaint, Yannick

    2009-01-01

    A text mining process using association rules generates a very large number of rules. According to experts of the domain, most of these rules basically convey a common knowledge, i.e. rules which associate terms that experts may likely relate to each other. In order to focus on the result interpretation and discover new knowledge units, it is necessary to define criteria for classifying the extracted rules. Most of the rule classification methods are based on numerical quality measures. In th...

  18. Text Mining and copy right laws : A case for change in the medical research field

    OpenAIRE

    Blanc, Xavier; Tinh-Hai, Collet; Iriarte, Pablo; De Kaenel, Isabelle; Krause, Jan Brice

    2012-01-01

    Mid 2011, the research team at the Department of Ambulatory Care and Community Medicine, University of Lausanne, decided to start a project on a new research topic: Shared Decision Making (SDM). The objective was to identify publication trends about SDM in 15 major internal medicine journals over the last 15 years. It was decided to use a "text mining" approach to systematically review all the articles published in these main journals and automatically search for the different occurrences of ...

  19. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    OpenAIRE

    Przemyslaw Maciolek; Grzegorz Dobrowolski

    2013-01-01

    The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT) system, which extracts and analyzes significant quantitiesof openl...

  20. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    Science.gov (United States)

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD. PMID:26319542

  1. Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL

    Science.gov (United States)

    Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo

    2011-01-01

    Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

  2. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    Science.gov (United States)

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD.

  3. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    Directory of Open Access Journals (Sweden)

    Przemyslaw Maciolek

    2013-01-01

    Full Text Available The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT system, which extracts and analyzes significant quantitiesof openly available information.

  4. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics.

    Science.gov (United States)

    Torii, Manabu; Tilak, Sameer S; Doan, Son; Zisook, Daniel S; Fan, Jung-Wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.

  5. Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics

    Science.gov (United States)

    Torii, Manabu; Tilak, Sameer S.; Doan, Son; Zisook, Daniel S.; Fan, Jung-wei

    2016-01-01

    In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research. PMID:27375358

  6. PolySearch2: a significantly improved text-mining system for discovering associations between human diseases, genes, drugs, metabolites, toxins and more.

    Science.gov (United States)

    Liu, Yifeng; Liang, Yongjie; Wishart, David

    2015-07-01

    PolySearch2 (http://polysearch.ca) is an online text-mining system for identifying relationships between biomedical entities such as human diseases, genes, SNPs, proteins, drugs, metabolites, toxins, metabolic pathways, organs, tissues, subcellular organelles, positive health effects, negative health effects, drug actions, Gene Ontology terms, MeSH terms, ICD-10 medical codes, biological taxonomies and chemical taxonomies. PolySearch2 supports a generalized 'Given X, find all associated Ys' query, where X and Y can be selected from the aforementioned biomedical entities. An example query might be: 'Find all diseases associated with Bisphenol A'. To find its answers, PolySearch2 searches for associations against comprehensive collections of free-text collections, including local versions of MEDLINE abstracts, PubMed Central full-text articles, Wikipedia full-text articles and US Patent application abstracts. PolySearch2 also searches 14 widely used, text-rich biological databases such as UniProt, DrugBank and Human Metabolome Database to improve its accuracy and coverage. PolySearch2 maintains an extensive thesaurus of biological terms and exploits the latest search engine technology to rapidly retrieve relevant articles and databases records. PolySearch2 also generates, ranks and annotates associative candidates and present results with relevancy statistics and highlighted key sentences to facilitate user interpretation.

  7. Screening for posttraumatic stress disorder using verbal features in self narratives: a text mining approach.

    Science.gov (United States)

    He, Qiwei; Veldkamp, Bernard P; de Vries, Theo

    2012-08-15

    Much evidence has shown that people's physical and mental health can be predicted by the words they use. However, such verbal information is seldom used in the screening and diagnosis process probably because the procedure to handle these words is rather difficult with traditional quantitative methods. The first challenge would be to extract robust information from diversified expression patterns, the second to transform unstructured text into a structuralized dataset. The present study developed a new textual assessment method to screen the posttraumatic stress disorder (PTSD) patients using lexical features in the self narratives with text mining techniques. Using 300 self narratives collected online, we extracted highly discriminative keywords with the Chi-square algorithm and constructed a textual assessment model to classify individuals with the presence or absence of PTSD. This resulted in a high agreement between computer and psychiatrists' diagnoses for PTSD and revealed some expressive characteristics in the writings of PTSD patients. Although the results of text analysis are not completely analogous to the results of structured interviews in PTSD diagnosis, the application of text mining is a promising addition to assessing PTSD in clinical and research settings. PMID:22464046

  8. BSQA: integrated text mining using entity relation semantics extracted from biological literature of insects.

    Science.gov (United States)

    He, Xin; Li, Yanen; Khetani, Radhika; Sanders, Barry; Lu, Yue; Ling, Xu; Zhai, Chengxiang; Schatz, Bruce

    2010-07-01

    Text mining is one promising way of extracting information automatically from the vast biological literature. To maximize its potential, the knowledge encoded in the text should be translated to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. We present BeeSpace question/answering (BSQA) system that performs integrated text mining for insect biology, covering diverse aspects from molecular interactions of genes to insect behavior. BSQA recognizes a number of entities and relations in Medline documents about the model insect, Drosophila melanogaster. For any text query, BSQA exploits entity annotation of retrieved documents to identify important concepts in different categories. By utilizing the extracted relations, BSQA is also able to answer many biologically motivated questions, from simple ones such as, which anatomical part is a gene expressed in, to more complex ones involving multiple types of relations. BSQA is freely available at http://www.beespace.uiuc.edu/QuestionAnswer.

  9. Text Mining of the Classical Medical Literature for Medicines That Show Potential in Diabetic Nephropathy

    Directory of Open Access Journals (Sweden)

    Lei Zhang

    2014-01-01

    Full Text Available Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1 identification of candidate ancient terms; (2 systemic search and assessment of medical records written in classical Chinese; (3 preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae.

  10. DDMGD: the database of text-mined associations between genes methylated in diseases from different species

    KAUST Repository

    Raies, A. B.

    2014-11-14

    Gathering information about associations between methylated genes and diseases is important for diseases diagnosis and treatment decisions. Recent advancements in epigenetics research allow for large-scale discoveries of associations of genes methylated in diseases in different species. Searching manually for such information is not easy, as it is scattered across a large number of electronic publications and repositories. Therefore, we developed DDMGD database (http://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD\\'s scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we developed earlier and additional post-processing, we extracted associations of genes methylated in different diseases from PubMed Central articles and PubMed abstracts. The accuracy of extracted associations is 82% as estimated on 2500 hand-curated entries. DDMGD provides a user-friendly interface facilitating retrieval of these associations ranked according to confidence scores. Submission of new associations to DDMGD is provided. A comparison analysis of DDMGD with several other databases focused on genes methylated in diseases shows that DDMGD is comprehensive and includes most of the recent information on genes methylated in diseases.

  11. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    Science.gov (United States)

    Hirdt, J. A.; Brown, D. A.

    2016-01-01

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  12. Text mining analysis of public comments regarding high-level radioactive waste disposal

    International Nuclear Information System (INIS)

    In order to narrow the risk perception gap as seen in social investigations between the general public and people who are involved in nuclear industry, public comments on high-level radioactive waste (HLW) disposal have been conducted to find the significant talking points with the general public for constructing an effective risk communication model of social risk information regarding HLW disposal. Text mining was introduced to examine public comments to identify the core public interest underlying the comments. The utilized test mining method is to cluster specific groups of words with negative meanings and then to analyze public understanding by employing text structural analysis to extract words from subjective expressions. Using these procedures, it was found that the public does not trust the nuclear fuel cycle promotion policy and shows signs of anxiety about the long-lasting technological reliability of waste storage. To develop effective social risk communication of HLW issues, these findings are expected to help experts in the nuclear industry to communicate with the general public more effectively to obtain their trust. (author)

  13. Text mining of the classical medical literature for medicines that show potential in diabetic nephropathy.

    Science.gov (United States)

    Zhang, Lei; Li, Yin; Guo, Xinfeng; May, Brian H; Xue, Charlie C L; Yang, Lihong; Liu, Xusheng

    2014-01-01

    Objectives. To apply modern text-mining methods to identify candidate herbs and formulae for the treatment of diabetic nephropathy. Methods. The method we developed includes three steps: (1) identification of candidate ancient terms; (2) systemic search and assessment of medical records written in classical Chinese; (3) preliminary evaluation of the effect and safety of candidates. Results. Ancient terms Xia Xiao, Shen Xiao, and Xiao Shen were determined as the most likely to correspond with diabetic nephropathy and used in text mining. A total of 80 Chinese formulae for treating conditions congruent with diabetic nephropathy recorded in medical books from Tang Dynasty to Qing Dynasty were collected. Sao si tang (also called Reeling Silk Decoction) was chosen to show the process of preliminary evaluation of the candidates. It had promising potential for development as new agent for the treatment of diabetic nephropathy. However, further investigations about the safety to patients with renal insufficiency are still needed. Conclusions. The methods developed in this study offer a targeted approach to identifying traditional herbs and/or formulae as candidates for further investigation in the search for new drugs for modern disease. However, more effort is still required to improve our techniques, especially with regard to compound formulae. PMID:24744808

  14. Texts and data mining and their possibilities applied to the process of news production

    Directory of Open Access Journals (Sweden)

    Walter Teixeira Lima Jr

    2008-06-01

    Full Text Available The proposal of this essay is to discuss the challenges of representing in a formalist computational process the knowledge which the journalist uses to articulate news values for the purpose of selecting and imposing hierarchy on news. It discusses how to make bridges to emulate this knowledge obtained in an empirical form with the bases of computational science, in the area of storage, recovery and linked to data in a database, which must show the way human brains treat information obtained through their sensorial system. Systemizing and automating part of the journalistic process in a database contributes to eliminating distortions, faults and to applying, in an efficient manner, techniques for Data Mining and/or Texts which, by definition, permit the discovery of nontrivial relations.

  15. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    Science.gov (United States)

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases. PMID:27073839

  16. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    Science.gov (United States)

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.

  17. Online Discourse on Fibromyalgia: Text-Mining to Identify Clinical Distinction and Patient Concerns

    Science.gov (United States)

    Park, Jungsik; Ryu, Young Uk

    2014-01-01

    Background The purpose of this study was to evaluate the possibility of using text-mining to identify clinical distinctions and patient concerns in online memoires posted by patients with fibromyalgia (FM). Material/Methods A total of 399 memoirs were collected from an FM group website. The unstructured data of memoirs associated with FM were collected through a crawling process and converted into structured data with a concordance, parts of speech tagging, and word frequency. We also conducted a lexical analysis and phrase pattern identification. After examining the data, a set of FM-related keywords were obtained and phrase net relationships were set through a web-based visualization tool. Results The clinical distinction of FM was verified. Pain is the biggest issue to the FM patients. The pains were affecting body parts including ‘muscles,’ ‘leg,’ ‘neck,’ ‘back,’ ‘joints,’ and ‘shoulders’ with accompanying symptoms such as ‘spasms,’ ‘stiffness,’ and ‘aching,’ and were described as ‘sever,’ ‘chronic,’ and ‘constant.’ This study also demonstrated that it was possible to understand the interests and concerns of FM patients through text-mining. FM patients wanted to escape from the pain and symptoms, so they were interested in medical treatment and help. Also, they seemed to have interest in their work and occupation, and hope to continue to live life through the relationships with the people around them. Conclusions This research shows the potential for extracting keywords to confirm the clinical distinction of a certain disease, and text-mining can help objectively understand the concerns of patients by generalizing their large number of subjective illness experiences. However, it is believed that there are limitations to the processes and methods for organizing and classifying large amounts of text, so these limits have to be considered when analyzing the results. The development of research methodology to overcome

  18. A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

    Directory of Open Access Journals (Sweden)

    Verspoor Karin

    2012-08-01

    Full Text Available Abstract Background We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus. Results Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data. Conclusions The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

  19. A multilingual text mining based content gathering system for open source intelligence

    International Nuclear Information System (INIS)

    The number of documents available in electronic format has grown dramatically in the recent years, whilst the information that States provide to the IAEA is not always complete or clear. Independent information sources can balance the limited State-reported information, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for source and language used, and transforming them into information, is therefore inextricably linked to the concepts of automatic textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. This paper describes is a multilingual indexing, searching and clustering system, whose main goal is managing huge collections of data coming from different and geographically distributed information sources, providing language independent searches and dynamic classification facilities. The automatic linguistic analysis of documents is based on Morpho-Syntactic, Functional and Statistical criteria. This phase is intended to identify only the significant expressions from the whole raw text: the system analyzes each sentence, cycling through all the possible sentence constructions. Using a series of word relationship tests to establish context, the system tries to determine the meaning of the sentence. Once reduced to its Part Of Speech and Functional tagged base form, referred to its language independent entry inside a sectorial multilingual dictionary, each tagged lemma is used as descriptor and possible seed of clustering. The automatic classification of results is made by Unsupervised Classification schema. By Multilingual Text Mining, analysts can get an overview of great volumes of textual data having a highly readable grid, which helps them discover meaningful similarities among documents and find any nuclear proliferation and safeguard related information. Multilingual Text Mining permits to overcome linguistic barriers, allowing the automatic

  20. MET network in PubMed: a text-mined network visualization and curation system.

    Science.gov (United States)

    Dai, Hong-Jie; Su, Chu-Hsien; Lai, Po-Ting; Huang, Ming-Siang; Jonnagaddala, Jitendra; Rose Jue, Toni; Rao, Shruti; Chou, Hui-Jou; Milacic, Marija; Singh, Onkar; Syed-Abdul, Shabbir; Hsu, Wen-Lian

    2016-01-01

    Metastasis is the dissemination of a cancer/tumor from one organ to another, and it is the most dangerous stage during cancer progression, causing more than 90% of cancer deaths. Improving the understanding of the complicated cellular mechanisms underlying metastasis requires investigations of the signaling pathways. To this end, we developed a METastasis (MET) network visualization and curation tool to assist metastasis researchers retrieve network information of interest while browsing through the large volume of studies in PubMed. MET can recognize relations among genes, cancers, tissues and organs of metastasis mentioned in the literature through text-mining techniques, and then produce a visualization of all mined relations in a metastasis network. To facilitate the curation process, MET is developed as a browser extension that allows curators to review and edit concepts and relations related to metastasis directly in PubMed. PubMed users can also view the metastatic networks integrated from the large collection of research papers directly through MET. For the BioCreative 2015 interactive track (IAT), a curation task was proposed to curate metastatic networks among PubMed abstracts. Six curators participated in the proposed task and a post-IAT task, curating 963 unique metastatic relations from 174 PubMed abstracts using MET.Database URL: http://btm.tmu.edu.tw/metastasisway. PMID:27242035

  1. Sequential Data Mining for Information Extraction from Texts Fouille de données séquentielles pour l’extraction d’information dans les textes

    Directory of Open Access Journals (Sweden)

    Thierry Charnois

    2010-09-01

    Full Text Available This paper shows the benefit of using data mining methods for Biological Natural Language Processing. A method for discovering linguistic patterns based on a recursive sequential pattern mining is proposed. It does not require a sentence parsing nor other resource except a training data set. It produces understandable results and we show its interest in the extraction of relations between named entities. For the named entities recognition problem, we propose a method based on a new kind of patterns taking account the sequence and its context.

  2. Developing timely insights into comparative effectiveness research with a text-mining pipeline.

    Science.gov (United States)

    Chang, Meiping; Chang, Man; Reed, Jane Z; Milward, David; Xu, Jinghai James; Cornell, Wendy D

    2016-03-01

    Comparative effectiveness research (CER) provides evidence for the relative effectiveness and risks of different treatment options and informs decisions made by healthcare providers, payers, and pharmaceutical companies. CER data come from retrospective analyses as well as prospective clinical trials. Here, we describe the development of a text-mining pipeline based on natural language processing (NLP) that extracts key information from three different trial data sources: NIH ClinicalTrials.gov, WHO International Clinical Trials Registry Platform (ICTRP), and Citeline Trialtrove. The pipeline leverages tailored terminologies to produce an integrated and structured output, capturing any trials in which pharmaceutical products of interest are compared with another therapy. The timely information alerts generated by this system provide the earliest and most complete picture of emerging clinical research. PMID:26854423

  3. A Distributed Look-up Architecture for Text Mining Applications using MapReduce.

    Science.gov (United States)

    Balkir, Atilla Soner; Foster, Ian; Rzhetsky, Andrey

    2011-11-01

    Text mining applications typically involve statistical models that require accessing and updating model parameters in an iterative fashion. With the growing size of the data, such models become extremely parameter rich, and naive parallel implementations fail to address the scalability problem of maintaining a distributed look-up table that maps model parameters to their values. We evaluate several existing alternatives to provide coordination among worker nodes in Hadoop [11] clusters, and suggest a new multi-layered look-up architecture that is specifically optimized for certain problem domains. Our solution exploits the power-law distribution characteristics of the phrase or n-gram counts in large corpora while utilizing a Bloom Filter [2], in-memory cache, and an HBase [12] cluster at varying levels of abstraction. PMID:25356441

  4. Finding falls in ambulatory care clinical documents using statistical text mining

    Science.gov (United States)

    McCart, James A; Berndt, Donald J; Jarman, Jay; Finch, Dezon K; Luther, Stephen L

    2013-01-01

    Objective To determine how well statistical text mining (STM) models can identify falls within clinical text associated with an ambulatory encounter. Materials and Methods 2241 patients were selected with a fall-related ICD-9-CM E-code or matched injury diagnosis code while being treated as an outpatient at one of four sites within the Veterans Health Administration. All clinical documents within a 48-h window of the recorded E-code or injury diagnosis code for each patient were obtained (n=26 010; 611 distinct document titles) and annotated for falls. Logistic regression, support vector machine, and cost-sensitive support vector machine (SVM-cost) models were trained on a stratified sample of 70% of documents from one location (dataset Atrain) and then applied to the remaining unseen documents (datasets Atest–D). Results All three STM models obtained area under the receiver operating characteristic curve (AUC) scores above 0.950 on the four test datasets (Atest–D). The SVM-cost model obtained the highest AUC scores, ranging from 0.953 to 0.978. The SVM-cost model also achieved F-measure values ranging from 0.745 to 0.853, sensitivity from 0.890 to 0.931, and specificity from 0.877 to 0.944. Discussion The STM models performed well across a large heterogeneous collection of document titles. In addition, the models also generalized across other sites, including a traditionally bilingual site that had distinctly different grammatical patterns. Conclusions The results of this study suggest STM-based models have the potential to improve surveillance of falls. Furthermore, the encouraging evidence shown here that STM is a robust technique for mining clinical documents bodes well for other surveillance-related topics. PMID:23242765

  5. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine

    OpenAIRE

    Baskin Berivan; Zhang Shudong; Tuekam Brigitte; Lay Vicki; Wolting Cheryl; de Bruijn Berry; Martin Joel; Donaldson Ian; Bader Gary D; Michalickova Katerina; Pawson Tony; Hogue Christopher WV

    2003-01-01

    Abstract Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate inte...

  6. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience.

    Science.gov (United States)

    Krallinger, Martin; Leitner, Florian; Vazquez, Miguel; Salgado, David; Marcelle, Christophe; Tyers, Mike; Valencia, Alfonso; Chatr-aryamontri, Andrew

    2012-01-01

    There is an increasing interest in developing ontologies and controlled vocabularies to improve the efficiency and consistency of manual literature curation, to enable more formal biocuration workflow results and ultimately to improve analysis of biological data. Two ontologies that have been successfully used for this purpose are the Gene Ontology (GO) for annotating aspects of gene products and the Molecular Interaction ontology (PSI-MI) used by databases that archive protein-protein interactions. The examination of protein interactions has proven to be extremely promising for the understanding of cellular processes. Manual mapping of information from the biomedical literature to bio-ontology terms is one of the most challenging components in the curation pipeline. It requires that expert curators interpret the natural language descriptions contained in articles and infer their semantic equivalents in the ontology (controlled vocabulary). Since manual curation is a time-consuming process, there is strong motivation to implement text-mining techniques to automatically extract annotations from free text. A range of text mining strategies has been devised to assist in the automated extraction of biological data. These strategies either recognize technical terms used recurrently in the literature and propose them as candidates for inclusion in ontologies, or retrieve passages that serve as evidential support for annotating an ontology term, e.g. from the PSI-MI or GO controlled vocabularies. Here, we provide a general overview of current text-mining methods to automatically extract annotations of GO and PSI-MI ontology terms in the context of the BioCreative (Critical Assessment of Information Extraction Systems in Biology) challenge. Special emphasis is given to protein-protein interaction data and PSI-MI terms referring to interaction detection methods.

  7. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes.

    Directory of Open Access Journals (Sweden)

    Nicholas J Leeper

    Full Text Available BACKGROUND: Peripheral arterial disease (PAD is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF. METHODS AND RESULTS: We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55], myocardial infarction (OR = 1.00, CI [0.71, 1.39], or death (OR = 0.86, CI [0.63, 1.18]. Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients. CONCLUSIONS: This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

  8. Development of Workshops on Biodiversity and Evaluation of the Educational Effect by Text Mining Analysis

    Science.gov (United States)

    Baba, R.; Iijima, A.

    2014-12-01

    Conservation of biodiversity is one of the key issues in the environmental studies. As means to solve this issue, education is becoming increasingly important. In the previous work, we have developed a course of workshops on the conservation of biodiversity. To disseminate the course as a tool for environmental education, determination of the educational effect is essential. A text mining enables analyses of frequency and co-occurrence of words in the freely described texts. This study is intended to evaluate the effect of workshop by using text mining technique. We hosted the originally developed workshop on the conservation of biodiversity for 22 college students. The aim of the workshop was to inform the definition of biodiversity. Generally, biodiversity refers to the diversity of ecosystem, diversity between species, and diversity within species. To facilitate discussion, supplementary materials were used. For instance, field guides of wildlife species were used to discuss about the diversity of ecosystem. Moreover, a hierarchical framework in an ecological pyramid was shown for understanding the role of diversity between species. Besides, we offered a document material on the historical affair of Potato Famine in Ireland to discuss about the diversity within species from the genetic viewpoint. Before and after the workshop, we asked students for free description on the definition of biodiversity, and analyzed by using Tiny Text Miner. This technique enables Japanese language morphological analysis. Frequently-used words were sorted into some categories. Moreover, a principle component analysis was carried out. After the workshop, frequency of the words tagged to diversity between species and diversity within species has significantly increased. From a principle component analysis, the 1st component consists of the words such as producer, consumer, decomposer, and food chain. This indicates that the students have comprehended the close relationship between

  9. Effective Mining of Protein Interactions

    OpenAIRE

    Rinaldi, F; Schneider, G; Kaljurand, K.; Clematide, S

    2009-01-01

    The detection of mentions of protein-protein interactions in the scientific literature has recently emerged as a core task in biomedical text mining. We present effective techniques for this task, which have been developed using the IntAct database as a gold standard, and have been evaluated in two text mining competitions.

  10. Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining

    Directory of Open Access Journals (Sweden)

    Xuan Shi

    2016-09-01

    Full Text Available In response to the widespread concern about the adequacy, distribution, and disparity of access to a health care workforce, the correct identification of physicians’ practice locations is critical to access public health services. In prior literature, little effort has been made to detect and resolve the uncertainty about whether the address provided by a physician in the survey is a practice address or a home address. This paper introduces how to identify the uncertainty in a physician’s practice location through spatial analytics, text mining, and visual examination. While land use and zoning code, embedded within the parcel datasets, help to differentiate resident areas from other types, spatial analytics may have certain limitations in matching and comparing physician and parcel datasets with different uncertainty issues, which may lead to unforeseen results. Handling and matching the string components between physicians’ addresses and the addresses of the parcels could identify the spatial uncertainty and instability to derive a more reasonable relationship between different datasets. Visual analytics and examination further help to clarify the undetectable patterns. This research will have a broader impact over federal and state initiatives and policies to address both insufficiency and maldistribution of a health care workforce to improve the accessibility to public health services.

  11. Interpretation of the consequences of mutations in protein kinases: combined use of bioinformatics and text mining

    Directory of Open Access Journals (Sweden)

    Jose M.G. Izarzugaza

    2012-08-01

    Full Text Available Protein kinases play a crucial role in a plethora of significant physiological functions and a number of mutations in this superfamily have been reported in the literature to disrupt protein structure and/or function.Computational and experimental research aims to discover the mechanistic connection between mutations in protein kinases and disease with the final aim of predicting the consequences of mutations on protein function and the subsequent phenotypic alterations.In this chapter, we will review the possibilities and limitations of current computational methods for the prediction of the pathogenicity of mutations in the protein kinase superfamily. In particular we will focus in the problem of benchmarking the predictions with independent gold-standard datasets. We will propose a pipeline for the curation of mutations automatically extracted from the literature. Since many of these mutations are not included in the databases that are commonly used to train the computational methods to predict the pathogenicity of protein kinase mutations we propose them to build a valuable gold-standard dataset in the benchmarking of a number of these predictors.Finally, we will discuss how text mining approaches constitute a powerful tool for the interpretation of the consequences of mutations in the context of personalized/stratified medicine.

  12. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques

    International Nuclear Information System (INIS)

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  13. A practical application of text mining to literature on cognitive rehabilitation and enhancement through neurostimulation

    Directory of Open Access Journals (Sweden)

    Puiu F Balan

    2014-09-01

    Full Text Available The exponential growth in publications represents a major challenge for researchers. Many scientific domains, including neuroscience, are not yet fully engaged in exploiting large bodies of publications. In this paper, we promote the idea to partially automate the processing of scientific documents, specifically using text mining (TM, to efficiently review big corpora of publications. The cognitive advantage given by TM is mainly related to the automatic extraction of relevant trends from corpora of literature, otherwise impossible to analyze in short periods of time. Specifically, the benefits of TM are increased speed, quality and reproducibility of text processing, boosted by rapid updates of the results. First, we selected a set of TM-tools that allow user-friendly approaches of the scientific literature, and which could serve as a guide for researchers willing to incorporate TM in their work. Second, we used these TM-tools to obtain basic insights into the relevant literature on cognitive rehabilitation (CR and cognitive enhancement (CE using transcranial magnetic stimulation (TMS. TM readily extracted the diversity of TMS applications in CR and CE from vast corpora of publications, automatically retrieving trends already described in published reviews. TMS emerged as one of the important non-invasive tools that can both improve cognitive and motor functions in numerous neurological diseases and induce modulations/enhancements of many fundamental brain functions. TM also revealed trends in big corpora of publications by extracting occurrence frequency and relationships of particular subtopics. Moreover, we showed that CR and CE share research topics, both aiming to increase the brain’s capacity to process information, thus supporting their integration in a larger perspective. Methodologically, despite limitations of a simple user-friendly approach, TM served well the reviewing process.

  14. Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application

    Directory of Open Access Journals (Sweden)

    Leon eFrench

    2015-05-01

    Full Text Available We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 51% of connectivity statements at 67% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/.

  15. Mining

    Directory of Open Access Journals (Sweden)

    Khairullah Khan

    2014-09-01

    Full Text Available Opinion mining is an interesting area of research because of its applications in various fields. Collecting opinions of people about products and about social and political events and problems through the Web is becoming increasingly popular every day. The opinions of users are helpful for the public and for stakeholders when making certain decisions. Opinion mining is a way to retrieve information through search engines, Web blogs and social networks. Because of the huge number of reviews in the form of unstructured text, it is impossible to summarize the information manually. Accordingly, efficient computational methods are needed for mining and summarizing the reviews from corpuses and Web documents. This study presents a systematic literature survey regarding the computational techniques, models and algorithms for mining opinion components from unstructured reviews.

  16. Integrating protein-protein interactions and text mining for protein function prediction

    Directory of Open Access Journals (Sweden)

    Leser Ulf

    2008-07-01

    Full Text Available Abstract Background Functional annotation of proteins remains a challenging task. Currently the scientific literature serves as the main source for yet uncurated functional annotations, but curation work is slow and expensive. Automatic techniques that support this work are still lacking reliability. We developed a method to identify conserved protein interaction graphs and to predict missing protein functions from orthologs in these graphs. To enhance the precision of the results, we furthermore implemented a procedure that validates all predictions based on findings reported in the literature. Results Using this procedure, more than 80% of the GO annotations for proteins with highly conserved orthologs that are available in UniProtKb/Swiss-Prot could be verified automatically. For a subset of proteins we predicted new GO annotations that were not available in UniProtKb/Swiss-Prot. All predictions were correct (100% precision according to the verifications from a trained curator. Conclusion Our method of integrating CCSs and literature mining is thus a highly reliable approach to predict GO annotations for weakly characterized proteins with orthologs.

  17. Identifying the Uncertainty in Physician Practice Location through Spatial Analytics and Text Mining

    Science.gov (United States)

    Shi, Xuan; Xue, Bowei; Xierali, Imam M.

    2016-01-01

    In response to the widespread concern about the adequacy, distribution, and disparity of access to a health care workforce, the correct identification of physicians’ practice locations is critical to access public health services. In prior literature, little effort has been made to detect and resolve the uncertainty about whether the address provided by a physician in the survey is a practice address or a home address. This paper introduces how to identify the uncertainty in a physician’s practice location through spatial analytics, text mining, and visual examination. While land use and zoning code, embedded within the parcel datasets, help to differentiate resident areas from other types, spatial analytics may have certain limitations in matching and comparing physician and parcel datasets with different uncertainty issues, which may lead to unforeseen results. Handling and matching the string components between physicians’ addresses and the addresses of the parcels could identify the spatial uncertainty and instability to derive a more reasonable relationship between different datasets. Visual analytics and examination further help to clarify the undetectable patterns. This research will have a broader impact over federal and state initiatives and policies to address both insufficiency and maldistribution of a health care workforce to improve the accessibility to public health services. PMID:27657100

  18. METSP: A Maximum-Entropy Classifier Based Text Mining Tool for Transporter-Substrate Identification with Semistructured Text

    Directory of Open Access Journals (Sweden)

    Min Zhao

    2015-01-01

    Full Text Available The substrates of a transporter are not only useful for inferring function of the transporter, but also important to discover compound-compound interaction and to reconstruct metabolic pathway. Though plenty of data has been accumulated with the developing of new technologies such as in vitro transporter assays, the search for substrates of transporters is far from complete. In this article, we introduce METSP, a maximum-entropy classifier devoted to retrieve transporter-substrate pairs (TSPs from semistructured text. Based on the high quality annotation from UniProt, METSP achieves high precision and recall in cross-validation experiments. When METSP is applied to 182,829 human transporter annotation sentences in UniProt, it identifies 3942 sentences with transporter and compound information. Finally, 1547 confidential human TSPs are identified for further manual curation, among which 58.37% pairs with novel substrates not annotated in public transporter databases. METSP is the first efficient tool to extract TSPs from semistructured annotation text in UniProt. This tool can help to determine the precise substrates and drugs of transporters, thus facilitating drug-target prediction, metabolic network reconstruction, and literature classification.

  19. Text mining for neuroanatomy using WhiteText with an updated corpus and a new web application.

    Science.gov (United States)

    French, Leon; Liu, Po; Marais, Olivia; Koreman, Tianna; Tseng, Lucia; Lai, Artemis; Pavlidis, Paul

    2015-01-01

    We describe the WhiteText project, and its progress towards automatically extracting statements of neuroanatomical connectivity from text. We review progress to date on the three main steps of the project: recognition of brain region mentions, standardization of brain region mentions to neuroanatomical nomenclature, and connectivity statement extraction. We further describe a new version of our manually curated corpus that adds 2,111 connectivity statements from 1,828 additional abstracts. Cross-validation classification within the new corpus replicates results on our original corpus, recalling 67% of connectivity statements at 51% precision. The resulting merged corpus provides 5,208 connectivity statements that can be used to seed species-specific connectivity matrices and to better train automated techniques. Finally, we present a new web application that allows fast interactive browsing of the over 70,000 sentences indexed by the system, as a tool for accessing the data and assisting in further curation. Software and data are freely available at http://www.chibi.ubc.ca/WhiteText/. PMID:26052282

  20. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.

    Science.gov (United States)

    Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H

    2015-12-01

    Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research. PMID:26579624

  1. Discovery and explanation of drug-drug interactions via text mining.

    Science.gov (United States)

    Percha, Bethany; Garten, Yael; Altman, Russ B

    2012-01-01

    Drug-drug interactions (DDIs) can occur when two drugs interact with the same gene product. Most available information about gene-drug relationships is contained within the scientific literature, but is dispersed over a large number of publications, with thousands of new publications added each month. In this setting, automated text mining is an attractive solution for identifying gene-drug relationships and aggregating them to predict novel DDIs. In previous work, we have shown that gene-drug interactions can be extracted from Medline abstracts with high fidelity - we extract not only the genes and drugs, but also the type of relationship expressed in individual sentences (e.g. metabolize, inhibit, activate and many others). We normalize these relationships and map them to a standardized ontology. In this work, we hypothesize that we can combine these normalized gene-drug relationships, drawn from a very broad and diverse literature, to infer DDIs. Using a training set of established DDIs, we have trained a random forest classifier to score potential DDIs based on the features of the normalized assertions extracted from the literature that relate two drugs to a gene product. The classifier recognizes the combinations of relationships, drugs and genes that are most associated with the gold standard DDIs, correctly identifying 79.8% of assertions relating interacting drug pairs and 78.9% of assertions relating noninteracting drug pairs. Most significantly, because our text processing method captures the semantics of individual gene-drug relationships, we can construct mechanistic pharmacological explanations for the newly-proposed DDIs. We show how our classifier can be used to explain known DDIs and to uncover new DDIs that have not yet been reported.

  2. Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

    OpenAIRE

    Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho

    2015-01-01

    Background Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature. We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named...

  3. Statistical modeling of biomedical corpora: mining the Caenorhabditis Genetic Center Bibliography for genes related to life span

    Directory of Open Access Journals (Sweden)

    Jordan MI

    2006-05-01

    Full Text Available Abstract Background The statistical modeling of biomedical corpora could yield integrated, coarse-to-fine views of biological phenomena that complement discoveries made from analysis of molecular sequence and profiling data. Here, the potential of such modeling is demonstrated by examining the 5,225 free-text items in the Caenorhabditis Genetic Center (CGC Bibliography using techniques from statistical information retrieval. Items in the CGC biomedical text corpus were modeled using the Latent Dirichlet Allocation (LDA model. LDA is a hierarchical Bayesian model which represents a document as a random mixture over latent topics; each topic is characterized by a distribution over words. Results An LDA model estimated from CGC items had better predictive performance than two standard models (unigram and mixture of unigrams trained using the same data. To illustrate the practical utility of LDA models of biomedical corpora, a trained CGC LDA model was used for a retrospective study of nematode genes known to be associated with life span modification. Corpus-, document-, and word-level LDA parameters were combined with terms from the Gene Ontology to enhance the explanatory value of the CGC LDA model, and to suggest additional candidates for age-related genes. A novel, pairwise document similarity measure based on the posterior distribution on the topic simplex was formulated and used to search the CGC database for "homologs" of a "query" document discussing the life span-modifying clk-2 gene. Inspection of these document homologs enabled and facilitated the production of hypotheses about the function and role of clk-2. Conclusion Like other graphical models for genetic, genomic and other types of biological data, LDA provides a method for extracting unanticipated insights and generating predictions amenable to subsequent experimental validation.

  4. Development of a Text Mining System for the Discussion of Proactive Aging Management in Nuclear Power Plant

    International Nuclear Information System (INIS)

    The purpose of this study is to develop an effective system to support the exploration process of knowledge extraction from the database of incident records in the long-operated nuclear power plants with text mining technology, especially for the Generic Issues for proactive materials degradation management (PMDM) project in Japan. A modified system with text mining technology has been developed to support to explore relationships of keywords as cues for the discussion of Generic Issues effectively. As a result of evaluation, the knowledge extraction method with the modified system has been confirmed to support to explore relationships of keywords more effectively than the proposed method in the previous study

  5. Development of a Text Mining System for the Discussion of Proactive Aging Management in Nuclear Power Plant

    Energy Technology Data Exchange (ETDEWEB)

    Shiraishi, Natsuki; Takahashi, Makoto; Wakabayashi, Toshio [Tohoku University, Tohoku (Japan)

    2011-08-15

    The purpose of this study is to develop an effective system to support the exploration process of knowledge extraction from the database of incident records in the long-operated nuclear power plants with text mining technology, especially for the Generic Issues for proactive materials degradation management (PMDM) project in Japan. A modified system with text mining technology has been developed to support to explore relationships of keywords as cues for the discussion of Generic Issues effectively. As a result of evaluation, the knowledge extraction method with the modified system has been confirmed to support to explore relationships of keywords more effectively than the proposed method in the previous study.

  6. 生物信息学中的文本挖掘方法%Text mining in bioinformatics

    Institute of Scientific and Technical Information of China (English)

    邹权; 林琛; 刘晓燕; 郭茂祖

    2011-01-01

    从两个角度讨论应用于生物信息学中的文本挖掘方法.以搜索生物知识为目标,利用文本挖掘方法进行文献检索,进而构建相关数据库,如在PubMed中挖掘蛋白质相互作用和基因疾病关系等知识.总结了可以应用文本挖掘技术的生物信息学问题,如蛋白质结构与功能的分析.探讨了文本挖掘研究者可以探索的生物信息学领域,以便更多的文本挖掘研究者可以将相关成果应用于生物信息学的研究中.%Text mining methods in bioinformatics are discussed from two views. First, three problems are reviewed including searching biology knowledge, retrieving the reference by text mining method and reconstructing databases. For example, protein-protein interaction and gene-disease relationship can be mined from PubMed. Then the bioinformatics applications of text mining are concluded, such as protein structure and function prediction. At last, more methods and applications are discussed for helping text mining researchers to do more contribution in bioinformatics.

  7. Text Mining to inform construction of Earth and Environmental Science Ontologies

    Science.gov (United States)

    Schildhauer, M.; Adams, B.; Rebich Hespanha, S.

    2013-12-01

    There is a clear need for better semantic representation of Earth and environmental concepts, to facilitate more effective discovery and re-use of information resources relevant to scientists doing integrative research. In order to develop general-purpose Earth and environmental science ontologies, however, it is necessary to represent concepts and relationships that span usage across multiple disciplines and scientific specialties. Traditional knowledge modeling through ontologies utilizes expert knowledge but inevitably favors the particular perspectives of the ontology engineers, as well as the domain experts who interacted with them. This often leads to ontologies that lack robust coverage of synonymy, while also missing important relationships among concepts that can be extremely useful for working scientists to be aware of. In this presentation we will discuss methods we have developed that utilize statistical topic modeling on a large corpus of Earth and environmental science articles, to expand coverage and disclose relationships among concepts in the Earth sciences. For our work we collected a corpus of over 121,000 abstracts from many of the top Earth and environmental science journals. We performed latent Dirichlet allocation topic modeling on this corpus to discover a set of latent topics, which consist of terms that commonly co-occur in abstracts. We match terms in the topics to concept labels in existing ontologies to reveal gaps, and we examine which terms are commonly associated in natural language discourse, to identify relationships that are important to formally model in ontologies. Our text mining methodology uncovers significant gaps in the content of some popular existing ontologies, and we show how, through a workflow involving human interpretation of topic models, we can bootstrap ontologies to have much better coverage and richer semantics. Because we base our methods directly on what working scientists are communicating about their

  8. The Text Mining and its Key Technigues and Methods%文本挖掘及其关键技术与方法

    Institute of Scientific and Technical Information of China (English)

    王丽坤; 王宏; 陆玉昌

    2002-01-01

    With the dramatically development of Internet, the information processing and management technology onWWW have become a great important branch of data mining and data warehouse. Especially, nowadays, Text Miningis marvelously emerging and plays an important role in interrelated fields. So it is worth summarizing the contentabout text mining from its definition to relational methods and techniques. In this paper, combined to comparativelymature data mining technology, we present the definition of text mining and the multi-stage text mining process mod-el. Moreover, this paper roundly introduces the key areas of text mining and some of the powerful text analysis tech-niques, including: Word Automatic Segmenting, Feature Representation, Feature Extraction, Text Categorization,Text Clustering, Text Summarization, Information Extraction, Pattern Quality Evaluation, etc. These techniquescover the whole process from information preprocessing to knowledge obtaining.

  9. The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

    Directory of Open Access Journals (Sweden)

    Sun Lin

    2009-09-01

    Full Text Available Abstract Background One of the most neglected areas of biomedical Text Mining (TM is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA. Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature. Results The taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice. Conclusion We present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.

  10. Text mining with emergent self organizing maps and multi-dimensional scaling: a comparative study on domestic violence

    NARCIS (Netherlands)

    J. Poelmans; M.M. van Hulle; S. Viaene; P. Elzinga; G. Dedene

    2011-01-01

    In this paper we compare the usability of ESOM and MDS as text exploration instruments in police investigations. We combine them with traditional classification instruments such as the SVM and Naïve Bayes. We perform a case of real-life data mining using a dataset consisting of police reports descri

  11. Examining Mobile Learning Trends 2003-2008: A Categorical Meta-Trend Analysis Using Text Mining Techniques

    Science.gov (United States)

    Hung, Jui-Long; Zhang, Ke

    2012-01-01

    This study investigated the longitudinal trends of academic articles in Mobile Learning (ML) using text mining techniques. One hundred and nineteen (119) refereed journal articles and proceedings papers from the SCI/SSCI database were retrieved and analyzed. The taxonomies of ML publications were grouped into twelve clusters (topics) and four…

  12. Towards Evidence-based Precision Medicine: Extracting Population Information from Biomedical Text using Binary Classifiers and Syntactic Patterns.

    Science.gov (United States)

    Raja, Kalpana; Dasot, Naman; Goyal, Pawan; Jonnalagadda, Siddhartha R

    2016-01-01

    Precision Medicine is an emerging approach for prevention and treatment of disease that considers individual variability in genes, environment, and lifestyle for each person. The dissemination of individualized evidence by automatically identifying population information in literature is a key for evidence-based precision medicine at the point-of-care. We propose a hybrid approach using natural language processing techniques to automatically extract the population information from biomedical literature. Our approach first implements a binary classifier to classify sentences with or without population information. A rule-based system based on syntactic-tree regular expressions is then applied to sentences containing population information to extract the population named entities. The proposed two-stage approach achieved an F-score of 0.81 using a MaxEnt classifier and the rule- based system, and an F-score of 0.87 using a Nai've-Bayes classifier and the rule-based system, and performed relatively well compared to many existing systems. The system and evaluation dataset is being released as open source. PMID:27570671

  13. BIOMedical search engine framework: lightweight and customized implementation of domain-specific biomedical search engines

    OpenAIRE

    Jácome, Alberto G.; Fdez-Riverola, Florentino; Lourenço, Anália

    2016-01-01

    The Smart Drug Search is publicly accessible at http://sing.ei.uvigo.es/sds/. The BIOMedical Search Engine Framework is freely available for non-commercial use at https://github.com/agjacome/biomsef Background and Objectives: Text mining and semantic analysis approaches can be applied to the construction of biomedical domain-specific search engines and provide an attractive alternative to create personalized and enhanced search experiences. Therefore, this work introduces the new open-sour...

  14. Construction of an index of information from clinical practice in Radiology and Imaging Diagnosis based on text mining and thesaurus

    Directory of Open Access Journals (Sweden)

    Paulo Roberto Barbosa Serapiao

    2013-09-01

    Full Text Available Objective To construct a Portuguese language index of information on the practice of diagnostic radiology in order to improve the standardization of the medical language and terminology. Materials and Methods A total of 61,461 definitive reports were collected from the database of the Radiology Information System at Hospital das Clínicas – Faculdade de Medicina de Ribeirão Preto (RIS/HCFMRP as follows: 30,000 chest x-ray reports; 27,000 mammography reports; and 4,461 thyroid ultrasonography reports. The text mining technique was applied for the selection of terms, and the ANSI/NISO Z39.19-2005 standard was utilized to construct the index based on a thesaurus structure. The system was created in *html. Results The text mining resulted in a set of 358,236 (n = 100% words. Out of this total, 76,347 (n = 21% terms were selected to form the index. Such terms refer to anatomical pathology description, imaging techniques, equipment, type of study and some other composite terms. The index system was developed with 78,538 *html web pages. Conclusion The utilization of text mining on a radiological reports database has allowed the construction of a lexical system in Portuguese language consistent with the clinical practice in Radiology.

  15. Text Classification using the Concept of Association Rule of Data Mining

    OpenAIRE

    Rahman, Chowdhury Mofizur; Sohel, Ferdous Ahmed; Naushad, Parvez; Kamruzzaman, S. M.

    2010-01-01

    As the amount of online text increases, the demand for text classification to aid the analysis and management of text is increasing. Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive. Automatic classification of text can provide this information at low cost, but the classifiers themselves must be built with expensive human effort, or trained from texts which have themselves been manually classified. In this paper we will discuss a procedure of...

  16. Business intelligence in banking: A literature analysis from 2002 to 2013 using Text Mining and latent Dirichlet allocation

    OpenAIRE

    S. Moro; Cortez, P.; Rita, P.

    2015-01-01

    WOS:000345734700028 (Nº de Acesso Web of Science) This paper analyzes recent literature in the search for trends in business intelligence applications for the banking industry. Searches were performed in relevant journals resulting in 219 articles published between 2002 and 2013. To analyze such a large number of manuscripts, text mining techniques were used in pursuit for relevant terms on both business intelligence and banking domains. Moreover, the latent Dirichlet allocation modeling w...

  17. A Digital Humanities Approach to the History of Science Eugenics Revisited in Hidden Debates by Means of Semantic Text Mining

    OpenAIRE

    Huijnen, Pim; Laan, Fons; De Rijke, Maarten; Pieters, Toine

    2014-01-01

    Comparative historical research on the the intensity, diversity and fluidity of public discourses has been severely hampered by the extraordinary task of manually gathering and processing large sets of opinionated data in news media in different countries. At most 50,000 documents have been systematically studied in a single comparative historical project in the subject area of heredity and eugenics. Digital techniques, like the text mining tools WAHSP and BILAND we have developed in two succ...

  18. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool

    OpenAIRE

    Ilona eSilins; Anna eKorhonen; Ulla eStenius

    2014-01-01

    Toxicity caused by chemical mixtures has emerged as a significant challenge for toxicologists and risk assessors. Information on individual chemicals’ modes of action is an important part of the hazard identification step. In this study, an automatic text mining-based tool was employed as a method to identify the carcinogenic modes of action of pesticides frequently found in fruit on the Swedish market. The current available scientific literature on the 26 most common pesticides found in appl...

  19. Evaluation of carcinogenic modes of action for pesticides in fruit on the Swedish market using a text-mining tool

    OpenAIRE

    Silins, Ilona; Korhonen, Anna; Stenius, Ulla

    2014-01-01

    Toxicity caused by chemical mixtures has emerged as a significant challenge for toxicologists and risk assessors. Information on individual chemicals' modes of action is an important part of the hazard identification step. In this study, an automatic text mining-based tool was employed as a method to identify the carcinogenic modes of action of pesticides frequently found in fruit on the Swedish market. The current available scientific literature on the 26 most common pesticides found in appl...

  20. Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation.

    Science.gov (United States)

    Smalheiser, Neil R; Bonifield, Gary

    2016-01-01

    In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html. Handling editor: Elizabeth Workman, MLIS, PhD. PMID:27213780

  1. Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation.

    Science.gov (United States)

    Smalheiser, Neil R; Bonifield, Gary

    2016-01-01

    In the present paper, we have created and characterized several similarity metrics for relating any two Medical Subject Headings (MeSH terms) to each other. The article-based metric measures the tendency of two MeSH terms to appear in the MEDLINE record of the same article. The author-based metric measures the tendency of two MeSH terms to appear in the body of articles written by the same individual (using the 2009 Author-ity author name disambiguation dataset as a gold standard). The two metrics are only modestly correlated with each other (r = 0.50), indicating that they capture different aspects of term usage. The article-based metric provides a measure of semantic relatedness, and MeSH term pairs that co-occur more often than expected by chance may reflect relations between the two terms. In contrast, the author metric is indicative of how individuals practice science, and may have value for author name disambiguation and studies of scientific discovery. We have calculated article metrics for all MeSH terms appearing in at least 25 articles in MEDLINE (as of 2014) and author metrics for MeSH terms published as of 2009. The dataset is freely available for download and can be queried at http://arrowsmith.psych.uic.edu/arrowsmith_uic/mesh_pair_metrics.html. Handling editor: Elizabeth Workman, MLIS, PhD.

  2. DESTAF: A database of text-mined associations for reproductive toxins potentially affecting human fertility

    KAUST Repository

    Dawe, Adam Sean

    2012-01-01

    The Dragon Exploration System for Toxicants and Fertility (DESTAF) is a publicly available resource which enables researchers to efficiently explore both known and potentially novel information and associations in the field of reproductive toxicology. To create DESTAF we used data from the literature (including over 10. 500 PubMed abstracts), several publicly available biomedical repositories, and specialized, curated dictionaries. DESTAF has an interface designed to facilitate rapid assessment of the key associations between relevant concepts, allowing for a more in-depth exploration of information based on different gene/protein-, enzyme/metabolite-, toxin/chemical-, disease- or anatomically centric perspectives. As a special feature, DESTAF allows for the creation and initial testing of potentially new association hypotheses that suggest links between biological entities identified through the database.DESTAF, along with a PDF manual, can be found at http://cbrc.kaust.edu.sa/destaf. It is free to academic and non-commercial users and will be updated quarterly. © 2011 Elsevier Inc.

  3. Development and testing of a text-mining approach to analyse patients’ comments on their experiences of colorectal cancer care

    OpenAIRE

    Wagland, Richard; Recio Saucedo, Alejandra; Simon, Michael; Bracher, Michael; Hunt, Katherine; Foster, Claire; Downing, Amy; Glaser, Adam W; Corner, Jessica

    2016-01-01

    Background: Quality of cancer care may greatly impact upon patients’ health-related quality of life (HRQoL). Free-text responses to patient-reported outcome measures (PROMs) provide rich data but analysis is time and resource-intensive. This study developed and tested a learning-based text-mining approach to facilitate analysis of patients’ experiences of care and develop an explanatory model illustrating impact upon HRQoL. Methods: Respondents to a population-based survey of colorectal c...

  4. An Ontology based Text Mining framework for R&D Project Selection

    Directory of Open Access Journals (Sweden)

    E.Sathya

    2013-03-01

    Full Text Available Researchanddevelopment(R&Dprojectselectionisandecision-making taskcommonlyfoundingovernment fundingagencies,universities,researchinstitutes,and technologyintensivecompanies.TextMininghasemergedasadefinitivetechniqueforextracting theunknowninformationfromlargetextdocument.Ontologyisaknowledgerepositoryinwhichconcepts andterms aredefined aswellasrelationshipsbetween theseconcepts.Ontology'smakethetaskofsearchingsimilarpatternof textthat tobemoreeffective,efficientandinteractive.Thecurrentmethod for groupingproposalsfor research projectselection is proposed using anontologybasedtextminingapproachtoclusterresearchproposalsbasedon theirsimilaritiesinresearcharea.Thismethodisefficient andeffectiveforclusteringresearchproposals.Howeverproposalassignmentregardingresearch areas toexpertscannotbeoftenaccurate.Thispaperpresents aframeworkonontologybasedtextminingtoclusterresearchproposals,externalreviewersbasedon theirresearchareaandtoassignconcernedresearchproposalstoreviewerssystematically.Aknowledgebasedagentisappendedtotheproposedsystem foraretrieval ofdatafromthesystem in an efficient way.

  5. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification

    Directory of Open Access Journals (Sweden)

    Yin Wang

    2016-01-01

    Full Text Available Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data.

  6. Motif-Based Text Mining of Microbial Metagenome Redundancy Profiling Data for Disease Classification.

    Science.gov (United States)

    Wang, Yin; Li, Rudong; Zhou, Yuhua; Ling, Zongxin; Guo, Xiaokui; Xie, Lu; Liu, Lei

    2016-01-01

    Background. Text data of 16S rRNA are informative for classifications of microbiota-associated diseases. However, the raw text data need to be systematically processed so that features for classification can be defined/extracted; moreover, the high-dimension feature spaces generated by the text data also pose an additional difficulty. Results. Here we present a Phylogenetic Tree-Based Motif Finding algorithm (PMF) to analyze 16S rRNA text data. By integrating phylogenetic rules and other statistical indexes for classification, we can effectively reduce the dimension of the large feature spaces generated by the text datasets. Using the retrieved motifs in combination with common classification methods, we can discriminate different samples of both pneumonia and dental caries better than other existing methods. Conclusions. We extend the phylogenetic approaches to perform supervised learning on microbiota text data to discriminate the pathological states for pneumonia and dental caries. The results have shown that PMF may enhance the efficiency and reliability in analyzing high-dimension text data. PMID:27057545

  7. Textpresso Site-Specific Recombinases: a text-mining server for the recombinase literature including Cre mice and conditional alleles

    Science.gov (United States)

    Urbanski, William M.; Condie, Brian G.

    2016-01-01

    Textpresso Site Specific Recombinases (http://ssrc.genetics.uga.edu/) is a text-mining web server for searching a database of over 9000 full-text publications. The papers and abstracts in this database represent a wide range of topics related to site-specific recombinase (SSR) research tools. Included in the database are most of the papers that report the characterization or use of mouse strains that express Cre recombinase as well as papers that describe or analyze mouse lines that carry conditional (floxed) alleles or SSR-activated transgenes/knockins. The database also includes reports describing SSR-based cloning methods such as the Gateway or the Creator systems, papers reporting the development or use of SSR-based tools in systems such as Drosophila, bacteria, parasites, stem cells, yeast, plants, zebrafish and Xenopus as well as publications that describe the biochemistry, genetics or molecular structure of the SSRs themselves. Textpresso Site Specific Recombinases is the only comprehensive text-mining resource available for the literature describing the biology and technical applications of site-specific recombinases. PMID:19882667

  8. PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine

    Directory of Open Access Journals (Sweden)

    Baskin Berivan

    2003-03-01

    Full Text Available Abstract Background The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. Results Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days. Conclusions Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

  9. Mining for constructions in texts using N-gram and network analysis

    DEFF Research Database (Denmark)

    Shibuya, Yoshikata; Jensen, Kim Ebensgaard

    2015-01-01

    N-gram analysis to Lewis Carroll's novel Alice's Adventures in Wonderland and Mark Twain's novelThe Adventures of Huckleberry Finn and extrapolate a number of likely constructional phenomena from recurring N-gram patterns in the two texts. In addition to simple N-gram analysis, the following...

  10. Combining Natural Language Processing and Statistical Text Mining: A Study of Specialized versus Common Languages

    Science.gov (United States)

    Jarman, Jay

    2011-01-01

    This dissertation focuses on developing and evaluating hybrid approaches for analyzing free-form text in the medical domain. This research draws on natural language processing (NLP) techniques that are used to parse and extract concepts based on a controlled vocabulary. Once important concepts are extracted, additional machine learning algorithms,…

  11. The biomedical discourse relation bank

    Directory of Open Access Journals (Sweden)

    Joshi Aravind

    2011-05-01

    Full Text Available Abstract Background Identification of discourse relations, such as causal and contrastive relations, between situations mentioned in text is an important task for biomedical text-mining. A biomedical text corpus annotated with discourse relations would be very useful for developing and evaluating methods for biomedical discourse processing. However, little effort has been made to develop such an annotated resource. Results We have developed the Biomedical Discourse Relation Bank (BioDRB, in which we have annotated explicit and implicit discourse relations in 24 open-access full-text biomedical articles from the GENIA corpus. Guidelines for the annotation were adapted from the Penn Discourse TreeBank (PDTB, which has discourse relations annotated over open-domain news articles. We introduced new conventions and modifications to the sense classification. We report reliable inter-annotator agreement of over 80% for all sub-tasks. Experiments for identifying the sense of explicit discourse connectives show the connective itself as a highly reliable indicator for coarse sense classification (accuracy 90.9% and F1 score 0.89. These results are comparable to results obtained with the same classifier on the PDTB data. With more refined sense classification, there is degradation in performance (accuracy 69.2% and F1 score 0.28, mainly due to sparsity in the data. The size of the corpus was found to be sufficient for identifying the sense of explicit connectives, with classifier performance stabilizing at about 1900 training instances. Finally, the classifier performs poorly when trained on PDTB and tested on BioDRB (accuracy 54.5% and F1 score 0.57. Conclusion Our work shows that discourse relations can be reliably annotated in biomedical text. Coarse sense disambiguation of explicit connectives can be done with high reliability by using just the connective as a feature, but more refined sense classification requires either richer features or more

  12. Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

    CERN Document Server

    Safeer, Yasir; Ali, Anis Noor

    2010-01-01

    With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of k-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result.

  13. Ask and Ye Shall Receive? Automated Text Mining of Michigan Capital Facility Finance Bond Election Proposals to Identify Which Topics Are Associated with Bond Passage and Voter Turnout

    Science.gov (United States)

    Bowers, Alex J.; Chen, Jingjing

    2015-01-01

    The purpose of this study is to bring together recent innovations in the research literature around school district capital facility finance, municipal bond elections, statistical models of conditional time-varying outcomes, and data mining algorithms for automated text mining of election ballot proposals to examine the factors that influence the…

  14. Newspaper archives + text mining = rich sources of historical geo-spatial data

    Science.gov (United States)

    Yzaguirre, A.; Smit, M.; Warren, R.

    2016-04-01

    Newspaper archives are rich sources of cultural, social, and historical information. These archives, even when digitized, are typically unstructured and organized by date rather than by subject or location, and require substantial manual effort to analyze. The effort of journalists to be accurate and precise means that there is often rich geo-spatial data embedded in the text, alongside text describing events that editors considered to be of sufficient importance to the region or the world to merit column inches. A regional newspaper can add over 100,000 articles to its database each year, and extracting information from this data for even a single country would pose a substantial Big Data challenge. In this paper, we describe a pilot study on the construction of a database of historical flood events (location(s), date, cause, magnitude) to be used in flood assessment projects, for example to calibrate models, estimate frequency, establish high water marks, or plan for future events in contexts ranging from urban planning to climate change adaptation. We then present a vision for extracting and using the rich geospatial data available in unstructured text archives, and suggest future avenues of research.

  15. ChemicalTagger: A tool for semantic text-mining in chemistry

    Directory of Open Access Journals (Sweden)

    Hawizy Lezan

    2011-05-01

    Full Text Available Abstract Background The primary method for scientific communication is in the form of published scientific articles and theses which use natural language combined with domain-specific terminology. As such, they contain free owing unstructured text. Given the usefulness of data extraction from unstructured literature, we aim to show how this can be achieved for the discipline of chemistry. The highly formulaic style of writing most chemists adopt make their contributions well suited to high-throughput Natural Language Processing (NLP approaches. Results We have developed the ChemicalTagger parser as a medium-depth, phrase-based semantic NLP tool for the language of chemical experiments. Tagging is based on a modular architecture and uses a combination of OSCAR, domain-specific regex and English taggers to identify parts-of-speech. The ANTLR grammar is used to structure this into tree-based phrases. Using a metric that allows for overlapping annotations, we achieved machine-annotator agreements of 88.9% for phrase recognition and 91.9% for phrase-type identification (Action names. Conclusions It is possible parse to chemical experimental text using rule-based techniques in conjunction with a formal grammar parser. ChemicalTagger has been deployed for over 10,000 patents and has identified solvents from their linguistic context with >99.5% precision.

  16. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens. PMID:27625608

  17. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens. PMID:27625608

  18. Cloud Based Metalearning System for Predictive Modeling of Biomedical Data

    Directory of Open Access Journals (Sweden)

    Milan Vukićević

    2014-01-01

    Full Text Available Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.

  19. DEVELOP ENGLISH TUTORIAL SYSTEM USING NATURAL LANGAUGE PROCESSING WITH TEXT MINING

    Directory of Open Access Journals (Sweden)

    K. J. SATAO,

    2010-12-01

    Full Text Available This paper describes the model of an Interactive Tutoring system. The model enables a Indian user to generate computer-based tutorials without any programming knowledge, serves as a multiple domain tutor to English monolingual users who are learning English, and the prototype also incorporates an English-Hindi translation ability. Based on reviews of previous work, such features are usually incorporated separately in different systems and applications. Therefore, the primary goal of this paper is to describe an experiment which investigates the feasibility of integrating these features into one common platform. The implementation of the prototype is based on paradigms of a flexible and interactive user interface, and Natural Language Processing.

  20. Analyzing Self-Help Forums with Ontology-Based Text Mining: An Exploration in Kidney Space.

    Science.gov (United States)

    Burckhardt, Philipp; Padman, Rema

    2015-01-01

    The Internet has emerged as a popular source for health-related information. More than eighty percent of American Internet users have searched for health topics online. Millions of patients use self-help online forums to exchange information and support. In parallel, the increasing prevalence of chronic diseases has become a financial burden for the healthcare system demanding new, cost-effective interventions. To provide such interventions, it is necessary to understand patients' preferences of treatment options and to gain insights into their experiences as patients. We introduce a text-processing algorithm based on semantic ontologies to allow for finer-grained analyses of online forums compared to standard methods. We have applied our method in an analysis of two major Chronic Kidney Disease (CKD) forums. Our results suggest that the analysis of forums may provide valuable insights on daily issues patients face, their choice of different treatment options and interactions between patients, their relatives and clinicians.

  1. A multilingual text mining based content gathering system for open source intelligence

    International Nuclear Information System (INIS)

    Full text: The number of documents available in electronic format has grown dramatically in the recent years, whilst the information that States provide to the IAEA is not always complete or clear. Generally speaking, up to 80% of electronic data is textual and most valuable information is often hidden and encoded in pages which are neither structured, nor classified. The availability of huge amount of data available in the open sources leads to the well-identified nowadays paradox: an overload of information means no usable knowledge. Besides, open source texts are - and will be - written in various native languages, but these documents are relevant even to non-native IAEA speakers. Independent information sources can balance the limited State-reported information, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for type (scientific article, patent, free textual document), source (Internet/Intranet, database, etc), protocol (HTTP/HTTPS, FTP, GOPHER, IRC, NNTP, etc) and language used, and transforming them into information, is therefore inextricably linked to the concepts of focused crawling, textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. This task undoubtedly requires remarkable efforts. This poster describes is a multimedia content gathering, multilingual indexing, searching and clustering system, whose main goal is managing huge collections of data coming from different and geographically distributed information sources, providing language independent searches and dynamic classification facilities. Its focused crawling aims to crawl only the subset of the web pages related to a specific category, in order to find only information of interest and improve quality in documents gathering. The focused crawling algorithm builds a model for the context within which topically relevant pages occur on the web, typically capturing link hierarchies

  2. 文本挖掘探讨青风藤用药规律研究%Treatment Rules of Sinomenium Acutum by Text Mining

    Institute of Scientific and Technical Information of China (English)

    李雨彦; 郑光; 刘良

    2015-01-01

    Objective:The study summarized the treatment rules of Sinomenium acutum (Menispermaceae,SA)using text mining techniques.Methods:Firstly,we conducted text-mining by collecting related literatures about SA from Chinese Biomedical Litera-ture (CBM)Database.Then structured query language was used to do data processing as well as data stratification.Algorithm was used to analyze the basic laws of symptom,TCM pattern,TCM herb compatibility and drug combination.Results:Sinomenium Acutum was mainly used to treat diseases with symptoms such as ache,swelling,stiffness,malformation,etc.Wind,cold,wet-ness,heat,sputum,stasis and deficiency were the main etiology and pathology.Sinomenium Acutum was always used in combina-tion with herbs with the functions of dispelling wind and eliminating dampness,nourishing the blood and promoting blood circula-tion,dredging collaterals,warming meridians and nourishing kidney.Conclusion:By text mining we summarized the treatment rules of Sinomenium Acutum in a systematic,comprehensive and precise way,providing literature basis for future clinical applica-tion and drug research.%目的:基于文本挖掘技术探讨青风藤用药规律。方法:在 CBM数据库中检索、下载所有涉及青风藤的文献,通过清洗、降噪及关键词频统计的数据分层算法,挖掘青风藤治疗疾病的规律,症状、证型的分布规律,中药配伍、中成药、西药、汤剂、针灸联用规律,并进行规律的可视化展示。结果:青风藤主要治疗以疼痛、肿胀、强直、畸形为主的病证,中医病证要素涉及风、寒、湿、热、痰、瘀、虚。疾病以现代医学的类风湿关节炎为主,涉及多种风湿类疾病以及慢性肾炎、肝炎、心律失常等。中药应用方面,青风藤多与祛风除湿类、养血活血类、通络类、温经类及补肾类中药合用。此外,青风藤多与雷公藤多苷、活络丸等调节免疫、通络药物联用。结论:数据

  3. 用文本挖掘方法发现药物的副作用%Detection of drug adverse effects by text-mining

    Institute of Scientific and Technical Information of China (English)

    隋明爽; 崔雷

    2015-01-01

    After the necessity and feasibility to detect drug adverse effects by text-mining were analyzed, the cur-rent researches on detection drug adverse effects by text-mining, unsolved problems and future development were summarized in aspects of text-mining process, text mining/detecting methods, results assessment, and current tool software.%分析了用文本挖掘方法探测药物副作用的必要性及可行性,从挖掘流程、挖掘/提取方法、结果评价和现有工具软件4个方面总结了用文本挖掘技术提取药物副作用的研究现状及尚未解决的问题和未来发展趋势。

  4. Using large-scale text mining for a systematic reconstruction of molecular mechanisms of diseases: a case study in thyroid cancer

    OpenAIRE

    Wu, Chengkun

    2014-01-01

    Information about genes and pathways involved in a disease is usually 'buried' in scientific literature, making it difficult to perform systematic studies for a comprehensive understanding. Text mining has provided opportunities to retrieve and extract most relevant information from literature, and thus might enable collecting and exploring relevant data to a certain disease systematically. This thesis aims to develop a text-mining pipeline that can identify genes and pathways involved in the...

  5. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    Anália LOURENÇO

    2013-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  6. Applying a text mining framework to the extraction of numerical parameters from scientific literature in the biotechnology domain

    Directory of Open Access Journals (Sweden)

    André SANTOS

    2012-07-01

    Full Text Available Scientific publications are the main vehicle to disseminate information in the field of biotechnology for wastewater treatment. Indeed, the new research paradigms and the application of high-throughput technologies have increased the rate of publication considerably. The problem is that manual curation becomes harder, prone-to-errors and time-consuming, leading to a probable loss of information and inefficient knowledge acquisition. As a result, research outputs are hardly reaching engineers, hampering the calibration of mathematical models used to optimize the stability and performance of biotechnological systems. In this context, we have developed a data curation workflow, based on text mining techniques, to extract numerical parameters from scientific literature, and applied it to the biotechnology domain. A workflow was built to process wastewater-related articles with the main goal of identifying physico-chemical parameters mentioned in the text. This work describes the implementation of the workflow, identifies achievements and current limitations in the overall process, and presents the results obtained for a corpus of 50 full-text documents.

  7. PESCADOR, a web-based tool to assist text-mining of biointeractions extracted from PubMed queries

    Directory of Open Access Journals (Sweden)

    Barbosa-Silva Adriano

    2011-11-01

    Full Text Available Abstract Background Biological function is greatly dependent on the interactions of proteins with other proteins and genes. Abstracts from the biomedical literature stored in the NCBI's PubMed database can be used for the derivation of interactions between genes and proteins by identifying the co-occurrences of their terms. Often, the amount of interactions obtained through such an approach is large and may mix processes occurring in different contexts. Current tools do not allow studying these data with a focus on concepts of relevance to a user, for example, interactions related to a disease or to a biological mechanism such as protein aggregation. Results To help the concept-oriented exploration of such data we developed PESCADOR, a web tool that extracts a network of interactions from a set of PubMed abstracts given by a user, and allows filtering the interaction network according to user-defined concepts. We illustrate its use in exploring protein aggregation in neurodegenerative disease and in the expansion of pathways associated to colon cancer. Conclusions PESCADOR is a platform independent web resource available at: http://cbdm.mdc-berlin.de/tools/pescador/

  8. MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.

    Science.gov (United States)

    Karthikeyan, Muthukumarasamy; Pandit, Yogesh; Pandit, Deepak; Vyas, Renu

    2015-01-01

    Virtual screening is an indispensable tool to cope with the massive amount of data being tossed by the high throughput omics technologies. With the objective of enhancing the automation capability of virtual screening process a robust portal termed MegaMiner has been built using the cloud computing platform wherein the user submits a text query and directly accesses the proposed lead molecules along with their drug-like, lead-like and docking scores. Textual chemical structural data representation is fraught with ambiguity in the absence of a global identifier. We have used a combination of statistical models, chemical dictionary and regular expression for building a disease specific dictionary. To demonstrate the effectiveness of this approach, a case study on malaria has been carried out in the present work. MegaMiner offered superior results compared to other text mining search engines, as established by F score analysis. A single query term 'malaria' in the portlet led to retrieval of related PubMed records, protein classes, drug classes and 8000 scaffolds which were internally processed and filtered to suggest new molecules as potential anti-malarials. The results obtained were validated by docking the virtual molecules into relevant protein targets. It is hoped that MegaMiner will serve as an indispensable tool for not only identifying hidden relationships between various biological and chemical entities but also for building better corpus and ontologies.

  9. Introduction to biomedical engineering

    CERN Document Server

    Enderle, John

    2011-01-01

    Introduction to Biomedical Engineering is a comprehensive survey text for biomedical engineering courses. It is the most widely adopted text across the BME course spectrum, valued by instructors and students alike for its authority, clarity and encyclopedic coverage in a single volume. Biomedical engineers need to understand the wide range of topics that are covered in this text, including basic mathematical modeling; anatomy and physiology; electrical engineering, signal processing and instrumentation; biomechanics; biomaterials science and tissue engineering; and medical and engineering e

  10. Biomedical Science, Unit II: Nutrition in Health and Medicine. Digestion of Foods; Organic Chemistry of Nutrients; Energy and Cell Respiration; The Optimal Diet; Foodborne Diseases; Food Technology; Dental Science and Nutrition. Student Text. Revised Version, 1975.

    Science.gov (United States)

    Biomedical Interdisciplinary Curriculum Project, Berkeley, CA.

    This student text presents instructional materials for a unit of science within the Biomedical Interdisciplinary Curriculum Project (BICP), a two-year interdisciplinary precollege curriculum aimed at preparing high school students for entry into college and vocational programs leading to a career in the health field. Lessons concentrate on…

  11. 基于Web的文本挖掘系统的研究与实现%The Research and Development of Text Mining System Based on Web

    Institute of Scientific and Technical Information of China (English)

    唐菁; 沈记全; 杨炳儒

    2003-01-01

    With the development of network technology, the spread of information on Internet becomes more andmore quick. There are many types of complicated data in the information ocean. How to acquire useful knowledgequickly from the information ocean is the very difficult. The Text Mining based on Web is the new research fieldwhich can solve the problem effectively. In this paper, we present a structure model of Text Mining and research thecore arithmetic - Classification arithmetic. We have developed the Text Mining system based on Web and appliedit in the modern long-distance education. This system can automatically classify the text information of education fieldwhich is collected from education site on Internet and help people to browser the important information quickly andacquire knowledge.

  12. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-01-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  13. Unblocking Blockbusters: Using Boolean Text-Mining to Optimise Clinical Trial Design and Timeline for Novel Anticancer Drugs

    Directory of Open Access Journals (Sweden)

    Richard J. Epstein

    2009-08-01

    Full Text Available Two problems now threaten the future of anticancer drug development: (i the information explosion has made research into new target-specific drugs more duplication-prone, and hence less cost-efficient; and (ii high-throughput genomic technologies have failed to deliver the anticipated early windfall of novel first-in-class drugs. Here it is argued that the resulting crisis of blockbuster drug development may be remedied in part by innovative exploitation of informatic power. Using scenarios relating to oncology, it is shown that rapid data-mining of the scientific literature can refine therapeutic hypotheses and thus reduce empirical reliance on preclinical model development and early-phase clinical trials. Moreover, as personalised medicine evolves, this approach may inform biomarker-guided phase III trial strategies for noncytotoxic (antimetastatic drugs that prolong patient survival without necessarily inducing tumor shrinkage. Though not replacing conventional gold standards, these findings suggest that this computational research approach could reduce costly ‘blue skies’ R&D investment and time to market for new biological drugs, thereby helping to reverse unsustainable drug price inflation.

  14. Grouping chemicals for health risk assessment: A text mining-based case study of polychlorinated biphenyls (PCBs).

    Science.gov (United States)

    Ali, Imran; Guo, Yufan; Silins, Ilona; Högberg, Johan; Stenius, Ulla; Korhonen, Anna

    2016-01-22

    As many chemicals act as carcinogens, chemical health risk assessment is critically important. A notoriously time consuming process, risk assessment could be greatly supported by classifying chemicals with similar toxicological profiles so that they can be assessed in groups rather than individually. We have previously developed a text mining (TM)-based tool that can automatically identify the mode of action (MOA) of a carcinogen based on the scientific evidence in literature, and it can measure the MOA similarity between chemicals on the basis of their literature profiles (Korhonen et al., 2009, 2012). A new version of the tool (2.0) was recently released and here we apply this tool for the first time to investigate and identify meaningful groups of chemicals for risk assessment. We used published literature on polychlorinated biphenyls (PCBs)-persistent, widely spread toxic organic compounds comprising of 209 different congeners. Although chemically similar, these compounds are heterogeneous in terms of MOA. We show that our TM tool, when applied to 1648 PubMed abstracts, produces a MOA profile for a subgroup of dioxin-like PCBs (DL-PCBs) which differs clearly from that for the rest of PCBs. This suggests that the tool could be used to effectively identify homogenous groups of chemicals and, when integrated in real-life risk assessment, could help and significantly improve the efficiency of the process.

  15. Identification of candidate genes in Populus cell wall biosynthesis using text-mining, co-expression network and comparative genomics

    Energy Technology Data Exchange (ETDEWEB)

    Yang, Xiaohan [ORNL; Ye, Chuyu [ORNL; Bisaria, Anjali [ORNL; Tuskan, Gerald A [ORNL; Kalluri, Udaya C [ORNL

    2011-01-01

    Populus is an important bioenergy crop for bioethanol production. A greater understanding of cell wall biosynthesis processes is critical in reducing biomass recalcitrance, a major hindrance in efficient generation of ethanol from lignocellulosic biomass. Here, we report the identification of candidate cell wall biosynthesis genes through the development and application of a novel bioinformatics pipeline. As a first step, via text-mining of PubMed publications, we obtained 121 Arabidopsis genes that had the experimental evidences supporting their involvement in cell wall biosynthesis or remodeling. The 121 genes were then used as bait genes to query an Arabidopsis co-expression database and additional genes were identified as neighbors of the bait genes in the network, increasing the number of genes to 548. The 548 Arabidopsis genes were then used to re-query the Arabidopsis co-expression database and re-construct a network that captured additional network neighbors, expanding to a total of 694 genes. The 694 Arabidopsis genes were computationally divided into 22 clusters. Queries of the Populus genome using the Arabidopsis genes revealed 817 Populus orthologs. Functional analysis of gene ontology and tissue-specific gene expression indicated that these Arabidopsis and Populus genes are high likelihood candidates for functional genomics in relation to cell wall biosynthesis.

  16. Text mining for pharmacovigilance: Using machine learning for drug name recognition and drug-drug interaction extraction and classification.

    Science.gov (United States)

    Ben Abacha, Asma; Chowdhury, Md Faisal Mahbub; Karanasiou, Aikaterini; Mrabet, Yassine; Lavelli, Alberto; Zweigenbaum, Pierre

    2015-12-01

    Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities related to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problem. An essential aspect in PV is to acquire knowledge about Drug-Drug Interactions (DDIs). The shared tasks on DDI-Extraction organized in 2011 and 2013 have pointed out the importance of this issue and provided benchmarks for: Drug Name Recognition, DDI extraction and DDI classification. In this paper, we present our text mining systems for these tasks and evaluate their results on the DDI-Extraction benchmarks. Our systems rely on machine learning techniques using both feature-based and kernel-based methods. The obtained results for drug name recognition are encouraging. For DDI-Extraction, our hybrid system combining a feature-based method and a kernel-based method was ranked second in the DDI-Extraction-2011 challenge, and our two-step system for DDI detection and classification was ranked first in the DDI-Extraction-2013 task at SemEval. We discuss our methods and results and give pointers to future work. PMID:26432353

  17. Grouping chemicals for health risk assessment: A text mining-based case study of polychlorinated biphenyls (PCBs).

    Science.gov (United States)

    Ali, Imran; Guo, Yufan; Silins, Ilona; Högberg, Johan; Stenius, Ulla; Korhonen, Anna

    2016-01-22

    As many chemicals act as carcinogens, chemical health risk assessment is critically important. A notoriously time consuming process, risk assessment could be greatly supported by classifying chemicals with similar toxicological profiles so that they can be assessed in groups rather than individually. We have previously developed a text mining (TM)-based tool that can automatically identify the mode of action (MOA) of a carcinogen based on the scientific evidence in literature, and it can measure the MOA similarity between chemicals on the basis of their literature profiles (Korhonen et al., 2009, 2012). A new version of the tool (2.0) was recently released and here we apply this tool for the first time to investigate and identify meaningful groups of chemicals for risk assessment. We used published literature on polychlorinated biphenyls (PCBs)-persistent, widely spread toxic organic compounds comprising of 209 different congeners. Although chemically similar, these compounds are heterogeneous in terms of MOA. We show that our TM tool, when applied to 1648 PubMed abstracts, produces a MOA profile for a subgroup of dioxin-like PCBs (DL-PCBs) which differs clearly from that for the rest of PCBs. This suggests that the tool could be used to effectively identify homogenous groups of chemicals and, when integrated in real-life risk assessment, could help and significantly improve the efficiency of the process. PMID:26562772

  18. Quantifying the impact and extent of undocumented biomedical synonymy.

    Directory of Open Access Journals (Sweden)

    David R Blair

    2014-09-01

    Full Text Available Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through "crowd-sourcing." Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90% of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for "next-generation," high-coverage lexical terminologies.

  19. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

    Directory of Open Access Journals (Sweden)

    Tareef K. Mustafa

    2010-01-01

    Full Text Available Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational

  20. PALM-IST: Pathway Assembly from Literature Mining - an Information Search Tool

    OpenAIRE

    Mandloi, Sapan; Chakrabarti, Saikat

    2015-01-01

    Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular c...

  1. 基于高维聚类的探索性文本挖掘算法%Exploratory text mining algorithm based on high-dimensional clustering

    Institute of Scientific and Technical Information of China (English)

    张爱科; 符保龙

    2013-01-01

    建立了一种基于高维聚类的探索性文本挖掘算法,利用文本挖掘的引导作用实现数据类文本中的数据挖掘.算法只需要少量迭代,就能够从非常大的文本集中产生良好的集群;映射到其他数据与将文本记录到用户组,能进一步提高算法的结果.通过对相关数据的测试以及实验结果的分析,证实了该方法的可行性与有效性.%Because of the unstructured characteristics of free text, text mining becomes an important branch of data mining. In recent years, types of text mining algorithms emerged in large numbers. In this paper, an exploratory text mining algorithm was proposed based on high-dimensional clustering. The algorithm required only a small number of iterations to produce favorable clusters from very large text. Mapping to other recorded data and recording the text to the user group enabled the result of the algorithm be improved further. The feasibility and validity of the proposed method is verified by related data test and the analysis of experimental results.

  2. Research on Fuzzy Clustering Validity in Web Text Mining%Web文本挖掘中模糊聚类的有效性评价研究

    Institute of Scientific and Technical Information of China (English)

    罗琪

    2012-01-01

    本文研究了基于模糊聚类的Web文本挖掘和模糊聚类有效性评价函数,并将其应用于Web文本挖掘中模糊聚类有效性评价.仿真实验表明该方法有一定的准确性和可行性.%This paper studies web documents mining based on fuzzy clustering and validity evaluation function, and puts forward to applying validity evaluation function into evaluation of web text mining. The experiments show that FKCM can effectively improve the precision of web text clustering; the method is feasible in web documents mining. The result of emulation examinations indicates that the method has certain feasibility and accuracy.

  3. Research on Web Text Mining Based on Ontology%基于领域本体实现Web文本挖掘研究

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2011-01-01

    为弥补改进传统Web文本挖掘方法缺乏对文本语义理解的不足,采用本体与Web文本挖掘相结合的方法,探讨基于领域本体的Web文本挖掘方法。首先创建Web文本的本体结构,然后引入领域本体“概念-概念”相似度矩阵,并就概念间关系识别进行描述,最后给出Web文本挖掘的实现方法,发现Web文本信息的内涵。实验中以网络媒体报道为例,通过文本挖掘得出相关结论。%The paper improved the traditional web text mining technology which can not understand the text semantics. The author discusses the web text mining methods based on the ontology, and sets up the web ontology structure at first, then introduces the "concept-concept" similarity matrix, and describs the relations among the concepts; puts forward the web text mining method at last. Based on text mining, the paper can find the potential information from the web pages. Finally, the author did a case study and drew some conclusion.

  4. Standardisation in the area of innovation and technological development, notably in the field of Text and Data Mining: report from the expert group

    NARCIS (Netherlands)

    I. Hargreaves; L. Guibault; C. Handke; P. Valcke; B. Martens

    2014-01-01

    Text and data mining (TDM) is an important technique for analysing and extracting new insights and knowledge from the exponentially increasing store of digital data (‘Big Data’). TDM is useful to researchers of all kinds, from historians to medical experts, and its methods are relevant to organisati

  5. 基于领域本体的语义文本挖掘研究%Research on Semantic Text Mining Based on Domain Ontology

    Institute of Scientific and Technical Information of China (English)

    张玉峰; 何超

    2011-01-01

    为了提高文本挖掘的深度和精度,研究并提出了一种基于领域本体的语义文本挖掘模型.该模型利用语义角色标注进行语义分析,获取概念和概念间的语义关系,提高文本表示的准确度;针对传统的知识挖掘算法不能有效挖掘语义元数据库,设计了一种基于语义的模式挖掘算法挖掘文本深层的语义模式.实验结果表明,该模型能够挖掘文本数据库中的深层语义知识,获取的模式具有很强的潜在应用价值,设计的算法具有很强的适应性和可扩展性.%In order to improve the depth and accuracy of text mining, a semantic text mining model based on domain ontology is proposed. In this model, semantic role labeling is applied to semantic analysis so that the semantic relations can be extracted accurately. For the defect of traditional knowledge mining algorithms that can not effectively mine semantic meta database, an association patterns mining algorithm hased on semantic is designed and used to acquire the deep semantic association patterns from semantic meta database. Experimental results show that the model can mine deep semantic knowledge from text database. The pattern got has great potential applications, and the algorithm designed has strong adaptability and scalability.

  6. Research on Methods of Text Mining and Its Application%文本挖掘的方法及应用研究

    Institute of Scientific and Technical Information of China (English)

    张晓艳; 华英

    2011-01-01

    互联网的兴起带来了大量的文本信息。在半结构化和非结构化的文本中提取对用户有用的信息,主要采用文本挖掘技术.本文对文本挖掘常用的方法进行比较分析,总结文本挖掘目前主要的应用领域%Vast amount of text information comes with the rise of the Internet. Text mining technology is used to extract useful information for users from semi-structured or non-structured texts. This article analyzes several common methods of text mining and summarizes its application.

  7. The Mining Methods of Multimedia Text Data Pattern%多媒体文本数据的模式挖掘方法

    Institute of Scientific and Technical Information of China (English)

    刘茂福; 曹加恒; 彭敏; 叶可; 林芝

    2001-01-01

    给出了多媒体文本数据挖掘(MTM)的定义和分类,提出了多媒体文本数据挖掘过程模型(MTMM)及其特征表示,讨论了多媒体文本分类挖掘方法,MTM与Web挖掘的区别与联系,以期发现有用的知识或模式,促进MTM的发展和应用.%Multimedia Text data Mining is a new research field in data mining. The definition and classifications of MTM are given. This article also focuses on Multimedia Text data Mining Model(MTMM) and feature expression, discusses multimedia text data categorization and its alteration. In this paper, the author points out the differences and relationships between MTM and Web mining. The goal of MTM is to discover the useful knowledge or model and push the development and application of MTM.

  8. Application of Text Mining to Extract Hotel Attributes and Construct Perceptual Map of Five Star Hotels from Online Review: Study of Jakarta and Singapore Five-Star Hotels

    Directory of Open Access Journals (Sweden)

    Arga Hananto

    2015-12-01

    Full Text Available The use of post-purchase online consumer review in hotel attributes study was still scarce in the literature. Arguably, post purchase online review data would gain more accurate attributes thatconsumers actually consider in their purchase decision. This study aims to extract attributes from two samples of five-star hotel reviews (Jakarta and Singapore with text mining methodology. In addition,this study also aims to describe positioning of five-star hotels in Jakarta and Singapore based on the extracted attributes using Correspondence Analysis. This study finds that reviewers of five star hotels in both cities mentioned similar attributes such as service, staff, club, location, pool and food. Attributes derived from text mining seem to be viable input to build fairly accurate positioning map of hotels. This study has demonstrated the viability of online review as a source of data for hotel attribute and positioning studies.

  9. Fundamental of biomedical engineering

    CERN Document Server

    Sawhney, GS

    2007-01-01

    About the Book: A well set out textbook explains the fundamentals of biomedical engineering in the areas of biomechanics, biofluid flow, biomaterials, bioinstrumentation and use of computing in biomedical engineering. All these subjects form a basic part of an engineer''s education. The text is admirably suited to meet the needs of the students of mechanical engineering, opting for the elective of Biomedical Engineering. Coverage of bioinstrumentation, biomaterials and computing for biomedical engineers can meet the needs of the students of Electronic & Communication, Electronic & Instrumenta

  10. What online communities can tell us about electronic cigarettes and hookah use: A study using text mining and visualization techniques

    OpenAIRE

    Chen, AT; Zhu, SH; Conway, M.

    2015-01-01

    © 2015 Journal of Medical Internet Research. Background: The rise in popularity of electronic cigarettes (e-cigarettes) and hookah over recent years has been accompanied by some confusion and uncertainty regarding the development of an appropriate regulatory response towards these emerging products. Mining online discussion content can lead to insights into people's experiences, which can in turn further our knowledge of how to address potential health implications. In this work, we take a no...

  11. Enriching a biomedical event corpus with meta-knowledge annotation

    OpenAIRE

    Thompson Paul; Nawaz Raheel; McNaught John; Ananiadou Sophia

    2011-01-01

    Abstract Background Biomedical papers contain rich information about entities, facts and events of biological relevance. To discover these automatically, we use text mining techniques, which rely on annotated corpora for training. In order to extract protein-protein interactions, genotype-phenotype/gene-disease associations, etc., we rely on event corpora that are annotated with classified, structured representations of important facts and findings contained within text. These provide an impo...

  12. TML:A General High-Performance Text Mining Language%TML:一种通用高效的文本挖掘语言

    Institute of Scientific and Technical Information of China (English)

    李佳静; 李晓明; 孟涛

    2015-01-01

    实现了一种通用高效的文本挖掘编程语言,包括其编译器、运行虚拟机和图形开发环境。其工作方式是用户通过编写该语言的代码以定制抽取目标和抽取手段,然后将用户代码编译成字节码并进行优化,再将其与输入文本语义结构做匹配。该语言具有如下特点:1)提供了一种描述文本挖掘的范围、目标和手段的形式化方法,从而能通过编写该语言的代码来在不同应用领域做声明式文本挖掘;2)运行虚拟机以信息抽取技术为核心,高效地实现了多种常用文本挖掘技术,并将其组成一个文本分析流水线;3)通过一系列编译优化技术使得大量匹配指令能够充分并发执行,从而解决了该语言在处理海量规则和海量数据上的执行效率问题。实用案例说明了TML语言的描述能力以及它的实际应用情况。%This paper proposes a general‐purpose programming language named TML for text mining . TML is the abbreviation of “text mining language” ,and it aims at turning complicated text mining tasks into easy jobs . The implementation of TML includes a compiler ,a runtime virtual machine (interpreter ) , and an IDE . TML has supplied most usual text mining techniques , which are implemented as grammars and reserved words .Users can use TML to program ,and the code will be compiled into bytecodes ,which will be next interpreted in the virual runtime machine .TML has the following characteristics :1) It supplies a formal way to model the searching area ,object definition and mining methods of text mining jobs ,so users can program with it to make a declarative text mining easily ;2) The TML runtime machine implements usual text mining techniques ,and organizes them into an efficient text analysis pipeline ;3) The TML compiler fully explores the possibility of concurrently executing its byte codes , and the execution has good performance on very large collections of

  13. The explore of corresponding rules of " syndrome-symptom-prescription-drugs"of drpression by the method of text mining%文本挖掘探索抑郁症“证-症-方-药”相应规律

    Institute of Scientific and Technical Information of China (English)

    展俊平; 姜淼; 张彤; 郑光; 吕诚; 蔡峰; 杨静; 何晓鹃; 梁非; 吕爱平

    2012-01-01

    Background: Explore the rules of Chinese herbal medicine under the framework of syndrome in traditional Chinese medicine (TCM). Methods: Download the data set on depression from Chinese BioMedical literature database. Then, the rules of TCM syndrome, symptoms, and Chinese herbal medicines were mined out. The results are shown in frequency explanation and 2-dimention networks. Results: For TCM pattern, depression is mainly focused on intermingled deficiency and excess syndrome. For TCM internal organs, liver , heart, spleen and are involved. For syndromes, insomnia,depression,pessimism are the most significant , For Chinese herbal medicines, medicines on tonification are involved. Conclusions: Text mining, together with artificial reading for anti-noising, is an important approach in exploring the rules of TCM syndrome, symptom, and their association with Chinese herbal medicines.%目的:探索抑郁症证药相应规律.方法:采用基于敏感关键词频数统计的数据分层算法,挖掘抑郁症的证候、症状、汤药及中药的规律.结果:抑郁症虚实夹杂,脏腑以肝为主,涉及心、脾、肾;证侯以肝气郁结和肝郁脾虚为主;核心症状为失眠、情绪低落等精神障碍;汤药和中药的使用均以疏肝解郁、健脾养心安神为主.结论:文本挖掘技术、结合文献回溯和人工阅读降噪,能够比较客观地总结中医“证-症-方-药”的规律.

  14. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    Science.gov (United States)

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  15. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    Directory of Open Access Journals (Sweden)

    Cesareni Gianni

    2011-10-01

    Full Text Available Abstract Background The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP systems is essential for the selection of relevant publications, identification of data attributes and partially automated annotation. One of the tasks of the Biocreative 2010 Challenge III was devoted to the evaluation of NLP systems developed to identify articles for curation and extraction of protein-protein interaction (PPI data. Results The Biocreative 2010 competition addressed three tasks: gene normalization, article classification and interaction method identification. The BioGRID and MINT protein interaction databases both participated in the generation of the test publication set for gene normalization, annotated the development and test sets for article classification, and curated the test set for interaction method classification. These test datasets served as a gold standard for the evaluation of data extraction algorithms. Conclusion The development of efficient tools for extraction of PPI data is a necessary step to achieve full curation of the biomedical literature. NLP systems can in the first instance facilitate expert curation by refining the list of candidate publications that contain PPI data; more ambitiously, NLP approaches may be able to directly extract relevant information from full-text articles for rapid inspection by expert curators. Close collaboration between biological databases and NLP systems developers will continue to facilitate the long-term objectives of both disciplines.

  16. Application of Text Mining to Extract Hotel Attributes and Construct Perceptual Map of Five Star Hotels from Online Review: Study of Jakarta and Singapore Five-Star Hotels

    OpenAIRE

    Arga Hananto

    2015-01-01

    The use of post-purchase online consumer review in hotel attributes study was still scarce in the literature. Arguably, post purchase online review data would gain more accurate attributes thatconsumers actually consider in their purchase decision. This study aims to extract attributes from two samples of five-star hotel reviews (Jakarta and Singapore) with text mining methodology. In addition,this study also aims to describe positioning of five-star hotels in Jakarta and Singapore based on t...

  17. Evaluation of the strengths and weaknesses of Text Mining and Netnography as methods of understanding consumer conversations around luxury brands on social media platforms.

    OpenAIRE

    SAINI, CHITRA; ,

    2015-01-01

    The advent of social media has led to Luxury brands increasingly turning to social media sites to build brand value. Understanding the discussions that happen on social media is therefore a key for the marketing managers of luxury brands. There are two prominent methodologies which have been used widely in the literature to study consumer conversations on social media, these two methodologies are Text Mining and Netnography. In this study I will compare and contrast both these methodologies t...

  18. Key Issues in Morphology Analysis Based on Text Mining%基于文本挖掘的形态分析方法的关键问题

    Institute of Scientific and Technical Information of China (English)

    冷伏海; 王林; 王立学

    2012-01-01

    Morphological analysis based on text mining integration of text mining method, that reduce the reliance on technical experts, and adds objective data to support analysis. Morphological analysis based on text mining has four key issues, which are the definition of the morphological structure, feature word selection, morphology representation, morphology analysis. The improvement to the four issues has key role in enhancing the method.%基于文本挖掘的形态分析方法是在传统方法基础上融入文本挖掘的手段,是国内外学者对形态分析方法的一次有益的探索与改进。改进后的方法减轻对领域专家的依赖,并且增加分析过程中客观数据的支持,提高方法的效率和科学性。基于文本挖掘的形态分析方法包括形态结构定义、特征词选择、形态表示、形态分析等4个关键问题,这4个问题解决方案的优化对整个方法的分析效率和质量的提高有关键作用。

  19. The Application of the Web Text Mining in the Druggist Interest Extraction%Web文本挖掘在药商兴趣提取中的应用

    Institute of Scientific and Technical Information of China (English)

    孙士新

    2014-01-01

    The information attainment has become the important component of the druggist's business operation and the market judgment basis. The appearance of the largely unstructured and semi-structured network has provided the technology space and the demonstration basis for the druggist's individual service. Through the critical technology of the text mining in individual service,the paper,combining the Traditional Chinese Medicinal Materials information website,has actually applied the text mining process, and applies the text mining technology to the example of the user's interest attainment about the Traditional Chinese Medicinal Materials information website.%信息获取已成为药商经营活动的重要组成部分和市场判断依据,网络大量非结构化、半结构化信息的出现为药商个性化服务提供了技术空间和实证依据。文章通过对个性化服务中文本挖掘的关键技术进行设计,并应用了中药材信息网站文本挖掘流程,把文本挖掘技术应用于中药材信息网站的用户兴趣获取实例中,实现用户兴趣的自动获取功能。

  20. 基于文本挖掘的网络新闻报道差异分析%Analysis on Web Media Report Differences Based on Text Mining

    Institute of Scientific and Technical Information of China (English)

    阮光册

    2012-01-01

    It is a new research on how to find potential but valued information in the web media reports based on text mining technology. This paper discusses the text mining methods of web media reports. In the case of web media reports on Shanghai Expo, the author has done some empirical study to analyze the differences among different web media. The paper selected the web media reports on Expo" from Hong Kong, Tai Wan, overseas newspapers (Chinese version) and Shanghai, analyzed the differences among these different regions base on text mining and attribution extraction and drew some conclusions.%运用文本挖掘技术发现网络新闻报道中潜在的、有价值的信息是情报研究的一个新尝试。笔者探讨了网络新闻的文本挖掘方法,以上海世博新闻媒体网络版报道为例,进行实证研究,并对报道差异进行对比分析。本文选取香港、台湾、境外媒体华语版、上海本地媒体对世博会相关报道,基于文本挖掘、特征提取对报道内容的差异进行阐述,并得出结论。

  1. Using Google blogs and discussions to recommend biomedical resources: a case study.

    Science.gov (United States)

    Reed, Robyn B; Chattopadhyay, Ansuman; Iwema, Carrie L

    2013-01-01

    This case study investigated whether data gathered from discussions within the social media provide a reliable basis for a biomedical resources recommendation system. Using a search query to mine text from Google Blogs and Discussions, a ranking of biomedical resources was determined based on those most frequently mentioned. To establish quality, these results were compared with rankings by subject experts. An overall agreement between the frequency of social media discussions and subject expert recommendations was observed when identifying key bioinformatics and consumer health resources. Testing the method in more than one biomedical area implies this procedure could be employed across different subjects.

  2. WSAM: AN INTERNET TEXT UGC SUBJECTIVE ATTITUDE MINING SYSTEM%WSAM:互联网 UGC 文本主观观点挖掘系统

    Institute of Scientific and Technical Information of China (English)

    费仲超; 朱鲲鹏; 魏芳

    2012-01-01

    The information about subjective attitude of users contained in UGC (User Generated Content) of internet is much valuable for user behaviour analysis and user demand analysis. In this paper we design an internet text UGC subjective attitude analysing system, WSAM, based on nature language comprehension. This system can mine the objects attended to and the subjective components, all contained in subjective attitude of users. The UGC phenomena in internet and the reason they generated are analysed in the paper? And four main types of subjective attitude of users in text UGC are concluded. During the process of mining subjective attitude of users, we convert the procedure of subjective attitude mining into the procedures of recognising the object attended to by subjective attitude in sentence and determining the subjective components. The algorithm uses the maximum entropy classifier to mine subjective attitude of users in combination with relative features in regard to lexical and structural classes. Experiments validate that the algorithm adopted by WSAM system is good in performance, and the system can be extended easily to related applications such as opinion mining with preferred good results as well.%互联网上的用户生成内容UGC( User Generated Content)中蕴含的用户主观观点信息对分析用户行为、用户需求等工作有着重要的价值.设计一套基于自然语言理解的互联网UGC文本主观观点分析系统WSAM,该系统能挖掘出用户主观观点所蕴含的关注对象和主观成分.分析了互联网UGC现象和生成原因,总结出UGC中用户主观观点中的四种主要类型.挖掘用户主观观点过程中,将用户主观观点的挖掘转化为句子中主观观点关注对象的识别和主观成分的判断.算法结合基于词语类、结构类等相关特征,采用最大熵分类器挖掘用户主观观点.实验验证,WSAM 系统所采用的算法性能较好,且还能够

  3. 关联挖掘下的海量文本信息深入挖掘实现%Text Mining Method of Massive Network Based on Correlation Mining

    Institute of Scientific and Technical Information of China (English)

    彭其华

    2013-01-01

    研究基于关联度挖掘的海量网络文本挖掘方法;随着计算机和网络技术的快速发展,网络上的文本呈现海量增长的趋势,传统的网络文本挖掘方法采用基于特征提取的方法实现,能够实现小数据量下的文本挖掘,但是在信息量的快速增长下,传统方法已经不能适应;提出一种基于关联度挖掘的海量网络文本挖掘方法,首先采用特征提取的方法对海量文本进行初步的分类和特征识别,然后采用关联度挖掘的方法对各个文本特征之间的关联度进行计算处理,根据关联度的大小最终实现文本挖掘,由于关联度可以很好的体现特征文本之间的相互关系;最后采用一组随机的网络热门词汇进行测试实验,结果显示,算法能够很好适应海量文本下的挖掘实现,具有很好的应用价值。%The text mining method of massive network based on correlation mining was research on .With the rapid development of computer and network technology ,the text rendering of network grew fast ,the traditional network-based text mining method extracted feature from text to achieve text mining ,but with the rapid growth in the amount of information ,the traditional methods cannot meet the need of development .So a text mining method of massive network based on correlation mining was proposed ,the feature was extracted with mass text to finish initial classification and characteristics identification ,and then the method of mining correlation between the characteristics of the various texts correlation was used to do calculate the coefficient of correlation ,according to the coefficient of correlation ,the text was divided into several types ,so the correlation can reflect the relationship between the characteristics of the text well .Finally ,a team of random words were used to test the ability of the algorithm ,and the result shows that the algorithm can adapt to massive text excavation

  4. Text-mining of PubMed abstracts by natural language processing to create a public knowledge base on molecular mechanisms of bacterial enteropathogens

    Directory of Open Access Journals (Sweden)

    Perna Nicole T

    2009-06-01

    Full Text Available Abstract Background The Enteropathogen Resource Integration Center (ERIC; http://www.ericbrc.org has a goal of providing bioinformatics support for the scientific community researching enteropathogenic bacteria such as Escherichia coli and Salmonella spp. Rapid and accurate identification of experimental conclusions from the scientific literature is critical to support research in this field. Natural Language Processing (NLP, and in particular Information Extraction (IE technology, can be a significant aid to this process. Description We have trained a powerful, state-of-the-art IE technology on a corpus of abstracts from the microbial literature in PubMed to automatically identify and categorize biologically relevant entities and predicative relations. These relations include: Genes/Gene Products and their Roles; Gene Mutations and the resulting Phenotypes; and Organisms and their associated Pathogenicity. Evaluations on blind datasets show an F-measure average of greater than 90% for entities (genes, operons, etc. and over 70% for relations (gene/gene product to role, etc. This IE capability, combined with text indexing and relational database technologies, constitute the core of our recently deployed text mining application. Conclusion Our Text Mining application is available online on the ERIC website http://www.ericbrc.org/portal/eric/articles. The information retrieval interface displays a list of recently published enteropathogen literature abstracts, and also provides a search interface to execute custom queries by keyword, date range, etc. Upon selection, processed abstracts and the entities and relations extracted from them are retrieved from a relational database and marked up to highlight the entities and relations. The abstract also provides links from extracted genes and gene products to the ERIC Annotations database, thus providing access to comprehensive genomic annotations and adding value to both the text-mining and annotations

  5. Ranking Biomedical Annotations with Annotator’s Semantic Relevancy

    Directory of Open Access Journals (Sweden)

    Aihua Wu

    2014-01-01

    Full Text Available Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator’s knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user’s vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large.

  6. 文本挖掘、数据挖掘和知识管理%Text Mining,Data Mining vs.Knowledge Management:the Intelligent Information Processing in the 21st Century

    Institute of Scientific and Technical Information of China (English)

    韩客松; 王永成

    2001-01-01

    本文首先介绍了数据挖掘、文本挖掘和知识管理等概念,然后从技术角度出发,将知识管理划分为知识库、知识共享和知识发现三个阶段,分析了作为最高阶段的知识发现的关键技术和意义,最后指出在文本中进行知识发现是新世纪智能信息处理的重要方向。%Based on the introduction to Data Mining,Text Mining andKnowledge Management,we divide the knowledge management into three phases,Knowledge Repository,Knowledge Sharing and Knowledge Discovery respectively,from the view-point of technical development.We analyse the key component of text mining,and point out that it is the main trend of intelligent information processing in the coming new century.

  7. 基于文本挖掘技术的偏头痛临床诊疗规律分析%Analysis of Regularity of Clinical Medication for Migraine with Text Mining Approach

    Institute of Scientific and Technical Information of China (English)

    杨静; 蔡峰; 谭勇; 郑光; 李立; 姜淼; 吕爱平

    2013-01-01

    Objective To analyze the regularity of clinical medication in the treatment of migraine with text mining approach. Methods The data set of migraine was downloaded from Chinese BioMedical Literature Database (CBM). Rules of TCM pattern, symptoms, Chinese herbal medicines (CHM), Chinese patent medicines (CPM) and western medicines on migraine were mined out by data slicing algorithm, the results were demonstrated in both frequency tables and two-dimension based network. Results A total of 7 921 literatures were searched. The main syndrome classification in TCM of migraine included liver-yang hyperactivity syndrome, stagnation of liver qi syndrome and qi deficiency to blood stasis syndrome, et al. The core symptoms of migraine included headache, vomit, nausea, swirl, photophobia, et al. Traditional Chinese medicine for migraine contained Chuanxiong, Tianma, Danshen, Chaihu, Danggui, Baishao and Baizhi, et al. For Chinese patent medicine, Yangxueqingnao granule, Toutongning capsule and Zhengtian pill were used in treating migraine. In western medicine, flunarizine, nimodipine and aspirin were used frequently. For the integrated treatment of TCM and western medicine, the combination of Yangxueqingnao granule and nimodipine was most commonly used. Conclusion Text mining approach provides a novel method in the summary of treatment rules on migraine in both TCM and western medicine. To some' extent, the migraine results from texting mining has significance for clinical practice.%目的 探索偏头痛中西医临床诊疗的规律.方法 应用中国生物医学文献服务系统,收集治疗偏头痛的文献数据,采用基于敏感关键词频数统计的数据分层算法,并结合原文献回溯、人工阅读分析等方法,挖掘有关偏头痛证候、症状、中药、中成药以及西药联用的规律,并通过一维频次表及二维的网络图对结果进行展示.结果 共检索到偏头痛文献7 921篇.文本挖掘结果显示,肝阳上亢、肝气郁

  8. In search of new product ideas: Identifying ideas in online communities by machine learning and text mining

    DEFF Research Database (Denmark)

    Christensen, Kasper; Frederiksen, Lars; Nørskov, Sladjana;

    2016-01-01

    Online communities are attractive sources of ideas and knowledge relevant for new product development and innovation. However, making sense of the “big data” available in these communities is a complex analytical task. A systematic way of dealing with these data is needed to fully exploit...... their potential for boosting companies’ innovation performance. We propose a method for analysing online community data with a special focus on identifying ideas. We employ a research design where, in a first step, two human raters classified 3000 texts extracted from an online community as to whether each text...... with large amounts of unstructured text data. Also, it is interesting in its own right that machine learning can be used to detect ideas....

  9. 面向生物文本挖掘的语义标注研究%Semantic Relation Annotation for Biomedical Text Mining Based on Recursive Directed Graph

    Institute of Scientific and Technical Information of China (English)

    陈波; 吕晨; 魏小梅

    2015-01-01

    文章提出了一个新颖的模型——“基于特征结构的递归有向图”,将其用于描述英文生物文本中定语后置的语义关系.后置定语的用法是复杂多变的,主要有三类情况:现在分词充当后置定语,过去分词充当后置定语,介词短语充当后置定语,这为自动分析带来很多难题.我们总结和标注了这三类后置定语的语义信息.与依存结构相比,特征结构可以形式化为可递归的有向图,标注结果表明递归有向图更适合与生物文本挖掘中的复杂语义关系抽取.

  10. Application of Lanczos bidiagonalization algorithm in text mining%Lanczos双对角算法在文本挖掘当中的应用

    Institute of Scientific and Technical Information of China (English)

    范伟鹏

    2012-01-01

    Text mining plays an important role in data mining, and classical text mining is based on latent semantic analysis. In the past, to get a low rank approximation, singular value decomposition is applied for latent semantic analysis. As we all know, singular value decomposition needs a cubic operation for it; so, it is cost, in particular, when the matrix is large and sparse. To solve this problem, this paper uses Lancos bidiagonalization algorithm and extended Lanczos bidiagonalization algorithm in here, both of them are efficient and effective for a large and sparse matrix.%文本挖掘是数据挖掘中的一个重要组成部分,传统的文本挖掘方法大部分是基于潜在语义分析的基础上进行的.由于由文本构成的矩阵基本上是大型稀疏的,而传统的潜在语义分析都是基于矩阵的奇异值分解的基础上进行的,矩阵的奇异值分解是一种立方次运算的求矩阵低秩近似方法,因而是一种低效的方法.针对文本矩阵是大型稀疏的特点,将Lanczos双对角算法和Lanczos双对角算法运用于此,并且从文中的算法分析得出,Lanczos双对角算法和扩展的Lanczos双对角算法是两种高效的求大型稀疏矩阵低秩近似的方法.

  11. Biomedical signal processing

    CERN Document Server

    Akay, Metin

    1994-01-01

    Sophisticated techniques for signal processing are now available to the biomedical specialist! Written in an easy-to-read, straightforward style, Biomedical Signal Processing presents techniques to eliminate background noise, enhance signal detection, and analyze computer data, making results easy to comprehend and apply. In addition to examining techniques for electrical signal analysis, filtering, and transforms, the author supplies an extensive appendix with several computer programs that demonstrate techniques presented in the text.

  12. 基于文本挖掘的流行病学致病因素的提取%Extraction of epidemiologic risk factors based on text mining

    Institute of Scientific and Technical Information of China (English)

    卢延鑫; 姚旭峰

    2013-01-01

    Objective Based on text mining techniques,we design a system which automatically extracts epidemiologic risk factors. Methods The system consists of a text mining engine subsystem and a rule-based information extraction subsystem. First,all the noun phrases are identified by the text mining engine subsystem and the information are collected. Then,the epidemiologic risk factors are identified by the text classifier system based on rules. Results The evaluation of the system using text annotated by an epidemiologist shows the highest F-measure of 64.6% (Precision 61.0% and Recall 68.8%), with certain avoidable mistakes. Conclusions This method is helpful for the automatic extraction of risk factors in the epidemiologic literatures.%目的 基于文本挖掘技术,设计出能够自动提取流行病学致病因素的系统.方法 该自动信息提取系统由一个文本挖掘引擎子系统和一个基于规则的信息提取子系统构成.首先使用文本挖掘引擎标记出所有的名词短语,并收集该名词短语的语义等信息.然后利用基于规则的文本分类器,标记出流行病学致病因素.结果 为评估本系统,将由流行病学专家人工注解的文本输入该系统,评估发现最好的结果F-measure为64.6%,其精确率和召回率分别为61.0%和68.8%,该结果优于其它相关研究,且其中有些错误仍可避免.结论 基于文本挖掘的方法对从流行病学研究文献中自动提取致病因素信息有很大帮助.

  13. Simulation Research of Text Categorization Based on Data Mining%基于数据挖掘的文本自动分类仿真研究

    Institute of Scientific and Technical Information of China (English)

    赖娟

    2011-01-01

    Research text classification problem. Text classification feature dimension is usually up to tens of thousands , characteristics and interrelated information are redundant, and the accuracy of traditional text classification is low. In order to improve the accuracy of automatic text categorization, we put forward a method of automatic text categorization based on data mining technology. Using the insensitivity of support vector machine to the characteristic correlation and sparseness and the advantage of handling high dimension problems, the contribution value of single word to the classification was calculated. Then, the words with similar contribution value were merged into a feature item of the text vector. Finally, by using support vector machine, text classification results was obtained. The results show that performance testing based on data mining technology is very good, and this method can quicken the text classification speed and improve the text classification accuracy and recall rate.%研究文本分类优化问题,文本是一种半结构化形式,特征数常高达几万,特征互相关联、冗余比较严重,影响分类的准确性.传统分类方法难以获得高正确率.为了提高文本自动分类正确率,提出了一种数据挖掘技术的文本自动分类方法.利用支持向量机对于特征相关性和稀疏性不敏感,能很好处理高维数问题的优点对单词对分类的贡献值进行计算,将对分类贡献相近单词合并成文本向量的一个特征项,采用支持向量机对特征项进行学习和分类.用文本分类库数据进行测试,结果表明,数据挖掘技术的分类方法,不仅加快了文本分类速度,同时提高文本分类准确率.

  14. LINNAEUS: A species name identification system for biomedical literature

    Directory of Open Access Journals (Sweden)

    Nenadic Goran

    2010-02-01

    Full Text Available Abstract Background The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles. Results In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers. Conclusions LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.

  15. Benefits of off-campus education for students in the health sciences: a text-mining analysis

    Directory of Open Access Journals (Sweden)

    Nakagawa Kazumasa

    2012-08-01

    Full Text Available Abstract Background In Japan, few community-based approaches have been adopted in health-care professional education, and the appropriate content for such approaches has not been clarified. In establishing community-based education for health-care professionals, clarification of its learning effects is required. A community-based educational program was started in 2009 in the health sciences course at Gunma University, and one of the main elements in this program is conducting classes outside school. The purpose of this study was to investigate using text-analysis methods how the off-campus program affects students. Methods In all, 116 self-assessment worksheets submitted by students after participating in the off-campus classes were decomposed into words. The extracted words were carefully selected from the perspective of contained meaning or content. With the selected terms, the relations to each word were analyzed by means of cluster analysis. Results Cluster analysis was used to select and divide 32 extracted words into four clusters: cluster 1—“actually/direct,” “learn/watch/hear,” “how,” “experience/participation,” “local residents,” “atmosphere in community-based clinical care settings,” “favorable,” “communication/conversation,” and “study”; cluster 2—“work of staff member” and “role”; cluster 3—“interaction/communication,” “understanding,” “feel,” “significant/important/necessity,” and “think”; and cluster 4—“community,” “confusing,” “enjoyable,” “proactive,” “knowledge,” “academic knowledge,” and “class.” Conclusions The students who participated in the program achieved different types of learning through the off-campus classes. They also had a positive impression of the community-based experience and interaction with the local residents, which is considered a favorable outcome. Off-campus programs could be a useful educational approach

  16. Study of Cloud Based ERP Services for Small and Medium Enterprises (Data is Processed by Text Mining Technique

    Directory of Open Access Journals (Sweden)

    SHARMA, R.

    2014-06-01

    Full Text Available The purpose of this research paper is to explore the knowledge of the existing studies related to cloud computing current trend. The outcome of research is demonstrated in the form of diagram which simplifies the ERP integration process for in-house and cloud eco-system. It will provide a conceptual view to the new client or entrepreneurs using ERP services and explain them how to deal with two stages of ERP systems (cloud and in-house. Also suggest how to improve knowledge about ERP services and implementation process for both stages. The work recommends which ERP services can be outsourced over the cloud. Cloud ERP is a mix of standard ERP services along with cloud flexibility and low cost to afford these services. This is a recent phenomenon in enterprise service offering. For most of non IT background entrepreneurs it is unclear and broad concept, since all the research work related to it are done in couple of years. Most of cloud ERP vendors describe their products as straight forward tasks. The process and selection of Cloud ERP Services and vendors is not clear. This research work draws a framework for selecting non-core business process from preferred ERP service partners. It also recommends which ERP services outsourced first over the cloud, and the security issues related to data or information moved out from company premises to the cloud eco-system.

  17. Association text classification of mining ItemSet significance%挖掘重要项集的关联文本分类

    Institute of Scientific and Technical Information of China (English)

    蔡金凤; 白清源

    2011-01-01

    针对在关联规则分类算法的构造分类器阶段中只考虑特征词是否存在,忽略了文本特征权重的问题,基于关联规则的文本分类方法(ARC-BC)的基础上提出一种可以提高关联文本分类准确率的ISARC(ItemSet Significance-based ARC)算法.该算法利用特征项权重定义了k-项集重要度,通过挖掘重要项集来产生关联规则,并考虑提升度对待分类文本的影响.实验结果表明,挖掘重要项集的ISARC算法可以提高关联文本分类的准确率.%Text classification technology is an important basis of information retrieval and text mining,and its main task is to mark category according to a given category set.Text classification has a wide range of applications in natural language processing and understanding、information organization and management、information filtering and other areas.At present,text classification can be mainly divided into three groups: based on statistical methods、based on connection method and the method based on rules. The basic idea of the traditional association text classification algorithm associative rule-based classifier by category(ARC-BC) is to use the association rule mining algorithm Apriori which generates frequent items that appear frequently feature items or itemsets,and then use these frequent items as rule antecedent and category is used as rule consequent to form the rule set and then make these rules constitute a classifier.During classifying the test samples,if the test sample matches the rule antecedent,put the rule that belongs to the class counterm to the cumulative confidence.If the confidence of the category counter is the maximum,then determine the test sample belongs to that category. However,ARC-BC algorithm has two main drawbacks:(1) During the structure classifier,it only considers the existence of feature words and ignores the weight of text features for mining frequent itemsets and generated association rules

  18. 文本数据挖掘技术在Web知识库中的应用研究%The Applied Research of Text Data Mining Technology in the Web Knowledge Base

    Institute of Scientific and Technical Information of China (English)

    蔡立斌

    2012-01-01

    介绍了文本数据挖掘和知识提取的基本理论,然后分析了网络信息的检索与挖掘的特征,特别是文本挖掘、Web数据挖掘和基于内容数据挖掘与之相关联的系列问题.在此基础上,分析了Web知识库的设计、建立、文本数据挖掘和知识发现所需的理论和技术,对Web知识库系统的架构和功能模块进行分析和设计,建立了基于文本数据挖掘的Web网络知识库的模型.%This article first briefly describes the basic theory of text data mining and knowledge extraction, and then analyzes the network information retrieval and mining of feature, especially Web text mining, data mining and data mining based on content associated with the series of problems. On this basis, we analyzed theory and technology that the Web knowledge base design, build, text data mining and knowledge discovery are required, the Web knowledge base system structure and function module is analyzed and designed, based on text data mining Web network knowledge base model.

  19. A CTD-Pfizer collaboration: manual curation of 88,000 scientific articles text mined for drug-disease and drug-phenotype interactions.

    Science.gov (United States)

    Davis, Allan Peter; Wiegers, Thomas C; Roberts, Phoebe M; King, Benjamin L; Lay, Jean M; Lennon-Hopkins, Kelley; Sciaky, Daniela; Johnson, Robin; Keating, Heather; Greene, Nigel; Hernandez, Robert; McConnell, Kevin J; Enayetallah, Ahmed E; Mattingly, Carolyn J

    2013-01-01

    Improving the prediction of chemical toxicity is a goal common to both environmental health research and pharmaceutical drug development. To improve safety detection assays, it is critical to have a reference set of molecules with well-defined toxicity annotations for training and validation purposes. Here, we describe a collaboration between safety researchers at Pfizer and the research team at the Comparative Toxicogenomics Database (CTD) to text mine and manually review a collection of 88,629 articles relating over 1,200 pharmaceutical drugs to their potential involvement in cardiovascular, neurological, renal and hepatic toxicity. In 1 year, CTD biocurators curated 254,173 toxicogenomic interactions (152,173 chemical-disease, 58,572 chemical-gene, 5,345 gene-disease and 38,083 phenotype interactions). All chemical-gene-disease interactions are fully integrated with public CTD, and phenotype interactions can be downloaded. We describe Pfizer's text-mining process to collate the articles, and CTD's curation strategy, performance metrics, enhanced data content and new module to curate phenotype information. As well, we show how data integration can connect phenotypes to diseases. This curation can be leveraged for information about toxic endpoints important to drug safety and help develop testable hypotheses for drug-disease events. The availability of these detailed, contextualized, high-quality annotations curated from seven decades' worth of the scientific literature should help facilitate new mechanistic screening assays for pharmaceutical compound survival. This unique partnership demonstrates the importance of resource sharing and collaboration between public and private entities and underscores the complementary needs of the environmental health science and pharmaceutical communities. Database URL: http://ctdbase.org/

  20. Thrombin and Related Coding Gene Text Mining Analysis%凝血酶及其相关编码基因的文本挖掘分析

    Institute of Scientific and Technical Information of China (English)

    许航; 吴坚

    2012-01-01

    [目的]对凝血酶及其相关基因进行文本挖掘分析.[方法]通过PubMed以“Thrombin”AND“gene”为检索策略检索相关文献,汇集了1980年以来的该主题文献,运用MMTx对这些文献进行处理,并设计相关Java算法程序,从处理结果中提取语义类型为“Pathologic Function”的病理功能概念和“gene or genome”基因类概念形成共句距阵,最后通过统计学软件SPSS16,对文本挖掘数据进行统计聚类分析.[结果]通过文本挖掘处理,系统的了解和归类了凝血酶编码基因与相应病理功能的关系.[结论]通过文献反查,说明该研究中的文本挖掘有效.%Text mining was carried out on thrombin and related gene. [Method] "Thrombin" AND "gene" were used for the retrieval strategy in the PubMed, brought references since 1980. The co-word analysis and cluster analysis were adopted to reflect the affinities between these words in all literature, and then the actual connection and meaning of these words were analyzed. Finally, the intrinsic link between these biological concepts were explored. [Result] The relationship between thrombin coding gene and corresponding pathological functions was classified. [Conclusion] Trough literature reverse-tracing, it was indicated that the text mining is effective.

  1. Construction of an annotated corpus to support biomedical information extraction

    Directory of Open Access Journals (Sweden)

    McNaught John

    2009-10-01

    Full Text Available Abstract Background Information Extraction (IE is a component of text mining that facilitates knowledge discovery by automatically locating instances of interesting biomedical events from huge document collections. As events are usually centred on verbs and nominalised verbs, understanding the syntactic and semantic behaviour of these words is highly important. Corpora annotated with information concerning this behaviour can constitute a valuable resource in the training of IE components and resources. Results We have defined a new scheme for annotating sentence-bound gene regulation events, centred on both verbs and nominalised verbs. For each event instance, all participants (arguments in the same sentence are identified and assigned a semantic role from a rich set of 13 roles tailored to biomedical research articles, together with a biological concept type linked to the Gene Regulation Ontology. To our knowledge, our scheme is unique within the biomedical field in terms of the range of event arguments identified. Using the scheme, we have created the Gene Regulation Event Corpus (GREC, consisting of 240 MEDLINE abstracts, in which events relating to gene regulation and expression have been annotated by biologists. A novel method of evaluating various different facets of the annotation task showed that average inter-annotator agreement rates fall within the range of 66% - 90%. Conclusion The GREC is a unique resource within the biomedical field, in that it annotates not only core relationships between entities, but also a range of other important details about these relationships, e.g., location, temporal, manner and environmental conditions. As such, it is specifically designed to support bio-specific tool and resource development. It has already been used to acquire semantic frames for inclusion within the BioLexicon (a lexical, terminological resource to aid biomedical text mining. Initial experiments have also shown that the corpus may

  2. 基于文本情报的数据挖掘%Data Mining Realization Technology Based on Text Intelligence Data

    Institute of Scientific and Technical Information of China (English)

    吕曹芳; 侯智斌

    2012-01-01

    文章介绍了适合于军事领域中进行情报数据的挖掘方法,建立了军事情报中非结构化文本情报数据处理方法,结合军孥睛报的特点,提出了军事情报中数据挖掘的框架模型,探讨了军事情报挖掘中文文本的方法。实现了对情报文本数据的分词、关键字提取、词频分析、关联分析等。%This paper introduces intelligence text classification model in military, the data processing map of unstructured intelligence text is established. Data mining model framework is established firstly by the feature of military intelligence. And implements Chinese word segmentation on text data, keyword extraction, word frequency analysis, relational analysis.

  3. Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases

    OpenAIRE

    Cesareni Gianni; Castagnoli Luisa; Iannuccelli Marta; Licata Luana; Briganti Leonardo; Perfetto Livia; Winter Andrew; Chatr-aryamontri Andrew; Tyers Mike

    2011-01-01

    Abstract Background The vast amount of data published in the primary biomedical literature represents a challenge for the automated extraction and codification of individual data elements. Biological databases that rely solely on manual extraction by expert curators are unable to comprehensively annotate the information dispersed across the entire biomedical literature. The development of efficient tools based on natural language processing (NLP) systems is essential for the selection of rele...

  4. Biomedical photonics handbook biomedical diagnostics

    CERN Document Server

    Vo-Dinh, Tuan

    2014-01-01

    Shaped by Quantum Theory, Technology, and the Genomics RevolutionThe integration of photonics, electronics, biomaterials, and nanotechnology holds great promise for the future of medicine. This topic has recently experienced an explosive growth due to the noninvasive or minimally invasive nature and the cost-effectiveness of photonic modalities in medical diagnostics and therapy. The second edition of the Biomedical Photonics Handbook presents fundamental developments as well as important applications of biomedical photonics of interest to scientists, engineers, manufacturers, teachers, studen

  5. Electronic Medical Record Writing Aided System Based on Text Mining%基于文本挖掘的电子病历书写辅助系统

    Institute of Scientific and Technical Information of China (English)

    周民

    2014-01-01

    Electronic medical record system is widely used in the hospital .Because of the particularity of occupation ;the doctor is impossible to fast input as professionals .Therefore the method which helps doctors quickly entry medical record information is studied ,and based on text mining technology ,an auxiliary system for electronic medical record system is pro-posed .The system carries on the excavation to the general information of medical records using data mining technology ,con-struct different lexicon for different types of medical records ,and uses Pinyin initials instead of Chinese characters input to speed up the medical record entry speed .%电子病历系统广泛应用于医院的管理系统中,由于职业的特殊性,医生在录入病历时不可能像专业人员那样快速录入,论文研究了帮助医生快速录入病历信息的方法,并基于文本挖掘技术提出了一种电子病历书写辅助系统,该系统利用数据挖掘技术对病历中的常用信息进行挖掘,为不同类型的病历构建不同的词库,并利用拼音首字母缩写代替汉字输入来加快病历的录入速度。

  6. CGMIM: Automated text-mining of Online Mendelian Inheritance in Man (OMIM to identify genetically-associated cancers and candidate genes

    Directory of Open Access Journals (Sweden)

    Jones Steven

    2005-03-01

    Full Text Available Abstract Background Online Mendelian Inheritance in Man (OMIM is a computerized database of information about genes and heritable traits in human populations, based on information reported in the scientific literature. Our objective was to establish an automated text-mining system for OMIM that will identify genetically-related cancers and cancer-related genes. We developed the computer program CGMIM to search for entries in OMIM that are related to one or more cancer types. We performed manual searches of OMIM to verify the program results. Results In the OMIM database on September 30, 2004, CGMIM identified 1943 genes related to cancer. BRCA2 (OMIM *164757, BRAF (OMIM *164757 and CDKN2A (OMIM *600160 were each related to 14 types of cancer. There were 45 genes related to cancer of the esophagus, 121 genes related to cancer of the stomach, and 21 genes related to both. Analysis of CGMIM results indicate that fewer than three gene entries in OMIM should mention both, and the more than seven-fold discrepancy suggests cancers of the esophagus and stomach are more genetically related than current literature suggests. Conclusion CGMIM identifies genetically-related cancers and cancer-related genes. In several ways, cancers with shared genetic etiology are anticipated to lead to further etiologic hypotheses and advances regarding environmental agents. CGMIM results are posted monthly and the source code can be obtained free of charge from the BC Cancer Research Centre website http://www.bccrc.ca/ccr/CGMIM.

  7. FROM DATA MINING TO BEHAVIOR MINING

    OpenAIRE

    ZHENGXIN CHEN

    2006-01-01

    Knowledge economy requires data mining be more goal-oriented so that more tangible results can be produced. This requirement implies that the semantics of the data should be incorporated into the mining process. Data mining is ready to deal with this challenge because recent developments in data mining have shown an increasing interest on mining of complex data (as exemplified by graph mining, text mining, etc.). By incorporating the relationships of the data along with the data itself (rathe...

  8. Constructing a semantic predication gold standard from the biomedical literature

    Directory of Open Access Journals (Sweden)

    Kilicoglu Halil

    2011-12-01

    Full Text Available Abstract Background Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology. Results We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475. With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536 in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations. Conclusions While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular

  9. Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value.

    Science.gov (United States)

    Bishop, Dorothy V M; Thompson, Paul A

    2016-01-01

    Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed.

  10. Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value.

    Science.gov (United States)

    Bishop, Dorothy V M; Thompson, Paul A

    2016-01-01

    Background. The p-curve is a plot of the distribution of p-values reported in a set of scientific studies. Comparisons between ranges of p-values have been used to evaluate fields of research in terms of the extent to which studies have genuine evidential value, and the extent to which they suffer from bias in the selection of variables and analyses for publication, p-hacking. Methods. p-hacking can take various forms. Here we used R code to simulate the use of ghost variables, where an experimenter gathers data on several dependent variables but reports only those with statistically significant effects. We also examined a text-mined dataset used by Head et al. (2015) and assessed its suitability for investigating p-hacking. Results. We show that when there is ghost p-hacking, the shape of the p-curve depends on whether dependent variables are intercorrelated. For uncorrelated variables, simulated p-hacked data do not give the "p-hacking bump" just below .05 that is regarded as evidence of p-hacking, though there is a negative skew when simulated variables are inter-correlated. The way p-curves vary according to features of underlying data poses problems when automated text mining is used to detect p-values in heterogeneous sets of published papers. Conclusions. The absence of a bump in the p-curve is not indicative of lack of p-hacking. Furthermore, while studies with evidential value will usually generate a right-skewed p-curve, we cannot treat a right-skewed p-curve as an indicator of the extent of evidential value, unless we have a model specific to the type of p-values entered into the analysis. We conclude that it is not feasible to use the p-curve to estimate the extent of p-hacking and evidential value unless there is considerable control over the type of data entered into the analysis. In particular, p-hacking with ghost variables is likely to be missed. PMID:26925335

  11. A CONDITIONAL RANDOM FIELDS APPROACH TO BIOMEDICAL NAMED ENTITY RECOGNITION

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Named entity recognition is a fundamental task in biomedical data mining. In this letter, a named entity recognition system based on CRFs (Conditional Random Fields) for biomedical texts is presented. The system makes extensive use of a diverse set of features, including local features, full text features and external resource features. All features incorporated in this system are described in detail,and the impacts of different feature sets on the performance of the system are evaluated. In order to improve the performance of system, post-processing modules are exploited to deal with the abbreviation phenomena, cascaded named entity and boundary errors identification. Evaluation on this system proved that the feature selection has important impact on the system performance, and the post-processing explored has an important contribution on system performance to achieve better results.

  12. RESEARCH ON TEXT MINING BASED ON BACKGROUND KNOWLEDGE AND ACTIVE LEARNING%基于背景知识和主动学习的文本挖掘技术研究

    Institute of Scientific and Technical Information of China (English)

    符保龙

    2013-01-01

    为了达成好的文本分类和文本挖掘效果,往往需要使用大量的标识数据.然而数据标识不但操作复杂,而且成本昂贵.为此,在基于支持向量机的分类技术框架下,在文本分类和文本挖掘中引入未标识数据,具体的执行通过基于背景知识和基于主动学习两种方法展开.实验结果表明,基于背景知识的文本挖掘方法在基线分类器性能较强的情况下可以发挥优秀的文本挖掘性能,而基于主动学习的文本挖掘方法在一般的情况下就可以改善文本挖掘的性能指标.%In order to achieve good effect in text classification and text mining,there often needs to use a large number of labelled data.However,to label data is usually complex in operation and also expensive.Therefore,in this paper we introduce the unlabelled data to text classification and text mining in the framework of support vector machine-based classification technology.The specific implementation is carried out through two methods,the background knowledge-based and the active learning-based.Experimental results show that the text mining based on background knowledge can bring the text mining performance into excellent play under the condition of stronger baseline classifier,while the text mining based on active learning can improve the performance index of text mining just in general situation.

  13. Implantable CMOS Biomedical Devices

    Directory of Open Access Journals (Sweden)

    Toshihiko Noda

    2009-11-01

    Full Text Available The results of recent research on our implantable CMOS biomedical devices are reviewed. Topics include retinal prosthesis devices and deep-brain implantation devices for small animals. Fundamental device structures and characteristics as well as in vivo experiments are presented.

  14. Biomedical Engineering

    CERN Document Server

    Suh, Sang C; Tanik, Murat M

    2011-01-01

    Biomedical Engineering: Health Care Systems, Technology and Techniques is an edited volume with contributions from world experts. It provides readers with unique contributions related to current research and future healthcare systems. Practitioners and researchers focused on computer science, bioinformatics, engineering and medicine will find this book a valuable reference.

  15. [Studies Using Text Mining on the Differences in Learning Effects between the KJ and World Café Method as Learning Strategies].

    Science.gov (United States)

    Yasuhara, Tomohisa; Sone, Tomomichi; Konishi, Motomi; Kushihata, Taro; Nishikawa, Tomoe; Yamamoto, Yumi; Kurio, Wasako; Kohno, Takeyuki

    2015-01-01

    The KJ method (named for developer Jiro Kawakita; also known as affinity diagramming) is widely used in participatory learning as a means to collect and organize information. In addition, the World Café (WC) has recently become popular. However, differences in the information obtained using each method have not been studied comprehensively. To determine the appropriate information selection criteria, we analyzed differences in the information generated by the WC and KJ methods. Two groups engaged in sessions to collect and organize information using either the WC or KJ method and small group discussions were held to create "proposals to improve first-year education". Both groups answered two pre- and post- session questionnaires that asked for free descriptions. Key words were extracted from the results of the two questionnaires and categorized using text mining. In the responses to questionnaire 1, which was directly related to the session theme, a significant increase in the number of key words was observed in the WC group (p=0.0050, Fisher's exact test). However, there was no significant increase in the number of key words in the responses to questionnaire 2, which was not directly related to the session theme (p=0.8347, Fisher's exact test). In the KJ method, participants extracted the most notable issues and progressed to a detailed discussion, whereas in the WC method, various information and problems were spread among the participants. The choice between the WC and KJ method should be made to reflect the educational objective and desired direction of discussion.

  16. Networks Models of Actin Dynamics during Spermatozoa Postejaculatory Life: A Comparison among Human-Made and Text Mining-Based Models.

    Science.gov (United States)

    Bernabò, Nicola; Ordinelli, Alessandra; Ramal Sanchez, Marina; Mattioli, Mauro; Barboni, Barbara

    2016-01-01

    Here we realized a networks-based model representing the process of actin remodelling that occurs during the acquisition of fertilizing ability of human spermatozoa (HumanMade_ActinSpermNetwork, HM_ASN). Then, we compared it with the networks provided by two different text mining tools: Agilent Literature Search (ALS) and PESCADOR. As a reference, we used the data from the online repository Kyoto Encyclopaedia of Genes and Genomes (KEGG), referred to the actin dynamics in a more general biological context. We found that HM_ALS and the networks from KEGG data shared the same scale-free topology following the Barabasi-Albert model, thus suggesting that the information is spread within the network quickly and efficiently. On the contrary, the networks obtained by ALS and PESCADOR have a scale-free hierarchical architecture, which implies a different pattern of information transmission. Also, the hubs identified within the networks are different: HM_ALS and KEGG networks contain as hubs several molecules known to be involved in actin signalling; ALS was unable to find other hubs than "actin," whereas PESCADOR gave some nonspecific result. This seems to suggest that the human-made information retrieval in the case of a specific event, such as actin dynamics in human spermatozoa, could be a reliable strategy. PMID:27642606

  17. Novel Mining Implicit Text Fragment from Abstract Scheme%一种摘要中隐含的知识片段的挖掘方案

    Institute of Scientific and Technical Information of China (English)

    戴璐; 丁立新; 薛兵

    2013-01-01

    This paper extracted high-frequency keywords appearing in the literature, then positioned the abstract through inverted index,mined the fixed semantic phrases with keywords in the abstract,and tracked the dynamic changes phrases in recent years by text bibliometric. By using the related affect matrix to establish associated network, the association between the semantic phrases was analysed and figured out. The experimental results show that the literature summary implicit knowledge fragments can better reflect the trends of disciplines.%提取文献中高频出现的关键词,通过倒排索引的方法将关键词在摘要中定位,挖掘出摘要中隐含的与关键词能构成固定搭配的语义词组,并运用文本计量的方法追踪词组近年来的动态变化.利用关联影响度矩阵对语义词组进行了网络分析.实验结果表明,文献摘要中隐含的知识片段更能反映学科的发展趋势.

  18. Next-generation text-mining mediated generation of chemical response-specific gene sets for interpretation of gene expression data

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2013-01-01

    Full Text Available Abstract Background Availability of chemical response-specific lists of genes (gene sets for pharmacological and/or toxic effect prediction for compounds is limited. We hypothesize that more gene sets can be created by next-generation text mining (next-gen TM, and that these can be used with gene set analysis (GSA methods for chemical treatment identification, for pharmacological mechanism elucidation, and for comparing compound toxicity profiles. Methods We created 30,211 chemical response-specific gene sets for human and mouse by next-gen TM, and derived 1,189 (human and 588 (mouse gene sets from the Comparative Toxicogenomics Database (CTD. We tested for significant differential expression (SDE (false discovery rate -corrected p-values Results Next-gen TM-derived gene sets matching the chemical treatment were significantly altered in three GE data sets, and the corresponding CTD-derived gene sets were significantly altered in five GE data sets. Six next-gen TM-derived and four CTD-derived fibrate gene sets were significantly altered in the PPARA knock-out GE dataset. None of the fibrate signatures in cMap scored significant against the PPARA GE signature. 33 environmental toxicant gene sets were significantly altered in the triazole GE data sets. 21 of these toxicants had a similar toxicity pattern as the triazoles. We confirmed embryotoxic effects, and discriminated triazoles from other chemicals. Conclusions Gene set analysis with next-gen TM-derived chemical response-specific gene sets is a scalable method for identifying similarities in gene responses to other chemicals, from which one may infer potential mode of action and/or toxic effect.

  19. 人文社会科学研究中文本挖掘技术应用进展%Progress of Text Mining Applications in Humanities and Social Science

    Institute of Scientific and Technical Information of China (English)

    郭金龙; 许鑫; 陆宇杰

    2012-01-01

    指出作为处理海量数据的有效工具,文本挖掘技术近年来在人文社科领域得到广泛重视。概述文本挖掘的相关技术和研究现状,介绍信息抽取、文本分类、文本聚类、关联规则与模式发现等常用的文本挖掘方法在人文社科研究中的具体应用,以拓展文本挖掘的应用领域,并为人文社科研究的方法创新提供新的思路。%As an effective method to handle data deluge, text mining has earned widespread respect in humanities and social science in recent years. This paper firstly summarizes the relevant techniques of text mining and current situation of study, then introduces spe- cific applications of frequently - used text mining techniques like information extraction, text classification, text clustering, association rules and pattern discovery in the domain of humanities and social science, so as to expand the domain of text mining application as well as providing new ideas for humanities and social science research.

  20. Using a search engine-based mutually reinforcing approach to assess the semantic relatedness of biomedical terms.

    Directory of Open Access Journals (Sweden)

    Yi-Yu Hsu

    Full Text Available BACKGROUND: Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field. Previous studies, such as those using ontology-based and corpus-based approaches, measured semantic relatedness by using information from the structure of biomedical literature, but these methods are limited by the small size of training resources. To increase the size of training datasets, the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms. METHODOLOGY/PRINCIPAL FINDINGS: In this work, we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text. ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms. By combining sentence structures and the linking activities between containers and lexical patterns, our algorithm can explore the correlation between two biomedical terms. CONCLUSIONS/SIGNIFICANCE: The average correlation coefficient of the ReLPR algorithm was 0.82 for various datasets. The results of the ReLPR algorithm were significantly superior to those of previous methods.

  1. Biomedical Materials

    Institute of Scientific and Technical Information of China (English)

    CHANG Jiang; ZHOU Yanling

    2011-01-01

    @@ Biomedical materials, biomaterials for short, is regarded as "any substance or combination of substances, synthetic or natural in origin, which can be used for any period of time, as a whole or as part of a system which treats, augments, or replaces any tissue, organ or function of the body" (Vonrecum & Laberge, 1995).Biomaterials can save lives, relieve suffering and enhance the quality of life for human being.

  2. Effective use of Latent Semantic Indexing and Computational Linguistics in Biological and Biomedical Applications

    Directory of Open Access Journals (Sweden)

    Hongyu eChen

    2013-01-01

    Full Text Available Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI, a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  3. MONK Project and the Reference for Text Mining Applied to the Humanities in China%MONK项目及其对我国人文领域文本挖掘的借鉴

    Institute of Scientific and Technical Information of China (English)

    许鑫; 郭金龙; 蔚海燕

    2012-01-01

    MONK is a crossdisciplinary text mining project in the humanities undertaken by several universities and research institutes from America and Canada. This paper mainly discusses the text mining process of MONK as well as relevant tools, techniques and algorithm. Two case studies based on MONK are introduced to give details about the application of text mining to the humanities. Finally the authors summarize some unique applications of text mining applied to the humanities and discuss what we can learn from the MONK projeet.%针对美国和加拿大等高校共同承担的大型跨学科人文文本挖掘项目MONK,详细介绍其文本挖掘流程及相应的工具、技术和算法,并具体探讨利用MONK提供的工具进行文学文本挖掘研究的应用实例。最后总结人文领域文本挖掘方法的几类应用,提出该项目对我国人文领域应用文本挖掘的启示。

  4. Effective use of latent semantic indexing and computational linguistics in biological and biomedical applications.

    Science.gov (United States)

    Chen, Hongyu; Martin, Bronwen; Daimon, Caitlin M; Maudsley, Stuart

    2013-01-01

    Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI), a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.

  5. 融合语义关联挖掘的文本情感分析算法研究%Text Sentiment Analysis Algorithm Combining with Semantic Association Mining

    Institute of Scientific and Technical Information of China (English)

    明均仁

    2012-01-01

    Facing the enriching textual sentiment information resources in the network, utilizing association mining technology to mine and analyse them automatically and intelligently to obtain user sentiment knowledge at semantic level has an important potential value for enterprise to formulate competitive strategies and keep competitive advantage. This paper integrates association mining technology into text sentiment analysis, researches and designs the text sentiment analysis algorithm combining with semantic association mining to realize sentiment analysis and user sentiment knowledge mining at semantic level. Experiment results demonstrate that this algorithm a- chieves a good expected effect. It dramatically improves the accuracy and efficiency of sentiment analysis and the depth and width of as- sociation mining.%面对网络中日益丰富的文本性情感信息资源,利用关联挖掘技术对其进行智能化的自动挖掘与分析,获取语义层面的用户情感知识,对于企业竞争策略的制定和竞争优势的保持具有重要的潜在价值。将关联挖掘技术融入文本情感分析之中,研究并设计一种融合语义关联挖掘的文本情感分析算法,实现语义层面的情感分析与用户情感知识挖掘。实验结果表明,该算法取得了很好的预期效果,显著提高了情感分析的准确率与效率以及关联挖掘的深度与广度。

  6. A realistic assessment of methods for extracting gene/protein interactions from free text

    Directory of Open Access Journals (Sweden)

    Shepherd Adrian J

    2009-07-01

    Full Text Available Abstract Background The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.

  7. Application of Text Mining Technology in the Traditional Chinese Medicine Literature Research%文本挖掘技术在中医药文献研究中的应用

    Institute of Scientific and Technical Information of China (English)

    郭洪涛

    2013-01-01

    目的:探讨文本挖掘技术在中医药文献研究中的应用成果.方法:对近年来文本挖掘技术应用于中医药研究的文献进行综述,总结文本挖掘技术在中医药中的应用成果.结果:文本挖掘技术能以线性和非线性方式解析数据,且能进行高层次的知识整合,又善于处理模糊和非量化数据.未来文本挖掘有可能整合中医药数据、蛋白质及代谢组学数据,分析组合中药活性成分,为新药发现和组合药物形成构建研发平台.结论:利用文本挖掘技术对中医药进行研究分析是一种很有前景的方法.%Objective:To investigate the application progress of text mining technology on traditional Chinese medicine literature research.Methods:Achievements of the application of text mining technology in TCM were summarized through literature review of the application of text mining technology in traditional chinese medicine in recent years.Results:Text mining technology can parse the data in the linear and nonlinear manner,with a high level of integration of knowledge.It is also good at dealing with fuzzy and not quantitative data.Text mining technology is likely to integrate traditional Chinese medicine data,protein and metabolomics data,analysis of combination traditional Chinese medicine active ingredients,which builds development platform for the new drug discovery and the form of combination drugs.Conclusion:Using text mining technology for research and analysis of traditional Chinese medicine is a promising method

  8. Optical Polarizationin Biomedical Applications

    CERN Document Server

    Tuchin, Valery V; Zimnyakov, Dmitry A

    2006-01-01

    Optical Polarization in Biomedical Applications introduces key developments in optical polarization methods for quantitative studies of tissues, while presenting the theory of polarization transfer in a random medium as a basis for the quantitative description of polarized light interaction with tissues. This theory uses the modified transfer equation for Stokes parameters and predicts the polarization structure of multiple scattered optical fields. The backscattering polarization matrices (Jones matrix and Mueller matrix) important for noninvasive medical diagnostic are introduced. The text also describes a number of diagnostic techniques such as CW polarization imaging and spectroscopy, polarization microscopy and cytometry. As a new tool for medical diagnosis, optical coherent polarization tomography is analyzed. The monograph also covers a range of biomedical applications, among them cataract and glaucoma diagnostics, glucose sensing, and the detection of bacteria.

  9. Toxicology of Biomedical Polymers

    Directory of Open Access Journals (Sweden)

    P. V. Vedanarayanan

    1987-04-01

    Full Text Available This paper deals with the various types of polymers, used in the fabrication of medical devices, their diversity of applications and toxic hazards which may arise out of their application. The potential toxicity of monomers and the various additives used in the manufacture of biomedical polymers have been discussed along with hazards which may arise out of processing of devices such as sterilization. The importance of quality control and stringent toxicity evaluation methods have been emphasised since in our country, at present, there are no regulations covering the manufacturing and marketing of medical devices. Finally the question of the general and subtle long term systemic toxicity of biomedical polymers have been brought to attention with the suggestion that this question needs to be resolved permanently by appropriate studies.

  10. A Review of Technical Topic Analysis Based on Text Mining%基于文本挖掘的专利技术主题分析研究综述

    Institute of Scientific and Technical Information of China (English)

    胡阿沛; 张静; 雷孝平; 张晓宇

    2013-01-01

    为应对专利数量巨大和技术的日益复杂给专利技术主题分析带来的挑战,以及利用文本挖掘技术的专利技术主题分析近来成为研究热点。首先介绍文本挖掘的概念和其发展历史。其次,对目前基于文本挖掘的专利技术主题分析方法进行了归纳,包括主题词词频分析、共词分析、文本聚类分析和与引文聚类结合的分析方法,总结其常用的分析工具并介绍新的科学图谱分析软件---SciMAT。最后总结基于文本挖掘的专利技术主题分析方法的优点与不足,为其将来的研究提供建议。%To cope with the challenges of patent technical topic analysis presented by huge amounts of patent documents and increasingly sophisticated technology, this paper uses text mining technology to assist technical topic analysis and get researchers' focus in recent years. Firstly, the concept of text mining and its development history are introduced. Secondly, the methods of analyzing patent technical topic based on text mining is summarized, including word frequency analysis, co-word analysis, text clustering analysis and analysis combined with citation clustering. Some important analytical tools and a new science mapping analysis software tool SciMAT are introduced. Final-ly, the paper points out the advantages and deficiencies of technical topic analysis based on text mining and future research direction.

  11. The structural and content aspects of abstracts versus bodies of full text journal articles are different

    Directory of Open Access Journals (Sweden)

    Roeder Christophe

    2010-09-01

    Full Text Available Abstract Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that

  12. Centroid Based Text Clustering

    Directory of Open Access Journals (Sweden)

    Priti Maheshwari

    2010-09-01

    Full Text Available Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tacklesproblems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mySql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing .We apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering ofdocuments.

  13. Design and Implementation of Collection Western Biomedical Journal Full-text Database Based on ASP and SQL Server%基于ASP+SQLServer框架的馆藏西文生物医学期刊全文数据库的设计与实现

    Institute of Scientific and Technical Information of China (English)

    刘红芝; 周薇

    2012-01-01

    为了提高读者对西文生物医学期刊全文的获取效率.开发馆藏西文生物医学期刊全文数据库系统。简要论述该系统的需求分析、结构设计、数据字典及应用框架ASP+SQLServer的原理与方法,利用ASP+SQLServer集成框架开发系统,实现全文的浏览、检索与后台管理等功能。%In order to increase user efficiency in accessing and obtaining Western Biomedical Journal Full-text resources, develops a Collection Western Biomedical Journal Full-text database system. Discusses the system requirement,structure design,data dictionary as well as the funda- mentals and methodologies of the ASP+SQL Server framework which is the key technology to develop the Western Biomedical Journal Full-text Database system, realizes full-text browsing, retrieval and background management functions.

  14. Review of Text Mining Application in Humanity and Social Science%文本挖掘在人文社会科学研究中的典型应用述评

    Institute of Scientific and Technical Information of China (English)

    陆宇杰; 许鑫; 郭金龙

    2012-01-01

    调研文本挖掘在人文社会科学领域的应用现况,介绍国际上文本挖掘在这些领域应用的成功案例与经验,展现目前文本挖掘在人文社科领域的最新研究进展,给国内相关研究的开展提供一定的启示。%This paper investigates the text mining application status in typical practice in the two domains, shows the newest progress of text sonic inspiration to Chinese researchers. humanity and social science, introduces the best exprience and mining all over the world, and correspondingly, tries to bring

  15. The Design and Implemention of a Text Mining Algorithm Based on MapReduce Framework%基于MapReduce框架一种文本挖掘算法的设计与实现

    Institute of Scientific and Technical Information of China (English)

    朱蔷蔷; 张桂芸; 刘文龙

    2012-01-01

    随着文本挖掘在主动信息服务中应用的日益扩展,在文本数据的基础上分析数据的内在特征已经成为目前的研究趋势,本文在Hadoop平台上设计并实现了一种文本挖掘算法,该算法利用MapReduce框架按照自然语料中相邻词组出现的频数进行降序输出,从而有助于用户挖掘大量数据中各项集之间的联系,实验结果体现了该算法的有效性和良好的加速比.%With the expanding application of text mining in active information service, analyzing the inherent characteristics of data based on the text data is becoming a current research trend, this paper designs and implements a text mining algorithm based on the Hadoop platform which outputs the data according to the natural corpora adjacent phrase descending frequency, thus helping the users mine the link between the set in the large quantities of data, In view of the distributed feature of the Hadoop platform, the experimental result shows the efficiency and better speedup.

  16. Biomedical engineering and nanotechnology

    International Nuclear Information System (INIS)

    This book is predominantly a compilation of papers presented in the conference which is focused on the development in biomedical materials, biomedical devises and instrumentation, biomedical effects of electromagnetic radiation, electrotherapy, radiotherapy, biosensors, biotechnology, bioengineering, tissue engineering, clinical engineering and surgical planning, medical imaging, hospital system management, biomedical education, biomedical industry and society, bioinformatics, structured nanomaterial for biomedical application, nano-composites, nano-medicine, synthesis of nanomaterial, nano science and technology development. The papers presented herein contain the scientific substance to suffice the academic directivity of the researchers from the field of biomedicine, biomedical engineering, material science and nanotechnology. Papers relevant to INIS are indexed separately

  17. Algorithm of Text Vectors Feature Mining Based on Multi Factor Analysis of Variance%基于多因素方差分析的文本向量特征挖掘算法

    Institute of Scientific and Technical Information of China (English)

    谭海中; 何波

    2015-01-01

    The text feature vector mining applied to information resources organization and management field, in the field of data mining and has great application value, characteristic vector of traditional text mining algorithm using K-means algo⁃rithm , the accuracy is not good. A new method based on multi factor variance analysis of the characteristics of mining algo⁃rithm of text vector. The features used multi factor variance analysis method to obtain a variety of corpora mining rules, based on ant colony algorithm, based on ant colony fitness probability regular training transfer rule, get the evolution of pop⁃ulation of recent data sets obtained effective moment features the maximum probability, the algorithm selects K-means ini⁃tial clustering center based on optimized division, first division of the sample data, then according to the sample distribu⁃tion characteristics to determine the initial cluster center, improve the performance of text feature mining, the simulation re⁃sults show that, this algorithm improves the clustering effect of the text feature vectors, and then improve the performance of feature mining, data feature has higher recall rate and detection rate, time consuming less, greater in the application of data mining in areas such as value.%文本向量特征挖掘应用于信息资源组织和管理领域,在大数据挖掘领域具有较大应用价值,传统算法精度不好。提出一种基于多因素方差分析的文本向量特征挖掘算法。使用多因素方差分析方法得到多种语料库的特征挖掘规律,结合蚁群算法,根据蚁群适应度概率正则训练迁移法则,得到种群进化最近时刻获得的数据集有效特征概率最大值,基于最优划分的K-means初始聚类中心选取算法,先对数据样本进行划分,然后根据样本分布特点来确定初始聚类中心,提高文本特征挖掘性能。仿真结果表明,该算法提高了文本向量特征的聚类效

  18. Application of Text Mining in Employment Information Analysis in Higher Vocational College%文本挖掘在高职院校就业信息分析中的应用

    Institute of Scientific and Technical Information of China (English)

    宁建飞

    2016-01-01

    、文本分类、文本聚类、文本关联分析、分布分析和趋势预测等。文本关联分析是其中一种很关键的挖掘任务,也是在文本信息处理领域用得较多的一种技术。本文主要用到文本关联分析,下面做重点介绍。%Taking the employment information data of the graduates in higher vocational colleges as the analysis object,text mining is applied to the employment information analysis. Through the employment information data mining,valuable data is obtained to act as important references for talent training,employment guidance and other scientific decisions. The experimental results show that text mining is a very effective analysis method for data anal-ysis of employment information.

  19. BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations.

    Science.gov (United States)

    Lee, Kyubum; Lee, Sunwon; Park, Sungjoon; Kim, Sunkyu; Kim, Suhkyung; Choi, Kwanghun; Tan, Aik Choon; Kang, Jaewoo

    2016-01-01

    Comprehensive knowledge of genomic variants in a biological context is key for precision medicine. As next-generation sequencing technologies improve, the amount of literature containing genomic variant data, such as new functions or related phenotypes, rapidly increases. Because numerous articles are published every day, it is almost impossible to manually curate all the variant information from the literature. Many researchers focus on creating an improved automated biomedical natural language processing (BioNLP) method that extracts useful variants and their functional information from the literature. However, there is no gold-standard data set that contains texts annotated with variants and their related functions. To overcome these limitations, we introduce a Biomedical entity Relation ONcology COrpus (BRONCO) that contains more than 400 variants and their relations with genes, diseases, drugs and cell lines in the context of cancer and anti-tumor drug screening research. The variants and their relations were manually extracted from 108 full-text articles. BRONCO can be utilized to evaluate and train new methods used for extracting biomedical entity relations from full-text publications, and thus be a valuable resource to the biomedical text mining research community. Using BRONCO, we quantitatively and qualitatively evaluated the performance of three state-of-the-art BioNLP methods. We also identified their shortcomings, and suggested remedies for each method. We implemented post-processing modules for the three BioNLP methods, which improved their performance.Database URL:http://infos.korea.ac.kr/bronco. PMID:27074804

  20. A Semantics-Based Approach to Retrieving Biomedical Information

    DEFF Research Database (Denmark)

    Andreasen, Troels; Bulskov, Henrik; Zambach, Sine;

    2011-01-01

    This paper describes an approach to representing, organising, and accessing conceptual content of biomedical texts using a formal ontology. The ontology is based on UMLS resources supplemented with domain ontologies developed in the project. The approach introduces the notion of ‘generative ontol...... of data mining of texts identifying paraphrases and concept relations and measuring distances between key concepts in texts. Thus, the project is distinct in its attempt to provide a formal underpinning of conceptual similarity or relatedness of meaning.......This paper describes an approach to representing, organising, and accessing conceptual content of biomedical texts using a formal ontology. The ontology is based on UMLS resources supplemented with domain ontologies developed in the project. The approach introduces the notion of ‘generative...... ontologies’, i.e., ontologies providing increasingly specialised concepts reflecting the phrase structure of natural language. Furthermore, we propose a novel so called ontological semantics which maps noun phrases from texts and queries into nodes in the generative ontology. This enables an advanced form...

  1. Principles of Biomedical Engineering

    CERN Document Server

    Madihally, Sundararajan V

    2010-01-01

    Describing the role of engineering in medicine today, this comprehensive volume covers a wide range of the most important topics in this burgeoning field. Supported with over 145 illustrations, the book discusses bioelectrical systems, mechanical analysis of biological tissues and organs, biomaterial selection, compartmental modeling, and biomedical instrumentation. Moreover, you find a thorough treatment of the concept of using living cells in various therapeutics and diagnostics.Structured as a complete text for students with some engineering background, the book also makes a valuable refere

  2. 基于增量队列的在全置信度下的关联挖掘%Association Mining on Massive Text under Full Confidence Based on Incremental Queue

    Institute of Scientific and Technical Information of China (English)

    刘炜

    2015-01-01

    关联挖掘是一种重要的数据分析方法, 提出了一种在全置信度下的增量队列关联挖掘算法模型, 在传统的 FP-Growth 及 PF-Tree 算法的关联挖掘中使用了全置信度规则, 算法的适应性得到提升, 由此提出FP4W-Growth 算法并运用到对文本数据的关联计算以及对增量式的数据进行关联性挖掘的研究中, 通过实验验证了此算法及模型的可行性与优化性, 为在庞大的文本数据中发现隐藏着的先前未知的并潜在有用的新信息和新模式, 提供了科学的决策方法.%Association mining is an important data analysis method, this article proposes an incremental queue association mining algorithm model under full confidence,using the full confidence rules in the traditional FP-Growth and PF-Tree association mining algorithm can improve the algorithm adaptability. Thus, the article proposes FP4W-Growth algorithm, and applies this algotithm to the association calculation of text data and association mining of incremental data. Then this paper conducted verification experiment. The experimental results show the feasibility of this algorithm and model. The article provides a scientific approach to finding hidden but useful information and patterns from large amount of text data.

  3. Application of an efficient Bayesian discretization method to biomedical data

    Directory of Open Access Journals (Sweden)

    Gopalakrishnan Vanathi

    2011-07-01

    Full Text Available Abstract Background Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI discretization method, which is commonly used for discretization. Results On 24 biomedical datasets obtained from high-throughput transcriptomic and proteomic studies, the classification performances of the C4.5 classifier and the naïve Bayes classifier were statistically significantly better when the predictor variables were discretized using EBD over FI. EBD was statistically significantly more stable to the variability of the datasets than FI. However, EBD was less robust, though not statistically significantly so, than FI and produced slightly more complex discretizations than FI. Conclusions On a range of biomedical datasets, a Bayesian discretization method (EBD yielded better classification performance and stability but was less robust than the widely used FI discretization method. The EBD discretization method is easy to implement, permits the incorporation of prior knowledge and belief, and is sufficiently fast for application to high-dimensional data.

  4. Automated recognition of malignancy mentions in biomedical literature

    Directory of Open Access Journals (Sweden)

    Liberman Mark Y

    2006-11-01

    Full Text Available Abstract Background The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. Results We developed a named entity recognizer (MTag, an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. Conclusion Together, these results suggest that the

  5. Research on Text Data Mining on Human Genome Sequence Analysis%人类基因组测序文本数据挖掘研究

    Institute of Scientific and Technical Information of China (English)

    于跃; 潘玮; 王丽伟; 王伟

    2012-01-01

    对PubMed数据库中2001年1月1日-2011年5月11日的人类基因组测序相关文献进行检索,对其题录信息进行提取并进行共词聚类分析,提取高频主题词,生成词篇矩阵、共现聚阵、共词聚类,认为文本数据挖掘技术能够很好地反映学科发展状况及研究热点,从而为研究人员提供有价值的信息。%Retrieving the literatures on human genome sequence analysis from PubMed published from 2001.1.1 to 2011.5.11,extracts bibliographic information and carries out co - word analysis,high frequency subject headings are extracted,word matrix,co - occurrence matrix,co - word clustering are formulated.It clarifies that data mining is a good way to reflect development status and research hotspots,so as to provide valuable information to researchers.

  6. Semi-supervised learning of causal relations in biomedical scientific discourse

    Science.gov (United States)

    2014-01-01

    Background The increasing number of daily published articles in the biomedical domain has become too large for humans to handle on their own. As a result, bio-text mining technologies have been developed to improve their workload by automatically analysing the text and extracting important knowledge. Specific bio-entities, bio-events between these and facts can now be recognised with sufficient accuracy and are widely used by biomedical researchers. However, understanding how the extracted facts are connected in text is an extremely difficult task, which cannot be easily tackled by machinery. Results In this article, we describe our method to recognise causal triggers and their arguments in biomedical scientific discourse. We introduce new features and show that a self-learning approach improves the performance obtained by supervised machine learners to 83.47% for causal triggers. Furthermore, the spans of causal arguments can be recognised to a slightly higher level that by using supervised or rule-based methods that have been employed before. Conclusion Exploiting the large amount of unlabelled data that is already available can help improve the performance of recognising causal discourse relations in the biomedical domain. This improvement will further benefit the development of multiple tasks, such as hypothesis generation for experimental laboratories, contradiction detection, and the creation of causal networks. PMID:25559746

  7. BioN∅T: A searchable database of biomedical negated sentences

    Directory of Open Access Journals (Sweden)

    Agarwal Shashank

    2011-10-01

    Full Text Available Abstract Background Negated biomedical events are often ignored by text-mining applications; however, such events carry scientific significance. We report on the development of BioN∅T, a database of negated sentences that can be used to extract such negated events. Description Currently BioN∅T incorporates ≈32 million negated sentences, extracted from over 336 million biomedical sentences from three resources: ≈2 million full-text biomedical articles in Elsevier and the PubMed Central, as well as ≈20 million abstracts in PubMed. We evaluated BioN∅T on three important genetic disorders: autism, Alzheimer's disease and Parkinson's disease, and found that BioN∅T is able to capture negated events that may be ignored by experts. Conclusions The BioN∅T database can be a useful resource for biomedical researchers. BioN∅T is freely available at http://bionot.askhermes.org/. In future work, we will develop semantic web related technologies to enrich BioN∅T.

  8. Oportunidades de letramento através de mineração textual e produção de Fanfictions Opportunities of literacy through text mining and fanfiction writing

    Directory of Open Access Journals (Sweden)

    Patrícia da Silva Campelo Costa

    2012-01-01

    Full Text Available Este trabalho tem por objetivo investigar como o letramento pode ser apoiado pelo uso de um recurso digital passível de auxiliar os processos de leitura e produção textual. Assim, a presente pesquisa baseia-se nos estudos de Feldman e Sanger (2006 acerca da mineração de textos e nas pesquisas de Black (2007; 2009 sobre a incorporação de um gênero textual característico da internet (fanfiction na aprendizagem de línguas. Através da utilização de um recurso de mineração de texto (Sobek, a partir do qual ocorre a extração dos termos mais recorrentes em um texto, os participantes deste estudo criaram narrativas em inglês como língua estrangeira (LE, em meio digital. Observou-se que a utilização da ferramenta deu suporte à produção textual em LE, e sua subsequente prática de letramento, visto que os alunos se apoiaram no recurso de mineração para criar narrativas fanfiction.This work aims at investigating how literacy may be supported by the use of a digital resource which can help the process of reading and writing. Thus, the present work is based on studies by Feldman and Sanger (2006 about text mining, and on research by Black (2007; 2009 about the incorporation of a textual genre characteristic of the Internet (fanfiction in language learning. Through the use of a text mining resource (Sobek, which promotes the extraction of frequent terms present in a text, the participants of our pilot study created narratives in English as a foreign language (FL, in digital media, and used the mining tool to develop graphs with the recurrent terms found in the story. It was observed that the use of a digital tool supported the text production in the FL, and its following practice of literacy, as the students relied on the mining resource to create their fanfictions.

  9. Oportunidades de letramento através de mineração textual e produção de fanfictions Opportunities of literacy through text mining and fanfiction writing

    Directory of Open Access Journals (Sweden)

    Patrícia da Silva Campelo Costa

    2012-01-01

    Full Text Available Este trabalho tem por objetivo investigar como o letramento pode ser apoiado pelo uso de um recurso digital passível de auxiliar os processos de leitura e produção textual. Assim, a presente pesquisa baseia-se nos estudos de Feldman e Sanger (2006 acerca da mineração de textos e nas pesquisas de Black (2007; 2009 sobre a incorporação de um gênero textual característico da internet (fanfiction na aprendizagem de línguas. Através da utilização de um recurso de mineração de texto (Sobek, a partir do qual ocorre a extração dos termos mais recorrentes em um texto, os participantes deste estudo criaram narrativas em inglês como língua estrangeira (LE, em meio digital. Observou-se que a utilização da ferramenta deu suporte à produção textual em LE, e sua subsequente prática de letramento, visto que os alunos se apoiaram no recurso de mineração para criar narrativas fanfiction.This work aims at investigating how literacy may be supported by the use of a digital resource which can help the process of reading and writing. Thus, the present work is based on studies by Feldman and Sanger (2006 about text mining, and on research by Black (2007; 2009 about the incorporation of a textual genre characteristic of the Internet (fanfiction in language learning. Through the use of a text mining resource (Sobek, which promotes the extraction of frequent terms present in a text, the participants of our pilot study created narratives in English as a foreign language (FL, in digital media, and used the mining tool to develop graphs with the recurrent terms found in the story. It was observed that the use of a digital tool supported the text production in the FL, and its following practice of literacy, as the students relied on the mining resource to create their fanfictions.

  10. 国际图书馆界对文本和数据挖掘权利的争取及启示%The International Library Community to the Text and Data Mining Rights for the Fight and Enlightenment

    Institute of Scientific and Technical Information of China (English)

    于静

    2016-01-01

    面对版权问题对文本和数据挖掘技术在图书馆领域应用的制约,国际图书馆界采取了发布版权原则声明、游说开展版权立法、对出版商的版权政策提出质疑以及构建维权合作同盟等争取文本与数据挖掘权利的对策,其成果主要体现在:版权立场受到社会认同和支持、版权例外制度初现端倪、部分出版商调整了版权政策、图书馆的版权实践模式趋于多元化等。国际图书馆界争取文本和数据挖掘权利的做法与经验,对我国图书馆具有启示意义。%〔Abstract〕In the face of the copyright problem of text and data mining technology in library ifeld constraints, the international library community take the publishing copyright statement of principles ,lobbying to carry out copyright legislation on the publisher’s copyright policy raised the question as well as the construction of rights cooperation alliance for text and number according to the Countermeasures of mining rights, the results mainly relfected in: copyright position by social recognition and support, incipient copyright exception system, some publishers copyright policy is adjusted, the practice mode of copyright in the library diversiifed etc.. The practice and experience of the international library community to strive for the right of the text and data mining, to our country library has the enlightenment signiifcance.

  11. BIG: a Grid Portal for Biomedical Data and Images

    Directory of Open Access Journals (Sweden)

    Giovanni Aloisio

    2004-06-01

    Full Text Available Modern management of biomedical systems involves the use of many distributed resources, such as high performance computational resources to analyze biomedical data, mass storage systems to store them, medical instruments (microscopes, tomographs, etc., advanced visualization and rendering tools. Grids offer the computational power, security and availability needed by such novel applications. This paper presents BIG (Biomedical Imaging Grid, a Web-based Grid portal for management of biomedical information (data and images in a distributed environment. BIG is an interactive environment that deals with complex user's requests, regarding the acquisition of biomedical data, the "processing" and "delivering" of biomedical images, using the power and security of Computational Grids.

  12. Patent Analysis Method for Field of Video Codec Based on Text Mining%基于文本挖掘的视频编解码领域专利分析方法

    Institute of Scientific and Technical Information of China (English)

    于雷; 夏鹏

    2012-01-01

    介绍了通过高级语义技术以及自然语言处理技术对专利进行文本挖掘分析的方法,同时利用该方法对涉及视频编解码领域的专利进行分析,得到一些有用的建议.%This paper introduces the advanced semantic technology and natural language processing technology of the patent text mining analysis methods, while taking advantage of the method to analyze the patent involved in the field of video codec and get some useful suggestions.

  13. Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature

    Directory of Open Access Journals (Sweden)

    Kevin E. Heinrich

    2008-01-01

    Full Text Available Identifying functional groups of genes is a challenging problem for biological applications. Text mining approaches can be used to build hierarchical clusters or trees from the information in the biological literature. In particular, the nonnegative matrix factorization (NMF is examined as one approach to label hierarchical trees. A generic labeling algorithm as well as an evaluation technique is proposed, and the effects of different NMF parameters with regard to convergence and labeling accuracy are discussed. The primary goals of this study are to provide a qualitative assessment of the NMF and its various parameters and initialization, to provide an automated way to classify biomedical data, and to provide a method for evaluating labeled data assuming a static input tree. As a byproduct, a method for generating gold standard trees is proposed.

  14. In silico discoveries for biomedical sciences

    NARCIS (Netherlands)

    Haagen, Herman van

    2011-01-01

    Text-mining is a challenging field of research initially meant for reading large text collections with a computer. Text-mining is useful in summarizing text, searching for the informative documents, and most important to do knowledge discovery. Knowledge discovery is the main subject of this thesis.

  15. Study on science and research information's auto-suggestion method based on text mining%文本挖掘技术在科研信息自动建议中的应用

    Institute of Scientific and Technical Information of China (English)

    李芳; 朱群雄

    2011-01-01

    This paper studies the characteristics of text data from research journal literature,applies the popular text mining technique into analyzing and processing research literature text data, and proposes research information's auto-suggestion system. Case study on journal documents is discussed.%研究了科研期刊文献文本数据的特点,将文本挖掘技术用于对科研期刊文本数据的分析和处理中,提出了基于文本挖掘技术的科研信息自动建议系统.结合国内信息领域较有影响的3种期刊2007全年的期刊文章,进行了实例仿真.

  16. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D

    2014-01-01

    Known as the bible of biomedical engineering, The Biomedical Engineering Handbook, Fourth Edition, sets the standard against which all other references of this nature are measured. As such, it has served as a major resource for both skilled professionals and novices to biomedical engineering.Biomedical Engineering Fundamentals, the first volume of the handbook, presents material from respected scientists with diverse backgrounds in physiological systems, biomechanics, biomaterials, bioelectric phenomena, and neuroengineering. More than three dozen specific topics are examined, including cardia

  17. Functionalized carbon nanotubes: biomedical applications

    Directory of Open Access Journals (Sweden)

    Vardharajula S

    2012-10-01

    Full Text Available Sandhya Vardharajula,1 Sk Z Ali,2 Pooja M Tiwari,1 Erdal Eroğlu,1 Komal Vig,1 Vida A Dennis,1 Shree R Singh11Center for NanoBiotechnology and Life Sciences Research, Alabama State University, Montgomery, AL, USA; 2Department of Microbiology, Osmania University, Hyderabad, IndiaAbstract: Carbon nanotubes (CNTs are emerging as novel nanomaterials for various biomedical applications. CNTs can be used to deliver a variety of therapeutic agents, including biomolecules, to the target disease sites. In addition, their unparalleled optical and electrical properties make them excellent candidates for bioimaging and other biomedical applications. However, the high cytotoxicity of CNTs limits their use in humans and many biological systems. The biocompatibility and low cytotoxicity of CNTs are attributed to size, dose, duration, testing systems, and surface functionalization. The functionalization of CNTs improves their solubility and biocompatibility and alters their cellular interaction pathways, resulting in much-reduced cytotoxic effects. Functionalized CNTs are promising novel materials for a variety of biomedical applications. These potential applications are particularly enhanced by their ability to penetrate biological membranes with relatively low cytotoxicity. This review is directed towards the overview of CNTs and their functionalization for biomedical applications with minimal cytotoxicity.Keywords: carbon nanotubes, cytotoxicity, functionalization, biomedical applications

  18. TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery.

    Directory of Open Access Journals (Sweden)

    Yi-An Chen

    Full Text Available Prioritising candidate genes for further experimental characterisation is a non-trivial challenge in drug discovery and biomedical research in general. An integrated approach that combines results from multiple data types is best suited for optimal target selection. We developed TargetMine, a data warehouse for efficient target prioritisation. TargetMine utilises the InterMine framework, with new data models such as protein-DNA interactions integrated in a novel way. It enables complicated searches that are difficult to perform with existing tools and it also offers integration of custom annotations and in-house experimental data. We proposed an objective protocol for target prioritisation using TargetMine and set up a benchmarking procedure to evaluate its performance. The results show that the protocol can identify known disease-associated genes with high precision and coverage. A demonstration version of TargetMine is available at http://targetmine.nibio.go.jp/.

  19. Biomedical microsystems

    CERN Document Server

    Meng, Ellis

    2010-01-01

    IntroductionEvolution of MEMSApplications of MEMSBioMEMS ApplicationsMEMS ResourcesText Goals and OrganizationMiniaturization and ScalingBioMEMS MaterialsTraditional MEMS and Microelectronic MaterialsPolymeric Materials for MEMSBiomaterialsMicrofabrication Methods and Processes for BioMEMSIntroductionMicrolithographyDopingMicromachiningWafer Bonding, Assembly, and PackagingSurface TreatmentConversion Factors for Energy and Intensity UnitsLaboratory ExercisesMicrofluidicsIntroduction and Fluid PropertiesConcepts in MicrofluidicsFluid-Transport Phenomena and PumpingFlow ControlLaboratory Exercis

  20. 基于语义文本挖掘的企业竞争对手分析模型研究%A Competitor Analysis Model Based on Semantic Text Mining

    Institute of Scientific and Technical Information of China (English)

    唐晓波; 郭萍

    2013-01-01

    为弥补传统竞争对手分析方法无法有效挖掘网络化企业竞争对手信息的缺陷,本文将语义文本挖掘技术引入企业竞争对手分析中,提出了一个基于语义文本挖掘的企业竞争对手分析模型.该模型采用规则化主题爬取技术获取结构化信息,利用竞争情报领域本体知识库和语义VSM矩阵实现竞争对手信息语义分析和描述,通过基于语义的文本挖掘技术提取竞争对手深层次语义知识.并以相机市场的两大竞争力企业--佳能、尼康为例进行了实证分析研究,实验结果表明,该模型具有潜在的实际应用价值,可有效提高企业决策水平.%In order to make up for the failure of traditional competitor analysis methods to mine information about corporate competitors in Web, this paper puts forward a competitor analysis model based on semantic text mining,involving text mining technology into the enterprise competitor analysis. This model adopts the regularized topical crawling technologies to obtain structured information, uses competitive ontology knowledge base and semantic VSM matrix to achieve semantic analysis and description of competitor information, extracts rival deep-level semantic knowledge through semantic-based text mining technology. Two competitive enterprises in camera market, namely Canon and Nikon are chosen to demonstrate the applicability of the model, the results show that this model has potential practical value, which can effectively improve business decision-making level.

  1. Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method

    Directory of Open Access Journals (Sweden)

    Knaus William A

    2006-03-01

    Full Text Available Abstract Background Data mining can be utilized to automate analysis of substantial amounts of data produced in many organizations. However, data mining produces large numbers of rules and patterns, many of which are not useful. Existing methods for pruning uninteresting patterns have only begun to automate the knowledge acquisition step (which is required for subjective measures of interestingness, hence leaving a serious bottleneck. In this paper we propose a method for automatically acquiring knowledge to shorten the pattern list by locating the novel and interesting ones. Methods The dual-mining method is based on automatically comparing the strength of patterns mined from a database with the strength of equivalent patterns mined from a relevant knowledgebase. When these two estimates of pattern strength do not match, a high "surprise score" is assigned to the pattern, identifying the pattern as potentially interesting. The surprise score captures the degree of novelty or interestingness of the mined pattern. In addition, we show how to compute p values for each surprise score, thus filtering out noise and attaching statistical significance. Results We have implemented the dual-mining method using scripts written in Perl and R. We applied the method to a large patient database and a biomedical literature citation knowledgebase. The system estimated association scores for 50,000 patterns, composed of disease entities and lab results, by querying the database and the knowledgebase. It then computed the surprise scores by comparing the pairs of association scores. Finally, the system estimated statistical significance of the scores. Conclusion The dual-mining method eliminates more than 90% of patterns with strong associations, thus identifying them as uninteresting. We found that the pruning of patterns using the surprise score matched the biomedical evidence in the 100 cases that were examined by hand. The method automates the acquisition of

  2. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  3. Statistics in biomedical research

    Directory of Open Access Journals (Sweden)

    González-Manteiga, Wenceslao

    2007-06-01

    Full Text Available The discipline of biostatistics is nowadays a fundamental scientific component of biomedical, public health and health services research. Traditional and emerging areas of application include clinical trials research, observational studies, physiology, imaging, and genomics. The present article reviews the current situation of biostatistics, considering the statistical methods traditionally used in biomedical research, as well as the ongoing development of new methods in response to the new problems arising in medicine. Clearly, the successful application of statistics in biomedical research requires appropriate training of biostatisticians. This training should aim to give due consideration to emerging new areas of statistics, while at the same time retaining full coverage of the fundamentals of statistical theory and methodology. In addition, it is important that students of biostatistics receive formal training in relevant biomedical disciplines, such as epidemiology, clinical trials, molecular biology, genetics, and neuroscience.La Bioestadística es hoy en día una componente científica fundamental de la investigación en Biomedicina, salud pública y servicios de salud. Las áreas tradicionales y emergentes de aplicación incluyen ensayos clínicos, estudios observacionales, fisología, imágenes, y genómica. Este artículo repasa la situación actual de la Bioestadística, considerando los métodos estadísticos usados tradicionalmente en investigación biomédica, así como los recientes desarrollos de nuevos métodos, para dar respuesta a los nuevos problemas que surgen en Medicina. Obviamente, la aplicación fructífera de la estadística en investigación biomédica exige una formación adecuada de los bioestadísticos, formación que debería tener en cuenta las áreas emergentes en estadística, cubriendo al mismo tiempo los fundamentos de la teoría estadística y su metodología. Es importante, además, que los estudiantes de

  4. Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles.

    Directory of Open Access Journals (Sweden)

    Rey-Long Liu

    Full Text Available Biomedical literature is an essential source of biomedical evidence. To translate the evidence for biomedicine study, researchers often need to carefully read multiple articles about specific biomedical issues. These articles thus need to be highly related to each other. They should share similar core contents, including research goals, methods, and findings. However, given an article r, it is challenging for search engines to retrieve highly related articles for r. In this paper, we present a technique PBC (Passage-based Bibliographic Coupling that estimates inter-article similarity by seamlessly integrating bibliographic coupling with the information collected from context passages around important out-link citations (references in each article. Empirical evaluation shows that PBC can significantly improve the retrieval of those articles that biomedical experts believe to be highly related to specific articles about gene-disease associations. PBC can thus be used to improve search engines in retrieving the highly related articles for any given article r, even when r is cited by very few (or even no articles. The contribution is essential for those researchers and text mining systems that aim at cross-validating the evidence about specific gene-disease associations.

  5. Current Market Demand for Core Competencies of Librarianship—A Text Mining Study of American Library Association’s Advertisements from 2009 through 2014

    Directory of Open Access Journals (Sweden)

    Qinghong Yang

    2016-02-01

    Full Text Available As librarianship evolves, it is important to examine the changes that have taken place in professional requirements. To provide an understanding of the current market demand for core competencies of librarianship, this article conducts a semi-automatic methodology to analyze job advertisements (ads posted on the American Library Association (ALA Joblist from 2009 through 2014. There is evidence that the ability to solve unexpected complex problems and to provide superior customer service gained increasing importance for librarians during those years. The authors contend that the findings in this report question the status quo of core competencies of librarianship in the US job market.

  6. A Comparative Study of Root -Based and Stem -Based Approaches for Measuring the Similarity Between Arabic Words for Arabic Text Mining Applications

    Directory of Open Access Journals (Sweden)

    Hanane FROUD

    2012-12-01

    Full Text Available Representation of semantic information contained in the words is needed for any Arabic Text Miningapplications. More precisely, the purpose is to better take into account the semantic dependenciesbetween words expressed by the co-occurrence frequencies of these words. There have been manyproposals to compute similarities between words based on their distributions in contexts. In this paper,we compare and contrast the effect of two preprocessing techniques applied to Arabic corpus: Rootbased(Stemming, and Stem-based (Light Stemming approaches for measuring the similarity betweenArabic words with the well known abstractive model -Latent Semantic Analysis (LSA- with a widevariety of distance functions and similarity measures, such as the Euclidean Distance, Cosine Similarity,Jaccard Coefficient, and the Pearson Correlation Coefficient. The obtained results show that, on the onehand, the variety of the corpus produces more accurate results; on the other hand, the Stem-basedapproach outperformed the Root-based one because this latter affects the words meanings

  7. PALM-IST: Pathway Assembly from Literature Mining--an Information Search Tool.

    Science.gov (United States)

    Mandloi, Sapan; Chakrabarti, Saikat

    2015-01-01

    Manual curation of biomedical literature has become extremely tedious process due to its exponential growth in recent years. To extract meaningful information from such large and unstructured text, newer and more efficient mining tool is required. Here, we introduce PALM-IST, a computational platform that not only allows users to explore biomedical abstracts using keyword based text mining but also extracts biological entity (e.g., gene/protein, drug, disease, biological processes, cellular component, etc.) information from the extracted text and subsequently mines various databases to provide their comprehensive inter-relation (e.g., interaction, expression, etc.). PALM-IST constructs protein interaction network and pathway information data relevant to the text search using multiple data mining tools and assembles them to create a meta-interaction network. It also analyzes scientific collaboration by extraction and creation of "co-authorship network," for a given search context. Hence, this useful combination of literature and data mining provided in PALM-IST can be used to extract novel protein-protein interaction (PPI), to generate meta-pathways and further to identify key crosstalk and bottleneck proteins. PALM-IST is available at www.hpppi.iicb.res.in/ctm. PMID:25989388

  8. 面向海量高维数据的文本主题发现%Text Topic Mining Oriented toward Massive High-dimensional Data

    Institute of Scientific and Technical Information of China (English)

    王和勇; 蓝金炯

    2015-01-01

    针对潜在语义分析( LSA: Latent Semantic Analysis)方法在海量高维数据中的制约,提出K均值聚类的LSA方法( KLSA):通过利用K均值聚类对主题词进行预处理,将主题词降到相对低维空间后再使用LSA方法;选取新浪微博文本数据作为具体研究对象,通过实验证明了所提出的方法能够在确保模型分类效果条件下,很好地满足海量高维数据对LSA方法计算速度的敏感要求。%Considering the constraints of Latent Semantic Analysis ( LSA) method in massive high-dimensional data, this paper proposes an improved LSA method based on k-means algorithm, called KLSA. This method takes advantage of k-means algorithm to reduce those feature words to relatively low-dimensional space and then uses the LSA method. In order to ensure the validity of this idea, the paper chooses text data from Sina Weibo to conduct an experiment. It is proved that the proposed method can satisfy the requirements of compu-tational efficiency in massive high-dimensional data under the condition of ensuring the classification results.

  9. Checklists in biomedical publications

    Directory of Open Access Journals (Sweden)

    Pardal-Refoyo JL

    2013-12-01

    Full Text Available Introduction and objectives: the authors, reviewers, editors and readers must have specific tools that help them in the process of drafting, review, or reading the articles. Objective: to offer a summary of the major checklists for different types of biomedical research articles. Material and method: review literature and resources of the EQUATOR Network and adaptations in Spanish published by Medicina Clínica and Evidencias en Pediatría journals. Results: are the checklists elaborated by various working groups. (CONSORT and TREND, experimental studies for observational studies (STROBE, accuracy (STARD diagnostic studies, systematic reviews and meta-analyses (PRISMA and for studies to improve the quality (SQUIRE. Conclusions: the use of checklists help to improve the quality of articles and help to authors, reviewers, to the editor and readers in the development and understanding of the content.

  10. Biomedical optical imaging

    CERN Document Server

    Fujimoto, James G

    2009-01-01

    Biomedical optical imaging is a rapidly emerging research area with widespread fundamental research and clinical applications. This book gives an overview of biomedical optical imaging with contributions from leading international research groups who have pioneered many of these techniques and applications. A unique research field spanning the microscopic to the macroscopic, biomedical optical imaging allows both structural and functional imaging. Techniques such as confocal and multiphoton microscopy provide cellular level resolution imaging in biological systems. The integration of this tech

  11. Detecting the research trend of islet amyloid polypeptide with text mining technique%基于文本挖掘技术探测胰岛淀粉样多肽的研究趋势

    Institute of Scientific and Technical Information of China (English)

    殷蜀梅; 李春英

    2014-01-01

    胰岛淀粉样多肽是2型糖尿病的重要致病原因之一.为了研究胰岛淀粉样多肽的生物学作用及其应用范围,本文拟通过文本挖掘技术来对胰岛淀粉样多肽生化用专业试剂和试剂盒检测的研究发展趋势进行探测.%Islet amyloid polypeptide (IAPP) is an important etiologic factor for the type 2 diabetes mellitus.To investigate the biological functions and the applications of IAPP,we used text mining to explore the development of the research about IAPP biochemical reagents and test kits in this study.

  12. 文本挖掘探索泌尿系感染中西医用药规律%Exploring the associated rules of traditional Chinese medicine and western medicine on urinary tract infection with text mining technique

    Institute of Scientific and Technical Information of China (English)

    陈文; 姜洋; 黄蕙莉; 孙玉香

    2013-01-01

    Objective To explore the associated rules between western medicine and traditional Chinese medicine (TCM) on urinary tract infection (UTI) with text mining technique.Methods The data set on UTI was downloaded from CBM database.The regularities of Chinese patent medicines (CPM),western medicines and the combination of CPM and western medicines on UTI were mined out by data slicing algorithm.The results were showed visually with Cytoscape2.8 software.Results The main function of CPM was focused on clearing heat and removing toxicity,promoting diuresis and relieving stranguria.For western medicine,antibacterial agents was often used and it was also frequently used together with CPM such as Sanjinpian.Conclusions Text mining approach provides an important method in the summary of the application regularity for disease in both TCM and western medicine.%目的 利用文本挖掘技术探索泌尿系感染中西医用药规律.方法 在中国生物医学文献服务系统中收集治疗泌尿系感染文献数据,采用基于敏感关键词频数统计的数据分层算法,挖掘泌尿系感染中成药、西药、中成药与西药联合应用规律,并利用Cytoscape2.8软件进行可视化展示.结果 中成药的应用以清热解毒、利尿通淋为主;西药以抗菌治疗为主;具有清热解毒、利尿通淋之功的中成药常与抗菌药联合应用.结论 文本挖掘能够比较客观地总结疾病用药规律,为临床应用提供有益的探索和参考.

  13. Biomedical engineering principles

    CERN Document Server

    Ritter, Arthur B; Valdevit, Antonio; Ascione, Alfred N

    2011-01-01

    Introduction: Modeling of Physiological ProcessesCell Physiology and TransportPrinciples and Biomedical Applications of HemodynamicsA Systems Approach to PhysiologyThe Cardiovascular SystemBiomedical Signal ProcessingSignal Acquisition and ProcessingTechniques for Physiological Signal ProcessingExamples of Physiological Signal ProcessingPrinciples of BiomechanicsPractical Applications of BiomechanicsBiomaterialsPrinciples of Biomedical Capstone DesignUnmet Clinical NeedsEntrepreneurship: Reasons why Most Good Designs Never Get to MarketAn Engineering Solution in Search of a Biomedical Problem

  14. Text Data Mining in Biomedical Informatics Based on Biclustering Method%基于双聚类方法的生物医学信息学文本数据挖掘研究

    Institute of Scientific and Technical Information of China (English)

    于跃; 徐志健; 王坤; 王伟

    2012-01-01

    运用TDA和gCluto软件对SCI数据库中近5年的相关文献进行处理,获得双聚类矩阵图,经分析后得到该学科领域近年来的期刊研究主题方向与热点。结论认为,双聚类方法能够很好地反映学科发展状况及研究热点,从而为医学人员提供有价值的信息与知识,值得进一步研究与推广。%In this paper, the authors retrieve SCI database to get the relating literatures during the recent 5 years and extract the recording information by TDA software and gCluto software. The outcomes indicate the research subject areas and hot spots in the biclustering figure. The conclusion is that biclustering techniques can reflect the development in a given field, and provide valuable medical information and knowledge for medical researchers. Further research and extension is needed to confirm the worth of these technologies.

  15. Implementation of Paste Backfill Mining Technology in Chinese Coal Mines

    Directory of Open Access Journals (Sweden)

    Qingliang Chang

    2014-01-01

    Full Text Available Implementation of clean mining technology at coal mines is crucial to protect the environment and maintain balance among energy resources, consumption, and ecology. After reviewing present coal clean mining technology, we introduce the technology principles and technological process of paste backfill mining in coal mines and discuss the components and features of backfill materials, the constitution of the backfill system, and the backfill process. Specific implementation of this technology and its application are analyzed for paste backfill mining in Daizhuang Coal Mine; a practical implementation shows that paste backfill mining can improve the safety and excavation rate of coal mining, which can effectively resolve surface subsidence problems caused by underground mining activities, by utilizing solid waste such as coal gangues as a resource. Therefore, paste backfill mining is an effective clean coal mining technology, which has widespread application.

  16. UKPMC: a full text article resource for the life sciences.

    Science.gov (United States)

    McEntyre, Johanna R; Ananiadou, Sophia; Andrews, Stephen; Black, William J; Boulderstone, Richard; Buttery, Paula; Chaplin, David; Chevuru, Sandeepreddy; Cobley, Norman; Coleman, Lee-Ann; Davey, Paul; Gupta, Bharti; Haji-Gholam, Lesley; Hawkins, Craig; Horne, Alan; Hubbard, Simon J; Kim, Jee-Hyub; Lewin, Ian; Lyte, Vic; MacIntyre, Ross; Mansoor, Sami; Mason, Linda; McNaught, John; Newbold, Elizabeth; Nobata, Chikashi; Ong, Ernest; Pillai, Sharmila; Rebholz-Schuhmann, Dietrich; Rosie, Heather; Rowbotham, Rob; Rupp, C J; Stoehr, Peter; Vaughan, Philip

    2011-01-01

    UK PubMed Central (UKPMC) is a full-text article database that extends the functionality of the original PubMed Central (PMC) repository. The UKPMC project was launched as the first 'mirror' site to PMC, which in analogy to the International Nucleotide Sequence Database Collaboration, aims to provide international preservation of the open and free-access biomedical literature. UKPMC (http://ukpmc.ac.uk) has undergone considerable development since its inception in 2007 and now includes both a UKPMC and PubMed search, as well as access to other records such as Agricola, Patents and recent biomedical theses. UKPMC also differs from PubMed/PMC in that the full text and abstract information can be searched in an integrated manner from one input box. Furthermore, UKPMC contains 'Cited By' information as an alternative way to navigate the literature and has incorporated text-mining approaches to semantically enrich content and integrate it with related database resources. Finally, UKPMC also offers added-value services (UKPMC+) that enable grantees to deposit manuscripts, link papers to grants, publish online portfolios and view citation information on their papers. Here we describe UKPMC and clarify the relationship between PMC and UKPMC, providing historical context and future directions, 10 years on from when PMC was first launched. PMID:21062818

  17. Exploring subdomain variation in biomedical language

    Directory of Open Access Journals (Sweden)

    Séaghdha Diarmuid Ó

    2011-05-01

    Full Text Available Abstract Background Applications of Natural Language Processing (NLP technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. Results Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. Conclusions We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers.

  18. Biomedical devices and their applications

    CERN Document Server

    2004-01-01

    This volume introduces readers to the basic concepts and recent advances in the field of biomedical devices. The text gives a detailed account of novel developments in drug delivery, protein electrophoresis, estrogen mimicking methods and medical devices. It also provides the necessary theoretical background as well as describing a wide range of practical applications. The level and style make this book accessible not only to scientific and medical researchers but also to graduate students.

  19. 改进的朴素贝叶斯聚类Web文本分类挖掘技术%The Improved Naive Bayes Text Classification Data Mining Clustering Web

    Institute of Scientific and Technical Information of China (English)

    高胜利

    2012-01-01

    通过对Web数据的特点进行详细的分析,在基于传统的贝叶斯聚类算法基础上,采用网页标记形式来有效地弥补朴素贝叶斯算法的不足,并将改进的方法应用在文本分类中,是一种很好的改进思路。最后实验结果也表明,此方法能够有效地对文本进行分类。%This paper first introduced the Web mining and text classification of basic theory, specific to the Web data characteristics are analyzed in detail, mainly based on the traditional Bayesian clustering algorithm based on the proposed algorithm, the improvement of the webpage, marked form to effectively compensates for the naive Bayes algorithm is in- sufficient, will be improved method and its application in text classification, finally the experimental results show that the method can effectively classify the text.

  20. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

    OpenAIRE

    Krallinger, Martin; Vazquez, Miguel; Leitner, Florian; Salgado, David; Chatr-aryamontri, Andrew; Winter, Andrew; Perfetto, Livia; Briganti, Leonardo; Licata, Luana; Iannuccelli, Marta; Castagnoli, Luisa; Cesareni, Gianni; Tyers, Mike; Schneider, Gerold; Rinaldi, Fabio

    2011-01-01

    Background Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretat...

  1. Rewriting and suppressing UMLS terms for improved biomedical term identification

    Directory of Open Access Journals (Sweden)

    Hettne Kristina M

    2010-03-01

    Full Text Available Abstract Background Identification of terms is essential for biomedical text mining.. We concentrate here on the use of vocabularies for term identification, specifically the Unified Medical Language System (UMLS. To make the UMLS more suitable for biomedical text mining we implemented and evaluated nine term rewrite and eight term suppression rules. The rules rely on UMLS properties that have been identified in previous work by others, together with an additional set of new properties discovered by our group during our work with the UMLS. Our work complements the earlier work in that we measure the impact on the number of terms identified by the different rules on a MEDLINE corpus. The number of uniquely identified terms and their frequency in MEDLINE were computed before and after applying the rules. The 50 most frequently found terms together with a sample of 100 randomly selected terms were evaluated for every rule. Results Five of the nine rewrite rules were found to generate additional synonyms and spelling variants that correctly corresponded to the meaning of the original terms and seven out of the eight suppression rules were found to suppress only undesired terms. Using the five rewrite rules that passed our evaluation, we were able to identify 1,117,772 new occurrences of 14,784 rewritten terms in MEDLINE. Without the rewriting, we recognized 651,268 terms belonging to 397,414 concepts; with rewriting, we recognized 666,053 terms belonging to 410,823 concepts, which is an increase of 2.8% in the number of terms and an increase of 3.4% in the number of concepts recognized. Using the seven suppression rules, a total of 257,118 undesired terms were suppressed in the UMLS, notably decreasing its size. 7,397 terms were suppressed in the corpus. Conclusions We recommend applying the five rewrite rules and seven suppression rules that passed our evaluation when the UMLS is to be used for biomedical term identification in MEDLINE. A software

  2. Biomedical applications engineering tasks

    Science.gov (United States)

    Laenger, C. J., Sr.

    1976-01-01

    The engineering tasks performed in response to needs articulated by clinicians are described. Initial contacts were made with these clinician-technology requestors by the Southwest Research Institute NASA Biomedical Applications Team. The basic purpose of the program was to effectively transfer aerospace technology into functional hardware to solve real biomedical problems.

  3. MedlineRanker: flexible ranking of biomedical literature.

    Science.gov (United States)

    Fontaine, Jean-Fred; Barbosa-Silva, Adriano; Schaefer, Martin; Huska, Matthew R; Muro, Enrique M; Andrade-Navarro, Miguel A

    2009-07-01

    The biomedical literature is represented by millions of abstracts available in the Medline database. These abstracts can be queried with the PubMed interface, which provides a keyword-based Boolean search engine. This approach shows limitations in the retrieval of abstracts related to very specific topics, as it is difficult for a non-expert user to find all of the most relevant keywords related to a biomedical topic. Additionally, when searching for more general topics, the same approach may return hundreds of unranked references. To address these issues, text mining tools have been developed to help scientists focus on relevant abstracts. We have implemented the MedlineRanker webserver, which allows a flexible ranking of Medline for a topic of interest without expert knowledge. Given some abstracts related to a topic, the program deduces automatically the most discriminative words in comparison to a random selection. These words are used to score other abstracts, including those from not yet annotated recent publications, which can be then ranked by relevance. We show that our tool can be highly accurate and that it is able to process millions of abstracts in a practical amount of time. MedlineRanker is free for use and is available at http://cbdm.mdc-berlin.de/tools/medlineranker. PMID:19429696

  4. Context-Aware Adaptive Hybrid Semantic Relatedness in Biomedical Science

    Science.gov (United States)

    Emadzadeh, Ehsan

    Text mining of biomedical literature and clinical notes is a very active field of research in biomedical science. Semantic analysis is one of the core modules for different Natural Language Processing (NLP) solutions. Methods for calculating semantic relatedness of two concepts can be very useful in solutions solving different problems such as relationship extraction, ontology creation and question / answering [1--6]. Several techniques exist in calculating semantic relatedness of two concepts. These techniques utilize different knowledge sources and corpora. So far, researchers attempted to find the best hybrid method for each domain by combining semantic relatedness techniques and data sources manually. In this work, attempts were made to eliminate the needs for manually combining semantic relatedness methods targeting any new contexts or resources through proposing an automated method, which attempted to find the best combination of semantic relatedness techniques and resources to achieve the best semantic relatedness score in every context. This may help the research community find the best hybrid method for each context considering the available algorithms and resources.

  5. Coreference resolution improves extraction of Biological Expression Language statements from texts.

    Science.gov (United States)

    Choi, Miji; Liu, Haibin; Baumgartner, William; Zobel, Justin; Verspoor, Karin

    2016-01-01

    We describe a system that automatically extracts biological events from biomedical journal articles, and translates those events into Biological Expression Language (BEL) statements. The system incorporates existing text mining components for coreference resolution, biological event extraction and a previously formally untested strategy for BEL statement generation. Although addressing the BEL track (Track 4) at BioCreative V (2015), we also investigate how incorporating coreference resolution might impact event extraction in the biomedical domain. In this paper, we report that our system achieved the best performance of 20.2 and 35.2 in F-score for the full BEL statement level on both stage 1, and stage 2 using provided gold standard entities, respectively. We also report that our results evaluated on the training dataset show benefit from integrating coreference resolution with event extraction. PMID:27374122

  6. An Attempt on Data Preprocessing for Text Mining in TCM Prescription Database%中医方剂数据库文本挖掘数据预处理的尝试

    Institute of Scientific and Technical Information of China (English)

    吴磊; 李舒

    2015-01-01

    Objective To propose a set of data preprocessing method based on data cleaning for TCM prescription database;To make data more standard, accurate and orderly, and convenient for follow-up processing. Methods The text data source was retrieved from prescription databases by bibliographic searching techniques. Non-normalized data were processed through steps followed by auxiliary word group line processing, regular expression substitution, and synonyms processing, with a purpose to improve data quality. Results Totally 1758 effective records were retrieved from TCM prescription database, and 91 records were retrieved from prescription modern application database. 6913 effective Chinese herbal medicines were retrieved after preprocessing, which can be successfully imported into relevant information mining system, and information about prescription and herb names can be extracted. Conclusion This method is applicable for text mining and knowledge discovery in TCM prescription database. It can successfully implement data cleaning for source text data, get data with unified standard and without noise, and finally realize the effective extraction of prescription information, which can provide references for researches on analysis and mining of TCM prescription text data.%目的:针对中医方剂数据挖掘需要提出一套以数据清洗为主的数据预处理方法,使数据规范、准确和有序,利于后续处理。方法通过检索技术,在方剂数据库中获取文本数据源,将非规范化的数据通过辅助词群行处理、正则表达式替换、异名处理等步骤进行清洗,改进数据质量。结果在中国方剂数据库共检索到1758条记录,在方剂现代应用数据库共检索到91条记录。源文本数据经预处理后共得到有效记录6913味药,可成功导入相关信息挖掘系统进行方剂名称和中药名词的信息抽取。结论本方法适用于基于中医方剂数据库的文本挖掘和知识

  7. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  8. Data mining in radiology

    Directory of Open Access Journals (Sweden)

    Amit T Kharat

    2014-01-01

    Full Text Available Data mining facilitates the study of radiology data in various dimensions. It converts large patient image and text datasets into useful information that helps in improving patient care and provides informative reports. Data mining technology analyzes data within the Radiology Information System and Hospital Information System using specialized software which assesses relationships and agreement in available information. By using similar data analysis tools, radiologists can make informed decisions and predict the future outcome of a particular imaging finding. Data, information and knowledge are the components of data mining. Classes, Clusters, Associations, Sequential patterns, Classification, Prediction and Decision tree are the various types of data mining. Data mining has the potential to make delivery of health care affordable and ensure that the best imaging practices are followed. It is a tool for academic research. Data mining is considered to be ethically neutral, however concerns regarding privacy and legality exists which need to be addressed to ensure success of data mining.

  9. Text Analytics to Data Warehousing

    OpenAIRE

    Kalli Srinivasa Nageswara Prasad; S. Ramakrishna

    2010-01-01

    Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language) to extract and analyze data, textanalytics has evolved to handle un...

  10. Tuning, Diagnostics & Data Preparation for Generalized Linear Models Supervised Algorithm in Data Mining Technologies

    Directory of Open Access Journals (Sweden)

    Sachin Bhaskar

    2015-07-01

    Full Text Available Data mining techniques are the result of a long process of research and product development. Large amount of data are searched by the practice of Data Mining to find out the trends and patterns that go beyond simple analysis. For segmentation of data and also to evaluate the possibility of future events, complex mathematical algorithms are used here. Specific algorithm produces each Data Mining model. More than one algorithms are used to solve in best way by some Data Mining problems. Data Mining technologies can be used through Oracle. Generalized Linear Models (GLM Algorithm is used in Regression and Classification Oracle Data Mining functions. For linear modelling, GLM is one the popular statistical techniques. For regression and binary classification, GLM is implemented by Oracle Data Mining. Row diagnostics as well as model statistics and extensive co-efficient statistics are provided by GLM. It also supports confidence bounds.. This paper outlines and produces analysis of GLM algorithm, which will guide to understand the tuning, diagnostics & data preparation process and the importance of Regression & Classification supervised Oracle Data Mining functions and it is utilized in marketing, time series prediction, financial forecasting, overall business planning, trend analysis, environmental modelling, biomedical and drug response modelling, etc.

  11. Handbook of biomedical optics

    CERN Document Server

    Boas, David A

    2011-01-01

    Biomedical optics holds tremendous promise to deliver effective, safe, non- or minimally invasive diagnostics and targeted, customizable therapeutics. Handbook of Biomedical Optics provides an in-depth treatment of the field, including coverage of applications for biomedical research, diagnosis, and therapy. It introduces the theory and fundamentals of each subject, ensuring accessibility to a wide multidisciplinary readership. It also offers a view of the state of the art and discusses advantages and disadvantages of various techniques.Organized into six sections, this handbook: Contains intr

  12. Biomedical engineering fundamentals

    CERN Document Server

    Bronzino, Joseph D; Bronzino, Joseph D

    2006-01-01

    Over the last century,medicine has come out of the "black bag" and emerged as one of the most dynamic and advanced fields of development in science and technology. Today, biomedical engineering plays a critical role in patient diagnosis, care, and rehabilitation. As such, the field encompasses a wide range of disciplines, from biology and physiology to informatics and signal processing. Reflecting the enormous growth and change in biomedical engineering during the infancy of the 21st century, The Biomedical Engineering Handbook enters its third edition as a set of three carefully focused and

  13. Biomedical Engineering Desk Reference

    CERN Document Server

    Ratner, Buddy D; Schoen, Frederick J; Lemons, Jack E; Dyro, Joseph; Martinsen, Orjan G; Kyle, Richard; Preim, Bernhard; Bartz, Dirk; Grimnes, Sverre; Vallero, Daniel; Semmlow, John; Murray, W Bosseau; Perez, Reinaldo; Bankman, Isaac; Dunn, Stanley; Ikada, Yoshito; Moghe, Prabhas V; Constantinides, Alkis

    2009-01-01

    A one-stop Desk Reference, for Biomedical Engineers involved in the ever expanding and very fast moving area; this is a book that will not gather dust on the shelf. It brings together the essential professional reference content from leading international contributors in the biomedical engineering field. Material covers a broad range of topics including: Biomechanics and Biomaterials; Tissue Engineering; and Biosignal Processing* A hard-working desk reference providing all the essential material needed by biomedical and clinical engineers on a day-to-day basis * Fundamentals, key techniques,

  14. Biomedical applications of polymers

    CERN Document Server

    Gebelein, C G

    1991-01-01

    The biomedical applications of polymers span an extremely wide spectrum of uses, including artificial organs, skin and soft tissue replacements, orthopaedic applications, dental applications, and controlled release of medications. No single, short review can possibly cover all these items in detail, and dozens of books andhundreds of reviews exist on biomedical polymers. Only a few relatively recent examples will be cited here;additional reviews are listed under most of the major topics in this book. We will consider each of the majorclassifications of biomedical polymers to some extent, inclu

  15. Powering biomedical devices

    CERN Document Server

    Romero, Edwar

    2013-01-01

    From exoskeletons to neural implants, biomedical devices are no less than life-changing. Compact and constant power sources are necessary to keep these devices running efficiently. Edwar Romero's Powering Biomedical Devices reviews the background, current technologies, and possible future developments of these power sources, examining not only the types of biomedical power sources available (macro, mini, MEMS, and nano), but also what they power (such as prostheses, insulin pumps, and muscular and neural stimulators), and how they work (covering batteries, biofluids, kinetic and ther

  16. Introducing Text Analytics as a Graduate Business School Course

    Science.gov (United States)

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  17. Automatically Generated Model of Book Acquisitioning Recommendation List Using Text-Mining%基于文本挖掘的图书采访推荐清单自动生成模型

    Institute of Scientific and Technical Information of China (English)

    张成林

    2013-01-01

      由于大多数图书馆的采编馆员人数有限,依靠传统的手工统计表单方式采访图书往往很难令读者满意。运用Web的文本挖掘技术提取读者历史查询关键词,构造一种图书采访推荐清单的自动生成模型。实验数据证明,该模型有着较好的召回率和精准率,能有效地执行清单的自动生成。%For most libraries, there is only limited number of acquisitioning and cataloging librarians. Using traditional manual statistics form methods, the acquisitioning result is often far from satisfying the readers’ needs. Based on web text mining technology, by extracting history query keywords used by readers, the paper gives an automatic model of book-acquisitioning recommendation list. Experiments show that the model has better recall rates and accuracy rates, and the effective book-acquisitioning list can be automatically generated.

  18. 基于文本挖掘的不同购物网站商品评论一致性研究%Text Mining - based Consistency of Product Reviews in Different Shopping Websites

    Institute of Scientific and Technical Information of China (English)

    施国良; 石桥峰

    2011-01-01

    基于文本挖掘的理论,提出不同购物网站商品评论对比分析的方法,对不同购物网站同一商品评论是否一致进行研究。首先对商品单个特征的评论进行对比分析,然后衍生到商品的整体特征对比。研究发现,不同购物网站对同一商品的评论并不完全一致,这种不一致主要体现在商品特征上面,这说明商品评论会因为购物网站的不同而有所差异。%Based on the theory of text mining, this paper puts forward a contrast method of product reviews in different shopping websites, and makes analysis on whether the product reviews from different shopping websites are consistent. Firstly, this paper analyses the reviews of product feature one by one. Then, it makes contrast analysis from one product feature to total product features. The study discovers that the reviews of the same product from different shopping websites are not completely consistent, and this inconsistency mainly reflects in product features, which means product reviews will be different due to different shopping websites.

  19. 基于SSH的Web中的文本挖掘算法的研究与应用%Research and Application of Text Mining Algorithm in SSH Based on Web

    Institute of Scientific and Technical Information of China (English)

    王钊

    2015-01-01

    当前互联网技术的迅猛发展,已经进入了海量数据的时代。通过互联网人们可以获取海量信息,但是海量信息也带来一些困扰。例如,面对海量信息我们该如何准确搜索。如何从海量信息中准确快速获取到想要的信息,显得尤为重要。基于此,研究了一种在信息平台中的文本挖掘算法,并将此算法在基于SSH(Struts+Spring+Hibernate)的Web信息平台下进行了应用。通过此应用,可以为用户更加准确、快速的搜索到用户想要的信息。因此,具有一定的现实意义与研究价值。%This paper studies an information platform of text mining algorithm,and this algorithm is applied in the informa-tion platform.Through this application,users can more accurately,quickly search the information the user wants.Therefore,this paper has certain practical significance and research value.

  20. Sensors for biomedical applications

    NARCIS (Netherlands)

    Bergveld, Piet

    1986-01-01

    This paper considers the impact during the last decade of modern IC technology, microelectronics, thin- and thick-film technology, fibre optic technology, etc. on the development of sensors for biomedical applications.