WorldWideScience

Sample records for web-based text mining

  1. Text mining of web-based medical content

    CERN Document Server

    Neustein, Amy

    2014-01-01

    Text Mining of Web-Based Medical Content examines web mining for extracting useful information that can be used for treating and monitoring the healthcare of patients. This work provides methodological approaches to designing mapping tools that exploit data found in social media postings. Specific linguistic features of medical postings are analyzed vis-a-vis available data extraction tools for culling useful information.

  2. OntoGene web services for biomedical text mining.

    Science.gov (United States)

    Rinaldi, Fabio; Clematide, Simon; Marques, Hernani; Ellendorff, Tilia; Romacker, Martin; Rodriguez-Esteban, Raul

    2014-01-01

    Text mining services are rapidly becoming a crucial component of various knowledge management pipelines, for example in the process of database curation, or for exploration and enrichment of biomedical data within the pharmaceutical industry. Traditional architectures, based on monolithic applications, do not offer sufficient flexibility for a wide range of use case scenarios, and therefore open architectures, as provided by web services, are attracting increased interest. We present an approach towards providing advanced text mining capabilities through web services, using a recently proposed standard for textual data interchange (BioC). The web services leverage a state-of-the-art platform for text mining (OntoGene) which has been tested in several community-organized evaluation challenges,with top ranked results in several of them.

  3. Beyond accuracy: creating interoperable and scalable text-mining web services.

    Science.gov (United States)

    Wei, Chih-Hsuan; Leaman, Robert; Lu, Zhiyong

    2016-06-15

    The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g. scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have preprocessed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. Our text-mining web service is freely available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl : Zhiyong.Lu@nih.gov. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.

  4. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

    Science.gov (United States)

    Wiegers, Thomas C; Davis, Allan Peter; Mattingly, Carolyn J

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and

  5. WEB STRUCTURE MINING

    Directory of Open Access Journals (Sweden)

    CLAUDIA ELENA DINUCĂ

    2011-01-01

    Full Text Available The World Wide Web became one of the most valuable resources for information retrievals and knowledge discoveries due to the permanent increasing of the amount of data available online. Taking into consideration the web dimension, the users get easily lost in the web’s rich hyper structure. Application of data mining methods is the right solution for knowledge discovery on the Web. The knowledge extracted from the Web can be used to raise the performances for Web information retrievals, question answering and Web based data warehousing. In this paper, I provide an introduction of Web mining categories and I focus on one of these categories: the Web structure mining. Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. It offers information about how different pages are linked together to form this huge web. Web Structure Mining finds hidden basic structures and uses hyperlinks for more web applications such as web search.

  6. The Voice of Chinese Health Consumers: A Text Mining Approach to Web-Based Physician Reviews.

    Science.gov (United States)

    Hao, Haijing; Zhang, Kunpeng

    2016-05-10

    Many Web-based health care platforms allow patients to evaluate physicians by posting open-end textual reviews based on their experiences. These reviews are helpful resources for other patients to choose high-quality doctors, especially in countries like China where no doctor referral systems exist. Analyzing such a large amount of user-generated content to understand the voice of health consumers has attracted much attention from health care providers and health care researchers. The aim of this paper is to automatically extract hidden topics from Web-based physician reviews using text-mining techniques to examine what Chinese patients have said about their doctors and whether these topics differ across various specialties. This knowledge will help health care consumers, providers, and researchers better understand this information. We conducted two-fold analyses on the data collected from the "Good Doctor Online" platform, the largest online health community in China. First, we explored all reviews from 2006-2014 using descriptive statistics. Second, we applied the well-known topic extraction algorithm Latent Dirichlet Allocation to more than 500,000 textual reviews from over 75,000 Chinese doctors across four major specialty areas to understand what Chinese health consumers said online about their doctor visits. On the "Good Doctor Online" platform, 112,873 out of 314,624 doctors had been reviewed at least once by April 11, 2014. Among the 772,979 textual reviews, we chose to focus on four major specialty areas that received the most reviews: Internal Medicine, Surgery, Obstetrics/Gynecology and Pediatrics, and Chinese Traditional Medicine. Among the doctors who received reviews from those four medical specialties, two-thirds of them received more than two reviews and in a few extreme cases, some doctors received more than 500 reviews. Across the four major areas, the most popular topics reviewers found were the experience of finding doctors, doctors' technical

  7. Interactive text mining with Pipeline Pilot: a bibliographic web-based tool for PubMed.

    Science.gov (United States)

    Vellay, S G P; Latimer, N E Miller; Paillard, G

    2009-06-01

    Text mining has become an integral part of all research in the medical field. Many text analysis software platforms support particular use cases and only those. We show an example of a bibliographic tool that can be used to support virtually any use case in an agile manner. Here we focus on a Pipeline Pilot web-based application that interactively analyzes and reports on PubMed search results. This will be of interest to any scientist to help identify the most relevant papers in a topical area more quickly and to evaluate the results of query refinement. Links with Entrez databases help both the biologist and the chemist alike. We illustrate this application with Leishmaniasis, a neglected tropical disease, as a case study.

  8. Applying Web usage mining for personalizing hyperlinks in Web-based adaptive educational systems

    NARCIS (Netherlands)

    Romero, C.; Ventura, S.; Zafra, A.; Bra, de P.M.E.

    2009-01-01

    Nowadays, the application of Web mining techniques in e-learning and Web-based adaptive educational systems is increasing exponentially. In this paper, we propose an advanced architecture for a personalization system to facilitate Web mining. A specific Web mining tool is developed and a recommender

  9. Applying Web Usage Mining for Personalizing Hyperlinks in Web-Based Adaptive Educational Systems

    Science.gov (United States)

    Romero, Cristobal; Ventura, Sebastian; Zafra, Amelia; de Bra, Paul

    2009-01-01

    Nowadays, the application of Web mining techniques in e-learning and Web-based adaptive educational systems is increasing exponentially. In this paper, we propose an advanced architecture for a personalization system to facilitate Web mining. A specific Web mining tool is developed and a recommender engine is integrated into the AHA! system in…

  10. Using Open Web APIs in Teaching Web Mining

    Science.gov (United States)

    Chen, Hsinchun; Li, Xin; Chau, M.; Ho, Yi-Jen; Tseng, Chunju

    2009-01-01

    With the advent of the World Wide Web, many business applications that utilize data mining and text mining techniques to extract useful business information on the Web have evolved from Web searching to Web mining. It is important for students to acquire knowledge and hands-on experience in Web mining during their education in information systems…

  11. Web based parallel/distributed medical data mining using software agents

    Energy Technology Data Exchange (ETDEWEB)

    Kargupta, H.; Stafford, B.; Hamzaoglu, I.

    1997-12-31

    This paper describes an experimental parallel/distributed data mining system PADMA (PArallel Data Mining Agents) that uses software agents for local data accessing and analysis and a web based interface for interactive data visualization. It also presents the results of applying PADMA for detecting patterns in unstructured texts of postmortem reports and laboratory test data for Hepatitis C patients.

  12. Cluo: Web-Scale Text Mining System For Open Source Intelligence Purposes

    Directory of Open Access Journals (Sweden)

    Przemyslaw Maciolek

    2013-01-01

    Full Text Available The amount of textual information published on the Internet is considered tobe in billions of web pages, blog posts, comments, social media updates andothers. Analyzing such quantities of data requires high level of distribution –both data and computing. This is especially true in case of complex algorithms,often used in text mining tasks.The paper presents a prototype implementation of CLUO – an Open SourceIntelligence (OSINT system, which extracts and analyzes significant quantitiesof openly available information.

  13. Web Mining

    Science.gov (United States)

    Fürnkranz, Johannes

    The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.

  14. Web Mining of Hotel Customer Survey Data

    Directory of Open Access Journals (Sweden)

    Richard S. Segall

    2008-12-01

    Full Text Available This paper provides an extensive literature review and list of references on the background of web mining as applied specifically to hotel customer survey data. This research applies the techniques of web mining to actual text of written comments for hotel customers using Megaputer PolyAnalyst®. Web mining functionalities utilized include those such as clustering, link analysis, key word and phrase extraction, taxonomy, and dimension matrices. This paper provides screen shots of the web mining applications using Megaputer PolyAnalyst®. Conclusions and future directions of the research are presented.

  15. ANALYSIS OF WEB MINING APPLICATIONS AND BENEFICIAL AREAS

    Directory of Open Access Journals (Sweden)

    Khaleel Ahmad

    2011-10-01

    Full Text Available The main purpose of this paper is to study the process of Web mining techniques, features, application ( e-commerce and e-business and its beneficial areas. Web mining has become more popular and its widely used in varies application areas (such as business intelligent system, e-commerce and e-business. The e-commerce or e-business results are bettered by the application of the mining techniques such as data mining and text mining, among all the mining techniques web mining is better.

  16. Mining Web-based Educational Systems to Predict Student Learning Achievements

    Directory of Open Access Journals (Sweden)

    José del Campo-Ávila

    2015-03-01

    Full Text Available Educational Data Mining (EDM is getting great importance as a new interdisciplinary research field related to some other areas. It is directly connected with Web-based Educational Systems (WBES and Data Mining (DM, a fundamental part of Knowledge Discovery in Databases. The former defines the context: WBES store and manage huge amounts of data. Such data are increasingly growing and they contain hidden knowledge that could be very useful to the users (both teachers and students. It is desirable to identify such knowledge in the form of models, patterns or any other representation schema that allows a better exploitation of the system. The latter reveals itself as the tool to achieve such discovering. Data mining must afford very complex and different situations to reach quality solutions. Therefore, data mining is a research field where many advances are being done to accommodate and solve emerging problems. For this purpose, many techniques are usually considered. In this paper we study how data mining can be used to induce student models from the data acquired by a specific Web-based tool for adaptive testing, called SIETTE. Concretely we have used top down induction decision trees algorithms to extract the patterns because these models, decision trees, are easily understandable. In addition, the conducted validation processes have assured high quality models.

  17. Integration of Web mining and web crawler: Relevance and State of Art

    OpenAIRE

    Subhendu kumar pani; Deepak Mohapatra,; Bikram Keshari Ratha

    2010-01-01

    This study presents the role of web crawler in web mining environment. As the growth of the World Wide Web exceeded all expectations,the research on Web mining is growing more and more.web mining research topic which combines two of the activated research areas: Data Mining and World Wide Web .So, the World Wide Web is a very advanced area for data mining research. Search engines that are based on web crawling framework also used in web mining to find theinteracted web pages. This paper discu...

  18. USING WEB MINING IN E-COMMERCE APPLICATIONS

    Directory of Open Access Journals (Sweden)

    Claudia Elena Dinucă

    2011-09-01

    Full Text Available Nowadays, the web is an important part of our daily life. The web is now the best medium of doing business. Large companies rethink their business strategy using the web to improve business. Business carried on the Web offers the opportunity to potential customers or partners where their products and specific business can be found. Business presence through a company web site has several advantages as it breaks the barrier of time and space compared with the existence of a physical office. To differentiate through the Internet economy, winning companies have realized that e-commerce transactions is more than just buying / selling, appropriate strategies are key to improve competitive power. One effective technique used for this purpose is data mining. Data mining is the process of extracting interesting knowledge from data. Web mining is the use of data mining techniques to extract information from web data. This article presents the three components of web mining: web usage mining, web structure mining and web content mining.

  19. Is Toscana A Formal Concept Analysis Based Solution In Web Usage Mining?

    Directory of Open Access Journals (Sweden)

    Dan-Andrei SITAR-TĂUT

    2012-01-01

    Full Text Available Analyzing large amount of data come from web logs represents a complex, but challenging nowadays problem with implication in various fields, thing that lets open a way for theoretically infinite approaches an implementations. The main goal of our paper represents the possibility of applying the formal concept analysis as viable solution of sustaining the web mining process, based on a technological open-source solution called TOSCANA.

  20. A STUDY OF TEXT MINING METHODS, APPLICATIONS,AND TECHNIQUES

    OpenAIRE

    R. Rajamani*1 & S. Saranya2

    2017-01-01

    Data mining is used to extract useful information from the large amount of data. It is used to implement and solve different types of research problems. The research related areas in data mining are text mining, web mining, image mining, sequential pattern mining, spatial mining, medical mining, multimedia mining, structure mining and graph mining. Text mining also referred to text of data mining, it is also called knowledge discovery in text (KDT) or knowledge of intelligent text analysis. T...

  1. AN INNOVATIVE WEB MINING APPLICATION ON BLOGS - A LAYOUT

    Directory of Open Access Journals (Sweden)

    S. Prakash

    2012-01-01

    Full Text Available Blogs and Web services agree to express user’s opinions and interests, in the form of small text messages which gives abbreviated and highly personalized remarks in real-time. Recognizing emotion is really significant for a text-based communication tool such as blogs. Nowadays, user opinions in the structure of comments, reviews in blogs have been utilized by researchers for various purposes. Among them the application of sentiment analysis techniques to these opinions is an interesting one. This paper deals with a proposal of a software structural design for constructing Web mining applications in the blog world. The design includes blog crawling and data mining algorithms, to offer a full-fledged and flexible key for constructing general-purpose Web mining applications. The structural design allocates some significant customizations, such as the construction of adapters for reading text from different blogs, and the utilization of different pre-processing methods and data mining procedures. The core of this paper is on explaining the innovative software structural design of the general framework offering thorough information about the data mining sub-framework.

  2. WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK – AN OVERVIEW

    Directory of Open Access Journals (Sweden)

    V. Lakshmi Praba

    2011-03-01

    Full Text Available Web Mining is the extraction of interesting and potentially useful patterns and information from Web. It includes Web documents, hyperlinks between documents, and usage logs of web sites. The significant task for web mining can be listed out as Information Retrieval, Information Selection / Extraction, Generalization and Analysis. Web information retrieval tools consider only the text on pages and ignore information in the links. The goal of Web structure mining is to explore structural summary about web. Web structure mining focusing on link information is an important aspect of web data. This paper presents an overview of the PageRank, Improved Page Rank and its working functionality in web structure mining.

  3. Introduction to the JASIST Special Topic Issue on Web Retrieval and Mining: A Machine Learning Perspective.

    Science.gov (United States)

    Chen, Hsinchun

    2003-01-01

    Discusses information retrieval techniques used on the World Wide Web. Topics include machine learning in information extraction; relevance feedback; information filtering and recommendation; text classification and text clustering; Web mining, based on data mining techniques; hyperlink structure; and Web size. (LRW)

  4. World Wide Web Usage Mining Systems and Technologies

    Directory of Open Access Journals (Sweden)

    Wen-Chen Hu

    2003-08-01

    Full Text Available Web usage mining is used to discover interesting user navigation patterns and can be applied to many real-world problems, such as improving Web sites/pages, making additional topic or product recommendations, user/customer behavior studies, etc. This article provides a survey and analysis of current Web usage mining systems and technologies. A Web usage mining system performs five major tasks: i data gathering, ii data preparation, iii navigation pattern discovery, iv pattern analysis and visualization, and v pattern applications. Each task is explained in detail and its related technologies are introduced. A list of major research systems and projects concerning Web usage mining is also presented, and a summary of Web usage mining is given in the last section.

  5. Web Page Recommendation Using Web Mining

    OpenAIRE

    Modraj Bhavsar; Mrs. P. M. Chavan

    2014-01-01

    On World Wide Web various kind of content are generated in huge amount, so to give relevant result to user web recommendation become important part of web application. On web different kind of web recommendation are made available to user every day that includes Image, Video, Audio, query suggestion and web page. In this paper we are aiming at providing framework for web page recommendation. 1) First we describe the basics of web mining, types of web mining. 2) Details of each...

  6. Web Mining and Social Networking

    DEFF Research Database (Denmark)

    Xu, Guandong; Zhang, Yanchun; Li, Lin

    This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web ...... sense of individuals or communities. The volume will benefit both academic and industry communities interested in the techniques and applications of web search, web data management, web mining and web knowledge discovery, as well as web community and social network analysis.......This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web...... mining, and the issue of how to incorporate web mining into web personalization and recommendation systems are also reviewed. Additionally, the volume explores web community mining and analysis to find the structural, organizational and temporal developments of web communities and reveal the societal...

  7. Web Mining and Social Networking

    CERN Document Server

    Xu, Guandong; Li, Lin

    2011-01-01

    This book examines the techniques and applications involved in the Web Mining, Web Personalization and Recommendation and Web Community Analysis domains, including a detailed presentation of the principles, developed algorithms, and systems of the research in these areas. The applications of web mining, and the issue of how to incorporate web mining into web personalization and recommendation systems are also reviewed. Additionally, the volume explores web community mining and analysis to find the structural, organizational and temporal developments of web communities and reveal the societal s

  8. Earth Science Mining Web Services

    Science.gov (United States)

    Pham, Long; Lynnes, Christopher; Hegde, Mahabaleshwa; Graves, Sara; Ramachandran, Rahul; Maskey, Manil; Keiser, Ken

    2008-01-01

    To allow scientists further capabilities in the area of data mining and web services, the Goddard Earth Sciences Data and Information Services Center (GES DISC) and researchers at the University of Alabama in Huntsville (UAH) have developed a system to mine data at the source without the need of network transfers. The system has been constructed by linking together several pre-existing technologies: the Simple Scalable Script-based Science Processor for Measurements (S4PM), a processing engine at he GES DISC; the Algorithm Development and Mining (ADaM) system, a data mining toolkit from UAH that can be configured in a variety of ways to create customized mining processes; ActiveBPEL, a workflow execution engine based on BPEL (Business Process Execution Language); XBaya, a graphical workflow composer; and the EOS Clearinghouse (ECHO). XBaya is used to construct an analysis workflow at UAH using ADam components, which are also installed remotely at the GES DISC, wrapped as Web Services. The S4PM processing engine searches ECHO for data using space-time criteria, staging them to cache, allowing the ActiveBPEL engine to remotely orchestras the processing workflow within S4PM. As mining is completed, the output is placed in an FTP holding area for the end user. The goals are to give users control over the data they want to process, while mining data at the data source using the server's resources rather than transferring the full volume over the internet. These diverse technologies have been infused into a functioning, distributed system with only minor changes to the underlying technologies. The key to the infusion is the loosely coupled, Web-Services based architecture: All of the participating components are accessible (one way or another) through (Simple Object Access Protocol) SOAP-based Web Services.

  9. An Educational Data Mining Approach to Concept Map Construction for Web based Learning

    Directory of Open Access Journals (Sweden)

    Anal ACHARYA

    2017-01-01

    Full Text Available This aim of this article is to study the use of Educational Data Mining (EDM techniques in constructing concept maps for organizing knowledge in web based learning systems whereby studying their synergistic effects in enhancing learning. This article first provides a tutorial based introduction to EDM. The applicability of web based learning systems in enhancing the efficiency of EDM techniques in real time environment is investigated. Web based learning systems often use a tool for organizing knowledge. This article explores the use of one such tool called concept map for this purpose. The pioneering works by various researchers who proposed web based learning systems in personalized and collaborative environment in this arena are next presented. A set of parameters are proposed based on which personalized and collaborative learning applications may be generalized and their performances compared. It is found that personalized learning environment uses EDM techniques more exhaustively compared to collaborative learning for concept map construction in web based environment. This article can be used as a starting point for freshers who would like to use EDM techniques for concept map construction for web based learning purposes.

  10. Association and Sequence Mining in Web Usage

    Directory of Open Access Journals (Sweden)

    Claudia Elena DINUCA

    2011-06-01

    Full Text Available Web servers worldwide generate a vast amount of information on web users’ browsing activities. Several researchers have studied these so-called clickstream or web access log data to better understand and characterize web users. Clickstream data can be enriched with information about the content of visited pages and the origin (e.g., geographic, organizational of the requests. The goal of this project is to analyse user behaviour by mining enriched web access log data. With the continued growth and proliferation of e-commerce, Web services, and Web-based information systems, the volumes of click stream and user data collected by Web-based organizations in their daily operations has reached astronomical proportions. This information can be exploited in various ways, such as enhancing the effectiveness of websites or developing directed web marketing campaigns. The discovered patterns are usually represented as collections of pages, objects, or re-sources that are frequently accessed by groups of users with common needs or interests. The focus of this paper is to provide an overview how to use frequent pattern techniques for discovering different types of patterns in a Web log database. In this paper we will focus on finding association as a data mining technique to extract potentially useful knowledge from web usage data. I implemented in Java, using NetBeans IDE, a program for identification of pages’ association from sessions. For exemplification, we used the log files from a commercial web site.

  11. The design and implementation of web mining in web sites security

    Science.gov (United States)

    Li, Jian; Zhang, Guo-Yin; Gu, Guo-Chang; Li, Jian-Li

    2003-06-01

    The backdoor or information leak of Web servers can be detected by using Web Mining techniques on some abnormal Web log and Web application log data. The security of Web servers can be enhanced and the damage of illegal access can be avoided. Firstly, the system for discovering the patterns of information leakages in CGI scripts from Web log data was proposed. Secondly, those patterns for system administrators to modify their codes and enhance their Web site security were provided. The following aspects were described: one is to combine web application log with web log to extract more information, so web data mining could be used to mine web log for discovering the information that firewall and Information Detection System cannot find. Another approach is to propose an operation module of web site to enhance Web site security. In cluster server session, Density-Based Clustering technique is used to reduce resource cost and obtain better efficiency.

  12. Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track

    Science.gov (United States)

    2015-11-20

    Mining Tasks from the Web Anchor Text Graph: MSR Notebook Paper for the TREC 2015 Tasks Track Paul N. Bennett Microsoft Research Redmond, USA pauben...anchor text graph has proven useful in the general realm of query reformulation [2], we sought to quantify the value of extracting key phrases from...anchor text in the broader setting of the task understanding track. Given a query, our approach considers a simple method for identifying a relevant

  13. SEMANTIC WEB MINING: ISSUES AND CHALLENGES

    OpenAIRE

    Karan Singh*, Anil kumar, Arun Kumar Yadav

    2016-01-01

    The combination of the two fast evolving scientific research areas “Semantic Web” and “Web Mining” are well-known as “Semantic Web Mining” in computer science. These two areas cover way for the mining of related and meaningful information from the web, by this means giving growth to the term “Semantic Web Mining”. The “Semantic Web” makes mining easy and “Web Mining” can construct new structure of Web. Web Mining applies Data Mining technique on web content, Structure and Usage. This paper gi...

  14. DrugQuest - a text mining workflow for drug association discovery.

    Science.gov (United States)

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis

    2016-06-06

    Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .

  15. LimTox: a web tool for applied text mining of adverse event and toxicity associations of compounds, drugs and genes.

    Science.gov (United States)

    Cañada, Andres; Capella-Gutierrez, Salvador; Rabal, Obdulia; Oyarzabal, Julen; Valencia, Alfonso; Krallinger, Martin

    2017-07-03

    A considerable effort has been devoted to retrieve systematically information for genes and proteins as well as relationships between them. Despite the importance of chemical compounds and drugs as a central bio-entity in pharmacological and biological research, only a limited number of freely available chemical text-mining/search engine technologies are currently accessible. Here we present LimTox (Literature Mining for Toxicology), a web-based online biomedical search tool with special focus on adverse hepatobiliary reactions. It integrates a range of text mining, named entity recognition and information extraction components. LimTox relies on machine-learning, rule-based, pattern-based and term lookup strategies. This system processes scientific abstracts, a set of full text articles and medical agency assessment reports. Although the main focus of LimTox is on adverse liver events, it enables also basic searches for other organ level toxicity associations (nephrotoxicity, cardiotoxicity, thyrotoxicity and phospholipidosis). This tool supports specialized search queries for: chemical compounds/drugs, genes (with additional emphasis on key enzymes in drug metabolism, namely P450 cytochromes-CYPs) and biochemical liver markers. The LimTox website is free and open to all users and there is no login requirement. LimTox can be accessed at: http://limtox.bioinfo.cnio.es. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  16. Working with Data: Discovering Knowledge through Mining and Analysis; Systematic Knowledge Management and Knowledge Discovery; Text Mining; Methodological Approach in Discovering User Search Patterns through Web Log Analysis; Knowledge Discovery in Databases Using Formal Concept Analysis; Knowledge Discovery with a Little Perspective.

    Science.gov (United States)

    Qin, Jian; Jurisica, Igor; Liddy, Elizabeth D.; Jansen, Bernard J; Spink, Amanda; Priss, Uta; Norton, Melanie J.

    2000-01-01

    These six articles discuss knowledge discovery in databases (KDD). Topics include data mining; knowledge management systems; applications of knowledge discovery; text and Web mining; text mining and information retrieval; user search patterns through Web log analysis; concept analysis; data collection; and data structure inconsistency. (LRW)

  17. Data Preparation for Web Mining – A survey

    OpenAIRE

    Amog Rajenderan

    2012-01-01

    An accepted trend is to categorize web mining intothree main areas: web content mining, webstructure mining and web usage mining. Webcontent mining involves extractingdetails/information from the contents of webpagesand performing things like knowledge synthesis.Web structure mining involves the usage of graphtheory to understand website structure/hierarchy.Web usage mining involves the mining of usefulinformation from things like server logs, tounderstand what the user does while on the inte...

  18. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials

    Science.gov (United States)

    2012-01-01

    Clinical trials are mandatory protocols describing medical research on humans and among the most valuable sources of medical practice evidence. Searching for trials relevant to some query is laborious due to the immense number of existing protocols. Apart from search, writing new trials includes composing detailed eligibility criteria, which might be time-consuming, especially for new researchers. In this paper we present ASCOT, an efficient search application customised for clinical trials. ASCOT uses text mining and data mining methods to enrich clinical trials with metadata, that in turn serve as effective tools to narrow down search. In addition, ASCOT integrates a component for recommending eligibility criteria based on a set of selected protocols. PMID:22595088

  19. Mining social media and web searches for disease detection

    Directory of Open Access Journals (Sweden)

    Y. Tony Yang

    2013-05-01

    Full Text Available Web-based social media is increasingly being used across different settings in the health care industry. The increased frequency in the use of the Internet via computer or mobile devices provides an opportunity for social media to be the medium through which people can be provided with valuable health information quickly and directly. While traditional methods of detection relied predominately on hierarchical or bureaucratic lines of communication, these often failed to yield timely and accurate epidemiological intelligence. New web-based platforms promise increased opportunities for a more timely and accurate spreading of information and analysis. This article aims to provide an overview and discussion of the availability of timely and accurate information. It is especially useful for the rapid identification of an outbreak of an infectious disease that is necessary to promptly and effectively develop public health responses. These web-based platforms include search queries, data mining of web and social media, process and analysis of blogs containing epidemic key words, text mining, and geographical information system data analyses. These new sources of analysis and information are intended to complement traditional sources of epidemic intelligence. Despite the attractiveness of these new approaches, further study is needed to determine the accuracy of blogger statements, as increases in public participation may not necessarily mean the information provided is more accurate.

  20. DISEASES: text mining and data integration of disease-gene associations.

    Science.gov (United States)

    Pletscher-Frankild, Sune; Pallejà, Albert; Tsafou, Kalliopi; Binder, Janos X; Jensen, Lars Juhl

    2015-03-01

    Text mining is a flexible technology that can be applied to numerous different tasks in biology and medicine. We present a system for extracting disease-gene associations from biomedical abstracts. The system consists of a highly efficient dictionary-based tagger for named entity recognition of human genes and diseases, which we combine with a scoring scheme that takes into account co-occurrences both within and between sentences. We show that this approach is able to extract half of all manually curated associations with a false positive rate of only 0.16%. Nonetheless, text mining should not stand alone, but be combined with other types of evidence. For this reason, we have developed the DISEASES resource, which integrates the results from text mining with manually curated disease-gene associations, cancer mutation data, and genome-wide association studies from existing databases. The DISEASES resource is accessible through a web interface at http://diseases.jensenlab.org/, where the text-mining software and all associations are also freely available for download. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  1. Fuzzy Clustering: An Approachfor Mining Usage Profilesfrom Web

    OpenAIRE

    Ms.Archana N. Boob; Prof. D. M. Dakhane

    2012-01-01

    Web usage mining is an application of data mining technology to mining the data of the web server log file. It can discover the browsing patterns of user and some kind of correlations between the web pages. Web usage mining provides the support for the web site design, providing personalization server and other business making decision, etc. Web mining applies the data mining, the artificial intelligence and the chart technology and so on to the web data and traces users' visiting characteris...

  2. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  3. Using an improved association rules mining optimization algorithm in web-based mobile-learning system

    Science.gov (United States)

    Huang, Yin; Chen, Jianhua; Xiong, Shaojun

    2009-07-01

    Mobile-Learning (M-learning) makes many learners get the advantages of both traditional learning and E-learning. Currently, Web-based Mobile-Learning Systems have created many new ways and defined new relationships between educators and learners. Association rule mining is one of the most important fields in data mining and knowledge discovery in databases. Rules explosion is a serious problem which causes great concerns, as conventional mining algorithms often produce too many rules for decision makers to digest. Since Web-based Mobile-Learning System collects vast amounts of student profile data, data mining and knowledge discovery techniques can be applied to find interesting relationships between attributes of learners, assessments, the solution strategies adopted by learners and so on. Therefore ,this paper focus on a new data-mining algorithm, combined with the advantages of genetic algorithm and simulated annealing algorithm , called ARGSA(Association rules based on an improved Genetic Simulated Annealing Algorithm), to mine the association rules. This paper first takes advantage of the Parallel Genetic Algorithm and Simulated Algorithm designed specifically for discovering association rules. Moreover, the analysis and experiment are also made to show the proposed method is superior to the Apriori algorithm in this Mobile-Learning system.

  4. Using text-mining techniques in electronic patient records to identify ADRs from medicine use

    DEFF Research Database (Denmark)

    Warrer, Pernille; Hansen, Ebba Holme; Jensen, Lars Juhl

    2012-01-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We...... included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs......, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text...

  5. A Customizable Text Classifier for Text Mining

    Directory of Open Access Journals (Sweden)

    Yun-liang Zhang

    2007-12-01

    Full Text Available Text mining deals with complex and unstructured texts. Usually a particular collection of texts that is specified to one or more domains is necessary. We have developed a customizable text classifier for users to mine the collection automatically. It derives from the sentence category of the HNC theory and corresponding techniques. It can start with a few texts, and it can adjust automatically or be adjusted by user. The user can also control the number of domains chosen and decide the standard with which to choose the texts based on demand and abundance of materials. The performance of the classifier varies with the user's choice.

  6. Constructing a web recommender system using web usage mining and user’s profiles

    Directory of Open Access Journals (Sweden)

    T. Mombeini

    2014-12-01

    Full Text Available The World Wide Web is a great source of information, which is nowadays being widely used due to the availability of useful information changing, dynamically. However, the large number of webpages often confuses many users and it is hard for them to find information on their interests. Therefore, it is necessary to provide a system capable of guiding users towards their desired choices and services. Recommender systems search among a large collection of user interests and recommend those, which are likely to be favored the most by the user. Web usage mining was designed to function on web server records, which are included in user search results. Therefore, recommender servers use the web usage mining technique to predict users’ browsing patterns and recommend those patterns in the form of a suggestion list. In this article, a recommender system based on web usage mining phases (online and offline was proposed. In the offline phase, the first step is to analyze user access records to identify user sessions. Next, user profiles are built using data from server records based on the frequency of access to pages, the time spent by the user on each page and the date of page view. Date is of importance since it is more possible for users to request new pages more than old ones and old pages are less probable to be viewed, as users mostly look for new information. Following the creation of user profiles, users are categorized in clusters using the Fuzzy C-means clustering algorithm and S(c criterion based on their similarities. In the online phase, a neural network is offered to identify the suggested model while online suggestions are generated using the suggestion module for the active user. Search engines analyze suggestion lists based on rate of user interest in pages and page rank and finally suggest appropriate pages to the active user. Experiments show that the proposed method of predicting user recent requested pages has more accuracy and

  7. Mining social media and web searches for disease detection.

    Science.gov (United States)

    Yang, Y Tony; Horneffer, Michael; DiLisio, Nicole

    2013-04-28

    Web-based social media is increasingly being used across different settings in the health care industry. The increased frequency in the use of the Internet via computer or mobile devices provides an opportunity for social media to be the medium through which people can be provided with valuable health information quickly and directly. While traditional methods of detection relied predominately on hierarchical or bureaucratic lines of communication, these often failed to yield timely and accurate epidemiological intelligence. New web-based platforms promise increased opportunities for a more timely and accurate spreading of information and analysis. This article aims to provide an overview and discussion of the availability of timely and accurate information. It is especially useful for the rapid identification of an outbreak of an infectious disease that is necessary to promptly and effectively develop public health responses. These web-based platforms include search queries, data mining of web and social media, process and analysis of blogs containing epidemic key words, text mining, and geographical information system data analyses. These new sources of analysis and information are intended to complement traditional sources of epidemic intelligence. Despite the attractiveness of these new approaches, further study is needed to determine the accuracy of blogger statements, as increases in public participation may not necessarily mean the information provided is more accurate.

  8. Data Mining Web Services for Science Data Repositories

    Science.gov (United States)

    Graves, S.; Ramachandran, R.; Keiser, K.; Maskey, M.; Lynnes, C.; Pham, L.

    2006-12-01

    The maturation of web services standards and technologies sets the stage for a distributed "Service-Oriented Architecture" (SOA) for NASA's next generation science data processing. This architecture will allow members of the scientific community to create and combine persistent distributed data processing services and make them available to other users over the Internet. NASA has initiated a project to create a suite of specialized data mining web services designed specifically for science data. The project leverages the Algorithm Development and Mining (ADaM) toolkit as its basis. The ADaM toolkit is a robust, mature and freely available science data mining toolkit that is being used by several research organizations and educational institutions worldwide. These mining services will give the scientific community a powerful and versatile data mining capability that can be used to create higher order products such as thematic maps from current and future NASA satellite data records with methods that are not currently available. The package of mining and related services are being developed using Web Services standards so that community-based measurement processing systems can access and interoperate with them. These standards-based services allow users different options for utilizing them, from direct remote invocation by a client application to deployment of a Business Process Execution Language (BPEL) solutions package where a complex data mining workflow is exposed to others as a single service. The ability to deploy and operate these services at a data archive allows the data mining algorithms to be run where the data are stored, a more efficient scenario than moving large amounts of data over the network. This will be demonstrated in a scenario in which a user uses a remote Web-Service-enabled clustering algorithm to create cloud masks from satellite imagery at the Goddard Earth Sciences Data and Information Services Center (GES DISC).

  9. Text mining and natural language processing approaches for automatic categorization of lay requests to web-based expert forums.

    Science.gov (United States)

    Himmel, Wolfgang; Reincke, Ulrich; Michelmann, Hans Wilhelm

    2009-07-22

    Both healthy and sick people increasingly use electronic media to obtain medical information and advice. For example, Internet users may send requests to Web-based expert forums, or so-called "ask the doctor" services. To automatically classify lay requests to an Internet medical expert forum using a combination of different text-mining strategies. We first manually classified a sample of 988 requests directed to a involuntary childlessness forum on the German website "Rund ums Baby" ("Everything about Babies") into one or more of 38 categories belonging to two dimensions ("subject matter" and "expectations"). After creating start and synonym lists, we calculated the average Cramer's V statistic for the association of each word with each category. We also used principle component analysis and singular value decomposition as further text-mining strategies. With these measures we trained regression models and determined, on the basis of best regression models, for any request the probability of belonging to each of the 38 different categories, with a cutoff of 50%. Recall and precision of a test sample were calculated as a measure of quality for the automatic classification. According to the manual classification of 988 documents, 102 (10%) documents fell into the category "in vitro fertilization (IVF)," 81 (8%) into the category "ovulation," 79 (8%) into "cycle," and 57 (6%) into "semen analysis." These were the four most frequent categories in the subject matter dimension (consisting of 32 categories). The expectation dimension comprised six categories; we classified 533 documents (54%) as "general information" and 351 (36%) as a wish for "treatment recommendations." The generation of indicator variables based on the chi-square analysis and Cramer's V proved to be the best approach for automatic classification in about half of the categories. In combination with the two other approaches, 100% precision and 100% recall were realized in 18 (47%) out of the 38

  10. An Application for Data Preprocessing and Models Extractions in Web Usage Mining

    Directory of Open Access Journals (Sweden)

    Claudia Elena DINUCA

    2011-11-01

    Full Text Available Web servers worldwide generate a vast amount of information on web users’ browsing activities. Several researchers have studied these so-called clickstream or web access log data to better understand and characterize web users. The goal of this application is to analyze user behaviour by mining enriched web access log data. With the continued growth and proliferation of e-commerce, Web services, and Web-based information systems, the volumes of click stream and user data collected by Web-based organizations in their daily operations has reached astronomical proportions. This information can be exploited in various ways, such as enhancing the effectiveness of websites or developing directed web marketing campaigns. The discovered patterns are usually represented as collections of pages, objects, or re-sources that are frequently accessed by groups of users with common needs or interests. In this paper we will focus on displaying the way how it was implemented the application for data preprocessing and extracting different data models from web logs data, finding association as a data mining technique to extract potentially useful knowledge from web usage data. We find different data models navigation patterns by analysing the log files of the web-site. I implemented the application in Java using NetBeans IDE. For exemplification, I used the log files data from a commercial web site www.nice-layouts.com.

  11. AN EFFECTIVE RECOMMENDATIONS BY DIFFUSION ALGORITHM FOR WEB GRAPH MINING

    Directory of Open Access Journals (Sweden)

    S. Vasukipriya

    2013-04-01

    Full Text Available The information on the World Wide Web grows in an explosive rate. Societies are relying more on the Web for their miscellaneous needs of information. Recommendation systems are active information filtering systems that attempt to present the information items like movies, music, images, books recommendations, tags recommendations, query suggestions, etc., to the users. Various kinds of data bases are used for the recommendations; fundamentally these data bases can be molded in the form of many types of graphs. Aiming at provided that a general framework on effective DR (Recommendations by Diffusion algorithm for web graphs mining. First introduce a novel graph diffusion model based on heat diffusion. This method can be applied to both undirected graphs and directed graphs. Then it shows how to convert different Web data sources into correct graphs in our models.

  12. A construction scheme of web page comment information extraction system based on frequent subtree mining

    Science.gov (United States)

    Zhang, Xiaowen; Chen, Bingfeng

    2017-08-01

    Based on the frequent sub-tree mining algorithm, this paper proposes a construction scheme of web page comment information extraction system based on frequent subtree mining, referred to as FSM system. The entire system architecture and the various modules to do a brief introduction, and then the core of the system to do a detailed description, and finally give the system prototype.

  13. Web-video-mining-supported workflow modeling for laparoscopic surgeries.

    Science.gov (United States)

    Liu, Rui; Zhang, Xiaoli; Zhang, Hao

    2016-11-01

    As quality assurance is of strong concern in advanced surgeries, intelligent surgical systems are expected to have knowledge such as the knowledge of the surgical workflow model (SWM) to support their intuitive cooperation with surgeons. For generating a robust and reliable SWM, a large amount of training data is required. However, training data collected by physically recording surgery operations is often limited and data collection is time-consuming and labor-intensive, severely influencing knowledge scalability of the surgical systems. The objective of this research is to solve the knowledge scalability problem in surgical workflow modeling with a low cost and labor efficient way. A novel web-video-mining-supported surgical workflow modeling (webSWM) method is developed. A novel video quality analysis method based on topic analysis and sentiment analysis techniques is developed to select high-quality videos from abundant and noisy web videos. A statistical learning method is then used to build the workflow model based on the selected videos. To test the effectiveness of the webSWM method, 250 web videos were mined to generate a surgical workflow for the robotic cholecystectomy surgery. The generated workflow was evaluated by 4 web-retrieved videos and 4 operation-room-recorded videos, respectively. The evaluation results (video selection consistency n-index ≥0.60; surgical workflow matching degree ≥0.84) proved the effectiveness of the webSWM method in generating robust and reliable SWM knowledge by mining web videos. With the webSWM method, abundant web videos were selected and a reliable SWM was modeled in a short time with low labor cost. Satisfied performances in mining web videos and learning surgery-related knowledge show that the webSWM method is promising in scaling knowledge for intelligent surgical systems. Copyright © 2016 Elsevier B.V. All rights reserved.

  14. Web Usage Mining, Pattern Discovery dan Log File

    OpenAIRE

    Tri Suratno; Toni Prahasto; Adian Fatchur Rochim

    2014-01-01

    Analysis  of  data  to  access  the  server  can  provide  significant  and  useful  information  for  performance  improvement,  restructuring  andimproving the effectiveness of a web site. Data mining is one of the most effective way to detect a series of patterns of information from large amounts of data. Application of  data mining  on  Internet use  called web  mining  is a set of  data mining  techniques  are  used  for the web. Web mining technologies and data mining is a combination o...

  15. QuadBase2: web server for multiplexed guanine quadruplex mining and visualization

    Science.gov (United States)

    Dhapola, Parashar; Chowdhury, Shantanu

    2016-01-01

    DNA guanine quadruplexes or G4s are non-canonical DNA secondary structures which affect genomic processes like replication, transcription and recombination. G4s are computationally identified by specific nucleotide motifs which are also called putative G4 (PG4) motifs. Despite the general relevance of these structures, there is currently no tool available that can allow batch queries and genome-wide analysis of these motifs in a user-friendly interface. QuadBase2 (quadbase.igib.res.in) presents a completely reinvented web server version of previously published QuadBase database. QuadBase2 enables users to mine PG4 motifs in up to 178 eukaryotes through the EuQuad module. This module interfaces with Ensembl Compara database, to allow users mine PG4 motifs in the orthologues of genes of interest across eukaryotes. PG4 motifs can be mined across genes and their promoter sequences in 1719 prokaryotes through ProQuad module. This module includes a feature that allows genome-wide mining of PG4 motifs and their visualization as circular histograms. TetraplexFinder, the module for mining PG4 motifs in user-provided sequences is now capable of handling up to 20 MB of data. QuadBase2 is a comprehensive PG4 motif mining tool that further expands the configurations and algorithms for mining PG4 motifs in a user-friendly way. PMID:27185890

  16. A comparative study on the flow experience in web-based and text-based interaction environments.

    Science.gov (United States)

    Huang, Li-Ting; Chiu, Chen-An; Sung, Kai; Farn, Cheng-Kiang

    2011-01-01

    The purpose of this study was to explore a substantial phenomenon related to flow experiences (immersion) in text-based interaction systems. Most previous research emphasizes the effects of challenge/skill, focused attention, telepresence, web characteristics, and systems' interface design on users' flow experiences in online environments. However, text-based interaction systems without telepresence features and web characteristics still seem to create opportunities for flow experience. To explore this phenomenon, this study incorporates subject involvement and interpersonal interaction as critical antecedents into the model of flow experience, as well as considers the existence of telepresence. Results reveal that subject involvement, interpersonal interaction, and interactivity speed are critical to focused attention, which enhances users' immersion. With regard to the effect of telepresence, the perceived attractiveness of the interface is a significant facilitator determining users' immersion in web-based, rather than in text-based, interaction environments. Interactivity speed is unrelated to immersion in both web-based and text-based interaction environments. The influence of interpersonal involvement is diminished in web-based interaction environments. The implications and limitations of this study are discussed.

  17. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  18. Semantic Web Requirements through Web Mining Techniques

    OpenAIRE

    Hassanzadeh, Hamed; Keyvanpour, Mohammad Reza

    2012-01-01

    In recent years, Semantic web has become a topic of active research in several fields of computer science and has applied in a wide range of domains such as bioinformatics, life sciences, and knowledge management. The two fast-developing research areas semantic web and web mining can complement each other and their different techniques can be used jointly or separately to solve the issues in both areas. In addition, since shifting from current web to semantic web mainly depends on the enhance...

  19. A WebGIS Decision Support System for Management of Abandoned Mines

    Directory of Open Access Journals (Sweden)

    Ranka Stanković

    2016-07-01

    Full Text Available This paper presents the development of a WebGIS application aimed at providing safe and reliable data needed for reclamation of abandoned mines in national parks and other protected areas in Vojvodina in compliance with existing legal regulations. The geodatabase model for this application has been developed using UML and the CASE tool Microsoft Visio featuring an interface with ArcGIS. The WebGIS application was developed using GeoServer, an open source tool in the Java programming language, with integrated PostgreSQL DB and the possibility of generating and publishing WMS, WFS and KML services. The WebGIS application is publicly available, based on an appropriate central database, which for the first time encompasses all available data on abandoned mines in Vojvodina, and as such may serve as a model for similar databases on the territory of the Republic of Serbia.

  20. PubstractHelper: A Web-based Text-Mining Tool for Marking Sentences in Abstracts from PubMed Using Multiple User-Defined Keywords.

    Science.gov (United States)

    Chen, Chou-Cheng; Ho, Chung-Liang

    2014-01-01

    While a huge amount of information about biological literature can be obtained by searching the PubMed database, reading through all the titles and abstracts resulting from such a search for useful information is inefficient. Text mining makes it possible to increase this efficiency. Some websites use text mining to gather information from the PubMed database; however, they are database-oriented, using pre-defined search keywords while lacking a query interface for user-defined search inputs. We present the PubMed Abstract Reading Helper (PubstractHelper) website which combines text mining and reading assistance for an efficient PubMed search. PubstractHelper can accept a maximum of ten groups of keywords, within each group containing up to ten keywords. The principle behind the text-mining function of PubstractHelper is that keywords contained in the same sentence are likely to be related. PubstractHelper highlights sentences with co-occurring keywords in different colors. The user can download the PMID and the abstracts with color markings to be reviewed later. The PubstractHelper website can help users to identify relevant publications based on the presence of related keywords, which should be a handy tool for their research. http://bio.yungyun.com.tw/ATM/PubstractHelper.aspx and http://holab.med.ncku.edu.tw/ATM/PubstractHelper.aspx.

  1. Hot complaint intelligent classification based on text mining

    Directory of Open Access Journals (Sweden)

    XIA Haifeng

    2013-10-01

    Full Text Available The complaint recognizer system plays an important role in making sure the correct classification of the hot complaint,improving the service quantity of telecommunications industry.The customers’ complaint in telecommunications industry has its special particularity which should be done in limited time,which cause the error in classification of hot complaint.The paper presents a model of complaint hot intelligent classification based on text mining,which can classify the hot complaint in the correct level of the complaint navigation.The examples show that the model can be efficient to classify the text of the complaint.

  2. Chapter 16: text mining for translational bioinformatics.

    Science.gov (United States)

    Cohen, K Bretonnel; Hunter, Lawrence E

    2013-04-01

    Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research-translating basic science results into new interventions-and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.

  3. Knowledge based word-concept model estimation and refinement for biomedical text mining.

    Science.gov (United States)

    Jimeno Yepes, Antonio; Berlanga, Rafael

    2015-02-01

    Text mining of scientific literature has been essential for setting up large public biomedical databases, which are being widely used by the research community. In the biomedical domain, the existence of a large number of terminological resources and knowledge bases (KB) has enabled a myriad of machine learning methods for different text mining related tasks. Unfortunately, KBs have not been devised for text mining tasks but for human interpretation, thus performance of KB-based methods is usually lower when compared to supervised machine learning methods. The disadvantage of supervised methods though is they require labeled training data and therefore not useful for large scale biomedical text mining systems. KB-based methods do not have this limitation. In this paper, we describe a novel method to generate word-concept probabilities from a KB, which can serve as a basis for several text mining tasks. This method not only takes into account the underlying patterns within the descriptions contained in the KB but also those in texts available from large unlabeled corpora such as MEDLINE. The parameters of the model have been estimated without training data. Patterns from MEDLINE have been built using MetaMap for entity recognition and related using co-occurrences. The word-concept probabilities were evaluated on the task of word sense disambiguation (WSD). The results showed that our method obtained a higher degree of accuracy than other state-of-the-art approaches when evaluated on the MSH WSD data set. We also evaluated our method on the task of document ranking using MEDLINE citations. These results also showed an increase in performance over existing baseline retrieval approaches. Copyright © 2014 Elsevier Inc. All rights reserved.

  4. Effective Filtering of Query Results on Updated User Behavioral Profiles in Web Mining

    Directory of Open Access Journals (Sweden)

    S. Sadesh

    2015-01-01

    Full Text Available Web with tremendous volume of information retrieves result for user related queries. With the rapid growth of web page recommendation, results retrieved based on data mining techniques did not offer higher performance filtering rate because relationships between user profile and queries were not analyzed in an extensive manner. At the same time, existing user profile based prediction in web data mining is not exhaustive in producing personalized result rate. To improve the query result rate on dynamics of user behavior over time, Hamilton Filtered Regime Switching User Query Probability (HFRS-UQP framework is proposed. HFRS-UQP framework is split into two processes, where filtering and switching are carried out. The data mining based filtering in our research work uses the Hamilton Filtering framework to filter user result based on personalized information on automatic updated profiles through search engine. Maximized result is fetched, that is, filtered out with respect to user behavior profiles. The switching performs accurate filtering updated profiles using regime switching. The updating in profile change (i.e., switches regime in HFRS-UQP framework identifies the second- and higher-order association of query result on the updated profiles. Experiment is conducted on factors such as personalized information search retrieval rate, filtering efficiency, and precision ratio.

  5. Effect of Temporal Relationships in Associative Rule Mining for Web Log Data

    Science.gov (United States)

    Mohd Khairudin, Nazli; Mustapha, Aida

    2014-01-01

    The advent of web-based applications and services has created such diverse and voluminous web log data stored in web servers, proxy servers, client machines, or organizational databases. This paper attempts to investigate the effect of temporal attribute in relational rule mining for web log data. We incorporated the characteristics of time in the rule mining process and analysed the effect of various temporal parameters. The rules generated from temporal relational rule mining are then compared against the rules generated from the classical rule mining approach such as the Apriori and FP-Growth algorithms. The results showed that by incorporating the temporal attribute via time, the number of rules generated is subsequently smaller but is comparable in terms of quality. PMID:24587757

  6. Research on the optimization strategy of web search engine based on data mining

    Science.gov (United States)

    Chen, Ronghua

    2018-04-01

    With the wide application of search engines, web site information has become an important way for people to obtain information. People have found that they are growing in an increasingly explosive manner. Web site information is verydifficult to find the information they need, and now the search engine can not meet the need, so there is an urgent need for the network to provide website personalized information service, data mining technology for this new challenge is to find a breakthrough. In order to improve people's accuracy of finding information from websites, a website search engine optimization strategy based on data mining is proposed, and verified by website search engine optimization experiment. The results show that the proposed strategy improves the accuracy of the people to find information, and reduces the time for people to find information. It has an important practical value.

  7. Learners’ Evaluation Based on Data Mining in a Web Based Learning Environment

    Directory of Open Access Journals (Sweden)

    İdris GÖKSU

    2015-06-01

    Full Text Available This study has been done in order to determine the efficiency level in the extend of learners’ evaluation by means of comparing the Web Based Learning (WBL with traditional face to face learning. In this respect, the effect of WBL and traditional environment has been analyzed in the class of Visual Programming I, and the learners have been evaluated with the rule based data mining method in a WBL environment. The study has been conducted according to experimental design with pre-test and post-test groups. Experimental group has attended the class in WBL environment, and the control group in a traditional class environment. In accordance with the pre-test and post-test scores of experimental and control groups, both methods have been proved to be effective. According the average scores of post-test, the learners in experimental groups have been more successful than the ones in the control group. The guiding of WBL system prepared for the study has been found to be significant in terms of both underlining the points in which the learners are unsuccessful in a short time and having trust in the system technically.

  8. Argo: an integrative, interactive, text mining-based workbench supporting curation

    Science.gov (United States)

    Rak, Rafal; Rowley, Andrew; Black, William; Ananiadou, Sophia

    2012-01-01

    Curation of biomedical literature is often supported by the automatic analysis of textual content that generally involves a sequence of individual processing components. Text mining (TM) has been used to enhance the process of manual biocuration, but has been focused on specific databases and tasks rather than an environment integrating TM tools into the curation pipeline, catering for a variety of tasks, types of information and applications. Processing components usually come from different sources and often lack interoperability. The well established Unstructured Information Management Architecture is a framework that addresses interoperability by defining common data structures and interfaces. However, most of the efforts are targeted towards software developers and are not suitable for curators, or are otherwise inconvenient to use on a higher level of abstraction. To overcome these issues we introduce Argo, an interoperable, integrative, interactive and collaborative system for text analysis with a convenient graphic user interface to ease the development of processing workflows and boost productivity in labour-intensive manual curation. Robust, scalable text analytics follow a modular approach, adopting component modules for distinct levels of text analysis. The user interface is available entirely through a web browser that saves the user from going through often complicated and platform-dependent installation procedures. Argo comes with a predefined set of processing components commonly used in text analysis, while giving the users the ability to deposit their own components. The system accommodates various areas and levels of user expertise, from TM and computational linguistics to ontology-based curation. One of the key functionalities of Argo is its ability to seamlessly incorporate user-interactive components, such as manual annotation editors, into otherwise completely automatic pipelines. As a use case, we demonstrate the functionality of an in

  9. Text mining for adverse drug events: the promise, challenges, and state of the art.

    Science.gov (United States)

    Harpaz, Rave; Callahan, Alison; Tamang, Suzanne; Low, Yen; Odgers, David; Finlayson, Sam; Jung, Kenneth; LePendu, Paea; Shah, Nigam H

    2014-10-01

    Text mining is the computational process of extracting meaningful information from large amounts of unstructured text. It is emerging as a tool to leverage underutilized data sources that can improve pharmacovigilance, including the objective of adverse drug event (ADE) detection and assessment. This article provides an overview of recent advances in pharmacovigilance driven by the application of text mining, and discusses several data sources-such as biomedical literature, clinical narratives, product labeling, social media, and Web search logs-that are amenable to text mining for pharmacovigilance. Given the state of the art, it appears text mining can be applied to extract useful ADE-related information from multiple textual sources. Nonetheless, further research is required to address remaining technical challenges associated with the text mining methodologies, and to conclusively determine the relative contribution of each textual source to improving pharmacovigilance.

  10. Web of Things-Based Remote Monitoring System for Coal Mine Safety Using Wireless Sensor Network

    OpenAIRE

    Bo, Cheng; Xin, Cheng; Zhongyi, Zhai; Chengwen, Zhang; Junliang, Chen

    2014-01-01

    Frequent accidents have occurred in coal mine enterprises; therefore, raising the technological level of coal mine safety monitoring systems is an urgent problem. Wireless sensor networks (WSN), as a new field of research, have broad application prospects. This paper proposes a Web of Things- (WoT-) based remote monitoring system that takes full advantage of wireless sensor networks in combination with the CAN bus communication technique that abstracts the underground sensor data and capabili...

  11. Study on online community user motif using web usage mining

    Science.gov (United States)

    Alphy, Meera; Sharma, Ajay

    2016-04-01

    The Web usage mining is the application of data mining, which is used to extract useful information from the online community. The World Wide Web contains at least 4.73 billion pages according to Indexed Web and it contains at least 228.52 million pages according Dutch Indexed web on 6th august 2015, Thursday. It’s difficult to get needed data from these billions of web pages in World Wide Web. Here is the importance of web usage mining. Personalizing the search engine helps the web user to identify the most used data in an easy way. It reduces the time consumption; automatic site search and automatic restore the useful sites. This study represents the old techniques to latest techniques used in pattern discovery and analysis in web usage mining from 1996 to 2015. Analyzing user motif helps in the improvement of business, e-commerce, personalisation and improvement of websites.

  12. Text mining for biology--the way forward

    DEFF Research Database (Denmark)

    Altman, Russ B; Bergman, Casey M; Blake, Judith

    2008-01-01

    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify...... several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger...

  13. Using text-mining techniques in electronic patient records to identify ADRs from medicine use.

    Science.gov (United States)

    Warrer, Pernille; Hansen, Ebba Holme; Juhl-Jensen, Lars; Aagaard, Lise

    2012-05-01

    This literature review included studies that use text-mining techniques in narrative documents stored in electronic patient records (EPRs) to investigate ADRs. We searched PubMed, Embase, Web of Science and International Pharmaceutical Abstracts without restrictions from origin until July 2011. We included empirically based studies on text mining of electronic patient records (EPRs) that focused on detecting ADRs, excluding those that investigated adverse events not related to medicine use. We extracted information on study populations, EPR data sources, frequencies and types of the identified ADRs, medicines associated with ADRs, text-mining algorithms used and their performance. Seven studies, all from the United States, were eligible for inclusion in the review. Studies were published from 2001, the majority between 2009 and 2010. Text-mining techniques varied over time from simple free text searching of outpatient visit notes and inpatient discharge summaries to more advanced techniques involving natural language processing (NLP) of inpatient discharge summaries. Performance appeared to increase with the use of NLP, although many ADRs were still missed. Due to differences in study design and populations, various types of ADRs were identified and thus we could not make comparisons across studies. The review underscores the feasibility and potential of text mining to investigate narrative documents in EPRs for ADRs. However, more empirical studies are needed to evaluate whether text mining of EPRs can be used systematically to collect new information about ADRs. © 2011 The Authors. British Journal of Clinical Pharmacology © 2011 The British Pharmacological Society.

  14. Text mining patents for biomedical knowledge.

    Science.gov (United States)

    Rodriguez-Esteban, Raul; Bundschus, Markus

    2016-06-01

    Biomedical text mining of scientific knowledge bases, such as Medline, has received much attention in recent years. Given that text mining is able to automatically extract biomedical facts that revolve around entities such as genes, proteins, and drugs, from unstructured text sources, it is seen as a major enabler to foster biomedical research and drug discovery. In contrast to the biomedical literature, research into the mining of biomedical patents has not reached the same level of maturity. Here, we review existing work and highlight the associated technical challenges that emerge from automatically extracting facts from patents. We conclude by outlining potential future directions in this domain that could help drive biomedical research and drug discovery. Copyright © 2016 Elsevier Ltd. All rights reserved.

  15. Mining protein function from text using term-based support vector machines

    Science.gov (United States)

    Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J

    2005-01-01

    Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835

  16. pubmed.mineR: an R package with text-mining algorithms to analyse PubMed abstracts.

    Science.gov (United States)

    Rani, Jyoti; Shah, A B Rauf; Ramachandran, Srinivasan

    2015-10-01

    The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, 'Evolving role of diabetes educators', 'Cancer risk assessment' and 'Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.rproject. org/web/packages/pubmed.mineR.

  17. On-Board Mining in the Sensor Web

    Science.gov (United States)

    Tanner, S.; Conover, H.; Graves, S.; Ramachandran, R.; Rushing, J.

    2004-12-01

    On-board data mining can contribute to many research and engineering applications, including natural hazard detection and prediction, intelligent sensor control, and the generation of customized data products for direct distribution to users. The ability to mine sensor data in real time can also be a critical component of autonomous operations, supporting deep space missions, unmanned aerial and ground-based vehicles (UAVs, UGVs), and a wide range of sensor meshes, webs and grids. On-board processing is expected to play a significant role in the next generation of NASA, Homeland Security, Department of Defense and civilian programs, providing for greater flexibility and versatility in measurements of physical systems. In addition, the use of UAV and UGV systems is increasing in military, emergency response and industrial applications. As research into the autonomy of these vehicles progresses, especially in fleet or web configurations, the applicability of on-board data mining is expected to increase significantly. Data mining in real time on board sensor platforms presents unique challenges. Most notably, the data to be mined is a continuous stream, rather than a fixed store such as a database. This means that the data mining algorithms must be modified to make only a single pass through the data. In addition, the on-board environment requires real time processing with limited computing resources, thus the algorithms must use fixed and relatively small amounts of processing time and memory. The University of Alabama in Huntsville is developing an innovative processing framework for the on-board data and information environment. The Environment for On-Board Processing (EVE) and the Adaptive On-board Data Processing (AODP) projects serve as proofs-of-concept of advanced information systems for remote sensing platforms. The EVE real-time processing infrastructure will upload, schedule and control the execution of processing plans on board remote sensors. These plans

  18. Text Mining in Organizational Research.

    Science.gov (United States)

    Kobayashi, Vladimer B; Mol, Stefan T; Berkers, Hannah A; Kismihók, Gábor; Den Hartog, Deanne N

    2018-07-01

    Despite the ubiquity of textual data, so far few researchers have applied text mining to answer organizational research questions. Text mining, which essentially entails a quantitative approach to the analysis of (usually) voluminous textual data, helps accelerate knowledge discovery by radically increasing the amount data that can be analyzed. This article aims to acquaint organizational researchers with the fundamental logic underpinning text mining, the analytical stages involved, and contemporary techniques that may be used to achieve different types of objectives. The specific analytical techniques reviewed are (a) dimensionality reduction, (b) distance and similarity computing, (c) clustering, (d) topic modeling, and (e) classification. We describe how text mining may extend contemporary organizational research by allowing the testing of existing or new research questions with data that are likely to be rich, contextualized, and ecologically valid. After an exploration of how evidence for the validity of text mining output may be generated, we conclude the article by illustrating the text mining process in a job analysis setting using a dataset composed of job vacancies.

  19. Experimental economics for web mining

    OpenAIRE

    Tagiew, Rustam; Ignatov, Dmitry I.; Amroush, Fadi

    2014-01-01

    This paper offers a step towards research infrastructure, which makes data from experimental economics efficiently usable for analysis of web data. We believe that regularities of human behavior found in experimental data also emerge in real world web data. A format for data from experiments is suggested, which enables its publication as open data. Once standardized datasets of experiments are available on-line, web mining can take advantages from this data. Further, the questions about the o...

  20. SparkText: Biomedical Text Mining on Big Data Framework.

    Science.gov (United States)

    Ye, Zhan; Tafti, Ahmad P; He, Karen Y; Wang, Kai; He, Max M

    Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  1. SparkText: Biomedical Text Mining on Big Data Framework.

    Directory of Open Access Journals (Sweden)

    Zhan Ye

    Full Text Available Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment.In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM, and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes.This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research.

  2. SIAM 2007 Text Mining Competition dataset

    Data.gov (United States)

    National Aeronautics and Space Administration — Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining...

  3. Mining web-based data to assess public response to environmental events

    International Nuclear Information System (INIS)

    Cha, YoonKyung; Stow, Craig A.

    2015-01-01

    We explore how the analysis of web-based data, such as Twitter and Google Trends, can be used to assess the social relevance of an environmental accident. The concept and methods are applied in the shutdown of drinking water supply at the city of Toledo, Ohio, USA. Toledo's notice, which persisted from August 1 to 4, 2014, is a high-profile event that directly influenced approximately half a million people and received wide recognition. The notice was given when excessive levels of microcystin, a byproduct of cyanobacteria blooms, were discovered at the drinking water treatment plant on Lake Erie. Twitter mining results illustrated an instant response to the Toledo incident, the associated collective knowledge, and public perception. The results from Google Trends, on the other hand, revealed how the Toledo event raised public attention on the associated environmental issue, harmful algal blooms, in a long-term context. Thus, when jointly applied, Twitter and Google Trend analysis results offer complementary perspectives. Web content aggregated through mining approaches provides a social standpoint, such as public perception and interest, and offers context for establishing and evaluating environmental management policies. - The joint application of Twitter and Google Trend analysis to an environmental event offered both short and long-term patterns of public perception and interest on the event

  4. Building a glaucoma interaction network using a text mining approach.

    Science.gov (United States)

    Soliman, Maha; Nasraoui, Olfa; Cooper, Nigel G F

    2016-01-01

    The volume of biomedical literature and its underlying knowledge base is rapidly expanding, making it beyond the ability of a single human being to read through all the literature. Several automated methods have been developed to help make sense of this dilemma. The present study reports on the results of a text mining approach to extract gene interactions from the data warehouse of published experimental results which are then used to benchmark an interaction network associated with glaucoma. To the best of our knowledge, there is, as yet, no glaucoma interaction network derived solely from text mining approaches. The presence of such a network could provide a useful summative knowledge base to complement other forms of clinical information related to this disease. A glaucoma corpus was constructed from PubMed Central and a text mining approach was applied to extract genes and their relations from this corpus. The extracted relations between genes were checked using reference interaction databases and classified generally as known or new relations. The extracted genes and relations were then used to construct a glaucoma interaction network. Analysis of the resulting network indicated that it bears the characteristics of a small world interaction network. Our analysis showed the presence of seven glaucoma linked genes that defined the network modularity. A web-based system for browsing and visualizing the extracted glaucoma related interaction networks is made available at http://neurogene.spd.louisville.edu/GlaucomaINViewer/Form1.aspx. This study has reported the first version of a glaucoma interaction network using a text mining approach. The power of such an approach is in its ability to cover a wide range of glaucoma related studies published over many years. Hence, a bigger picture of the disease can be established. To the best of our knowledge, this is the first glaucoma interaction network to summarize the known literature. The major findings were a set of

  5. SparkText: Biomedical Text Mining on Big Data Framework

    Science.gov (United States)

    He, Karen Y.; Wang, Kai

    2016-01-01

    Background Many new biomedical research articles are published every day, accumulating rich information, such as genetic variants, genes, diseases, and treatments. Rapid yet accurate text mining on large-scale scientific literature can discover novel knowledge to better understand human diseases and to improve the quality of disease diagnosis, prevention, and treatment. Results In this study, we designed and developed an efficient text mining framework called SparkText on a Big Data infrastructure, which is composed of Apache Spark data streaming and machine learning methods, combined with a Cassandra NoSQL database. To demonstrate its performance for classifying cancer types, we extracted information (e.g., breast, prostate, and lung cancers) from tens of thousands of articles downloaded from PubMed, and then employed Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression to build prediction models to mine the articles. The accuracy of predicting a cancer type by SVM using the 29,437 full-text articles was 93.81%. While competing text-mining tools took more than 11 hours, SparkText mined the dataset in approximately 6 minutes. Conclusions This study demonstrates the potential for mining large-scale scientific articles on a Big Data infrastructure, with real-time update from new articles published daily. SparkText can be extended to other areas of biomedical research. PMID:27685652

  6. U-Compare: share and compare text mining tools with UIMA

    Science.gov (United States)

    Kano, Yoshinobu; Baumgartner, William A.; McCrohon, Luke; Ananiadou, Sophia; Cohen, K. Bretonnel; Hunter, Lawrence; Tsujii, Jun'ichi

    2009-01-01

    Summary: Due to the increasing number of text mining resources (tools and corpora) available to biologists, interoperability issues between these resources are becoming significant obstacles to using them effectively. UIMA, the Unstructured Information Management Architecture, is an open framework designed to aid in the construction of more interoperable tools. U-Compare is built on top of the UIMA framework, and provides both a concrete framework for out-of-the-box text mining and a sophisticated evaluation platform allowing users to run specific tools on any target text, generating both detailed statistics and instance-based visualizations of outputs. U-Compare is a joint project, providing the world's largest, and still growing, collection of UIMA-compatible resources. These resources, originally developed by different groups for a variety of domains, include many famous tools and corpora. U-Compare can be launched straight from the web, without needing to be manually installed. All U-Compare components are provided ready-to-use and can be combined easily via a drag-and-drop interface without any programming. External UIMA components can also simply be mixed with U-Compare components, without distinguishing between locally and remotely deployed resources. Availability: http://u-compare.org/ Contact: kano@is.s.u-tokyo.ac.jp PMID:19414535

  7. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  8. EnvMine: A text-mining system for the automatic extraction of contextual information

    Directory of Open Access Journals (Sweden)

    de Lorenzo Victor

    2010-06-01

    Full Text Available Abstract Background For ecological studies, it is crucial to count on adequate descriptions of the environments and samples being studied. Such a description must be done in terms of their physicochemical characteristics, allowing a direct comparison between different environments that would be difficult to do otherwise. Also the characterization must include the precise geographical location, to make possible the study of geographical distributions and biogeographical patterns. Currently, there is no schema for annotating these environmental features, and these data have to be extracted from textual sources (published articles. So far, this had to be performed by manual inspection of the corresponding documents. To facilitate this task, we have developed EnvMine, a set of text-mining tools devoted to retrieve contextual information (physicochemical variables and geographical locations from textual sources of any kind. Results EnvMine is capable of retrieving the physicochemical variables cited in the text, by means of the accurate identification of their associated units of measurement. In this task, the system achieves a recall (percentage of items retrieved of 92% with less than 1% error. Also a Bayesian classifier was tested for distinguishing parts of the text describing environmental characteristics from others dealing with, for instance, experimental settings. Regarding the identification of geographical locations, the system takes advantage of existing databases such as GeoNames to achieve 86% recall with 92% precision. The identification of a location includes also the determination of its exact coordinates (latitude and longitude, thus allowing the calculation of distance between the individual locations. Conclusion EnvMine is a very efficient method for extracting contextual information from different text sources, like published articles or web pages. This tool can help in determining the precise location and physicochemical

  9. A fuzzy method for improving the functionality of search engines based on user's web interactions

    Directory of Open Access Journals (Sweden)

    Farzaneh Kabirbeyk

    2015-04-01

    Full Text Available Web mining has been widely used to discover knowledge from various sources in the web. One of the important tools in web mining is mining of web user’s behavior that is considered as a way to discover the potential knowledge of web user’s interaction. Nowadays, Website personalization is regarded as a popular phenomenon among web users and it plays an important role in facilitating user access and provides information of users’ requirements based on their own interests. Extracting important features about web user behavior plays a significant role in web usage mining. Such features are page visit frequency in each session, visit duration, and dates of visiting a certain pages. This paper presents a method to predict user’s interest and to propose a list of pages based on their interests by identifying user’s behavior based on fuzzy techniques called fuzzy clustering method. Due to the user’s different interests and use of one or more interest at a time, user’s interest may belong to several clusters and fuzzy clustering provide a possible overlap. Using the resulted cluster helps extract fuzzy rules. This helps detecting user’s movement pattern and using neural network a list of suggested pages to the users is provided.

  10. Optimizing the Information Presentation on Mining Potential by using Web Services Technology with Restful Protocol

    Science.gov (United States)

    Abdillah, T.; Dai, R.; Setiawan, E.

    2018-02-01

    This study aims to develop the application of Web Services technology with RestFul Protocol to optimize the information presentation on mining potential. This study used User Interface Design approach for the information accuracy and relevance as well as the Web Service for the reliability in presenting the information. The results show that: the information accuracy and relevance regarding mining potential can be seen from the achievement of User Interface implementation in the application that is based on the following rules: The consideration of the appropriate colours and objects, the easiness of using the navigation, and users’ interaction with the applications that employs symbols and languages understood by the users; the information accuracy and relevance related to mining potential can be observed by the information presented by using charts and Tool Tip Text to help the users understand the provided chart/figure; the reliability of the information presentation is evident by the results of Web Services testing in Figure 4.5.6. This study finds out that User Interface Design and Web Services approaches (for the access of different Platform apps) are able to optimize the presentation. The results of this study can be used as a reference for software developers and Provincial Government of Gorontalo.

  11. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  12. Graph Mining Meets the Semantic Web

    Energy Technology Data Exchange (ETDEWEB)

    Lee, Sangkeun (Matt) [ORNL; Sukumar, Sreenivas R [ORNL; Lim, Seung-Hwan [ORNL

    2015-01-01

    The Resource Description Framework (RDF) and SPARQL Protocol and RDF Query Language (SPARQL) were introduced about a decade ago to enable flexible schema-free data interchange on the Semantic Web. Today, data scientists use the framework as a scalable graph representation for integrating, querying, exploring and analyzing data sets hosted at different sources. With increasing adoption, the need for graph mining capabilities for the Semantic Web has emerged. We address that need through implementation of three popular iterative Graph Mining algorithms (Triangle count, Connected component analysis, and PageRank). We implement these algorithms as SPARQL queries, wrapped within Python scripts. We evaluate the performance of our implementation on 6 real world data sets and show graph mining algorithms (that have a linear-algebra formulation) can indeed be unleashed on data represented as RDF graphs using the SPARQL query interface.

  13. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  14. Benchmarking infrastructure for mutation text mining.

    Science.gov (United States)

    Klein, Artjom; Riazanov, Alexandre; Hindle, Matthew M; Baker, Christopher Jo

    2014-02-25

    Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption.

  15. Benchmarking infrastructure for mutation text mining

    Science.gov (United States)

    2014-01-01

    Background Experimental research on the automatic extraction of information about mutations from texts is greatly hindered by the lack of consensus evaluation infrastructure for the testing and benchmarking of mutation text mining systems. Results We propose a community-oriented annotation and benchmarking infrastructure to support development, testing, benchmarking, and comparison of mutation text mining systems. The design is based on semantic standards, where RDF is used to represent annotations, an OWL ontology provides an extensible schema for the data and SPARQL is used to compute various performance metrics, so that in many cases no programming is needed to analyze results from a text mining system. While large benchmark corpora for biological entity and relation extraction are focused mostly on genes, proteins, diseases, and species, our benchmarking infrastructure fills the gap for mutation information. The core infrastructure comprises (1) an ontology for modelling annotations, (2) SPARQL queries for computing performance metrics, and (3) a sizeable collection of manually curated documents, that can support mutation grounding and mutation impact extraction experiments. Conclusion We have developed the principal infrastructure for the benchmarking of mutation text mining tasks. The use of RDF and OWL as the representation for corpora ensures extensibility. The infrastructure is suitable for out-of-the-box use in several important scenarios and is ready, in its current state, for initial community adoption. PMID:24568600

  16. A Watercolor NPR System with Web-Mining 3D Color Charts

    Science.gov (United States)

    Chen, Lieu-Hen; Ho, Yi-Hsin; Liu, Ting-Yu; Hsieh, Wen-Chieh

    In this paper, we propose a watercolor image synthesizing system which integrates the user-personalized color charts based on web-mining technologies with the 3D Watercolor NPR system. Through our system, users can personalize their own color palette by using keywords such as the name of the artist or by choosing color sets on an emotional map. The related images are searched from web by adopting web mining technology, and the appropriate colors are extracted to construct the color chart by analyzing these images. Then, the color chart is rendered in a 3D visualization system which allows users to view and manage the distribution of colors interactively. Then, users can use these colors on our watercolor NPR system with a sketch-based GUI which allows users to manipulate watercolor attributes of object intuitively and directly.

  17. Project management in mine actions using Multi-Criteria-Analysis-based decision support system

    Directory of Open Access Journals (Sweden)

    Marko Mladineo

    2014-12-01

    Full Text Available In this paper, a Web-based Decision Support System (Web DSS, that supports humanitarian demining operations and restoration of mine-contaminated areas, is presented. The financial shortage usually triggers a need for priority setting in Project Management in Mine actions. As part of the FP7 Project TIRAMISU, a specialized Web DSS has been developed to achieve a fully transparent priority setting process. It allows stakeholders and donors to actively join the decision making process using a user-friendly and intuitive Web application. The main advantage of this Web DSS is its unique way of managing a mine action project using Multi-Criteria Analysis (MCA, namely the PROMETHEE method, in order to select priorities for demining actions. The developed Web DSS allows decision makers to use several predefined scenarios (different criteria weights or to develop their own, so it allows project managers to compare different demining possibilities with ease.

  18. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  19. Web mining in soft computing framework: relevance, state of the art and future directions.

    Science.gov (United States)

    Pal, S K; Talwar, V; Mitra, P

    2002-01-01

    The paper summarizes the different characteristics of Web data, the basic components of Web mining and its different types, and the current state of the art. The reason for considering Web mining, a separate field from data mining, is explained. The limitations of some of the existing Web mining methods and tools are enunciated, and the significance of soft computing (comprising fuzzy logic (FL), artificial neural networks (ANNs), genetic algorithms (GAs), and rough sets (RSs) are highlighted. A survey of the existing literature on "soft Web mining" is provided along with the commercially available systems. The prospective areas of Web mining where the application of soft computing needs immediate attention are outlined with justification. Scope for future research in developing "soft Web mining" systems is explained. An extensive bibliography is also provided.

  20. Text-mining-assisted biocuration workflows in Argo

    Science.gov (United States)

    Rak, Rafal; Batista-Navarro, Riza Theresa; Rowley, Andrew; Carter, Jacob; Ananiadou, Sophia

    2014-01-01

    Biocuration activities have been broadly categorized into the selection of relevant documents, the annotation of biological concepts of interest and identification of interactions between the concepts. Text mining has been shown to have a potential to significantly reduce the effort of biocurators in all the three activities, and various semi-automatic methodologies have been integrated into curation pipelines to support them. We investigate the suitability of Argo, a workbench for building text-mining solutions with the use of a rich graphical user interface, for the process of biocuration. Central to Argo are customizable workflows that users compose by arranging available elementary analytics to form task-specific processing units. A built-in manual annotation editor is the single most used biocuration tool of the workbench, as it allows users to create annotations directly in text, as well as modify or delete annotations created by automatic processing components. Apart from syntactic and semantic analytics, the ever-growing library of components includes several data readers and consumers that support well-established as well as emerging data interchange formats such as XMI, RDF and BioC, which facilitate the interoperability of Argo with other platforms or resources. To validate the suitability of Argo for curation activities, we participated in the BioCreative IV challenge whose purpose was to evaluate Web-based systems addressing user-defined biocuration tasks. Argo proved to have the edge over other systems in terms of flexibility of defining biocuration tasks. As expected, the versatility of the workbench inevitably lengthened the time the curators spent on learning the system before taking on the task, which may have affected the usability of Argo. The participation in the challenge gave us an opportunity to gather valuable feedback and identify areas of improvement, some of which have already been introduced. Database URL: http://argo.nactem.ac.uk PMID

  1. URL Mining Using Agglomerative Clustering Algorithm

    Directory of Open Access Journals (Sweden)

    Chinmay R. Deshmukh

    2015-02-01

    Full Text Available Abstract The tremendous growth of the web world incorporates application of data mining techniques to the web logs. Data Mining and World Wide Web encompasses an important and active area of research. Web log mining is analysis of web log files with web pages sequences. Web mining is broadly classified as web content mining web usage mining and web structure mining. Web usage mining is a technique to discover usage patterns from Web data in order to understand and better serve the needs of Web-based applications. URL mining refers to a subclass of Web mining that helps us to investigate the details of a Uniform Resource Locator. URL mining can be advantageous in the fields of security and protection. The paper introduces a technique for mining a collection of user transactions with an Internet search engine to discover clusters of similar queries and similar URLs. The information we exploit is a clickthrough data each record consist of a users query to a search engine along with the URL which the user selected from among the candidates offered by search engine. By viewing this dataset as a bipartite graph with the vertices on one side corresponding to queries and on the other side to URLs one can apply an agglomerative clustering algorithm to the graphs vertices to identify related queries and URLs.

  2. Aspects of Text Mining From Computational Semiotics to Systemic Functional Hypertexts

    Directory of Open Access Journals (Sweden)

    Alexander Mehler

    2001-05-01

    Full Text Available The significance of natural language texts as the prime information structure for the management and dissemination of knowledge in organisations is still increasing. Making relevant documents available depending on varying tasks in different contexts is of primary importance for any efficient task completion. Implementing this demand requires the content based processing of texts, which enables to reconstruct or, if necessary, to explore the relationship of task, context and document. Text mining is a technology that is suitable for solving problems of this kind. In the following, semiotic aspects of text mining are investigated. Based on the primary object of text mining - natural language lexis - the specific complexity of this class of signs is outlined and requirements for the implementation of text mining procedures are derived. This is done with reference to text linkage introduced as a special task in text mining. Text linkage refers to the exploration of implicit, content based relations of texts (and their annotation as typed links in corpora possibly organised as hypertexts. In this context, the term systemic functional hypertext is introduced, which distinguishes genre and register layers for the management of links in a poly-level hypertext system.

  3. Recognition of pornographic web pages by classifying texts and images.

    Science.gov (United States)

    Hu, Weiming; Wu, Ou; Chen, Zhouyao; Fu, Zhouyu; Maybank, Steve

    2007-06-01

    With the rapid development of the World Wide Web, people benefit more and more from the sharing of information. However, Web pages with obscene, harmful, or illegal content can be easily accessed. It is important to recognize such unsuitable, offensive, or pornographic Web pages. In this paper, a novel framework for recognizing pornographic Web pages is described. A C4.5 decision tree is used to divide Web pages, according to content representations, into continuous text pages, discrete text pages, and image pages. These three categories of Web pages are handled, respectively, by a continuous text classifier, a discrete text classifier, and an algorithm that fuses the results from the image classifier and the discrete text classifier. In the continuous text classifier, statistical and semantic features are used to recognize pornographic texts. In the discrete text classifier, the naive Bayes rule is used to calculate the probability that a discrete text is pornographic. In the image classifier, the object's contour-based features are extracted to recognize pornographic images. In the text and image fusion algorithm, the Bayes theory is used to combine the recognition results from images and texts. Experimental results demonstrate that the continuous text classifier outperforms the traditional keyword-statistics-based classifier, the contour-based image classifier outperforms the traditional skin-region-based image classifier, the results obtained by our fusion algorithm outperform those by either of the individual classifiers, and our framework can be adapted to different categories of Web pages.

  4. Text mining for the biocuration workflow.

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  5. Text mining for the biocuration workflow

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  6. SA-Search: a web tool for protein structure mining based on a Structural Alphabet

    OpenAIRE

    Guyon, Frédéric; Camproux, Anne-Claude; Hochez, Joëlle; Tufféry, Pierre

    2004-01-01

    SA-Search is a web tool that can be used to mine for protein structures and extract structural similarities. It is based on a hidden Markov model derived Structural Alphabet (SA) that allows the compression of three-dimensional (3D) protein conformations into a one-dimensional (1D) representation using a limited number of prototype conformations. Using such a representation, classical methods developed for amino acid sequences can be employed. Currently, SA-Search permits the performance of f...

  7. Socio-contextual Network Mining for User Assistance in Web-based Knowledge Gathering Tasks

    Science.gov (United States)

    Rajendran, Balaji; Kombiah, Iyakutti

    Web-based Knowledge Gathering (WKG) is a specialized and complex information seeking task carried out by many users on the web, for their various learning, and decision-making requirements. We construct a contextual semantic structure by observing the actions of the users involved in WKG task, in order to gain an understanding of their task and requirement. We also build a knowledge warehouse in the form of a master Semantic Link Network (SLX) that accommodates and assimilates all the contextual semantic structures. This master SLX, which is a socio-contextual network, is then mined to provide contextual inputs to the current users through their agents. We validated our approach through experiments and analyzed the benefits to the users in terms of resource explorations and the time saved. The results are positive enough to motivate us to implement in a larger scale.

  8. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.

    Science.gov (United States)

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org.

  9. Text mining in livestock animal science: introducing the potential of text mining to animal sciences.

    Science.gov (United States)

    Sahadevan, S; Hofmann-Apitius, M; Schellander, K; Tesfaye, D; Fluck, J; Friedrich, C M

    2012-10-01

    In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from

  10. Universally Designed Text on the Web: Towards Readability Criteria Based on Anti-Patterns.

    Science.gov (United States)

    Eika, Evelyn

    2016-01-01

    The readability of web texts affects accessibility. The Web Content Accessibility guidelines (WCAG) state that the recommended reading level should match that of someone who has completed basic schooling. However, WCAG does not give advice on what constitutes an appropriate reading level. Web authors need tools to help composing WCAG compliant texts, and specific criteria are needed. Classic readability metrics are generally based on lengths of words and sentences and have been criticized for being over-simplistic. Automatic measures and classifications of texts' reading levels employing more advanced constructs remain an unresolved problem. If such measures were feasible, what should these be? This work examines three language constructs not captured by current readability indices but believed to significantly affect actual readability, namely, relative clauses, garden path sentences, and left-branching structures. The goal is to see whether quantifications of these stylistic features reflect readability and how they correspond to common readability measures. Manual assessments of a set of authentic web texts for such uses were conducted. The results reveal that texts related to narratives such as children's stories, which are given the highest readability value, do not contain these constructs. The structures in question occur more frequently in expository texts that aim at educating or disseminating information such as strategy and journal articles. The results suggest that language anti-patterns hold potential for establishing a set of deeper readability criteria.

  11. Alkemio: association of chemicals with biomedical topics by text and data mining.

    Science.gov (United States)

    Gijón-Correas, José A; Andrade-Navarro, Miguel A; Fontaine, Jean F

    2014-07-01

    The PubMed® database of biomedical citations allows the retrieval of scientific articles studying the function of chemicals in biology and medicine. Mining millions of available citations to search reported associations between chemicals and topics of interest would require substantial human time. We have implemented the Alkemio text mining web tool and SOAP web service to help in this task. The tool uses biomedical articles discussing chemicals (including drugs), predicts their relatedness to the query topic with a naïve Bayesian classifier and ranks all chemicals by P-values computed from random simulations. Benchmarks on seven human pathways showed good retrieval performance (areas under the receiver operating characteristic curves ranged from 73.6 to 94.5%). Comparison with existing tools to retrieve chemicals associated to eight diseases showed the higher precision and recall of Alkemio when considering the top 10 candidate chemicals. Alkemio is a high performing web tool ranking chemicals for any biomedical topics and it is free to non-commercial users. http://cbdm.mdc-berlin.de/∼medlineranker/cms/alkemio. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Analysing Customer Opinions with Text Mining Algorithms

    Science.gov (United States)

    Consoli, Domenico

    2009-08-01

    Knowing what the customer thinks of a particular product/service helps top management to introduce improvements in processes and products, thus differentiating the company from their competitors and gain competitive advantages. The customers, with their preferences, determine the success or failure of a company. In order to know opinions of the customers we can use technologies available from the web 2.0 (blog, wiki, forums, chat, social networking, social commerce). From these web sites, useful information must be extracted, for strategic purposes, using techniques of sentiment analysis or opinion mining.

  13. Text mining improves prediction of protein functional sites.

    Directory of Open Access Journals (Sweden)

    Karin M Verspoor

    Full Text Available We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites. The structure analysis was carried out using Dynamics Perturbation Analysis (DPA, which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.

  14. Text Mining Improves Prediction of Protein Functional Sites

    Science.gov (United States)

    Cohn, Judith D.; Ravikumar, Komandur E.

    2012-01-01

    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388

  15. Mining

    Directory of Open Access Journals (Sweden)

    Khairullah Khan

    2014-09-01

    Full Text Available Opinion mining is an interesting area of research because of its applications in various fields. Collecting opinions of people about products and about social and political events and problems through the Web is becoming increasingly popular every day. The opinions of users are helpful for the public and for stakeholders when making certain decisions. Opinion mining is a way to retrieve information through search engines, Web blogs and social networks. Because of the huge number of reviews in the form of unstructured text, it is impossible to summarize the information manually. Accordingly, efficient computational methods are needed for mining and summarizing the reviews from corpuses and Web documents. This study presents a systematic literature survey regarding the computational techniques, models and algorithms for mining opinion components from unstructured reviews.

  16. Data mining approach to web application intrusions detection

    Science.gov (United States)

    Kalicki, Arkadiusz

    2011-10-01

    Web applications became most popular medium in the Internet. Popularity, easiness of web application script languages and frameworks together with careless development results in high number of web application vulnerabilities and high number of attacks performed. There are several types of attacks possible because of improper input validation: SQL injection Cross-site scripting, Cross-Site Request Forgery (CSRF), web spam in blogs and others. In order to secure web applications intrusion detection (IDS) and intrusion prevention systems (IPS) are being used. Intrusion detection systems are divided in two groups: misuse detection (traditional IDS) and anomaly detection. This paper presents data mining based algorithm for anomaly detection. The principle of this method is the comparison of the incoming HTTP traffic with a previously built profile that contains a representation of the "normal" or expected web application usage sequence patterns. The frequent sequence patterns are found with GSP algorithm. Previously presented detection method was rewritten and improved. Some tests show that the software catches malicious requests, especially long attack sequences, results quite good with medium length sequences, for short length sequences must be complemented with other methods.

  17. Information Retrieval and Text Mining Technologies for Chemistry.

    Science.gov (United States)

    Krallinger, Martin; Rabal, Obdulia; Lourenço, Anália; Oyarzabal, Julen; Valencia, Alfonso

    2017-06-28

    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.

  18. Internet of Things in Health Trends Through Bibliometrics and Text Mining.

    Science.gov (United States)

    Konstantinidis, Stathis Th; Billis, Antonis; Wharrad, Heather; Bamidis, Panagiotis D

    2017-01-01

    Recently a new buzzword has slowly but surely emerged, namely the Internet of Things (IoT). The importance of IoT is identified worldwide both by organisations and governments and the scientific community with an incremental number of publications during the last few years. IoT in Health is one of the main pillars of this evolution, but limited research has been performed on future visions and trends. Thus, in this study we investigate the longitudinal trends of Internet of Things in Health through bibliometrics and use of text mining. Seven hundred seventy eight (778) articles were retrieved form The Web of Science database from 1998 to 2016. The publications are grouped into thirty (30) clusters based on abstract text analysis resulting into some eight (8) trends of IoT in Health. Research in this field is obviously obtaining a worldwide character with specific trends, which are worth delineating to be in favour of some areas.

  19. Frontiers of biomedical text mining: current progress

    Science.gov (United States)

    Zweigenbaum, Pierre; Demner-Fushman, Dina; Yu, Hong; Cohen, Kevin B.

    2008-01-01

    It is now almost 15 years since the publication of the first paper on text mining in the genomics domain, and decades since the first paper on text mining in the medical domain. Enormous progress has been made in the areas of information retrieval, evaluation methodologies and resource construction. Some problems, such as abbreviation-handling, can essentially be considered solved problems, and others, such as identification of gene mentions in text, seem likely to be solved soon. However, a number of problems at the frontiers of biomedical text mining continue to present interesting challenges and opportunities for great improvements and interesting research. In this article we review the current state of the art in biomedical text mining or ‘BioNLP’ in general, focusing primarily on papers published within the past year. PMID:17977867

  20. Text mining in the classification of digital documents

    Directory of Open Access Journals (Sweden)

    Marcial Contreras Barrera

    2016-11-01

    Full Text Available Objective: Develop an automated classifier for the classification of bibliographic material by means of the text mining. Methodology: The text mining is used for the development of the classifier, based on a method of type supervised, conformed by two phases; learning and recognition, in the learning phase, the classifier learns patterns across the analysis of bibliographical records, of the classification Z, belonging to library science, information sciences and information resources, recovered from the database LIBRUNAM, in this phase is obtained the classifier capable of recognizing different subclasses (LC. In the recognition phase the classifier is validated and evaluates across classification tests, for this end bibliographical records of the classification Z are taken randomly, classified by a cataloguer and processed by the automated classifier, in order to obtain the precision of the automated classifier. Results: The application of the text mining achieved the development of the automated classifier, through the method classifying documents supervised type. The precision of the classifier was calculated doing the comparison among the assigned topics manually and automated obtaining 75.70% of precision. Conclusions: The application of text mining facilitated the creation of automated classifier, allowing to obtain useful technology for the classification of bibliographical material with the aim of improving and speed up the process of organizing digital documents.

  1. Stratification-Based Outlier Detection over the Deep Web.

    Science.gov (United States)

    Xian, Xuefeng; Zhao, Pengpeng; Sheng, Victor S; Fang, Ligang; Gu, Caidong; Yang, Yuanfeng; Cui, Zhiming

    2016-01-01

    For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribution of this paper is to develop a new data mining method for outlier detection over deep web. In our approach, the query space of a deep web data source is stratified based on a pilot sample. Neighborhood sampling and uncertainty sampling are developed in this paper with the goal of improving recall and precision based on stratification. Finally, a careful performance evaluation of our algorithm confirms that our approach can effectively detect outliers in deep web.

  2. Text mining resources for the life sciences.

    Science.gov (United States)

    Przybyła, Piotr; Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia

    2016-01-01

    Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable-those that have the crucial ability to share information, enabling smooth integration and reusability. © The Author(s) 2016. Published by Oxford University Press.

  3. Text mining resources for the life sciences

    Science.gov (United States)

    Shardlow, Matthew; Aubin, Sophie; Bossy, Robert; Eckart de Castilho, Richard; Piperidis, Stelios; McNaught, John; Ananiadou, Sophia

    2016-01-01

    Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability. PMID:27888231

  4. An Introduction to Social Semantic Web Mining & Big Data Analytics for Political Attitudes and Mentalities Research

    Directory of Open Access Journals (Sweden)

    Markus Schatten

    2015-01-01

    Full Text Available The social web has become a major repository of social and behavioral data that is of exceptional interest to the social science and humanities research community. Computer science has only recently developed various technologies and techniques that allow for harvesting, organizing and analyzing such data and provide knowledge and insights into the structure and behavior or people on-line. Some of these techniques include social web mining, conceptual and social network analysis and modeling, tag clouds, topic maps, folksonomies, complex network visualizations, modeling of processes on networks, agent based models of social network emergence, speech recognition, computer vision, natural language processing, opinion mining and sentiment analysis, recommender systems, user profiling and semantic wikis. All of these techniques are briefly introduced, example studies are given and ideas as well as possible directions in the field of political attitudes and mentalities are given. In the end challenges for future studies are discussed.

  5. Cultural text mining: using text mining to map the emergence of transnational reference cultures in public media repositories

    NARCIS (Netherlands)

    Pieters, Toine; Verheul, Jaap

    2014-01-01

    This paper discusses the research project Translantis, which uses innovative technologies for cultural text mining to analyze large repositories of digitized public media, such as newspapers and journals.1 The Translantis research team uses and develops the text mining tool Texcavator, which is

  6. PathText: a text mining integrator for biological pathway visualizations

    Science.gov (United States)

    Kemper, Brian; Matsuzaki, Takuya; Matsuoka, Yukiko; Tsuruoka, Yoshimasa; Kitano, Hiroaki; Ananiadou, Sophia; Tsujii, Jun'ichi

    2010-01-01

    Motivation: Metabolic and signaling pathways are an increasingly important part of organizing knowledge in systems biology. They serve to integrate collective interpretations of facts scattered throughout literature. Biologists construct a pathway by reading a large number of articles and interpreting them as a consistent network, but most of the models constructed currently lack direct links to those articles. Biologists who want to check the original articles have to spend substantial amounts of time to collect relevant articles and identify the sections relevant to the pathway. Furthermore, with the scientific literature expanding by several thousand papers per week, keeping a model relevant requires a continuous curation effort. In this article, we present a system designed to integrate a pathway visualizer, text mining systems and annotation tools into a seamless environment. This will enable biologists to freely move between parts of a pathway and relevant sections of articles, as well as identify relevant papers from large text bases. The system, PathText, is developed by Systems Biology Institute, Okinawa Institute of Science and Technology, National Centre for Text Mining (University of Manchester) and the University of Tokyo, and is being used by groups of biologists from these locations. Contact: brian@monrovian.com. PMID:20529930

  7. Wireless sensing of gas in mining with web service in real time

    Directory of Open Access Journals (Sweden)

    Juan Mauricio Salamanca

    2014-12-01

    hierarchically in order to transmit the data to the entrance of the mine. Finally, the network configuration is done until the system enters in mode sleep (idle when it is not receiving information, in this way the consuming power decreased, increasing the autonomy of the batteries. This paper describes the design, implementation and operation of a gas monitoring system in mining with web service inreal-time based on a network of Zigbee sensors.

  8. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization

    Directory of Open Access Journals (Sweden)

    Krasnogor Natalio

    2009-10-01

    Full Text Available Abstract Background Statistical analysis of DNA microarray data provides a valuable diagnostic tool for the investigation of genetic components of diseases. To take advantage of the multitude of available data sets and analysis methods, it is desirable to combine both different algorithms and data from different studies. Applying ensemble learning, consensus clustering and cross-study normalization methods for this purpose in an almost fully automated process and linking different analysis modules together under a single interface would simplify many microarray analysis tasks. Results We present ArrayMining.net, a web-application for microarray analysis that provides easy access to a wide choice of feature selection, clustering, prediction, gene set analysis and cross-study normalization methods. In contrast to other microarray-related web-tools, multiple algorithms and data sets for an analysis task can be combined using ensemble feature selection, ensemble prediction, consensus clustering and cross-platform data integration. By interlinking different analysis tools in a modular fashion, new exploratory routes become available, e.g. ensemble sample classification using features obtained from a gene set analysis and data from multiple studies. The analysis is further simplified by automatic parameter selection mechanisms and linkage to web tools and databases for functional annotation and literature mining. Conclusion ArrayMining.net is a free web-application for microarray analysis combining a broad choice of algorithms based on ensemble and consensus methods, using automatic parameter selection and integration with annotation databases.

  9. A Text-Mining Framework for Supporting Systematic Reviews.

    Science.gov (United States)

    Li, Dingcheng; Wang, Zhen; Wang, Liwei; Sohn, Sunghwan; Shen, Feichen; Murad, Mohammad Hassan; Liu, Hongfang

    2016-11-01

    Systematic reviews (SRs) involve the identification, appraisal, and synthesis of all relevant studies for focused questions in a structured reproducible manner. High-quality SRs follow strict procedures and require significant resources and time. We investigated advanced text-mining approaches to reduce the burden associated with abstract screening in SRs and provide high-level information summary. A text-mining SR supporting framework consisting of three self-defined semantics-based ranking metrics was proposed, including keyword relevance, indexed-term relevance and topic relevance. Keyword relevance is based on the user-defined keyword list used in the search strategy. Indexed-term relevance is derived from indexed vocabulary developed by domain experts used for indexing journal articles and books. Topic relevance is defined as the semantic similarity among retrieved abstracts in terms of topics generated by latent Dirichlet allocation, a Bayesian-based model for discovering topics. We tested the proposed framework using three published SRs addressing a variety of topics (Mass Media Interventions, Rectal Cancer and Influenza Vaccine). The results showed that when 91.8%, 85.7%, and 49.3% of the abstract screening labor was saved, the recalls were as high as 100% for the three cases; respectively. Relevant studies identified manually showed strong topic similarity through topic analysis, which supported the inclusion of topic analysis as relevance metric. It was demonstrated that advanced text mining approaches can significantly reduce the abstract screening labor of SRs and provide an informative summary of relevant studies.

  10. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Cui, X.; Mueller, F.; Zhang, Y.; Potok, Thomas E.

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  11. HC StratoMineR: A Web-Based Tool for the Rapid Analysis of High-Content Datasets.

    Science.gov (United States)

    Omta, Wienand A; van Heesbeen, Roy G; Pagliero, Romina J; van der Velden, Lieke M; Lelieveld, Daphne; Nellen, Mehdi; Kramer, Maik; Yeong, Marley; Saeidi, Amir M; Medema, Rene H; Spruit, Marco; Brinkkemper, Sjaak; Klumperman, Judith; Egan, David A

    2016-10-01

    High-content screening (HCS) can generate large multidimensional datasets and when aligned with the appropriate data mining tools, it can yield valuable insights into the mechanism of action of bioactive molecules. However, easy-to-use data mining tools are not widely available, with the result that these datasets are frequently underutilized. Here, we present HC StratoMineR, a web-based tool for high-content data analysis. It is a decision-supportive platform that guides even non-expert users through a high-content data analysis workflow. HC StratoMineR is built by using My Structured Query Language for storage and querying, PHP: Hypertext Preprocessor as the main programming language, and jQuery for additional user interface functionality. R is used for statistical calculations, logic and data visualizations. Furthermore, C++ and graphical processor unit power is diffusely embedded in R by using the rcpp and rpud libraries for operations that are computationally highly intensive. We show that we can use HC StratoMineR for the analysis of multivariate data from a high-content siRNA knock-down screen and a small-molecule screen. It can be used to rapidly filter out undesirable data; to select relevant data; and to perform quality control, data reduction, data exploration, morphological hit picking, and data clustering. Our results demonstrate that HC StratoMineR can be used to functionally categorize HCS hits and, thus, provide valuable information for hit prioritization.

  12. Biomedical text mining and its applications in cancer research.

    Science.gov (United States)

    Zhu, Fei; Patumcharoenpol, Preecha; Zhang, Cheng; Yang, Yang; Chan, Jonathan; Meechai, Asawin; Vongsangnak, Wanwipa; Shen, Bairong

    2013-04-01

    Cancer is a malignant disease that has caused millions of human deaths. Its study has a long history of well over 100years. There have been an enormous number of publications on cancer research. This integrated but unstructured biomedical text is of great value for cancer diagnostics, treatment, and prevention. The immense body and rapid growth of biomedical text on cancer has led to the appearance of a large number of text mining techniques aimed at extracting novel knowledge from scientific text. Biomedical text mining on cancer research is computationally automatic and high-throughput in nature. However, it is error-prone due to the complexity of natural language processing. In this review, we introduce the basic concepts underlying text mining and examine some frequently used algorithms, tools, and data sets, as well as assessing how much these algorithms have been utilized. We then discuss the current state-of-the-art text mining applications in cancer research and we also provide some resources for cancer text mining. With the development of systems biology, researchers tend to understand complex biomedical systems from a systems biology viewpoint. Thus, the full utilization of text mining to facilitate cancer systems biology research is fast becoming a major concern. To address this issue, we describe the general workflow of text mining in cancer systems biology and each phase of the workflow. We hope that this review can (i) provide a useful overview of the current work of this field; (ii) help researchers to choose text mining tools and datasets; and (iii) highlight how to apply text mining to assist cancer systems biology research. Copyright © 2012 Elsevier Inc. All rights reserved.

  13. Text Mining in Biomedical Domain with Emphasis on Document Clustering.

    Science.gov (United States)

    Renganathan, Vinaitheerthan

    2017-07-01

    With the exponential increase in the number of articles published every year in the biomedical domain, there is a need to build automated systems to extract unknown information from the articles published. Text mining techniques enable the extraction of unknown knowledge from unstructured documents. This paper reviews text mining processes in detail and the software tools available to carry out text mining. It also reviews the roles and applications of text mining in the biomedical domain. Text mining processes, such as search and retrieval of documents, pre-processing of documents, natural language processing, methods for text clustering, and methods for text classification are described in detail. Text mining techniques can facilitate the mining of vast amounts of knowledge on a given topic from published biomedical research articles and draw meaningful conclusions that are not possible otherwise.

  14. Active Collection of Land Cover Sample Data from Geo-Tagged Web Texts

    Directory of Open Access Journals (Sweden)

    Dongyang Hou

    2015-05-01

    Full Text Available Sample data plays an important role in land cover (LC map validation. Traditionally, they are collected through field survey or image interpretation, either of which is costly, labor-intensive and time-consuming. In recent years, massive geo-tagged texts are emerging on the web and they contain valuable information for LC map validation. However, this kind of special textual data has seldom been analyzed and used for supporting LC map validation. This paper examines the potential of geo-tagged web texts as a new cost-free sample data source to assist LC map validation and proposes an active data collection approach. The proposed approach uses a customized deep web crawler to search for geo-tagged web texts based on land cover-related keywords and string-based rules matching. A data transformation based on buffer analysis is then performed to convert the collected web texts into LC sample data. Using three provinces and three municipalities directly under the Central Government in China as study areas, geo-tagged web texts were collected to validate artificial surface class of China’s 30-meter global land cover datasets (GlobeLand30-2010. A total of 6283 geo-tagged web texts were collected at a speed of 0.58 texts per second. The collected texts about built-up areas were transformed into sample data. User’s accuracy of 82.2% was achieved, which is close to that derived from formal expert validation. The preliminary results show that geo-tagged web texts are valuable ancillary data for LC map validation and the proposed approach can improve the efficiency of sample data collection.

  15. Adverse Event extraction from Structured Product Labels using the Event-based Text-mining of Health Electronic Records (ETHER)system.

    Science.gov (United States)

    Pandey, Abhishek; Kreimeyer, Kory; Foster, Matthew; Botsis, Taxiarchis; Dang, Oanh; Ly, Thomas; Wang, Wei; Forshee, Richard

    2018-01-01

    Structured Product Labels follow an XML-based document markup standard approved by the Health Level Seven organization and adopted by the US Food and Drug Administration as a mechanism for exchanging medical products information. Their current organization makes their secondary use rather challenging. We used the Side Effect Resource database and DailyMed to generate a comparison dataset of 1159 Structured Product Labels. We processed the Adverse Reaction section of these Structured Product Labels with the Event-based Text-mining of Health Electronic Records system and evaluated its ability to extract and encode Adverse Event terms to Medical Dictionary for Regulatory Activities Preferred Terms. A small sample of 100 labels was then selected for further analysis. Of the 100 labels, Event-based Text-mining of Health Electronic Records achieved a precision and recall of 81 percent and 92 percent, respectively. This study demonstrated Event-based Text-mining of Health Electronic Record's ability to extract and encode Adverse Event terms from Structured Product Labels which may potentially support multiple pharmacoepidemiological tasks.

  16. Text mining meets workflow: linking U-Compare with Taverna

    Science.gov (United States)

    Kano, Yoshinobu; Dobson, Paul; Nakanishi, Mio; Tsujii, Jun'ichi; Ananiadou, Sophia

    2010-01-01

    Summary: Text mining from the biomedical literature is of increasing importance, yet it is not easy for the bioinformatics community to create and run text mining workflows due to the lack of accessibility and interoperability of the text mining resources. The U-Compare system provides a wide range of bio text mining resources in a highly interoperable workflow environment where workflows can very easily be created, executed, evaluated and visualized without coding. We have linked U-Compare to Taverna, a generic workflow system, to expose text mining functionality to the bioinformatics community. Availability: http://u-compare.org/taverna.html, http://u-compare.org Contact: kano@is.s.u-tokyo.ac.jp Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20709690

  17. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  18. The WONP-NURT corpus as nuclear knowledge base for text mining in the INIS database

    International Nuclear Information System (INIS)

    Guerra Valdes, R.

    2011-01-01

    In the present work the WONP-NURT corpus is taken as knowledge base for text mining in the INIS database. Main components of the information processing system, as well as computational methods for content analysis of INIS database record files are described. Results of the content analysis of the WONP-NURT corpus are reported. Furthermore, results of two comparative text mining studies in the INIS database are also shown. The first one explores 10 research areas in the more familiar nearest range of WONP-NURT corpus, while the second one surveys 15 regions in the more exotic far range. The results provide new elements to asses the significance of the WONP-NURT corpus in the context of the current state of nuclear science and technology research areas. (Author)

  19. Mining knowledge from text repositories using information extraction ...

    Indian Academy of Sciences (India)

    Information extraction (IE); text mining; text repositories; knowledge discovery from .... general purpose English words. However ... of precision and recall, as extensive experimentation is required due to lack of public tagged corpora. 4. Mining ...

  20. Opinion Mining in Latvian Text Using Semantic Polarity Analysis and Machine Learning Approach

    Directory of Open Access Journals (Sweden)

    Gatis Špats

    2016-07-01

    Full Text Available In this paper we demonstrate approaches for opinion mining in Latvian text. Authors have applied, combined and extended results of several previous studies and public resources to perform opinion mining in Latvian text using two approaches, namely, semantic polarity analysis and machine learning. One of the most significant constraints that make application of opinion mining for written content classification in Latvian text challenging is the limited publicly available text corpora for classifier training. We have joined several sources and created a publically available extended lexicon. Our results are comparable to or outperform current achievements in opinion mining in Latvian. Experiments show that lexicon-based methods provide more accurate opinion mining than the application of Naive Bayes machine learning classifier on Latvian tweets. Methods used during this study could be further extended using human annotators, unsupervised machine learning and bootstrapping to create larger corpora of classified text.

  1. Text Mining of Supreme Administrative Court Jurisdictions

    OpenAIRE

    Feinerer, Ingo; Hornik, Kurt

    2007-01-01

    Within the last decade text mining, i.e., extracting sensitive information from text corpora, has become a major factor in business intelligence. The automated textual analysis of law corpora is highly valuable because of its impact on a company's legal options and the raw amount of available jurisdiction. The study of supreme court jurisdiction and international law corpora is equally important due to its effects on business sectors. In this paper we use text mining methods to investigate Au...

  2. Text mining in cancer gene and pathway prioritization.

    Science.gov (United States)

    Luo, Yuan; Riedlinger, Gregory; Szolovits, Peter

    2014-01-01

    Prioritization of cancer implicated genes has received growing attention as an effective way to reduce wet lab cost by computational analysis that ranks candidate genes according to the likelihood that experimental verifications will succeed. A multitude of gene prioritization tools have been developed, each integrating different data sources covering gene sequences, differential expressions, function annotations, gene regulations, protein domains, protein interactions, and pathways. This review places existing gene prioritization tools against the backdrop of an integrative Omic hierarchy view toward cancer and focuses on the analysis of their text mining components. We explain the relatively slow progress of text mining in gene prioritization, identify several challenges to current text mining methods, and highlight a few directions where more effective text mining algorithms may improve the overall prioritization task and where prioritizing the pathways may be more desirable than prioritizing only genes.

  3. Integrating UIMA annotators in a web-based text processing framework.

    Science.gov (United States)

    Chen, Xiang; Arnold, Corey W

    2013-01-01

    The Unstructured Information Management Architecture (UIMA) [1] framework is a growing platform for natural language processing (NLP) applications. However, such applications may be difficult for non-technical users deploy. This project presents a web-based framework that wraps UIMA-based annotator systems into a graphical user interface for researchers and clinicians, and a web service for developers. An annotator that extracts data elements from lung cancer radiology reports is presented to illustrate the use of the system. Annotation results from the web system can be exported to multiple formats for users to utilize in other aspects of their research and workflow. This project demonstrates the benefits of a lay-user interface for complex NLP applications. Efforts such as this can lead to increased interest and support for NLP work in the clinical domain.

  4. Using ontology network structure in text mining.

    Science.gov (United States)

    Berndt, Donald J; McCart, James A; Luther, Stephen L

    2010-11-13

    Statistical text mining treats documents as bags of words, with a focus on term frequencies within documents and across document collections. Unlike natural language processing (NLP) techniques that rely on an engineered vocabulary or a full-featured ontology, statistical approaches do not make use of domain-specific knowledge. The freedom from biases can be an advantage, but at the cost of ignoring potentially valuable knowledge. The approach proposed here investigates a hybrid strategy based on computing graph measures of term importance over an entire ontology and injecting the measures into the statistical text mining process. As a starting point, we adapt existing search engine algorithms such as PageRank and HITS to determine term importance within an ontology graph. The graph-theoretic approach is evaluated using a smoking data set from the i2b2 National Center for Biomedical Computing, cast as a simple binary classification task for categorizing smoking-related documents, demonstrating consistent improvements in accuracy.

  5. Web-Based Tools for Text-Based Patient-Provider Communication in Chronic Conditions: Scoping Review.

    Science.gov (United States)

    Voruganti, Teja; Grunfeld, Eva; Makuwaza, Tutsirai; Bender, Jacqueline L

    2017-10-27

    Patients with chronic conditions require ongoing care which not only necessitates support from health care providers outside appointments but also self-management. Web-based tools for text-based patient-provider communication, such as secure messaging, allow for sharing of contextual information and personal narrative in a simple accessible medium, empowering patients and enabling their providers to address emerging care needs. The objectives of this study were to (1) conduct a systematic search of the published literature and the Internet for Web-based tools for text-based communication between patients and providers; (2) map tool characteristics, their intended use, contexts in which they were used, and by whom; (3) describe the nature of their evaluation; and (4) understand the terminology used to describe the tools. We conducted a scoping review using the MEDLINE (Medical Literature Analysis and Retrieval System Online) and EMBASE (Excerpta Medica Database) databases. We summarized information on the characteristics of the tools (structure, functions, and communication paradigm), intended use, context and users, evaluation (study design and outcomes), and terminology. We performed a parallel search of the Internet to compare with tools identified in the published literature. We identified 54 papers describing 47 unique tools from 13 countries studied in the context of 68 chronic health conditions. The majority of tools (77%, 36/47) had functions in addition to communication (eg, viewable care plan, symptom diary, or tracker). Eight tools (17%, 8/47) were described as allowing patients to communicate with the team or multiple health care providers. Most of the tools were intended to support communication regarding symptom reporting (49%, 23/47), and lifestyle or behavior modification (36%, 17/47). The type of health care providers who used tools to communicate with patients were predominantly allied health professionals of various disciplines (30%, 14

  6. Negation scope and spelling variation for text-mining of Danish electronic patient records

    DEFF Research Database (Denmark)

    Thomas, Cecilia Engel; Jensen, Peter Bjødstrup; Werge, Thomas

    2014-01-01

    Electronic patient records are a potentially rich data source for knowledge extraction in biomedical research. Here we present a method based on the ICD10 system for text-mining of Danish health records. We have evaluated how adding functionalities to a baseline text-mining tool affected...

  7. Vaccine adverse event text mining system for extracting features from vaccine safety reports.

    Science.gov (United States)

    Botsis, Taxiarchis; Buttolph, Thomas; Nguyen, Michael D; Winiecki, Scott; Woo, Emily Jane; Ball, Robert

    2012-01-01

    To develop and evaluate a text mining system for extracting key clinical features from vaccine adverse event reporting system (VAERS) narratives to aid in the automated review of adverse event reports. Based upon clinical significance to VAERS reviewing physicians, we defined the primary (diagnosis and cause of death) and secondary features (eg, symptoms) for extraction. We built a novel vaccine adverse event text mining (VaeTM) system based on a semantic text mining strategy. The performance of VaeTM was evaluated using a total of 300 VAERS reports in three sequential evaluations of 100 reports each. Moreover, we evaluated the VaeTM contribution to case classification; an information retrieval-based approach was used for the identification of anaphylaxis cases in a set of reports and was compared with two other methods: a dedicated text classifier and an online tool. The performance metrics of VaeTM were text mining metrics: recall, precision and F-measure. We also conducted a qualitative difference analysis and calculated sensitivity and specificity for classification of anaphylaxis cases based on the above three approaches. VaeTM performed best in extracting diagnosis, second level diagnosis, drug, vaccine, and lot number features (lenient F-measure in the third evaluation: 0.897, 0.817, 0.858, 0.874, and 0.914, respectively). In terms of case classification, high sensitivity was achieved (83.1%); this was equal and better compared to the text classifier (83.1%) and the online tool (40.7%), respectively. Our VaeTM implementation of a semantic text mining strategy shows promise in providing accurate and efficient extraction of key features from VAERS narratives.

  8. A text-mining system for extracting metabolic reactions from full-text articles.

    Science.gov (United States)

    Czarnecki, Jan; Nobeli, Irene; Smith, Adrian M; Shepherd, Adrian J

    2012-07-23

    Increasingly biological text mining research is focusing on the extraction of complex relationships relevant to the construction and curation of biological networks and pathways. However, one important category of pathway - metabolic pathways - has been largely neglected.Here we present a relatively simple method for extracting metabolic reaction information from free text that scores different permutations of assigned entities (enzymes and metabolites) within a given sentence based on the presence and location of stemmed keywords. This method extends an approach that has proved effective in the context of the extraction of protein-protein interactions. When evaluated on a set of manually-curated metabolic pathways using standard performance criteria, our method performs surprisingly well. Precision and recall rates are comparable to those previously achieved for the well-known protein-protein interaction extraction task. We conclude that automated metabolic pathway construction is more tractable than has often been assumed, and that (as in the case of protein-protein interaction extraction) relatively simple text-mining approaches can prove surprisingly effective. It is hoped that these results will provide an impetus to further research and act as a useful benchmark for judging the performance of more sophisticated methods that are yet to be developed.

  9. Text mining and visualization case studies using open-source tools

    CERN Document Server

    Chisholm, Andrew

    2016-01-01

    Text Mining and Visualization: Case Studies Using Open-Source Tools provides an introduction to text mining using some of the most popular and powerful open-source tools: KNIME, RapidMiner, Weka, R, and Python. The contributors-all highly experienced with text mining and open-source software-explain how text data are gathered and processed from a wide variety of sources, including books, server access logs, websites, social media sites, and message boards. Each chapter presents a case study that you can follow as part of a step-by-step, reproducible example. You can also easily apply and extend the techniques to other problems. All the examples are available on a supplementary website. The book shows you how to exploit your text data, offering successful application examples and blueprints for you to tackle your text mining tasks and benefit from open and freely available tools. It gets you up to date on the latest and most powerful tools, the data mining process, and specific text mining activities.

  10. SA-Search: a web tool for protein structure mining based on a Structural Alphabet.

    Science.gov (United States)

    Guyon, Frédéric; Camproux, Anne-Claude; Hochez, Joëlle; Tufféry, Pierre

    2004-07-01

    SA-Search is a web tool that can be used to mine for protein structures and extract structural similarities. It is based on a hidden Markov model derived Structural Alphabet (SA) that allows the compression of three-dimensional (3D) protein conformations into a one-dimensional (1D) representation using a limited number of prototype conformations. Using such a representation, classical methods developed for amino acid sequences can be employed. Currently, SA-Search permits the performance of fast 3D similarity searches such as the extraction of exact words using a suffix tree approach, and the search for fuzzy words viewed as a simple 1D sequence alignment problem. SA-Search is available at http://bioserv.rpbs.jussieu.fr/cgi-bin/SA-Search.

  11. A text-based data mining and toxicity prediction modeling system for a clinical decision support in radiation oncology: A preliminary study

    Science.gov (United States)

    Kim, Kwang Hyeon; Lee, Suk; Shim, Jang Bo; Chang, Kyung Hwan; Yang, Dae Sik; Yoon, Won Sup; Park, Young Je; Kim, Chul Yong; Cao, Yuan Jie

    2017-08-01

    The aim of this study is an integrated research for text-based data mining and toxicity prediction modeling system for clinical decision support system based on big data in radiation oncology as a preliminary research. The structured and unstructured data were prepared by treatment plans and the unstructured data were extracted by dose-volume data image pattern recognition of prostate cancer for research articles crawling through the internet. We modeled an artificial neural network to build a predictor model system for toxicity prediction of organs at risk. We used a text-based data mining approach to build the artificial neural network model for bladder and rectum complication predictions. The pattern recognition method was used to mine the unstructured toxicity data for dose-volume at the detection accuracy of 97.9%. The confusion matrix and training model of the neural network were achieved with 50 modeled plans (n = 50) for validation. The toxicity level was analyzed and the risk factors for 25% bladder, 50% bladder, 20% rectum, and 50% rectum were calculated by the artificial neural network algorithm. As a result, 32 plans could cause complication but 18 plans were designed as non-complication among 50 modeled plans. We integrated data mining and a toxicity modeling method for toxicity prediction using prostate cancer cases. It is shown that a preprocessing analysis using text-based data mining and prediction modeling can be expanded to personalized patient treatment decision support based on big data.

  12. AN EFFICIENT WEB PERSONALIZATION APPROACH TO DISCOVER USER INTERESTED DIRECTORIES

    Directory of Open Access Journals (Sweden)

    M. Robinson Joel

    2014-04-01

    Full Text Available Web Usage Mining is the application of data mining technique used to retrieve the web usage from web proxy log file. Web Usage Mining consists of three major stages: preprocessing, clustering and pattern analysis. This paper explains each of these stages in detail. In this proposed approach, the web directories are discovered based on the user’s interestingness. The web proxy log file undergoes a preprocessing phase to improve the quality of data. Fuzzy Clustering Algorithm is used to cluster the user and session into disjoint clusters. In this paper, an effective approach is presented for Web personalization based on an Advanced Apriori algorithm. It is used to select the user interested web directories. The proposed method is compared with the existing web personalization methods like Objective Probabilistic Directory Miner (OPDM, Objective Community Directory Miner (OCDM and Objective Clustering and Probabilistic Directory Miner (OCPDM. The result shows that the proposed approach provides better results than the aforementioned existing approaches. At last, an application is developed with the user interested directories and web usage details.

  13. Citation Mining: Integrating Text Mining and Bibliometrics for Research User Profiling.

    Science.gov (United States)

    Kostoff, Ronald N.; del Rio, J. Antonio; Humenik, James A.; Garcia, Esther Ofilia; Ramirez, Ana Maria

    2001-01-01

    Discusses the importance of identifying the users and impact of research, and describes an approach for identifying the pathways through which research can impact other research, technology development, and applications. Describes a study that used citation mining, an integration of citation bibliometrics and text mining, on articles from the…

  14. Data Mining of Web-Based Documents on Social Networking Sites That Included Suicide-Related Words Among Korean Adolescents.

    Science.gov (United States)

    Song, Juyoung; Song, Tae Min; Seo, Dong-Chul; Jin, Jae Hyun

    2016-12-01

    To investigate online search activity of suicide-related words in South Korean adolescents through data mining of social media Web sites as the suicide rate in South Korea is one of the highest in the world. Out of more than 2.35 billion posts for 2 years from January 1, 2011 to December 31, 2012 on 163 social media Web sites in South Korea, 99,693 suicide-related documents were retrieved by Crawler and analyzed using text mining and opinion mining. These data were further combined with monthly employment rate, monthly rental prices index, monthly youth suicide rate, and monthly number of reported bully victims to fit multilevel models as well as structural equation models. The link from grade pressure to suicide risk showed the largest standardized path coefficient (beta = .357, p < .001) in structural models and a significant random effect (p < .01) in multilevel models. Depression was a partial mediator between suicide risk and grade pressure, low body image, victims of bullying, and concerns about disease. The largest total effect was observed in the grade pressure to depression to suicide risk. The multilevel models indicate about 27% of the variance in the daily suicide-related word search activity is explained by month-to-month variations. A lower employment rate, a higher rental prices index, and more bullying were associated with an increased suicide-related word search activity. Academic pressure appears to be the biggest contributor to Korean adolescents' suicide risk. Real-time suicide-related word search activity monitoring and response system needs to be developed. Copyright © 2016 Society for Adolescent Health and Medicine. Published by Elsevier Inc. All rights reserved.

  15. Application of text mining for customer evaluations in commercial banking

    Science.gov (United States)

    Tan, Jing; Du, Xiaojiang; Hao, Pengpeng; Wang, Yanbo J.

    2015-07-01

    Nowadays customer attrition is increasingly serious in commercial banks. To combat this problem roundly, mining customer evaluation texts is as important as mining customer structured data. In order to extract hidden information from customer evaluations, Textual Feature Selection, Classification and Association Rule Mining are necessary techniques. This paper presents all three techniques by using Chinese Word Segmentation, C5.0 and Apriori, and a set of experiments were run based on a collection of real textual data that includes 823 customer evaluations taken from a Chinese commercial bank. Results, consequent solutions, some advice for the commercial bank are given in this paper.

  16. Mercury flow through an Asian rice-based food web

    International Nuclear Information System (INIS)

    Abeysinghe, Kasun S.; Qiu, Guangle; Goodale, Eben; Anderson, Christopher W.N.; Bishop, Kevin; Evers, David C.; Goodale, Morgan W.; Hintelmann, Holger; Liu, Shengjie

    2017-01-01

    Mercury (Hg) is a globally-distributed pollutant, toxic to humans and animals. Emissions are particularly high in Asia, and the source of exposure for humans there may also be different from other regions, including rice as well as fish consumption, particularly in contaminated areas. Yet the threats Asian wildlife face in rice-based ecosystems are as yet unclear. We sought to understand how Hg flows through rice-based food webs in historic mining and non-mining regions of Guizhou, China. We measured total Hg (THg) and methylmercury (MeHg) in soil, rice, 38 animal species (27 for MeHg) spanning multiple trophic levels, and examined the relationship between stable isotopes and Hg concentrations. Our results confirm biomagnification of THg/MeHg, with a high trophic magnification slope. Invertivorous songbirds had concentrations of THg in their feathers that were 15x and 3x the concentration reported to significantly impair reproduction, at mining and non-mining sites, respectively. High concentrations in specialist rice consumers and in granivorous birds, the later as high as in piscivorous birds, suggest rice is a primary source of exposure. Spiders had the highest THg concentrations among invertebrates and may represent a vector through which Hg is passed to vertebrates, especially songbirds. Our findings suggest there could be significant population level health effects and consequent biodiversity loss in sensitive ecosystems, like agricultural wetlands, across Asia, and invertivorous songbirds would be good subjects for further studies investigating this possibility. - Highlights: • Hg concentrations were measured across rice-based food webs in Guizhou, China. • Of 38 animal species, THg concentrations were highest for invertivorous songbirds. • High THg levels in rice pests and in granivorous birds suggest rice as a source. • Levels of THg in songbird feathers at mining site were among highest ever recorded. • Even at non-mining site, THg in such

  17. Geovisualization of Local and Regional Migration Using Web-mined Demographics

    Science.gov (United States)

    Schuermann, R. T.; Chow, T. E.

    2014-11-01

    The intent of this research was to augment and facilitate analyses, which gauges the feasibility of web-mined demographics to study spatio-temporal dynamics of migration. As a case study, we explored the spatio-temporal dynamics of Vietnamese Americans (VA) in Texas through geovisualization of mined demographic microdata from the World Wide Web. Based on string matching across all demographic attributes, including full name, address, date of birth, age and phone number, multiple records of the same entity (i.e. person) over time were resolved and reconciled into a database. Migration trajectories were geovisualized through animated sprites by connecting the different addresses associated with the same person and segmenting the trajectory into small fragments. Intra-metropolitan migration patterns appeared at the local scale within many metropolitan areas. At the scale of metropolitan area, varying degrees of immigration and emigration manifest different types of migration clusters. This paper presents a methodology incorporating GIS methods and cartographic design to produce geovisualization animation, enabling the cognitive identification of migration patterns at multiple scales. Identification of spatio-temporal patterns often stimulates further research to better understand the phenomenon and enhance subsequent modeling.

  18. Financial Statement Fraud Detection using Text Mining

    OpenAIRE

    Rajan Gupta; Nasib Singh Gill

    2013-01-01

    Data mining techniques have been used enormously by the researchers’ community in detecting financial statement fraud. Most of the research in this direction has used the numbers (quantitative information) i.e. financial ratios present in the financial statements for detecting fraud. There is very little or no research on the analysis of text such as auditor’s comments or notes present in published reports. In this study we propose a text mining approach for detecting financial statement frau...

  19. Web-based control application using WebSocket

    International Nuclear Information System (INIS)

    Furukawa, Y.

    2012-01-01

    The WebSocket allows asynchronous full-duplex communication between a Web-based (i.e. Java Script-based) application and a Web-server. WebSocket started as a part of HTML5 standardization but has now been separated from HTML5 and has been developed independently. Using WebSocket, it becomes easy to develop platform independent presentation layer applications for accelerator and beamline control software. In addition, a Web browser is the only application program that needs to be installed on client computer. The WebSocket-based applications communicate with the WebSocket server using simple text-based messages, so WebSocket is applicable message-based control system like MADOCA, which was developed for the SPring-8 control system. A simple WebSocket server for the MADOCA control system and a simple motor control application were successfully made as a first trial of the WebSocket control application. Using Google-Chrome (version 13.0) on Debian/Linux and Windows 7, Opera (version 11.0) on Debian/Linux and Safari (version 5.0.3) on Mac OS X as clients, the motors can be controlled using a WebSocket-based Web-application. Diffractometer control application use in synchrotron radiation diffraction experiment was also developed. (author)

  20. Mining the inner structure of the Web graph

    International Nuclear Information System (INIS)

    Donato, Debora; Leonardi, Stefano; Millozzi, Stefano; Tsaparas, Panayiotis

    2008-01-01

    Despite being the sum of decentralized and uncoordinated efforts by heterogeneous groups and individuals, the World Wide Web exhibits a well-defined structure, characterized by several interesting properties. This structure was clearly revealed by Broder et al (2000 Graph structure in the web Comput. Netw. 33 309) who presented the evocative bow-tie picture of the Web. Although, the bow-tie structure is a relatively clear abstraction of the macroscopic picture of the Web, it is quite uninformative with respect to the finer details of the Web graph. In this paper, we mine the inner structure of the Web graph. We present a series of measurements on the Web, which offer a better understanding of the individual components of the bow-tie. In the process, we develop algorithmic techniques for performing these measurements. We discover that the scale-free properties permeate all the components of the bow-tie which exhibit the same macroscopic properties as the Web graph itself. However, close inspection reveals that their inner structure is quite distinct. We show that the Web graph does not exhibit self similarity within its components, and we propose a possible alternative picture for the Web graph, as it emerges from our experiments

  1. Application of text mining in the biomedical domain.

    Science.gov (United States)

    Fleuren, Wilco W M; Alkema, Wynand

    2015-03-01

    In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. The Role of Text Mining in Export Control

    Energy Technology Data Exchange (ETDEWEB)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon [Korea Institute of Nuclear Nonproliferation and Control, Daejeon (Korea, Republic of)

    2015-10-15

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control.

  3. The Role of Text Mining in Export Control

    International Nuclear Information System (INIS)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon

    2015-01-01

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control

  4. Mining biological networks from full-text articles.

    Science.gov (United States)

    Czarnecki, Jan; Shepherd, Adrian J

    2014-01-01

    The study of biological networks is playing an increasingly important role in the life sciences. Many different kinds of biological system can be modelled as networks; perhaps the most important examples are protein-protein interaction (PPI) networks, metabolic pathways, gene regulatory networks, and signalling networks. Although much useful information is easily accessible in publicly databases, a lot of extra relevant data lies scattered in numerous published papers. Hence there is a pressing need for automated text-mining methods capable of extracting such information from full-text articles. Here we present practical guidelines for constructing a text-mining pipeline from existing code and software components capable of extracting PPI networks from full-text articles. This approach can be adapted to tackle other types of biological network.

  5. PubMed-EX: a web browser extension to enhance PubMed search with text mining features.

    Science.gov (United States)

    Tsai, Richard Tzong-Han; Dai, Hong-Jie; Lai, Po-Ting; Huang, Chi-Hsin

    2009-11-15

    PubMed-EX is a browser extension that marks up PubMed search results with additional text-mining information. PubMed-EX's page mark-up, which includes section categorization and gene/disease and relation mark-up, can help researchers to quickly focus on key terms and provide additional information on them. All text processing is performed server-side, freeing up user resources. PubMed-EX is freely available at http://bws.iis.sinica.edu.tw/PubMed-EX and http://iisr.cse.yzu.edu.tw:8000/PubMed-EX/.

  6. The Application of Text Mining in Business Research

    DEFF Research Database (Denmark)

    Preuss, Bjørn

    2017-01-01

    The aim of this paper is to present a methodological concept in business research that has the potential to become one of the most powerful methods in the upcoming years when it comes to research qualitative phenomena in business and society. It presents a selection of algorithms as well elaborat...... on potential use cases for a text mining based approach to qualitative data analysis....

  7. MeSHmap: a text mining tool for MEDLINE.

    OpenAIRE

    Srinivasan, P.

    2001-01-01

    Our research goal is to explore text mining from the metadata included in MEDLINE documents. We present MeSHmap our prototype text mining system that exploits the MeSH indexing accompanying MEDLINE records. MeSHmap supports searches via PubMed followed by user driven exploration of the MeSH terms and subheadings in the retrieved set. The potential of the system goes beyond text retrieval. It may also be used to compare entities of the same type such as pairs of drugs or pairs of procedures et...

  8. Text messaging as an adjunct to a web-based intervention for college student alcohol use: A preliminary study.

    Science.gov (United States)

    Tahaney, Kelli D; Palfai, Tibor P

    2017-10-01

    Brief, web-based motivational interventions have shown promising results for reducing alcohol use and associated harm among college students. However, findings regarding which alcohol use outcomes are impacted are mixed and effects tend to be small to moderate, with effect sizes decreasing over longer-term follow-up periods. As a result, these interventions may benefit from adjunctive strategies to bolster students' engagement with intervention material and to extend interventions beyond initial contacts into student's daily lives. This study tested the efficacy of text messaging as an adjunct to a web-based intervention for heavy episodic drinking college students. One-hundred and thirteen undergraduate student risky drinkers recruited from an introductory psychology class were randomly assigned to one of three conditions-assessment only (AO), web intervention (WI), and web intervention plus text messaging (WI+TXT). Heavy drinking episodes (HDEs), weekend quantity per occasion, and alcohol-related consequences were assessed at baseline and one month follow-up. Univariate analysis of covariance (ANCOVA) was used to assess the influence of condition assignment on 1-month outcomes, controlling for baseline variables. Planned contrasts showed that those in the WI+TXT condition showed significantly less weekend drinking than those in the AO and WI conditions. Although those in the WI+TXT condition showed significantly fewer HDEs compared to AO, it was not significantly different than the WI only condition. No differences were observed on alcohol-related problems. These findings provide partial support for the view that text messaging may be a useful adjunct to web-based interventions for reducing alcohol consumption among student drinkers. Copyright © 2017. Published by Elsevier Ltd.

  9. Text mining with R a tidy approach

    CERN Document Server

    Silge, Julia

    2017-01-01

    Much of the data available today is unstructured and text-heavy, making it challenging for analysts to apply their usual data wrangling and visualization tools. With this practical book, you'll explore text-mining techniques with tidytext, a package that authors Julia Silge and David Robinson developed using the tidy principles behind R packages like ggraph and dplyr. You'll learn how tidytext and other tidy tools in R can make text analysis easier and more effective. The authors demonstrate how treating text as data frames enables you to manipulate, summarize, and visualize characteristics of text. You'll also learn how to integrate natural language processing (NLP) into effective workflows. Practical code examples and data explorations will help you generate real insights from literature, news, and social media. Learn how to apply the tidy text format to NLP Use sentiment analysis to mine the emotional content of text Identify a document's most important terms with frequency measurements E...

  10. Rule-based statistical data mining agents for an e-commerce application

    Science.gov (United States)

    Qin, Yi; Zhang, Yan-Qing; King, K. N.; Sunderraman, Rajshekhar

    2003-03-01

    Intelligent data mining techniques have useful e-Business applications. Because an e-Commerce application is related to multiple domains such as statistical analysis, market competition, price comparison, profit improvement and personal preferences, this paper presents a hybrid knowledge-based e-Commerce system fusing intelligent techniques, statistical data mining, and personal information to enhance QoS (Quality of Service) of e-Commerce. A Web-based e-Commerce application software system, eDVD Web Shopping Center, is successfully implemented uisng Java servlets and an Oracle81 database server. Simulation results have shown that the hybrid intelligent e-Commerce system is able to make smart decisions for different customers.

  11. Text mining for traditional Chinese medical knowledge discovery: a survey.

    Science.gov (United States)

    Zhou, Xuezhong; Peng, Yonghong; Liu, Baoyan

    2010-08-01

    Extracting meaningful information and knowledge from free text is the subject of considerable research interest in the machine learning and data mining fields. Text data mining (or text mining) has become one of the most active research sub-fields in data mining. Significant developments in the area of biomedical text mining during the past years have demonstrated its great promise for supporting scientists in developing novel hypotheses and new knowledge from the biomedical literature. Traditional Chinese medicine (TCM) provides a distinct methodology with which to view human life. It is one of the most complete and distinguished traditional medicines with a history of several thousand years of studying and practicing the diagnosis and treatment of human disease. It has been shown that the TCM knowledge obtained from clinical practice has become a significant complementary source of information for modern biomedical sciences. TCM literature obtained from the historical period and from modern clinical studies has recently been transformed into digital data in the form of relational databases or text documents, which provide an effective platform for information sharing and retrieval. This motivates and facilitates research and development into knowledge discovery approaches and to modernize TCM. In order to contribute to this still growing field, this paper presents (1) a comparative introduction to TCM and modern biomedicine, (2) a survey of the related information sources of TCM, (3) a review and discussion of the state of the art and the development of text mining techniques with applications to TCM, (4) a discussion of the research issues around TCM text mining and its future directions. Copyright 2010 Elsevier Inc. All rights reserved.

  12. The Distribution of the Informative Intensity of the Text in Terms of its Structure (On Materials of the English Texts in the Mining Sphere

    Directory of Open Access Journals (Sweden)

    Znikina Ludmila

    2017-01-01

    Full Text Available The article deals with the distribution of informative intensity of the English-language scientific text based on its structural features contributing to the process of formalization of the scientific text and the preservation of the adequacy of the text with derived semantic information in relation to the primary. Discourse analysis is built on specific compositional and meaningful examples of scientific texts taken from the mining field. It also analyzes the adequacy of the translation of foreign texts into another language, the relationships between elements of linguistic systems, the degree of a formal conformance, translation with the specific objectives and information needs of the recipient. Some key words and ideas are emphasized in the paragraphs of the English-language mining scientific texts. The article gives the characteristic features of the structure of paragraphs of technical text and examples of constructions in English scientific texts based on a mining theme with the aim to explain the possible ways of their adequate translation.

  13. A citation analysis of the research reports of the Central Mining Institute. Mining and Environment using the Web of Science, Scopus, BazTech, and Google Scholar: A case study

    OpenAIRE

    Magdalena Bemke-Switilnik; Aneta Drabek

    2015-01-01

    This paper presents the analysis of a Polish mining sciences journal (Prace Naukowe GIG. Górnictwo i Środowisko; title in English: Research Reports of the Central Mining Institute. Mining and Environment; acronym in English [RRCMIME]). The analysis is based on data from the following sources: the Web of Science (WoS), Scopus, BazTech (a bibliographic database containing citations from Polish Technical Journals), and Google Scholar (GS). The data from the WoS and Scopus were collected manually...

  14. The Distribution of the Informative Intensity of the Text in Terms of its Structure (On Materials of the English Texts in the Mining Sphere)

    Science.gov (United States)

    Znikina, Ludmila; Rozhneva, Elena

    2017-11-01

    The article deals with the distribution of informative intensity of the English-language scientific text based on its structural features contributing to the process of formalization of the scientific text and the preservation of the adequacy of the text with derived semantic information in relation to the primary. Discourse analysis is built on specific compositional and meaningful examples of scientific texts taken from the mining field. It also analyzes the adequacy of the translation of foreign texts into another language, the relationships between elements of linguistic systems, the degree of a formal conformance, translation with the specific objectives and information needs of the recipient. Some key words and ideas are emphasized in the paragraphs of the English-language mining scientific texts. The article gives the characteristic features of the structure of paragraphs of technical text and examples of constructions in English scientific texts based on a mining theme with the aim to explain the possible ways of their adequate translation.

  15. Two-step web-mining approach to study geology/geophysics-related open-source software projects

    Science.gov (United States)

    Behrends, Knut; Conze, Ronald

    2013-04-01

    Geology/geophysics is a highly interdisciplinary science, overlapping with, for instance, physics, biology and chemistry. In today's software-intensive work environments, geoscientists often encounter new open-source software from scientific fields that are only remotely related to the own field of expertise. We show how web-mining techniques can help to carry out systematic discovery and evaluation of such software. In a first step, we downloaded ~500 abstracts (each consisting of ~1 kb UTF-8 text) from agu-fm12.abstractcentral.com. This web site hosts the abstracts of all publications presented at AGU Fall Meeting 2012, the world's largest annual geology/geophysics conference. All abstracts belonged to the category "Earth and Space Science Informatics", an interdisciplinary label cross-cutting many disciplines such as "deep biosphere", "atmospheric research", and "mineral physics". Each publication was represented by a highly structured record with ~20 short data attributes, the largest authorship-record being the unstructured "abstract" field. We processed texts of the abstracts with the statistics software "R" to calculate a corpus and a term-document matrix. Using R package "tm", we applied text-mining techniques to filter data and develop hypotheses about software-development activities happening in various geology/geophysics fields. Analyzing the term-document matrix with basic techniques (e.g., word frequencies, co-occurences, weighting) as well as more complex methods (clustering, classification) several key pieces of information were extracted. For example, text-mining can be used to identify scientists who are also developers of open-source scientific software, and the names of their programming projects and codes can also be identified. In a second step, based on the intermediate results found by processing the conference-abstracts, any new hypotheses can be tested in another webmining subproject: by merging the dataset with open data from github

  16. Applying Supervised Opinion Mining Techniques on Online User Reviews

    Directory of Open Access Journals (Sweden)

    Ion SMEUREANU

    2012-01-01

    Full Text Available In recent years, the spectacular development of web technologies, lead to an enormous quantity of user generated information in online systems. This large amount of information on web platforms make them viable for use as data sources, in applications based on opinion mining and sentiment analysis. The paper proposes an algorithm for detecting sentiments on movie user reviews, based on naive Bayes classifier. We make an analysis of the opinion mining domain, techniques used in sentiment analysis and its applicability. We implemented the proposed algorithm and we tested its performance, and suggested directions of development.

  17. VisualUrText: A Text Analytics Tool for Unstructured Textual Data

    Science.gov (United States)

    Zainol, Zuraini; Jaymes, Mohd T. H.; Nohuddin, Puteri N. E.

    2018-05-01

    The growing amount of unstructured text over Internet is tremendous. Text repositories come from Web 2.0, business intelligence and social networking applications. It is also believed that 80-90% of future growth data is available in the form of unstructured text databases that may potentially contain interesting patterns and trends. Text Mining is well known technique for discovering interesting patterns and trends which are non-trivial knowledge from massive unstructured text data. Text Mining covers multidisciplinary fields involving information retrieval (IR), text analysis, natural language processing (NLP), data mining, machine learning statistics and computational linguistics. This paper discusses the development of text analytics tool that is proficient in extracting, processing, analyzing the unstructured text data and visualizing cleaned text data into multiple forms such as Document Term Matrix (DTM), Frequency Graph, Network Analysis Graph, Word Cloud and Dendogram. This tool, VisualUrText, is developed to assist students and researchers for extracting interesting patterns and trends in document analyses.

  18. Text-mining analysis of mHealth research

    Science.gov (United States)

    Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

    2017-01-01

    Interventions, (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies. PMID:29430456

  19. Text-mining analysis of mHealth research.

    Science.gov (United States)

    Ozaydin, Bunyamin; Zengul, Ferhat; Oner, Nurettin; Delen, Dursun

    2017-01-01

    , (V) Research Design, (VI) Infrastructure, (VII) Applications, (VIII) Research and Innovation in Health Technologies, (IX) Sensor-based Devices and Measurement Algorithms, (X) Survey-based Research. Third, the trend analyses indicated the infrastructure cluster as the highest percentage researched area until 2014. The Research and Innovation in Health Technologies cluster experienced the largest increase in numbers of publications in recent years, especially after 2014. This study is unique because it is the only known study utilizing text-mining analyses to reveal the streams and trends for mHealth research. The fast growth in mobile technologies is expected to lead to higher numbers of studies focusing on mHealth and its implications for various healthcare outcomes. Findings of this study can be utilized by researchers in identifying areas for future studies.

  20. BAGEL4: a user-friendly web server to thoroughly mine RiPPs and bacteriocins.

    Science.gov (United States)

    van Heel, Auke J; de Jong, Anne; Song, Chunxu; Viel, Jakob H; Kok, Jan; Kuipers, Oscar P

    2018-05-21

    Interest in secondary metabolites such as RiPPs (ribosomally synthesized and posttranslationally modified peptides) is increasing worldwide. To facilitate the research in this field we have updated our mining web server. BAGEL4 is faster than its predecessor and is now fully independent from ORF-calling. Gene clusters of interest are discovered using the core-peptide database and/or through HMM motifs that are present in associated context genes. The databases used for mining have been updated and extended with literature references and links to UniProt and NCBI. Additionally, we have included automated promoter and terminator prediction and the option to upload RNA expression data, which can be displayed along with the identified clusters. Further improvements include the annotation of the context genes, which is now based on a fast blast against the prokaryote part of the UniRef90 database, and the improved web-BLAST feature that dynamically loads structural data such as internal cross-linking from UniProt. Overall BAGEL4 provides the user with more information through a user-friendly web-interface which simplifies data evaluation. BAGEL4 is freely accessible at http://bagel4.molgenrug.nl.

  1. The Effects of a Web-Based Vocabulary Development Tool on Student Reading Comprehension of Science Texts

    Directory of Open Access Journals (Sweden)

    Karen Thompson

    2012-10-01

    Full Text Available The complexities of reading comprehension have received increasing recognition in recent years. In this realm, the power of vocabulary in predicting cognitive challenges in phonological, orthographic, and semantic processes is well documented. In this study, we present a web-based vocabulary development tool that has a series of interactive displays, including a list of the 50 most frequent words in a particular text, Google image and video results for any combination of those words, definitions, and synonyms for particular words from the text, and a list of sentences from the text in which particular words appear. Additionally, we report the results of an experiment that was performed working collaboratively with middle school science teachers from a large urban district in the United States. While this experiment did not show a significant positive effect of this tool on reading comprehension in science, we did find that girls seem to score worse on a reading comprehension assessment after using our web-based tool. This result could reflect prior research that suggests that some girls tend to have a negative attitude towards technology due to gender stereotypes that give girls the impression that they are not as good as boys in working with computers.

  2. A Combined Mining Approach and Application in Tax Administration

    OpenAIRE

    Arun Solanki; Dr. Ela Kumar

    2010-01-01

    This paper reports the development of a model for taxation. This model will work for the tax payers as well as for the administrator. It utilizes the technique of web mining, text mining, data mining and human experience knowledge for creating a knowledge base of taxation. All knowledge from each part is saved in knowledge base through a knowledge management platform. Using this knowledge management platform the administrator and tax payer can retrieve knowledge;send feedback on the basis of ...

  3. Text mining-based in silico drug discovery in oral mucositis caused by high-dose cancer therapy.

    Science.gov (United States)

    Kirk, Jon; Shah, Nirav; Noll, Braxton; Stevens, Craig B; Lawler, Marshall; Mougeot, Farah B; Mougeot, Jean-Luc C

    2018-08-01

    Oral mucositis (OM) is a major dose-limiting side effect of chemotherapy and radiation used in cancer treatment. Due to the complex nature of OM, currently available drug-based treatments are of limited efficacy. Our objectives were (i) to determine genes and molecular pathways associated with OM and wound healing using computational tools and publicly available data and (ii) to identify drugs formulated for topical use targeting the relevant OM molecular pathways. OM and wound healing-associated genes were determined by text mining, and the intersection of the two gene sets was selected for gene ontology analysis using the GeneCodis program. Protein interaction network analysis was performed using STRING-db. Enriched gene sets belonging to the identified pathways were queried against the Drug-Gene Interaction database to find drug candidates for topical use in OM. Our analysis identified 447 genes common to both the "OM" and "wound healing" text mining concepts. Gene enrichment analysis yielded 20 genes representing six pathways and targetable by a total of 32 drugs which could possibly be formulated for topical application. A manual search on ClinicalTrials.gov confirmed no relevant pathway/drug candidate had been overlooked. Twenty-five of the 32 drugs can directly affect the PTGS2 (COX-2) pathway, the pathway that has been targeted in previous clinical trials with limited success. Drug discovery using in silico text mining and pathway analysis tools can facilitate the identification of existing drugs that have the potential of topical administration to improve OM treatment.

  4. A Framework for Text Mining in Scientometric Study: A Case Study in Biomedicine Publications

    Science.gov (United States)

    Silalahi, V. M. M.; Hardiyati, R.; Nadhiroh, I. M.; Handayani, T.; Rahmaida, R.; Amelia, M.

    2018-04-01

    The data of Indonesians research publications in the domain of biomedicine has been collected to be text mined for the purpose of a scientometric study. The goal is to build a predictive model that provides a classification of research publications on the potency for downstreaming. The model is based on the drug development processes adapted from the literatures. An effort is described to build the conceptual model and the development of a corpus on the research publications in the domain of Indonesian biomedicine. Then an investigation is conducted relating to the problems associated with building a corpus and validating the model. Based on our experience, a framework is proposed to manage the scientometric study based on text mining. Our method shows the effectiveness of conducting a scientometric study based on text mining in order to get a valid classification model. This valid model is mainly supported by the iterative and close interactions with the domain experts starting from identifying the issues, building a conceptual model, to the labelling, validation and results interpretation.

  5. PubRunner: A light-weight framework for updating text mining results.

    Science.gov (United States)

    Anekalla, Kishore R; Courneya, J P; Fiorini, Nicolas; Lever, Jake; Muchow, Michael; Busby, Ben

    2017-01-01

    Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.

  6. Text mining a self-report back-translation.

    Science.gov (United States)

    Blanch, Angel; Aluja, Anton

    2016-06-01

    There are several recommendations about the routine to undertake when back translating self-report instruments in cross-cultural research. However, text mining methods have been generally ignored within this field. This work describes a text mining innovative application useful to adapt a personality questionnaire to 12 different languages. The method is divided in 3 different stages, a descriptive analysis of the available back-translated instrument versions, a dissimilarity assessment between the source language instrument and the 12 back-translations, and an item assessment of item meaning equivalence. The suggested method contributes to improve the back-translation process of self-report instruments for cross-cultural research in 2 significant intertwined ways. First, it defines a systematic approach to the back translation issue, allowing for a more orderly and informed evaluation concerning the equivalence of different versions of the same instrument in different languages. Second, it provides more accurate instrument back-translations, which has direct implications for the reliability and validity of the instrument's test scores when used in different cultures/languages. In addition, this procedure can be extended to the back-translation of self-reports measuring psychological constructs in clinical assessment. Future research works could refine the suggested methodology and use additional available text mining tools. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  7. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    Directory of Open Access Journals (Sweden)

    Calvin Lam

    Full Text Available Research on publication trends in journal articles on sleep disorders (SDs and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings.SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms.MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms.Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  8. Science and Technology Text Mining Basic Concepts

    National Research Council Canada - National Science Library

    Losiewicz, Paul

    2003-01-01

    ...). It then presents some of the most widely used data and text mining techniques, including clustering and classification methods, such as nearest neighbor, relational learning models, and genetic...

  9. Social big data mining

    CERN Document Server

    Ishikawa, Hiroshi

    2015-01-01

    Social Media. Big Data and Social Data. Hypotheses in the Era of Big Data. Social Big Data Applications. Basic Concepts in Data Mining. Association Rule Mining. Clustering. Classification. Prediction. Web Structure Mining. Web Content Mining. Web Access Log Mining, Information Extraction and Deep Web Mining. Media Mining. Scalability and Outlier Detection.

  10. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

    Science.gov (United States)

    Westergaard, David; Stærfeldt, Hans-Henrik; Tønsberg, Christian; Jensen, Lars Juhl; Brunak, Søren

    2018-02-01

    Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.

  11. Verification of the fulfilment of the purposes of Basel II, Pillar 3 through application of the web log mining methods

    Directory of Open Access Journals (Sweden)

    M. Munk

    2012-01-01

    Full Text Available The objective of the paper is the verification of the fulfilment of the purposes of Basel II, Pillar 3 – market discipline during the recent financial crisis. The objective of the paper is to describe the current state of the working out of the project that is focused on the analysis of the market participants’ interest in mandatory disclosure of financial information by a commercial bank by means of advanced methods of web log mining. The output of the realized project will be the verification of the assumptions related to the purposes of Basel III by means of the web mining methods, the recommendations for possible reduction of mandatory disclosure of information under Basel II and III, the proposal of the methodology for data preparation for web log mining in this application domain and the generalised procedure for users’ behaviour modelling dependent on time. The schedule of the project has been divided into three phases. The paper deals with its first phase that is focusing on the data pre-processing, analysis and evaluation of the required information under Basel II, Pillar 3 since 2008 and its disclosure into the web site of a commercial bank. The authors introduce the methodologies for data preparation and known heuristic methods for path completion into web log files with respect to the particularity of investigated application domain. They propose scientific methods for modelling users’ behaviour of the webpages related to Pillar 3 with respect to time.

  12. Preprocessing and Content/Navigational Pages Identification as Premises for an Extended Web Usage Mining Model Development

    Directory of Open Access Journals (Sweden)

    Daniel MICAN

    2009-01-01

    Full Text Available From its appearance until nowadays, the internet saw a spectacular growth not only in terms of websites number and information volume, but also in terms of the number of visitors. Therefore, the need of an overall analysis regarding both the web sites and the content provided by them was required. Thus, a new branch of research was developed, namely web mining, that aims to discover useful information and knowledge, based not only on the analysis of websites and content, but also on the way in which the users interact with them. The aim of the present paper is to design a database that captures only the relevant data from logs in a way that will allow to store and manage large sets of temporal data with common tools in real time. In our work, we rely on different web sites or website sections with known architecture and we test several hypotheses from the literature in order to extend the framework to sites with unknown or chaotic structure, which are non-transparent in determining the type of visited pages. In doing this, we will start from non-proprietary, preexisting raw server logs.

  13. Spectral signature verification using statistical analysis and text mining

    Science.gov (United States)

    DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.

    2016-05-01

    In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is

  14. Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II.

    Science.gov (United States)

    Lu, Zhiyong; Hirschman, Lynette

    2012-01-01

    Manual curation of data from the biomedical literature is a rate-limiting factor for many expert curated databases. Despite the continuing advances in biomedical text mining and the pressing needs of biocurators for better tools, few existing text-mining tools have been successfully integrated into production literature curation systems such as those used by the expert curated databases. To close this gap and better understand all aspects of literature curation, we invited submissions of written descriptions of curation workflows from expert curated databases for the BioCreative 2012 Workshop Track II. We received seven qualified contributions, primarily from model organism databases. Based on these descriptions, we identified commonalities and differences across the workflows, the common ontologies and controlled vocabularies used and the current and desired uses of text mining for biocuration. Compared to a survey done in 2009, our 2012 results show that many more databases are now using text mining in parts of their curation workflows. In addition, the workshop participants identified text-mining aids for finding gene names and symbols (gene indexing), prioritization of documents for curation (document triage) and ontology concept assignment as those most desired by the biocurators. DATABASE URL: http://www.biocreative.org/tasks/bc-workshop-2012/workflow/.

  15. Pressing needs of biomedical text mining in biocuration and beyond: opportunities and challenges.

    Science.gov (United States)

    Singhal, Ayush; Leaman, Robert; Catlett, Natalie; Lemberger, Thomas; McEntyre, Johanna; Polson, Shawn; Xenarios, Ioannis; Arighi, Cecilia; Lu, Zhiyong

    2016-01-01

    Text mining in the biomedical sciences is rapidly transitioning from small-scale evaluation to large-scale application. In this article, we argue that text-mining technologies have become essential tools in real-world biomedical research. We describe four large scale applications of text mining, as showcased during a recent panel discussion at the BioCreative V Challenge Workshop. We draw on these applications as case studies to characterize common requirements for successfully applying text-mining techniques to practical biocuration needs. We note that system 'accuracy' remains a challenge and identify several additional common difficulties and potential research directions including (i) the 'scalability' issue due to the increasing need of mining information from millions of full-text articles, (ii) the 'interoperability' issue of integrating various text-mining systems into existing curation workflows and (iii) the 'reusability' issue on the difficulty of applying trained systems to text genres that are not seen previously during development. We then describe related efforts within the text-mining community, with a special focus on the BioCreative series of challenge workshops. We believe that focusing on the near-term challenges identified in this work will amplify the opportunities afforded by the continued adoption of text-mining tools. Finally, in order to sustain the curation ecosystem and have text-mining systems adopted for practical benefits, we call for increased collaboration between text-mining researchers and various stakeholders, including researchers, publishers and biocurators. Published by Oxford University Press 2016. This work is written by US Government employees and is in the public domain in the US.

  16. pubmed.mineR: An R package with text-mining algorithms to ...

    Indian Academy of Sciences (India)

    2015-09-29

    Sep 29, 2015 ... using text-mining algorithms for biomedical research pur- poses. ... studies are described to illustrate some potential uses of ... This is the most applied task. ... other alphabets (for example, Greek alphabets) and hyphens.

  17. Rare disease diagnosis: A review of web search, social media and large-scale data-mining approaches.

    Science.gov (United States)

    Svenstrup, Dan; Jørgensen, Henrik L; Winther, Ole

    2015-01-01

    Physicians and the general public are increasingly using web-based tools to find answers to medical questions. The field of rare diseases is especially challenging and important as shown by the long delay and many mistakes associated with diagnoses. In this paper we review recent initiatives on the use of web search, social media and data mining in data repositories for medical diagnosis. We compare the retrieval accuracy on 56 rare disease cases with known diagnosis for the web search tools google.com, pubmed.gov, omim.org and our own search tool findzebra.com. We give a detailed description of IBM's Watson system and make a rough comparison between findzebra.com and Watson on subsets of the Doctor's dilemma dataset. The recall@10 and recall@20 (fraction of cases where the correct result appears in top 10 and top 20) for the 56 cases are found to be be 29%, 16%, 27% and 59% and 32%, 18%, 34% and 64%, respectively. Thus, FindZebra has a significantly (p mining tools and social media are some of the areas that hold promise.

  18. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

    Science.gov (United States)

    Westergaard, David; Stærfeldt, Hans-Henrik

    2018-01-01

    Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823–2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein–protein, disease–gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. PMID:29447159

  19. Text Mining of Journal Articles for Sleep Disorder Terminologies.

    Science.gov (United States)

    Lam, Calvin; Lai, Fu-Chih; Wang, Chia-Hui; Lai, Mei-Hsin; Hsu, Nanly; Chung, Min-Huey

    2016-01-01

    Research on publication trends in journal articles on sleep disorders (SDs) and the associated methodologies by using text mining has been limited. The present study involved text mining for terms to determine the publication trends in sleep-related journal articles published during 2000-2013 and to identify associations between SD and methodology terms as well as conducting statistical analyses of the text mining findings. SD and methodology terms were extracted from 3,720 sleep-related journal articles in the PubMed database by using MetaMap. The extracted data set was analyzed using hierarchical cluster analyses and adjusted logistic regression models to investigate publication trends and associations between SD and methodology terms. MetaMap had a text mining precision, recall, and false positive rate of 0.70, 0.77, and 11.51%, respectively. The most common SD term was breathing-related sleep disorder, whereas narcolepsy was the least common. Cluster analyses showed similar methodology clusters for each SD term, except narcolepsy. The logistic regression models showed an increasing prevalence of insomnia, parasomnia, and other sleep disorders but a decreasing prevalence of breathing-related sleep disorder during 2000-2013. Different SD terms were positively associated with different methodology terms regarding research design terms, measure terms, and analysis terms. Insomnia-, parasomnia-, and other sleep disorder-related articles showed an increasing publication trend, whereas those related to breathing-related sleep disorder showed a decreasing trend. Furthermore, experimental studies more commonly focused on hypersomnia and other SDs and less commonly on insomnia, breathing-related sleep disorder, narcolepsy, and parasomnia. Thus, text mining may facilitate the exploration of the publication trends in SDs and the associated methodologies.

  20. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts

    DEFF Research Database (Denmark)

    Westergaard, David; Stærfeldt, Hans Henrik; Tønsberg, Christian

    2018-01-01

    Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15...... subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full...... million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein...

  1. Imitating manual curation of text-mined facts in biomedicine.

    Directory of Open Access Journals (Sweden)

    Raul Rodriguez-Esteban

    2006-09-01

    Full Text Available Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted of individual facts--to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations, we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95. Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

  2. Automated web usage data mining and recommendation system using K-Nearest Neighbor (KNN classification method

    Directory of Open Access Journals (Sweden)

    D.A. Adeniyi

    2016-01-01

    Full Text Available The major problem of many on-line web sites is the presentation of many choices to the client at a time; this usually results to strenuous and time consuming task in finding the right product or information on the site. In this work, we present a study of automatic web usage data mining and recommendation system based on current user behavior through his/her click stream data on the newly developed Really Simple Syndication (RSS reader website, in order to provide relevant information to the individual without explicitly asking for it. The K-Nearest-Neighbor (KNN classification method has been trained to be used on-line and in Real-Time to identify clients/visitors click stream data, matching it to a particular user group and recommend a tailored browsing option that meet the need of the specific user at a particular time. To achieve this, web users RSS address file was extracted, cleansed, formatted and grouped into meaningful session and data mart was developed. Our result shows that the K-Nearest Neighbor classifier is transparent, consistent, straightforward, simple to understand, high tendency to possess desirable qualities and easy to implement than most other machine learning techniques specifically when there is little or no prior knowledge about data distribution.

  3. Text Clustering Algorithm Based on Random Cluster Core

    Directory of Open Access Journals (Sweden)

    Huang Long-Jun

    2016-01-01

    Full Text Available Nowadays clustering has become a popular text mining algorithm, but the huge data can put forward higher requirements for the accuracy and performance of text mining. In view of the performance bottleneck of traditional text clustering algorithm, this paper proposes a text clustering algorithm with random features. This is a kind of clustering algorithm based on text density, at the same time using the neighboring heuristic rules, the concept of random cluster is introduced, which effectively reduces the complexity of the distance calculation.

  4. A Survey of Text Mining in Social Media: Facebook and Twitter Perspectives

    Directory of Open Access Journals (Sweden)

    Said A. Salloum

    2017-01-01

    Full Text Available Text mining has become one of the trendy fields that has been incorporated in several research fields such as computational linguistics, Information Retrieval (IR and data mining. Natural Language Processing (NLP techniques were used to extract knowledge from the textual text that is written by human beings. Text mining reads an unstructured form of data to provide meaningful information patterns in a shortest time period. Social networking sites are a great source of communication as most of the people in today’s world use these sites in their daily lives to keep connected to each other. It becomes a common practice to not write a sentence with correct grammar and spelling. This practice may lead to different kinds of ambiguities like lexical, syntactic, and semantic and due to this type of unclear data, it is hard to find out the actual data order. Accordingly, we are conducting an investigation with the aim of looking for different text mining methods to get various textual orders on social media websites. This survey aims to describe how studies in social media have used text analytics and text mining techniques for the purpose of identifying the key themes in the data. This survey focused on analyzing the text mining studies related to Facebook and Twitter; the two dominant social media in the world. Results of this survey can serve as the baselines for future text mining research.

  5. [Text mining, a method for computer-assisted analysis of scientific texts, demonstrated by an analysis of author networks].

    Science.gov (United States)

    Hahn, P; Dullweber, F; Unglaub, F; Spies, C K

    2014-06-01

    Searching for relevant publications is becoming more difficult with the increasing number of scientific articles. Text mining as a specific form of computer-based data analysis may be helpful in this context. Highlighting relations between authors and finding relevant publications concerning a specific subject using text analysis programs are illustrated graphically by 2 performed examples. © Georg Thieme Verlag KG Stuttgart · New York.

  6. Combining Data Warehouse and Data Mining Techniques for Web Log Analysis

    DEFF Research Database (Denmark)

    Pedersen, Torben Bach; Jespersen, Søren; Thorhauge, Jesper

    2008-01-01

    a number of approaches thatcombine data warehousing and data mining techniques in order to analyze Web logs.After introducing the well-known click and session data warehouse (DW) schemas,the chapter presents the subsession schema, which allows fast queries on sequences...

  7. New challenges for text mining: mapping between text and manually curated pathways

    Science.gov (United States)

    Oda, Kanae; Kim, Jin-Dong; Ohta, Tomoko; Okanohara, Daisuke; Matsuzaki, Takuya; Tateisi, Yuka; Tsujii, Jun'ichi

    2008-01-01

    Background Associating literature with pathways poses new challenges to the Text Mining (TM) community. There are three main challenges to this task: (1) the identification of the mapping position of a specific entity or reaction in a given pathway, (2) the recognition of the causal relationships among multiple reactions, and (3) the formulation and implementation of required inferences based on biological domain knowledge. Results To address these challenges, we constructed new resources to link the text with a model pathway; they are: the GENIA pathway corpus with event annotation and NF-kB pathway. Through their detailed analysis, we address the untapped resource, ‘bio-inference,’ as well as the differences between text and pathway representation. Here, we show the precise comparisons of their representations and the nine classes of ‘bio-inference’ schemes observed in the pathway corpus. Conclusions We believe that the creation of such rich resources and their detailed analysis is the significant first step for accelerating the research of the automatic construction of pathway from text. PMID:18426550

  8. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies.

    Science.gov (United States)

    Cohen, Raphael; Elhadad, Michael; Elhadad, Noémie

    2013-01-16

    The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for phenotype extraction. Text-mining methods in particular can help disease modeling by mapping named-entities mentions to terminologies and clustering semantically related terms. EHR corpora, however, exhibit specific statistical and linguistic characteristics when compared with corpora in the biomedical literature domain. We focus on copy-and-paste redundancy: clinicians typically copy and paste information from previous notes when documenting a current patient encounter. Thus, within a longitudinal patient record, one expects to observe heavy redundancy. In this paper, we ask three research questions: (i) How can redundancy be quantified in large-scale text corpora? (ii) Conventional wisdom is that larger corpora yield better results in text mining. But how does the observed EHR redundancy affect text mining? Does such redundancy introduce a bias that distorts learned models? Or does the redundancy introduce benefits by highlighting stable and important subsets of the corpus? (iii) How can one mitigate the impact of redundancy on text mining? We analyze a large-scale EHR corpus and quantify redundancy both in terms of word and semantic concept repetition. We observe redundancy levels of about 30% and non-standard distribution of both words and concepts. We measure the impact of redundancy on two standard text-mining applications: collocation identification and topic modeling. We compare the results of these methods on synthetic data with controlled levels of redundancy and observe significant performance variation. Finally, we compare two mitigation strategies to avoid redundancy-induced bias: (i) a baseline strategy, keeping only the last note for each patient in the corpus; (ii) removing redundant notes with an efficient fingerprinting-based algorithm. (a)For text mining, preprocessing the EHR corpus with fingerprinting yields

  9. Data Processing and Text Mining Technologies on Electronic Medical Records: A Review

    Directory of Open Access Journals (Sweden)

    Wencheng Sun

    2018-01-01

    Full Text Available Currently, medical institutes generally use EMR to record patient’s condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition and RE (relation extraction. This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.

  10. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text.

    Science.gov (United States)

    Garten, Yael; Altman, Russ B

    2009-02-05

    Pharmacogenomics studies the relationship between genetic variation and the variation in drug response phenotypes. The field is rapidly gaining importance: it promises drugs targeted to particular subpopulations based on genetic background. The pharmacogenomics literature has expanded rapidly, but is dispersed in many journals. It is challenging, therefore, to identify important associations between drugs and molecular entities--particularly genes and gene variants, and thus these critical connections are often lost. Text mining techniques can allow us to convert the free-style text to a computable, searchable format in which pharmacogenomic concepts (such as genes, drugs, polymorphisms, and diseases) are identified, and important links between these concepts are recorded. Availability of full text articles as input into text mining engines is key, as literature abstracts often do not contain sufficient information to identify these pharmacogenomic associations. Thus, building on a tool called Textpresso, we have created the Pharmspresso tool to assist in identifying important pharmacogenomic facts in full text articles. Pharmspresso parses text to find references to human genes, polymorphisms, drugs and diseases and their relationships. It presents these as a series of marked-up text fragments, in which key concepts are visually highlighted. To evaluate Pharmspresso, we used a gold standard of 45 human-curated articles. Pharmspresso identified 78%, 61%, and 74% of target gene, polymorphism, and drug concepts, respectively. Pharmspresso is a text analysis tool that extracts pharmacogenomic concepts from the literature automatically and thus captures our current understanding of gene-drug interactions in a computable form. We have made Pharmspresso available at http://pharmspresso.stanford.edu.

  11. Automated detection of follow-up appointments using text mining of discharge records.

    Science.gov (United States)

    Ruud, Kari L; Johnson, Matthew G; Liesinger, Juliette T; Grafft, Carrie A; Naessens, James M

    2010-06-01

    To determine whether text mining can accurately detect specific follow-up appointment criteria in free-text hospital discharge records. Cross-sectional study. Mayo Clinic Rochester hospitals. Inpatients discharged from general medicine services in 2006 (n = 6481). Textual hospital dismissal summaries were manually reviewed to determine whether the records contained specific follow-up appointment arrangement elements: date, time and either physician or location for an appointment. The data set was evaluated for the same criteria using SAS Text Miner software. The two assessments were compared to determine the accuracy of text mining for detecting records containing follow-up appointment arrangements. Agreement of text-mined appointment findings with gold standard (manual abstraction) including sensitivity, specificity, positive predictive and negative predictive values (PPV and NPV). About 55.2% (3576) of discharge records contained all criteria for follow-up appointment arrangements according to the manual review, 3.2% (113) of which were missed through text mining. Text mining incorrectly identified 3.7% (107) follow-up appointments that were not considered valid through manual review. Therefore, the text mining analysis concurred with the manual review in 96.6% of the appointment findings. Overall sensitivity and specificity were 96.8 and 96.3%, respectively; and PPV and NPV were 97.0 and 96.1%, respectively. of individual appointment criteria resulted in accuracy rates of 93.5% for date, 97.4% for time, 97.5% for physician and 82.9% for location. Text mining of unstructured hospital dismissal summaries can accurately detect documentation of follow-up appointment arrangement elements, thus saving considerable resources for performance assessment and quality-related research.

  12. OSCAR4: a flexible architecture for chemical text-mining

    Directory of Open Access Journals (Sweden)

    Jessop David M

    2011-10-01

    Full Text Available Abstract The Open-Source Chemistry Analysis Routines (OSCAR software, a toolkit for the recognition of named entities and data in chemistry publications, has been developed since 2002. Recent work has resulted in the separation of the core OSCAR functionality and its release as the OSCAR4 library. This library features a modular API (based on reduction of surface coupling that permits client programmers to easily incorporate it into external applications. OSCAR4 offers a domain-independent architecture upon which chemistry specific text-mining tools can be built, and its development and usage are discussed.

  13. Developing an open source-based spatial data infrastructure for integrated monitoring of mining areas

    Science.gov (United States)

    Lahn, Florian; Knoth, Christian; Prinz, Torsten; Pebesma, Edzer

    2014-05-01

    In all phases of mining campaigns, comprehensive spatial information is an essential requirement in order to ensure economically efficient but also safe mining activities as well as to reduce environmental impacts. Earth observation data acquired from various sources like remote sensing or ground measurements is important e.g. for the exploration of mineral deposits, the monitoring of mining induced impacts on vegetation or the detection of ground subsidence. The GMES4Mining project aims at exploring new remote sensing techniques and developing analysis methods on various types of sensor data to provide comprehensive spatial information during mining campaigns (BENECKE et al. 2013). One important task in this project is the integration of the data gathered (e.g. hyperspectral images, spaceborne radar data and ground measurements) as well as results of the developed analysis methods within a web-accessible data source based on open source software. The main challenges here are to provide various types and formats of data from different sensors and to enable access to analysis and processing techniques without particular software or licensing requirements for users. Furthermore the high volume of the involved data (especially hyperspectral remote sensing images) makes data transfer a major issue in this use case. To engage these problems a spatial data infrastructure (SDI) including a web portal as user frontend is being developed which allows users to access not only the data but also several analysis methods. The Geoserver software is used for publishing the data, which is then accessed and visualized in a JavaScript-based web portal. In order to perform descriptive statistics and some straightforward image processing techniques on the raster data (e.g. band arithmetic or principal component analysis) the statistics software R is implemented on a server and connected via Rserve. The analysis is controlled and executed directly by the user through the web portal and

  14. Monitoring interaction and collective text production through text mining

    Directory of Open Access Journals (Sweden)

    Macedo, Alexandra Lorandi

    2014-04-01

    Full Text Available This article presents the Concepts Network tool, developed using text mining technology. The main objective of this tool is to extract and relate terms of greatest incidence from a text and exhibit the results in the form of a graph. The Network was implemented in the Collective Text Editor (CTE which is an online tool that allows the production of texts in synchronized or non-synchronized forms. This article describes the application of the Network both in texts produced collectively and texts produced in a forum. The purpose of the tool is to offer support to the teacher in managing the high volume of data generated in the process of interaction amongst students and in the construction of the text. Specifically, the aim is to facilitate the teacher’s job by allowing him/her to process data in a shorter time than is currently demanded. The results suggest that the Concepts Network can aid the teacher, as it provides indicators of the quality of the text produced. Moreover, messages posted in forums can be analyzed without their content necessarily having to be pre-read.

  15. Web-Based Analysis for Decision Support Systems

    African Journals Online (AJOL)

    pc

    2018-03-05

    Mar 5, 2018 ... such as web mining, social analytics, and data mining were examined. ... Additionally, the systems possess superb interaction capability which enables .... technologies has a significant impact on DSS design especially ..... Evaluating the Impact of User Characteristics and Different Layouts on an Interactive ...

  16. Unsupervised text mining for assessing and augmenting GWAS results.

    Science.gov (United States)

    Ailem, Melissa; Role, François; Nadif, Mohamed; Demenais, Florence

    2016-04-01

    Text mining can assist in the analysis and interpretation of large-scale biomedical data, helping biologists to quickly and cheaply gain confirmation of hypothesized relationships between biological entities. We set this question in the context of genome-wide association studies (GWAS), an actively emerging field that contributed to identify many genes associated with multifactorial diseases. These studies allow to identify groups of genes associated with the same phenotype, but provide no information about the relationships between these genes. Therefore, our objective is to leverage unsupervised text mining techniques using text-based cosine similarity comparisons and clustering applied to candidate and random gene vectors, in order to augment the GWAS results. We propose a generic framework which we used to characterize the relationships between 10 genes reported associated with asthma by a previous GWAS. The results of this experiment showed that the similarities between these 10 genes were significantly stronger than would be expected by chance (one-sided p-value<0.01). The clustering of observed and randomly selected gene also allowed to generate hypotheses about potential functional relationships between these genes and thus contributed to the discovery of new candidate genes for asthma. Copyright © 2016 Elsevier Inc. All rights reserved.

  17. Supporting the education evidence portal via text mining

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John

    2010-01-01

    The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500 000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679

  18. 3D Web-based HMI with WebGL Rendering Performance

    Directory of Open Access Journals (Sweden)

    Muennoi Atitayaporn

    2016-01-01

    Full Text Available An HMI, or Human-Machine Interface, is a software allowing users to communicate with a machine or automation system. It usually serves as a display section in SCADA (Supervisory Control and Data Acquisition system for device monitoring and control. In this papper, a 3D Web-based HMI with WebGL (Web-based Graphics Library rendering performance is presented. The main purpose of this work is to attempt to reduce the limitations of traditional 3D web HMI using the advantage of WebGL. To evaluate the performance, frame rate and frame time metrics were used. The results showed 3D Web-based HMI can maintain the frame rate 60FPS for #cube=0.5K/0.8K, 30FPS for #cube=1.1K/1.6K when it was run on Internet Explorer and Chrome respectively. Moreover, the study found that 3D Web-based HMI using WebGL contains similar frame time in each frame even though the numbers of cubes are up to 5K. This indicated stuttering incurred less in the proposed 3D Web-based HMI compared to the chosen commercial HMI product.

  19. Mining consumer health vocabulary from community-generated text.

    Science.gov (United States)

    Vydiswaran, V G Vinod; Mei, Qiaozhu; Hanauer, David A; Zheng, Kai

    2014-01-01

    Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.

  20. Nuclear expert web mining system: monitoring and analysis of nuclear acceptance by information retrieval and opinion extraction on the Internet

    Energy Technology Data Exchange (ETDEWEB)

    Reis, Thiago; Barroso, Antonio C.O.; Imakuma, Kengo, E-mail: thiagoreis@usp.b, E-mail: barroso@ipen.b, E-mail: kimakuma@ipen.b [Instituto de Pesquisas Energeticas e Nucleares (IPEN/CNEN-SP), Sao Paulo, SP (Brazil)

    2011-07-01

    This paper presents a research initiative that aims to collect nuclear related information and to analyze opinionated texts by mining the hypertextual data environment and social networks web sites on the Internet. Different from previous approaches that employed traditional statistical techniques, it is being proposed a novel Web Mining approach, built using the concept of Expert Systems, for massive and autonomous data collection and analysis. The initial step has been accomplished, resulting in a framework design that is able to gradually encompass a set of evolving techniques, methods, and theories in such a way that this work will build a platform upon which new researches can be performed more easily by just substituting modules or plugging in new ones. Upon completion it is expected that this research will contribute to the understanding of the population views on nuclear technology and its acceptance. (author)

  1. Nuclear expert web mining system: monitoring and analysis of nuclear acceptance by information retrieval and opinion extraction on the Internet

    International Nuclear Information System (INIS)

    Reis, Thiago; Barroso, Antonio C.O.; Imakuma, Kengo

    2011-01-01

    This paper presents a research initiative that aims to collect nuclear related information and to analyze opinionated texts by mining the hypertextual data environment and social networks web sites on the Internet. Different from previous approaches that employed traditional statistical techniques, it is being proposed a novel Web Mining approach, built using the concept of Expert Systems, for massive and autonomous data collection and analysis. The initial step has been accomplished, resulting in a framework design that is able to gradually encompass a set of evolving techniques, methods, and theories in such a way that this work will build a platform upon which new researches can be performed more easily by just substituting modules or plugging in new ones. Upon completion it is expected that this research will contribute to the understanding of the population views on nuclear technology and its acceptance. (author)

  2. CONAN : Text Mining in the Biomedical Domain

    NARCIS (Netherlands)

    Malik, R.

    2006-01-01

    This thesis is about Text Mining. Extracting important information from literature. In the last years, the number of biomedical articles and journals is growing exponentially. Scientists might not find the information they want because of the large number of publications. Therefore a system was

  3. Using text mining for study identification in systematic reviews: a systematic review of current approaches.

    Science.gov (United States)

    O'Mara-Eves, Alison; Thomas, James; McNaught, John; Miwa, Makoto; Ananiadou, Sophia

    2015-01-14

    The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities. Five research questions led our review: what is the state of the evidence base; how has workload reduction been evaluated; what are the purposes of semi-automation and how effective are they; how have key contextual problems of applying text mining to the systematic review field been addressed; and what challenges to implementation have emerged? We answered these questions using standard systematic review methods: systematic and exhaustive searching, quality-assured data extraction and a narrative synthesis to synthesise findings. The evidence base is active and diverse; there is almost no replication between studies or collaboration between research teams and, whilst it is difficult to establish any overall conclusions about best approaches, it is clear that efficiencies and reductions in workload are potentially achievable. On the whole, most suggested that a saving in workload of between 30% and 70% might be possible, though sometimes the saving in workload is accompanied by the loss of 5% of relevant studies (i.e. a 95% recall). Using text mining to prioritise the order in which items are screened should be considered safe and ready for use in 'live' reviews. The use of text mining as a 'second screener' may also be used cautiously

  4. DYNAMIC FEATURE SELECTION FOR WEB USER IDENTIFICATION ON LINGUISTIC AND STYLISTIC FEATURES OF ONLINE TEXTS

    Directory of Open Access Journals (Sweden)

    A. A. Vorobeva

    2017-01-01

    Full Text Available The paper deals with identification and authentication of web users participating in the Internet information processes (based on features of online texts.In digital forensics web user identification based on various linguistic features can be used to discover identity of individuals, criminals or terrorists using the Internet to commit cybercrimes. Internet could be used as a tool in different types of cybercrimes (fraud and identity theft, harassment and anonymous threats, terrorist or extremist statements, distribution of illegal content and information warfare. Linguistic identification of web users is a kind of biometric identification, it can be used to narrow down the suspects, identify a criminal and prosecute him. Feature set includes various linguistic and stylistic features extracted from online texts. We propose dynamic feature selection for each web user identification task. Selection is based on calculating Manhattan distance to k-nearest neighbors (Relief-f algorithm. This approach improves the identification accuracy and minimizes the number of features. Experiments were carried out on several datasets with different level of class imbalance. Experiment results showed that features relevance varies in different set of web users (probable authors of some text; features selection for each set of web users improves identification accuracy by 4% at the average that is approximately 1% higher than with the use of static set of features. The proposed approach is most effective for a small number of training samples (messages per user.

  5. Using text mining for study identification in systematic reviews: a systematic review of current approaches

    OpenAIRE

    O?Mara-Eves, Alison; Thomas, James; McNaught, John; Miwa, Makoto; Ananiadou, Sophia

    2015-01-01

    Background The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic...

  6. A tm Plug-In for Distributed Text Mining in R

    Directory of Open Access Journals (Sweden)

    Stefan Theussl

    2012-11-01

    Full Text Available R has gained explicit text mining support with the tm package enabling statisticians to answer many interesting research questions via statistical analysis or modeling of (text corpora. However, we typically face two challenges when analyzing large corpora: (1 the amount of data to be processed in a single machine is usually limited by the available main memory (i.e., RAM, and (2 the more data to be analyzed the higher the need for efficient procedures for calculating valuable results. Fortunately, adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing data sets beyond what would fit into memory by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. In this paper we present a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We show on the basis of an application in culturomics that we can efficiently handle data sets of significant size.

  7. Text Mining of UU-ITE Implementation in Indonesia

    Science.gov (United States)

    Hakim, Lukmanul; Kusumasari, Tien F.; Lubis, Muharman

    2018-04-01

    At present, social media and networks act as one of the main platforms for sharing information, idea, thought and opinions. Many people share their knowledge and express their views on the specific topics or current hot issues that interest them. The social media texts have rich information about the complaints, comments, recommendation and suggestion as the automatic reaction or respond to government initiative or policy in order to overcome certain issues.This study examines the sentiment from netizensas part of citizen who has vocal sound about the implementation of UU ITE as the first cyberlaw in Indonesia as a means to identify the current tendency of citizen perception. To perform text mining techniques, this study used Twitter Rest API while R programming was utilized for the purpose of classification analysis based on hierarchical cluster.

  8. Use and Effectiveness of a Video- and Text-Driven Web-Based Computer-Tailored Intervention: Randomized Controlled Trial.

    Science.gov (United States)

    Walthouwer, Michel Jean Louis; Oenema, Anke; Lechner, Lilian; de Vries, Hein

    2015-09-25

    Many Web-based computer-tailored interventions are characterized by high dropout rates, which limit their potential impact. This study had 4 aims: (1) examining if the use of a Web-based computer-tailored obesity prevention intervention can be increased by using videos as the delivery format, (2) examining if the delivery of intervention content via participants' preferred delivery format can increase intervention use, (3) examining if intervention effects are moderated by intervention use and matching or mismatching intervention delivery format preference, (4) and identifying which sociodemographic factors and intervention appreciation variables predict intervention use. Data were used from a randomized controlled study into the efficacy of a video and text version of a Web-based computer-tailored obesity prevention intervention consisting of a baseline measurement and a 6-month follow-up measurement. The intervention consisted of 6 weekly sessions and could be used for 3 months. ANCOVAs were conducted to assess differences in use between the video and text version and between participants allocated to a matching and mismatching intervention delivery format. Potential moderation by intervention use and matching/mismatching delivery format on self-reported body mass index (BMI), physical activity, and energy intake was examined using regression analyses with interaction terms. Finally, regression analysis was performed to assess determinants of intervention use. In total, 1419 participants completed the baseline questionnaire (follow-up response=71.53%, 1015/1419). Intervention use declined rapidly over time; the first 2 intervention sessions were completed by approximately half of the participants and only 10.9% (104/956) of the study population completed all 6 sessions of the intervention. There were no significant differences in use between the video and text version. Intervention use was significantly higher among participants who were allocated to an

  9. UMineAR: Mobile-Tablet-Based Abandoned Mine Hazard Site Investigation Support System Using Augmented Reality

    Directory of Open Access Journals (Sweden)

    Jangwon Suh

    2017-10-01

    Full Text Available Conventional mine site investigation has difficulties in fostering location awareness and understanding the subsurface environment; moreover, it produces a large amount of hardcopy data. To overcome these limitations, the UMineAR mobile tablet application was developed. It enables users to rapidly identify underground mine objects (drifts, entrances, boreholes, hazards and intuitively visualize them in 3D using a mobile augmented reality (AR technique. To design UMineAR, South Korean georeferenced standard-mine geographic information system (GIS databases were employed. A web database system was designed to access via a tablet groundwater-level data measured every hour by sensors installed in boreholes. UMineAR consists of search, AR, map, and database modules. The search module provides data retrieval and visualization options/functions. The AR module provides 3D interactive visualization of mine GIS data and camera imagery on the tablet screen. The map module shows the locations of corresponding borehole data on a 2D map. The database module provides mine GIS database management functions. A case study showed that the proposed application is suitable for onsite visualization of high-volume mine GIS data based on geolocations; no specialized equipment or skills are required to understand the underground mine environment. UMineAR can be used to support abandoned-mine hazard site investigations.

  10. EVALUATION OF WEB SEARCHING METHOD USING A NOVEL WPRR ALGORITHM FOR TWO DIFFERENT CASE STUDIES

    Directory of Open Access Journals (Sweden)

    V. Lakshmi Praba

    2012-04-01

    Full Text Available The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to web data and documents. Web content mining and web structure mining have important roles in identifying the relevant web page. Relevancy of web page denotes how well a retrieved web page or set of web pages meets the information need of the user. Page Rank, Weighted Page Rank and Hypertext Induced Topic Selection (HITS are existing algorithms which considers only web structure mining. Vector Space Model (VSM, Cover Density Ranking (CDR, Okapi similarity measurement (Okapi and Three-Level Scoring method (TLS are some of existing relevancy score methods which consider only web content mining. In this paper, we propose a new algorithm, Weighted Page with Relevant Rank (WPRR which is blend of both web content mining and web structure mining that demonstrates the relevancy of the page with respect to given query for two different case scenarios. It is shown that WPRR’s performance is better than the existing algorithms.

  11. Text mining approach to predict hospital admissions using early medical records from the emergency department.

    Science.gov (United States)

    Lucini, Filipe R; S Fogliatto, Flavio; C da Silveira, Giovani J; L Neyeloff, Jeruza; Anzanello, Michel J; de S Kuchenbecker, Ricardo; D Schaan, Beatriz

    2017-04-01

    Emergency department (ED) overcrowding is a serious issue for hospitals. Early information on short-term inward bed demand from patients receiving care at the ED may reduce the overcrowding problem, and optimize the use of hospital resources. In this study, we use text mining methods to process data from early ED patient records using the SOAP framework, and predict future hospitalizations and discharges. We try different approaches for pre-processing of text records and to predict hospitalization. Sets-of-words are obtained via binary representation, term frequency, and term frequency-inverse document frequency. Unigrams, bigrams and trigrams are tested for feature formation. Feature selection is based on χ 2 and F-score metrics. In the prediction module, eight text mining methods are tested: Decision Tree, Random Forest, Extremely Randomized Tree, AdaBoost, Logistic Regression, Multinomial Naïve Bayes, Support Vector Machine (Kernel linear) and Nu-Support Vector Machine (Kernel linear). Prediction performance is evaluated by F1-scores. Precision and Recall values are also informed for all text mining methods tested. Nu-Support Vector Machine was the text mining method with the best overall performance. Its average F1-score in predicting hospitalization was 77.70%, with a standard deviation (SD) of 0.66%. The method could be used to manage daily routines in EDs such as capacity planning and resource allocation. Text mining could provide valuable information and facilitate decision-making by inward bed management teams. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.

  12. Feature Reduction Based on Genetic Algorithm and Hybrid Model for Opinion Mining

    Directory of Open Access Journals (Sweden)

    P. Kalaivani

    2015-01-01

    Full Text Available With the rapid growth of websites and web form the number of product reviews is available on the sites. An opinion mining system is needed to help the people to evaluate emotions, opinions, attitude, and behavior of others, which is used to make decisions based on the user preference. In this paper, we proposed an optimized feature reduction that incorporates an ensemble method of machine learning approaches that uses information gain and genetic algorithm as feature reduction techniques. We conducted comparative study experiments on multidomain review dataset and movie review dataset in opinion mining. The effectiveness of single classifiers Naïve Bayes, logistic regression, support vector machine, and ensemble technique for opinion mining are compared on five datasets. The proposed hybrid method is evaluated and experimental results using information gain and genetic algorithm with ensemble technique perform better in terms of various measures for multidomain review and movie reviews. Classification algorithms are evaluated using McNemar’s test to compare the level of significance of the classifiers.

  13. Sustainable Supply Chain Based on News Articles and Sustainability Reports: Text Mining with Leximancer and DICTION

    Directory of Open Access Journals (Sweden)

    Dongwook Kim

    2017-06-01

    Full Text Available The purpose of this research is to explore sustainable supply chain management (SSCM trends, and firms’ strategic positioning and execution with regard to sustainability in the textile and apparel industry based on news articles and sustainability reports. Further analysis of the rhetoric in Chief executive officer (CEO letters within sustainability reports is used to determine firms’ resoluteness, positive entailments, sharing of values, perception of reality, and sustainability strategy and execution feasibility. Computer-based content analysis is used for this research: Leximancer is applied for text analysis, while dictionary-based text mining program DICTION and SPSS are used for rhetorical analysis. Overall, contents similar to the literature on environmental, social, and economic aspects of the triple bottom line (TBL are observed, however, topics such as regulation, green incentives, and international standards are not readily observed. Furthmore, ethical issues, sustainable production, quality, and customer roles are emphasized in texts analyzed. The CEO letter analysis indicates that listed firms show relatively low realism and high commonality, while North American firms exhibit relatively high commonality, and Europe firms show relatively high realism. The results will serve as a baseline for providing academia guidelines in SSCM research, and provide an opportunity for businesses to complement their sustainability strategies and executions.

  14. A comparison study on algorithms of detecting long forms for short forms in biomedical text

    Directory of Open Access Journals (Sweden)

    Wu Cathy H

    2007-11-01

    Full Text Available Abstract Motivation With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from when building literature mining applications. In this study, we focus on one specific kind of literature mining task, i.e., detecting definitions of acronyms, abbreviations, and symbols in biomedical text. We denote acronyms, abbreviations, and symbols as short forms (SFs and their corresponding definitions as long forms (LFs. The study was designed to answer the following questions; i how well a system performs in detecting LFs from novel text, ii what the coverage is for various terminological knowledge bases in including SFs as synonyms of their LFs, and iii how to combine results from various SF knowledge bases. Method We evaluated the following three publicly available detection systems in detecting LFs for SFs: i a handcrafted pattern/rule based system by Ao and Takagi, ALICE, ii a machine learning system by Chang et al., and iii a simple alignment-based program by Schwartz and Hearst. In addition, we investigated the conceptual coverage of two terminological knowledge bases: i the UMLS (the Unified Medical Language System, and ii the BioThesaurus (a thesaurus of names for all UniProt protein records. We also implemented a web interface that provides a virtual integration of various SF knowledge bases. Results We found that detection systems agree with each other on most cases, and the existing terminological knowledge bases have a good coverage of synonymous relationship for frequently defined LFs. The web interface allows people to detect SF definitions from text and to search several SF knowledge bases. Availability The web site is http://gauss.dbb.georgetown.edu/liblab/SFThesaurus.

  15. ParaBTM: A Parallel Processing Framework for Biomedical Text Mining on Supercomputers.

    Science.gov (United States)

    Xing, Yuting; Wu, Chengkun; Yang, Xi; Wang, Wei; Zhu, En; Yin, Jianping

    2018-04-27

    A prevailing way of extracting valuable information from biomedical literature is to apply text mining methods on unstructured texts. However, the massive amount of literature that needs to be analyzed poses a big data challenge to the processing efficiency of text mining. In this paper, we address this challenge by introducing parallel processing on a supercomputer. We developed paraBTM, a runnable framework that enables parallel text mining on the Tianhe-2 supercomputer. It employs a low-cost yet effective load balancing strategy to maximize the efficiency of parallel processing. We evaluated the performance of paraBTM on several datasets, utilizing three types of named entity recognition tasks as demonstration. Results show that, in most cases, the processing efficiency can be greatly improved with parallel processing, and the proposed load balancing strategy is simple and effective. In addition, our framework can be readily applied to other tasks of biomedical text mining besides NER.

  16. Semantics-based Automated Web Testing

    Directory of Open Access Journals (Sweden)

    Hai-Feng Guo

    2015-08-01

    Full Text Available We present TAO, a software testing tool performing automated test and oracle generation based on a semantic approach. TAO entangles grammar-based test generation with automated semantics evaluation using a denotational semantics framework. We show how TAO can be incorporated with the Selenium automation tool for automated web testing, and how TAO can be further extended to support automated delta debugging, where a failing web test script can be systematically reduced based on grammar-directed strategies. A real-life parking website is adopted throughout the paper to demonstrate the effectivity of our semantics-based web testing approach.

  17. Empirical advances with text mining of electronic health records.

    Science.gov (United States)

    Delespierre, T; Denormandie, P; Bar-Hen, A; Josseran, L

    2017-08-22

    Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives. Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs. By combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives. This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and

  18. Assimilating Text-Mining & Bio-Informatics Tools to Analyze Cellulase structures

    Science.gov (United States)

    Satyasree, K. P. N. V., Dr; Lalitha Kumari, B., Dr; Jyotsna Devi, K. S. N. V.; Choudri, S. M. Roy; Pratap Joshi, K.

    2017-08-01

    Text-mining is one of the best potential way of automatically extracting information from the huge biological literature. To exploit its prospective, the knowledge encrypted in the text should be converted to some semantic representation such as entities and relations, which could be analyzed by machines. But large-scale practical systems for this purpose are rare. But text mining could be helpful for generating or validating predictions. Cellulases have abundant applications in various industries. Cellulose degrading enzymes are cellulases and the same producing bacteria - Bacillus subtilis & fungus Pseudomonas putida were isolated from top soil of Guntur Dt. A.P. India. Absolute cultures were conserved on potato dextrose agar medium for molecular studies. In this paper, we presented how well the text mining concepts can be used to analyze cellulase producing bacteria and fungi, their comparative structures are also studied with the aid of well-establised, high quality standard bioinformatic tools such as Bioedit, Swissport, Protparam, EMBOSSwin with which a complete data on Cellulases like structure, constituents of the enzyme has been obtained.

  19. A method for extracting design rationale knowledge based on Text Mining

    Directory of Open Access Journals (Sweden)

    Liu Jihong

    2017-01-01

    Full Text Available Capture design rationale (DR knowledge and presenting it to designers by good form, which have great significance for design reuse and design innovation. Since the 1970s design rationality began to develop, many teams have developed their own design rational system. However, the DR acquisition system is not intelligent enough, and it still requires designers to do a lot of operations. In addition, the existing design documents contain a large number of DR knowledge, but it has not been well excavated. Therefore, a method and system are needed to better extract DR knowledge in design documents. We have proposed a DRKH (design rationale knowledge hierarchy model for DR representation. The DRKH model has three layers, respectively as design intent layer, design decision layer and design basis layer. In this paper, we use text mining method to extract DR from design documents and construct DR model. Finally, the welding robot design specification is taken as an example to demonstrate the system interface.

  20. Data Mining Based on Cloud-Computing Technology

    Directory of Open Access Journals (Sweden)

    Ren Ying

    2016-01-01

    Full Text Available There are performance bottlenecks and scalability problems when traditional data-mining system is used in cloud computing. In this paper, we present a data-mining platform based on cloud computing. Compared with a traditional data mining system, this platform is highly scalable, has massive data processing capacities, is service-oriented, and has low hardware cost. This platform can support the design and applications of a wide range of distributed data-mining systems.

  1. BioCreative Workshops for DOE Genome Sciences: Text Mining for Metagenomics

    Energy Technology Data Exchange (ETDEWEB)

    Wu, Cathy H. [Univ. of Delaware, Newark, DE (United States). Center for Bioinformatics and Computational Biology; Hirschman, Lynette [The MITRE Corporation, Bedford, MA (United States)

    2016-10-29

    The objective of this project was to host BioCreative workshops to define and develop text mining tasks to meet the needs of the Genome Sciences community, focusing on metadata information extraction in metagenomics. Following the successful introduction of metagenomics at the BioCreative IV workshop, members of the metagenomics community and BioCreative communities continued discussion to identify candidate topics for a BioCreative metagenomics track for BioCreative V. Of particular interest was the capture of environmental and isolation source information from text. The outcome was to form a “community of interest” around work on the interactive EXTRACT system, which supported interactive tagging of environmental and species data. This experiment is included in the BioCreative V virtual issue of Database. In addition, there was broad participation by members of the metagenomics community in the panels held at BioCreative V, leading to valuable exchanges between the text mining developers and members of the metagenomics research community. These exchanges are reflected in a number of the overview and perspective pieces also being captured in the BioCreative V virtual issue. Overall, this conversation has exposed the metagenomics researchers to the possibilities of text mining, and educated the text mining developers to the specific needs of the metagenomics community.

  2. Experiences with Text Mining Large Collections of Unstructured Systems Development Artifacts at JPL

    Science.gov (United States)

    Port, Dan; Nikora, Allen; Hihn, Jairus; Huang, LiGuo

    2011-01-01

    Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

  3. The Application of Machine Learning Algorithms for Text Mining based on Sentiment Analysis Approach

    Directory of Open Access Journals (Sweden)

    Reza Samizade

    2018-06-01

    Full Text Available Classification of the cyber texts and comments into two categories of positive and negative sentiment among social media users is of high importance in the research are related to text mining. In this research, we applied supervised classification methods to classify Persian texts based on sentiment in cyber space. The result of this research is in a form of a system that can decide whether a comment which is published in cyber space such as social networks is considered positive or negative. The comments that are published in Persian movie and movie review websites from 1392 to 1395 are considered as the data set for this research. A part of these data are considered as training and others are considered as testing data. Prior to implementing the algorithms, pre-processing activities such as tokenizing, removing stop words, and n-germs process were applied on the texts. Naïve Bayes, Neural Networks and support vector machine were used for text classification in this study. Out of sample tests showed that there is no evidence indicating that the accuracy of SVM approach is statistically higher than Naïve Bayes or that the accuracy of Naïve Bayes is not statistically higher than NN approach. However, the researchers can conclude that the accuracy of the classification using SVM approach is statistically higher than the accuracy of NN approach in 5% confidence level.

  4. Systematic Review of Data Mining Applications in Patient-Centered Mobile-Based Information Systems.

    Science.gov (United States)

    Fallah, Mina; Niakan Kalhori, Sharareh R

    2017-10-01

    Smartphones represent a promising technology for patient-centered healthcare. It is claimed that data mining techniques have improved mobile apps to address patients' needs at subgroup and individual levels. This study reviewed the current literature regarding data mining applications in patient-centered mobile-based information systems. We systematically searched PubMed, Scopus, and Web of Science for original studies reported from 2014 to 2016. After screening 226 records at the title/abstract level, the full texts of 92 relevant papers were retrieved and checked against inclusion criteria. Finally, 30 papers were included in this study and reviewed. Data mining techniques have been reported in development of mobile health apps for three main purposes: data analysis for follow-up and monitoring, early diagnosis and detection for screening purpose, classification/prediction of outcomes, and risk calculation (n = 27); data collection (n = 3); and provision of recommendations (n = 2). The most accurate and frequently applied data mining method was support vector machine; however, decision tree has shown superior performance to enhance mobile apps applied for patients' self-management. Embedded data-mining-based feature in mobile apps, such as case detection, prediction/classification, risk estimation, or collection of patient data, particularly during self-management, would save, apply, and analyze patient data during and after care. More intelligent methods, such as artificial neural networks, fuzzy logic, and genetic algorithms, and even the hybrid methods may result in more patients-centered recommendations, providing education, guidance, alerts, and awareness of personalized output.

  5. Text mining by Tsallis entropy

    Science.gov (United States)

    Jamaati, Maryam; Mehri, Ali

    2018-01-01

    Long-range correlations between the elements of natural languages enable them to convey very complex information. Complex structure of human language, as a manifestation of natural languages, motivates us to apply nonextensive statistical mechanics in text mining. Tsallis entropy appropriately ranks the terms' relevance to document subject, taking advantage of their spatial correlation length. We apply this statistical concept as a new powerful word ranking metric in order to extract keywords of a single document. We carry out an experimental evaluation, which shows capability of the presented method in keyword extraction. We find that, Tsallis entropy has reliable word ranking performance, at the same level of the best previous ranking methods.

  6. Text Mining to Support Gene Ontology Curation and Vice Versa.

    Science.gov (United States)

    Ruch, Patrick

    2017-01-01

    In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate.

  7. Ion Channel ElectroPhysiology Ontology (ICEPO) - a case study of text mining assisted ontology development.

    Science.gov (United States)

    Elayavilli, Ravikumar Komandur; Liu, Hongfang

    2016-01-01

    Computational modeling of biological cascades is of great interest to quantitative biologists. Biomedical text has been a rich source for quantitative information. Gathering quantitative parameters and values from biomedical text is one significant challenge in the early steps of computational modeling as it involves huge manual effort. While automatically extracting such quantitative information from bio-medical text may offer some relief, lack of ontological representation for a subdomain serves as impedance in normalizing textual extractions to a standard representation. This may render textual extractions less meaningful to the domain experts. In this work, we propose a rule-based approach to automatically extract relations involving quantitative data from biomedical text describing ion channel electrophysiology. We further translated the quantitative assertions extracted through text mining to a formal representation that may help in constructing ontology for ion channel events using a rule based approach. We have developed Ion Channel ElectroPhysiology Ontology (ICEPO) by integrating the information represented in closely related ontologies such as, Cell Physiology Ontology (CPO), and Cardiac Electro Physiology Ontology (CPEO) and the knowledge provided by domain experts. The rule-based system achieved an overall F-measure of 68.93% in extracting the quantitative data assertions system on an independently annotated blind data set. We further made an initial attempt in formalizing the quantitative data assertions extracted from the biomedical text into a formal representation that offers potential to facilitate the integration of text mining into ontological workflow, a novel aspect of this study. This work is a case study where we created a platform that provides formal interaction between ontology development and text mining. We have achieved partial success in extracting quantitative assertions from the biomedical text and formalizing them in ontological

  8. Advances in Text Mining and Visualization for Precision Medicine.

    Science.gov (United States)

    Gonzalez-Hernandez, Graciela; Sarker, Abeed; O'Connor, Karen; Greene, Casey; Liu, Hongfang

    2018-01-01

    According to the National Institutes of Health (NIH), precision medicine is "an emerging approach for disease treatment and prevention that takes into account individual variability in genes, environment, and lifestyle for each person." Although the text mining community has explored this realm for some years, the official endorsement and funding launched in 2015 with the Precision Medicine Initiative are beginning to bear fruit. This session sought to elicit participation of researchers with strong background in text mining and/or visualization who are actively collaborating with bench scientists and clinicians for the deployment of integrative approaches in precision medicine that could impact scientific discovery and advance the vision of precision medicine as a universal, accessible approach at the point of care.

  9. Using Text Mining to Characterize Online Discussion Facilitation

    Science.gov (United States)

    Ming, Norma; Baumer, Eric

    2011-01-01

    Facilitating class discussions effectively is a critical yet challenging component of instruction, particularly in online environments where student and faculty interaction is limited. Our goals in this research were to identify facilitation strategies that encourage productive discussion, and to explore text mining techniques that can help…

  10. J-TEXT WebScope: An efficient data access and visualization system for long pulse fusion experiment

    Energy Technology Data Exchange (ETDEWEB)

    Zheng, Wei, E-mail: zhenghaku@gmail.com [State Key Laboratory of Advanced Electromagnetic Engineering and Technology in Huazhong University of Science and Technology, Wuhan 430074 (China); School of Electrical and Electronic Engineering in Huazhong University of Science and Technology, Wuhan 430074 (China); Wan, Kuanhong; Chen, Zhi; Hu, Feiran; Liu, Qiang [State Key Laboratory of Advanced Electromagnetic Engineering and Technology in Huazhong University of Science and Technology, Wuhan 430074 (China); School of Electrical and Electronic Engineering in Huazhong University of Science and Technology, Wuhan 430074 (China)

    2016-11-15

    Highlights: • No matter how large the data is, the response time is always less than 500 milliseconds. • It is intelligent and just gives you the data you want. • It can be accessed directly over the Internet without installing special client software if you already have a browser. • Adopt scale and segment technology to organize data. • To support a new database for the WebScope is quite easy. • With the configuration stored in user’s profile, you have your own portable WebScope. - Abstract: Fusion research is an international collaboration work. To enable researchers across the world to visualize and analyze the experiment data, a web based data access and visualization tool is quite important [1]. Now, a new WebScope based on RIA (Rich Internet Application) is designed and implemented to meet these requirements. On the browser side, a fluent and intuitive interface is provided for researchers at J-TEXT laboratory and collaborators from all over the world to view experiment data and related metadata. The fusion experiments will feature long pulse and high sampling rate in the future. The data access and visualization system in this work has adopted segment and scale concept. Large data samples are re-sampled in different scales and then split into segments for instant response. It allows users to view extremely large data on the web browser efficiently, without worrying about the limitation on the size of the data. The HTML5 and JavaScript based web front-end can provide intuitive and fluent user experience. On the server side, a RESTful (Representational State Transfer) web API, which is based on ASP.NET MVC (Model View Controller), allows users to access the data and its metadata through HTTP (HyperText Transfer Protocol). An interface to the database has been designed to decouple the data access and visualization system from the data storage. It can be applied upon any data storage system like MDSplus or JTEXTDB, and this system is very easy to

  11. J-TEXT WebScope: An efficient data access and visualization system for long pulse fusion experiment

    International Nuclear Information System (INIS)

    Zheng, Wei; Wan, Kuanhong; Chen, Zhi; Hu, Feiran; Liu, Qiang

    2016-01-01

    Highlights: • No matter how large the data is, the response time is always less than 500 milliseconds. • It is intelligent and just gives you the data you want. • It can be accessed directly over the Internet without installing special client software if you already have a browser. • Adopt scale and segment technology to organize data. • To support a new database for the WebScope is quite easy. • With the configuration stored in user’s profile, you have your own portable WebScope. - Abstract: Fusion research is an international collaboration work. To enable researchers across the world to visualize and analyze the experiment data, a web based data access and visualization tool is quite important [1]. Now, a new WebScope based on RIA (Rich Internet Application) is designed and implemented to meet these requirements. On the browser side, a fluent and intuitive interface is provided for researchers at J-TEXT laboratory and collaborators from all over the world to view experiment data and related metadata. The fusion experiments will feature long pulse and high sampling rate in the future. The data access and visualization system in this work has adopted segment and scale concept. Large data samples are re-sampled in different scales and then split into segments for instant response. It allows users to view extremely large data on the web browser efficiently, without worrying about the limitation on the size of the data. The HTML5 and JavaScript based web front-end can provide intuitive and fluent user experience. On the server side, a RESTful (Representational State Transfer) web API, which is based on ASP.NET MVC (Model View Controller), allows users to access the data and its metadata through HTTP (HyperText Transfer Protocol). An interface to the database has been designed to decouple the data access and visualization system from the data storage. It can be applied upon any data storage system like MDSplus or JTEXTDB, and this system is very easy to

  12. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task.

    Science.gov (United States)

    Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel; Krallinger, Martin; Wilbur, W John; Fey, Petra; Dodson, Robert; Cooper, Laurel; Van Slyke, Ceri E; Dahdul, Wasila; Mabee, Paula; Li, Donghui; Harris, Bethany; Gillespie, Marc; Jimenez, Silvia; Roberts, Phoebe; Matthews, Lisa; Becker, Kevin; Drabkin, Harold; Bello, Susan; Licata, Luana; Chatr-aryamontri, Andrew; Schaeffer, Mary L; Park, Julie; Haendel, Melissa; Van Auken, Kimberly; Li, Yuling; Chan, Juancarlos; Muller, Hans-Michael; Cui, Hong; Balhoff, James P; Chi-Yang Wu, Johnny; Lu, Zhiyong; Wei, Chih-Hsuan; Tudor, Catalina O; Raja, Kalpana; Subramani, Suresh; Natarajan, Jeyakumar; Cejuela, Juan Miguel; Dubey, Pratibha; Wu, Cathy

    2013-01-01

    In many databases, biocuration primarily involves literature curation, which usually involves retrieving relevant articles, extracting information that will translate into annotations and identifying new incoming literature. As the volume of biological literature increases, the use of text mining to assist in biocuration becomes increasingly relevant. A number of groups have developed tools for text mining from a computer science/linguistics perspective, and there are many initiatives to curate some aspect of biology from the literature. Some biocuration efforts already make use of a text mining tool, but there have not been many broad-based systematic efforts to study which aspects of a text mining tool contribute to its usefulness for a curation task. Here, we report on an effort to bring together text mining tool developers and database biocurators to test the utility and usability of tools. Six text mining systems presenting diverse biocuration tasks participated in a formal evaluation, and appropriate biocurators were recruited for testing. The performance results from this evaluation indicate that some of the systems were able to improve efficiency of curation by speeding up the curation task significantly (∼1.7- to 2.5-fold) over manual curation. In addition, some of the systems were able to improve annotation accuracy when compared with the performance on the manually curated set. In terms of inter-annotator agreement, the factors that contributed to significant differences for some of the systems included the expertise of the biocurator on the given curation task, the inherent difficulty of the curation and attention to annotation guidelines. After the task, annotators were asked to complete a survey to help identify strengths and weaknesses of the various systems. The analysis of this survey highlights how important task completion is to the biocurators' overall experience of a system, regardless of the system's high score on design, learnability and

  13. Studies on medicinal herbs for cognitive enhancement based on the text mining of Dongeuibogam and preliminary evaluation of its effects.

    Science.gov (United States)

    Pak, Malk Eun; Kim, Yu Ri; Kim, Ha Neui; Ahn, Sung Min; Shin, Hwa Kyoung; Baek, Jin Ung; Choi, Byung Tae

    2016-02-17

    In literature on Korean medicine, Dongeuibogam (Treasured Mirror of Eastern Medicine), published in 1613, represents the overall results of the traditional medicines of North-East Asia based on prior medicinal literature of this region. We utilized this medicinal literature by text mining to establish a list of candidate herbs for cognitive enhancement in the elderly and then performed an evaluation of their effects. Text mining was performed for selection of candidate herbs. Cell viability was determined in HT22 hippocampal cells and immunohistochemistry and behavioral analysis was performed in a kainic acid (KA) mice model in order to observe alterations of hippocampal cells and cognition. Twenty four herbs for cognitive enhancement in the elderly were selected by text mining of Dongeuibogam. In HT22 cells, pretreatment with 3 candidate herbs resulted in significantly reduced glutamate-induced cell death. Panax ginseng was the most neuroprotective herb against glutamate-induced cell death. In the hippocampus of a KA mice model, pretreatment with 11 candidate herbs resulted in suppression of caspase-3 expression. Treatment with 7 candidate herbs resulted in significantly enhanced expression levels of phosphorylated cAMP response element binding protein. Number of proliferated cells indicated by BrdU labeling was increased by treatment with 10 candidate herbs. Schisandra chinensis was the most effective herb against cell death and proliferation of progenitor cells and Rehmannia glutinosa in neuroprotection in the hippocampus of a KA mice model. In a KA mice model, we confirmed improved spatial and short memory by treatment with the 3 most effective candidate herbs and these recovered functions were involved in a higher number of newly formed neurons from progenitor cells in the hippocampus. These established herbs and their combinations identified by text-mining technique and evaluation for effectiveness may have value in further experimental and clinical

  14. A Semantic Web-based System for Mining Genetic Mutations in Cancer Clinical Trials.

    Science.gov (United States)

    Priya, Sambhawa; Jiang, Guoqian; Dasari, Surendra; Zimmermann, Michael T; Wang, Chen; Heflin, Jeff; Chute, Christopher G

    2015-01-01

    Textual eligibility criteria in clinical trial protocols contain important information about potential clinically relevant pharmacogenomic events. Manual curation for harvesting this evidence is intractable as it is error prone and time consuming. In this paper, we develop and evaluate a Semantic Web-based system that captures and manages mutation evidences and related contextual information from cancer clinical trials. The system has 2 main components: an NLP-based annotator and a Semantic Web ontology-based annotation manager. We evaluated the performance of the annotator in terms of precision and recall. We demonstrated the usefulness of the system by conducting case studies in retrieving relevant clinical trials using a collection of mutations identified from TCGA Leukemia patients and Atlas of Genetics and Cytogenetics in Oncology and Haematology. In conclusion, our system using Semantic Web technologies provides an effective framework for extraction, annotation, standardization and management of genetic mutations in cancer clinical trials.

  15. Practical text mining and statistical analysis for non-structured text data applications

    CERN Document Server

    Miner, Gary; Hill, Thomas; Nisbet, Robert; Delen, Dursun

    2012-01-01

    The world contains an unimaginably vast amount of digital information which is getting ever vaster ever more rapidly. This makes it possible to do many things that previously could not be done: spot business trends, prevent diseases, combat crime and so on. Managed well, the textual data can be used to unlock new sources of economic value, provide fresh insights into science and hold governments to account. As the Internet expands and our natural capacity to process the unstructured text that it contains diminishes, the value of text mining for information retrieval and search will increase d

  16. Using Web Crawler Technology for Text Analysis of Geo-Events: A Case Study of the Huangyan Island Incident

    Science.gov (United States)

    Hu, H.; Ge, Y. J.

    2013-11-01

    With the social networking and network socialisation have brought more text information and social relationships into our daily lives, the question of whether big data can be fully used to study the phenomenon and discipline of natural sciences has prompted many specialists and scholars to innovate their research. Though politics were integrally involved in the hyperlinked word issues since 1990s, automatic assembly of different geospatial web and distributed geospatial information systems utilizing service chaining have explored and built recently, the information collection and data visualisation of geo-events have always faced the bottleneck of traditional manual analysis because of the sensibility, complexity, relativity, timeliness and unexpected characteristics of political events. Based on the framework of Heritrix and the analysis of web-based text, word frequency, sentiment tendency and dissemination path of the Huangyan Island incident is studied here by combining web crawler technology and the text analysis method. The results indicate that tag cloud, frequency map, attitudes pie, individual mention ratios and dissemination flow graph based on the data collection and processing not only highlight the subject and theme vocabularies of related topics but also certain issues and problems behind it. Being able to express the time-space relationship of text information and to disseminate the information regarding geo-events, the text analysis of network information based on focused web crawler technology can be a tool for understanding the formation and diffusion of web-based public opinions in political events.

  17. Discovering More Accurate Frequent Web Usage Patterns

    OpenAIRE

    Bayir, Murat Ali; Toroslu, Ismail Hakki; Cosar, Ahmet; Fidan, Guven

    2008-01-01

    Web usage mining is a type of web mining, which exploits data mining techniques to discover valuable information from navigation behavior of World Wide Web users. As in classical data mining, data preparation and pattern discovery are the main issues in web usage mining. The first phase of web usage mining is the data processing phase, which includes the session reconstruction operation from server logs. Session reconstruction success directly affects the quality of the frequent patterns disc...

  18. Event-based text mining for biology and functional genomics

    Science.gov (United States)

    Thompson, Paul; Nawaz, Raheel; McNaught, John; Kell, Douglas B.

    2015-01-01

    The assessment of genome function requires a mapping between genome-derived entities and biochemical reactions, and the biomedical literature represents a rich source of information about reactions between biological components. However, the increasingly rapid growth in the volume of literature provides both a challenge and an opportunity for researchers to isolate information about reactions of interest in a timely and efficient manner. In response, recent text mining research in the biology domain has been largely focused on the identification and extraction of ‘events’, i.e. categorised, structured representations of relationships between biochemical entities, from the literature. Functional genomics analyses necessarily encompass events as so defined. Automatic event extraction systems facilitate the development of sophisticated semantic search applications, allowing researchers to formulate structured queries over extracted events, so as to specify the exact types of reactions to be retrieved. This article provides an overview of recent research into event extraction. We cover annotated corpora on which systems are trained, systems that achieve state-of-the-art performance and details of the community shared tasks that have been instrumental in increasing the quality, coverage and scalability of recent systems. Finally, several concrete applications of event extraction are covered, together with emerging directions of research. PMID:24907365

  19. Research trends on Big Data in Marketing: A text mining and topic modeling based literature analysis

    Directory of Open Access Journals (Sweden)

    Alexandra Amado

    2018-01-01

    Full Text Available Given the research interest on Big Data in Marketing, we present a research literature analysis based on a text mining semi-automated approach with the goal of identifying the main trends in this domain. In particular, the analysis focuses on relevant terms and topics related with five dimensions: Big Data, Marketing, Geographic location of authors’ affiliation (countries and continents, Products, and Sectors. A total of 1560 articles published from 2010 to 2015 were scrutinized. The findings revealed that research is bipartite between technological and research domains, with Big Data publications not clearly aligning cutting edge techniques toward Marketing benefits. Also, few inter-continental co-authored publications were found. Moreover, findings show that research in Big Data applications to Marketing is still in an embryonic stage, thus making it essential to develop more direct efforts toward business for Big Data to thrive in the Marketing arena.

  20. Design of an Interface for Page Rank Calculation using Web Link Attributes Information

    Directory of Open Access Journals (Sweden)

    Jeyalatha SIVARAMAKRISHNAN

    2010-01-01

    Full Text Available This paper deals with the Web Structure Mining and the different Structure Mining Algorithms like Page Rank, HITS, Trust Rank and Sel-HITS. The functioning of these algorithms are discussed. An incremental algorithm for calculation of PageRank using an interface has been formulated. This algorithm makes use of Web Link Attributes Information as key parameters and has been implemented using Visibility and Position of a Link. The application of Web Structure Mining Algorithm in an Academic Search Application has been discussed. The present work can be a useful input to Web Users, Faculty, Students and Web Administrators in a University Environment.

  1. Biomedical hypothesis generation by text mining and gene prioritization.

    Science.gov (United States)

    Petric, Ingrid; Ligeti, Balazs; Gyorffy, Balazs; Pongor, Sandor

    2014-01-01

    Text mining methods can facilitate the generation of biomedical hypotheses by suggesting novel associations between diseases and genes. Previously, we developed a rare-term model called RaJoLink (Petric et al, J. Biomed. Inform. 42(2): 219-227, 2009) in which hypotheses are formulated on the basis of terms rarely associated with a target domain. Since many current medical hypotheses are formulated in terms of molecular entities and molecular mechanisms, here we extend the methodology to proteins and genes, using a standardized vocabulary as well as a gene/protein network model. The proposed enhanced RaJoLink rare-term model combines text mining and gene prioritization approaches. Its utility is illustrated by finding known as well as potential gene-disease associations in ovarian cancer using MEDLINE abstracts and the STRING database.

  2. Indian accent text-to-speech system for web browsing

    Indian Academy of Sciences (India)

    This paper describes a 'web reader' which 'reads out' the textual contents of a selected web page in Hindi or in English with Indian accent. The content of the page is downloaded and parsed into suitable textual form. It is then passed on to an indigenously developed text-to-speech system for Hindi/Indian English, ...

  3. Cognition-Based Approaches for High-Precision Text Mining

    Science.gov (United States)

    Shannon, George John

    2017-01-01

    This research improves the precision of information extraction from free-form text via the use of cognitive-based approaches to natural language processing (NLP). Cognitive-based approaches are an important, and relatively new, area of research in NLP and search, as well as linguistics. Cognitive approaches enable significant improvements in both…

  4. Text Mining for Information Systems Researchers: An Annotated Topic Modeling Tutorial

    DEFF Research Database (Denmark)

    Debortoli, Stefan; Müller, Oliver; Junglas, Iris

    2016-01-01

    , such as manual coding. Yet, the size of text data setsobtained from the Internet makes manual analysis virtually impossible. In this tutorial, we discuss the challengesencountered when applying automated text-mining techniques in information systems research. In particular, weshowcase the use of probabilistic...... researchers,this tutorial provides some guidance for conducting text mining studies on their own and for evaluating the quality ofothers.......t is estimated that more than 80 percent of today’s data is stored in unstructured form (e.g., text, audio, image, video);and much of it is expressed in rich and ambiguous natural language. Traditionally, the analysis of natural languagehas prompted the use of qualitative data analysis approaches...

  5. Sentiment Analysis and Opinion Mining

    CERN Document Server

    Liu, Bing

    2012-01-01

    Sentiment analysis and opinion mining is the field of study that analyzes people's opinions, sentiments, evaluations, attitudes, and emotions from written language. It is one of the most active research areas in natural language processing and is also widely studied in data mining, Web mining, and text mining. In fact, this research has spread outside of computer science to the management sciences and social sciences due to its importance to business and society as a whole. The growing importance of sentiment analysis coincides with the growth of social media such as reviews, forum discussions

  6. Towards A Model Of Knowledge Extraction Of Text Mining For Palliative Care Patients In Panama.

    Directory of Open Access Journals (Sweden)

    Denis Cedeno Moreno

    2015-08-01

    Full Text Available Solutions using information technology is an innovative way to manage the information hospice patients in hospitals in Panama. The application of techniques of text mining for the domain of medicine especially information from electronic health records of patients in palliative care is one of the most recent and promising research areas for the analysis of textual data. Text mining is based on new knowledge extraction from unstructured natural language data. We may also create ontologies to describe the terminology and knowledge in a given domain. In an ontology conceptualization of a domain that may be general or specific formalized. Knowledge can be used for decision making by health specialists or can help in research topics for improving the health system.

  7. A COMPARATIVE ANALYSIS OF WEB INFORMATION EXTRACTION TECHNIQUES DEEP LEARNING vs. NAÏVE BAYES vs. BACK PROPAGATION NEURAL NETWORKS IN WEB DOCUMENT EXTRACTION

    OpenAIRE

    J. Sharmila; A. Subramani

    2016-01-01

    Web mining related exploration is getting the chance to be more essential these days in view of the reason that a lot of information is overseen through the web. Web utilization is expanding in an uncontrolled way. A particular framework is required for controlling such extensive measure of information in the web space. Web mining is ordered into three noteworthy divisions: Web content mining, web usage mining and web structure mining. Tak-Lam Wong has proposed a web content mining methodolog...

  8. Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows.

    Science.gov (United States)

    Fu, Xiao; Batista-Navarro, Riza; Rak, Rafal; Ananiadou, Sophia

    2015-01-01

    Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.

  9. Models and methods for building web recommendation systems

    OpenAIRE

    Stekh, Yu.; Artsibasov, V.

    2012-01-01

    Modern Word Wide Web contains a large number of Web sites and pages in each Web site. Web recommendation system (recommendation system for web pages) are typically implemented on web servers and use the data obtained from the collection viewed web templates (implicit data) or user registration data (explicit data). In article considering methods and algorithms of web recommendation system based on the technology of data mining (web mining). Сучасна мережа Інтернет містить велику кількість веб...

  10. Deploying and sharing U-Compare workflows as web services.

    Science.gov (United States)

    Kontonatsios, Georgios; Korkontzelos, Ioannis; Kolluru, Balakrishna; Thompson, Paul; Ananiadou, Sophia

    2013-02-18

    U-Compare is a text mining platform that allows the construction, evaluation and comparison of text mining workflows. U-Compare contains a large library of components that are tuned to the biomedical domain. Users can rapidly develop biomedical text mining workflows by mixing and matching U-Compare's components. Workflows developed using U-Compare can be exported and sent to other users who, in turn, can import and re-use them. However, the resulting workflows are standalone applications, i.e., software tools that run and are accessible only via a local machine, and that can only be run with the U-Compare platform. We address the above issues by extending U-Compare to convert standalone workflows into web services automatically, via a two-click process. The resulting web services can be registered on a central server and made publicly available. Alternatively, users can make web services available on their own servers, after installing the web application framework, which is part of the extension to U-Compare. We have performed a user-oriented evaluation of the proposed extension, by asking users who have tested the enhanced functionality of U-Compare to complete questionnaires that assess its functionality, reliability, usability, efficiency and maintainability. The results obtained reveal that the new functionality is well received by users. The web services produced by U-Compare are built on top of open standards, i.e., REST and SOAP protocols, and therefore, they are decoupled from the underlying platform. Exported workflows can be integrated with any application that supports these open standards. We demonstrate how the newly extended U-Compare enhances the cross-platform interoperability of workflows, by seamlessly importing a number of text mining workflow web services exported from U-Compare into Taverna, i.e., a generic scientific workflow construction platform.

  11. Stratification-Based Outlier Detection over the Deep Web

    OpenAIRE

    Xian, Xuefeng; Zhao, Pengpeng; Sheng, Victor S.; Fang, Ligang; Gu, Caidong; Yang, Yuanfeng; Cui, Zhiming

    2016-01-01

    For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribu...

  12. SalanderMaps: A rapid overview about felt earthquakes through data mining of web-accesses

    Science.gov (United States)

    Kradolfer, Urs

    2013-04-01

    While seismological observatories detect and locate earthquakes based on measurements of the ground motion, they neither know a priori whether an earthquake has been felt by the public nor is it known, where it has been felt. Such information is usually gathered by evaluating feedback reported by the public through on-line forms on the web. However, after a felt earthquake in Switzerland, many people visit the webpages of the Swiss Seismological Service (SED) at the ETH Zurich and each such visit leaves traces in the logfiles on our web-servers. Data mining techniques, applied to these logfiles and mining publicly available data bases on the internet open possibilities to obtain previously unknown information about our virtual visitors. In order to provide precise information to authorities and the media, it would be desirable to rapidly know from which locations these web-accesses origin. The method 'Salander' (Seismic Activitiy Linked to Area codes - Nimble Detection of Earthquake Rumbles) will be introduced and it will be explained, how the IP-addresses (each computer or router directly connected to the internet has a unique IP-address; an example would be 129.132.53.5) of a sufficient amount of our virtual visitors were linked to their geographical area. This allows us to unprecedentedly quickly know whether and where an earthquake was felt in Switzerland. It will also be explained, why the method Salander is superior to commercial so-called geolocation products. The corresponding products of the Salander method, animated SalanderMaps, which are routinely generated after each earthquake with a magnitude of M>2 in Switzerland (http://www.seismo.ethz.ch/prod/salandermaps/, available after March 2013), demonstrate how the wavefield of earthquakes propagates through Switzerland and where it was felt. Often, such information is available within less than 60 seconds after origin time, and we always get a clear picture within already five minutes after origin time

  13. SQUAT: A web tool to mine human, murine and avian SAGE data

    Directory of Open Access Journals (Sweden)

    Besson Jérémy

    2008-09-01

    Full Text Available Abstract Background There is an increasing need in transcriptome research for gene expression data and pattern warehouses. It is of importance to integrate in these warehouses both raw transcriptomic data, as well as some properties encoded in these data, like local patterns. Description We have developed an application called SQUAT (SAGE Querying and Analysis Tools which is available at: http://bsmc.insa-lyon.fr/squat/. This database gives access to both raw SAGE data and patterns mined from these data, for three species (human, mouse and chicken. This database allows to make simple queries like "In which biological situations is my favorite gene expressed?" as well as much more complex queries like: ≪what are the genes that are frequently co-over-expressed with my gene of interest in given biological situations?≫. Connections with external web databases enrich biological interpretations, and enable sophisticated queries. To illustrate the power of SQUAT, we show and analyze the results of three different queries, one of which led to a biological hypothesis that was experimentally validated. Conclusion SQUAT is a user-friendly information retrieval platform, which aims at bringing some of the state-of-the-art mining tools to biologists.

  14. Paragraphs or Lists? The Effects of Text Structure on Web Sites

    NARCIS (Netherlands)

    Karreman, Joyce; Loorbach, N.R.

    2007-01-01

    This paper describes a study that we conducted to investigate the effects of the visual text structure on Web sites on the users' browsing behavior and on their appreciation for the Web site. It has been known for a long time that the visual structure of a text has considerable effects on reading

  15. A simplified approach to the PROMETHEE method for priority setting in management of mine action projects

    Directory of Open Access Journals (Sweden)

    Marko Mladineo

    2016-12-01

    Full Text Available In the last 20 years, priority setting in mine actions, i.e. in humanitarian demining, has become an increasingly important topic. Given that mine action projects require management and decision-making based on a multi -criteria approach, multi-criteria decision-making methods like PROMETHEE and AHP have been used worldwide for priority setting. However, from the aspect of mine action, where stakeholders in the decision-making process for priority setting are project managers, local politicians, leaders of different humanitarian organizations, or similar, applying these methods can be difficult. Therefore, a specialized web-based decision support system (Web DSS for priority setting, developed as part of the FP7 project TIRAMISU, has been extended using a module for developing custom priority setting scenarios in line with an exceptionally easy, user-friendly approach. The idea behind this research is to simplify the multi-criteria analysis based on the PROMETHEE method. Therefore, a simplified PROMETHEE method based on statistical analysis for automated suggestions of parameters such as preference function thresholds, interactive selection of criteria weights, and easy input of criteria evaluations is presented in this paper. The result is web-based DSS that can be applied worldwide for priority setting in mine action. Additionally, the management of mine action projects is supported using modules for providing spatial data based on the geographic information system (GIS. In this paper, the benefits and limitations of a simplified PROMETHEE method are presented using a case study involving mine action projects, and subsequently, certain proposals are given for the further research.

  16. Placing Music Artists and Songs in Time Using Editorial Metadata and Web Mining Techniques

    NARCIS (Netherlands)

    Bountouridis, D.; Veltkamp, R.C.; Balen, J.M.H. van

    2013-01-01

    This paper investigates the novel task of situating music artists and songs in time, thereby adding contextual information that typically correlates with an artist’s similarities, collaborations and influences. The proposed method makes use of editorial metadata in conjunction with web mining

  17. Coronary artery disease risk assessment from unstructured electronic health records using text mining.

    Science.gov (United States)

    Jonnagaddala, Jitendra; Liaw, Siaw-Teng; Ray, Pradeep; Kumar, Manish; Chang, Nai-Wen; Dai, Hong-Jie

    2015-12-01

    Coronary artery disease (CAD) often leads to myocardial infarction, which may be fatal. Risk factors can be used to predict CAD, which may subsequently lead to prevention or early intervention. Patient data such as co-morbidities, medication history, social history and family history are required to determine the risk factors for a disease. However, risk factor data are usually embedded in unstructured clinical narratives if the data is not collected specifically for risk assessment purposes. Clinical text mining can be used to extract data related to risk factors from unstructured clinical notes. This study presents methods to extract Framingham risk factors from unstructured electronic health records using clinical text mining and to calculate 10-year coronary artery disease risk scores in a cohort of diabetic patients. We developed a rule-based system to extract risk factors: age, gender, total cholesterol, HDL-C, blood pressure, diabetes history and smoking history. The results showed that the output from the text mining system was reliable, but there was a significant amount of missing data to calculate the Framingham risk score. A systematic approach for understanding missing data was followed by implementation of imputation strategies. An analysis of the 10-year Framingham risk scores for coronary artery disease in this cohort has shown that the majority of the diabetic patients are at moderate risk of CAD. Copyright © 2015 Elsevier Inc. All rights reserved.

  18. A COMPARATIVE ANALYSIS OF WEB INFORMATION EXTRACTION TECHNIQUES DEEP LEARNING vs. NAÏVE BAYES vs. BACK PROPAGATION NEURAL NETWORKS IN WEB DOCUMENT EXTRACTION

    Directory of Open Access Journals (Sweden)

    J. Sharmila

    2016-01-01

    Full Text Available Web mining related exploration is getting the chance to be more essential these days in view of the reason that a lot of information is overseen through the web. Web utilization is expanding in an uncontrolled way. A particular framework is required for controlling such extensive measure of information in the web space. Web mining is ordered into three noteworthy divisions: Web content mining, web usage mining and web structure mining. Tak-Lam Wong has proposed a web content mining methodology in the exploration with the aid of Bayesian Networks (BN. In their methodology, they were learning on separating the web data and characteristic revelation in view of the Bayesian approach. Roused from their investigation, we mean to propose a web content mining methodology, in view of a Deep Learning Algorithm. The Deep Learning Algorithm gives the interest over BN on the basis that BN is not considered in any learning architecture planning like to propose system. The main objective of this investigation is web document extraction utilizing different grouping algorithm and investigation. This work extricates the data from the web URL. This work shows three classification algorithms, Deep Learning Algorithm, Bayesian Algorithm and BPNN Algorithm. Deep Learning is a capable arrangement of strategies for learning in neural system which is connected like computer vision, speech recognition, and natural language processing and biometrics framework. Deep Learning is one of the simple classification technique and which is utilized for subset of extensive field furthermore Deep Learning has less time for classification. Naive Bayes classifiers are a group of basic probabilistic classifiers in view of applying Bayes hypothesis with concrete independence assumptions between the features. At that point the BPNN algorithm is utilized for classification. Initially training and testing dataset contains more URL. We extract the content presently from the dataset. The

  19. Uncoolness factor of collaborative Web Mining Tools (WMT

    Directory of Open Access Journals (Sweden)

    Juan Luis Chulilla

    2009-12-01

    Full Text Available The recent development of social mining is a useful and direct analogy to talking about the less visible part of the adoption of successive waves of social software. The striking fact of visibility decrease as each type of social software matures should be taken into account for any comprehensive analysis of the relation between collectives and Internet technologies. One of the main results of this relation is the social data mining of Internet, which both gives sense to virtual communities and produces contents via feedback. We are just at the beginning of the adoption of new ways of social data mining, which will be significant when grow mature and become invisible.

  20. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes.

    Directory of Open Access Journals (Sweden)

    Nicholas J Leeper

    Full Text Available Peripheral arterial disease (PAD is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF.We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55], myocardial infarction (OR = 1.00, CI [0.71, 1.39], or death (OR = 0.86, CI [0.63, 1.18]. Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients.This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

  1. Web multimedia information retrieval using improved Bayesian algorithm.

    Science.gov (United States)

    Yu, Yi-Jun; Chen, Chun; Yu, Yi-Min; Lin, Huai-Zhong

    2003-01-01

    The main thrust of this paper is application of a novel data mining approach on the log of user's feedback to improve web multimedia information retrieval performance. A user space model was constructed based on data mining, and then integrated into the original information space model to improve the accuracy of the new information space model. It can remove clutter and irrelevant text information and help to eliminate mismatch between the page author's expression and the user's understanding and expectation. User space model was also utilized to discover the relationship between high-level and low-level features for assigning weight. The authors proposed improved Bayesian algorithm for data mining. Experiment proved that the authors' proposed algorithm was efficient.

  2. Research on Classification of Chinese Text Data Based on SVM

    Science.gov (United States)

    Lin, Yuan; Yu, Hongzhi; Wan, Fucheng; Xu, Tao

    2017-09-01

    Data Mining has important application value in today’s industry and academia. Text classification is a very important technology in data mining. At present, there are many mature algorithms for text classification. KNN, NB, AB, SVM, decision tree and other classification methods all show good classification performance. Support Vector Machine’ (SVM) classification method is a good classifier in machine learning research. This paper will study the classification effect based on the SVM method in the Chinese text data, and use the support vector machine method in the chinese text to achieve the classify chinese text, and to able to combination of academia and practical application.

  3. Classification algorithm of Web document in ionization radiation

    International Nuclear Information System (INIS)

    Geng Zengmin; Liu Wanchun

    2005-01-01

    Resources in the Internet is numerous. It is one of research directions of Web mining (WM) how to mine the resource of some calling or trade more efficiently. The paper studies the classification of Web document in ionization radiation (IR) based on the algorithm of Bayes, Rocchio, Widrow-Hoff, and analyses the result of trial effect. (authors)

  4. From university research to innovation Detecting knowledge transfer via text mining

    DEFF Research Database (Denmark)

    Woltmann, Sabrina; Clemmensen, Line Katrine Harder; Alkærsig, Lars

    2016-01-01

    and indicators such as patents, collaborative publications and license agreements, to assess the contribution to the socioeconomic surrounding of universities. In this study, we present an extension of the current empirical framework by applying new computational methods, namely text mining and pattern...... associated the former with the latter to obtain insights into possible text and semantic relatedness. The text mining methods are extrapolating the correlations, semantic patterns and content comparison of the two corpora to define the document relatedness. We expect the development of a novel tool using...... recognition. Text samples for this purpose can include files containing social media contents, company websites and annual reports. The empirical focus in the present study is on the technical sciences and in particular on the case of the Technical University of Denmark (DTU). We generated two independent...

  5. The Rise and Fall of Text on the Web: A Quantitative Study of Web Archives

    Science.gov (United States)

    Cocciolo, Anthony

    2015-01-01

    Introduction: This study addresses the following research question: is the use of text on the World Wide Web declining? If so, when did it start declining, and by how much has it declined? Method: Web pages are downloaded from the Internet Archive for the years 1999, 2002, 2005, 2008, 2011 and 2014, producing 600 captures of 100 prominent and…

  6. Populating the Semantic Web by Macro-reading Internet Text

    Science.gov (United States)

    Mitchell, Tom M.; Betteridge, Justin; Carlson, Andrew; Hruschka, Estevam; Wang, Richard

    A key question regarding the future of the semantic web is "how will we acquire structured information to populate the semantic web on a vast scale?" One approach is to enter this information manually. A second approach is to take advantage of pre-existing databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web. We also describe preliminary results demonstrating that machine learning algorithms can learn to extract tens of thousands of facts to populate a diverse ontology, with imperfect but reasonably good accuracy.

  7. MetaRanker 2.0: a web server for prioritization of genetic variation data

    DEFF Research Database (Denmark)

    Pers, Tune Hannes; Dworzynski, Piotr; Thomas, Cecilia Engel

    2013-01-01

    MetaRanker 2.0 is a web server for prioritization of common and rare frequency genetic variation data. Based on heterogeneous data sets including genetic association data, protein–protein interactions, large-scale text-mining data, copy number variation data and gene expression experiments, Meta...

  8. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  9. Effects of Surrounding Information and Line Length on Text Comprehension from the Web

    Directory of Open Access Journals (Sweden)

    Jess McMullin

    2002-02-01

    Full Text Available The World Wide Web (Web is becoming a popular medium for transmission of information and online learning. We need to understand how people comprehend information from the Web to design Web sites that maximize the acquisition of information. We examined two features of Web page design that are easily modified by developers, namely line length and the amount of surrounding information, or whitespace. Undergraduate university student participants read text and answered comprehension questions on the Web. Comprehension was affected by whitespace; participants had better comprehension for information surrounded by whitespace than for information surrounded by meaningless information. Participants were not affected by line length. These findings demonstrate that reading from the Web is not the same as reading print and have implications for instructional Web design.

  10. Development of a web-based, underground coalmine gas outburst information management system

    Energy Technology Data Exchange (ETDEWEB)

    Naj Aziz; Richard Caladine; Lucia Tome; Ken Cram; Devendra Vyas [University of Wollongong, NSW (Australia)

    2007-04-15

    The primary objective of this project was to develop an online coal mine outburst information management system to provide the coal mining industry with the necessary information and knowledge on outbursts via the World Wide Web. The Website has been constructed using the standard web format. Access to the site is by standard web browsers. The address of the site is http://www.uow.edu.au/eng/outburst. The website has 85 conference papers which were held in Australia, dating as far back as the 1980's, various seminar presentations, more than 250 references, a limited but important collection of international papers, direct links to ACARP and NERRDC publication lists, links to several leading organisations of particular interest in mine gas and outburst control. These links include both private and government organisations, and a forum for discussion.

  11. Mining Sequential Update Summarization with Hierarchical Text Analysis

    Directory of Open Access Journals (Sweden)

    Chunyun Zhang

    2016-01-01

    Full Text Available The outbreak of unexpected news events such as large human accident or natural disaster brings about a new information access problem where traditional approaches fail. Mostly, news of these events shows characteristics that are early sparse and later redundant. Hence, it is very important to get updates and provide individuals with timely and important information of these incidents during their development, especially when being applied in wireless and mobile Internet of Things (IoT. In this paper, we define the problem of sequential update summarization extraction and present a new hierarchical update mining system which can broadcast with useful, new, and timely sentence-length updates about a developing event. The new system proposes a novel method, which incorporates techniques from topic-level and sentence-level summarization. To evaluate the performance of the proposed system, we apply it to the task of sequential update summarization of temporal summarization (TS track at Text Retrieval Conference (TREC 2013 to compute four measurements of the update mining system: the expected gain, expected latency gain, comprehensiveness, and latency comprehensiveness. Experimental results show that our proposed method has good performance.

  12. Text Mining Untuk Analisis Sentimen Review Film Menggunakan Algoritma K-Means

    Directory of Open Access Journals (Sweden)

    Setyo Budi

    2017-02-01

    Full Text Available Kemudahan manusia didalam menggunakan website mengakibatkan bertambahnya dokumen teks yang berupa pendapat dan informasi. Dalam waktu yang lama dokumen teks akan bertambah besar. Text mining merupakan salah satu teknik yang digunakan untuk menggali kumpulan dokumen text sehingga dapat diambil intisarinya. Ada beberapa algoritma yang di gunakan untuk penggalian dokumen untuk analisis sentimen, salah satunya adalah K-Means. Didalam penelitian ini algoritma yang digunakan adalah K-Means. Hasil penelitian menunjukkan bahwa akurasi K-Means dengan dataset digunakan 300 positif dan 300 negatif  akurasinya 57.83%,  700 dokumen positif dan 700  negatif akurasinya 56.71%%, 1000 dokumen positif dan 1000  negatif akurasinya 50.40%%. Dari hasil pengujian disimpulkan bahwa semakin besar dataset yang digunakan semakin rendah akurasi K-Means.   Kata Kunci : Text Mining, Analisis Sentimen, K-Means, Review Film 

  13. Web-based Cooperative Learning in College Chemistry Teaching

    Directory of Open Access Journals (Sweden)

    Bin Jiang

    2014-03-01

    Full Text Available With the coming of information era, information process depend on internet and multi-media technology in education becomes the new approach of present teaching model reform. Web-based cooperative learning is becoming a popular learning approach with the rapid development of web technology. The paper aims to how to carry out the teaching strategy of web-based cooperative learning and applied in the foundation chemistry teaching.It was shown that with the support of modern web-based teaching environment, students' cooperative learning capacity and overall competence can be better improved and the problems of interaction in large foundation chemistry classes can be solved. Web-based cooperative learning can improve learning performance of students, what's more Web-based cooperative learning provides students with cooperative skills, communication skills, creativity, critical thinking skills and skills in information technology application.

  14. A web server for mining Comparative Genomic Hybridization (CGH) data

    Science.gov (United States)

    Liu, Jun; Ranka, Sanjay; Kahveci, Tamer

    2007-11-01

    Advances in cytogenetics and molecular biology has established that chromosomal alterations are critical in the pathogenesis of human cancer. Recurrent chromosomal alterations provide cytological and molecular markers for the diagnosis and prognosis of disease. They also facilitate the identification of genes that are important in carcinogenesis, which in the future may help in the development of targeted therapy. A large amount of publicly available cancer genetic data is now available and it is growing. There is a need for public domain tools that allow users to analyze their data and visualize the results. This chapter describes a web based software tool that will allow researchers to analyze and visualize Comparative Genomic Hybridization (CGH) datasets. It employs novel data mining methodologies for clustering and classification of CGH datasets as well as algorithms for identifying important markers (small set of genomic intervals with aberrations) that are potentially cancer signatures. The developed software will help in understanding the relationships between genomic aberrations and cancer types.

  15. Web-based Data Mining to Systematically Determine Data Quality From the EarthScope USArray Seismic Observatory Project

    Science.gov (United States)

    Newman, R. L.; Lindquist, K. G.; Hansen, T. S.; Vernon, F. L.; Eakins, J.; Foley, S.

    2004-12-01

    When fully operational, the Transportable Array (TA) and Flexible Array (FA) components of the continent-scale EarthScope USArray seismic observatory project will provide telemetered real-time data from up to 600 stations. By the fifth year of the deployment the predicted total amount of data production for the TA and FA will be approximately 1500 Gb/yr and approximately 1000 Gb/yr respectively. In addition to delivering the data to the IRIS Data Management Center (DMC) for permanent archiving, the Array Network Facility (ANF) is charged with real-time data quality control, calibration, metadata storage and retrieval, network monitoring and local archiving. The Antelope real-time processing software provides the back-bone to this effort, supported by the Storage Resource Broker data replication/archiving system and the Nagios network monitoring tool. Real-time, web-based data mining, with support for multiple database schemas, is provided by an Antelope interface to both Perl and PHP scripting languages. This allows embedding of database functions in HTML. A suite of online tools allows query and graphical display of dynamic real-time sensor network parameters such as data latency, network topologies, and data return rates. Data and metadata are also web-accessible, for example XML trees of seismic data and graphical display of instrument response functions. The purpose of these tools is to provide the ANF, IRIS and end-users of USArray data with a real-time systematic method of determining data quality for the spatio-temporal area of interest. The tools are accessible at http://anf.ucsd.edu

  16. Text mining for literature review and knowledge discovery in cancer risk assessment and research.

    Directory of Open Access Journals (Sweden)

    Anna Korhonen

    Full Text Available Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB - a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

  17. Tools and Databases of the KOMICS Web Portal for Preprocessing, Mining, and Dissemination of Metabolomics Data

    Directory of Open Access Journals (Sweden)

    Nozomu Sakurai

    2014-01-01

    Full Text Available A metabolome—the collection of comprehensive quantitative data on metabolites in an organism—has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal, where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.

  18. Text mining, a race against time? An attempt to quantify possible variations in text corpora of medical publications throughout the years.

    Science.gov (United States)

    Wagner, Mathias; Vicinus, Benjamin; Muthra, Sherieda T; Richards, Tereza A; Linder, Roland; Frick, Vilma Oliveira; Groh, Andreas; Rubie, Claudia; Weichert, Frank

    2016-06-01

    The continuous growth of medical sciences literature indicates the need for automated text analysis. Scientific writing which is neither unitary, transcending social situation nor defined by a timeless idea is subject to constant change as it develops in response to evolving knowledge, aims at different goals, and embodies different assumptions about nature and communication. The objective of this study was to evaluate whether publication dates should be considered when performing text mining. A search of PUBMED for combined references to chemokine identifiers and particular cancer related terms was conducted to detect changes over the past 36 years. Text analyses were performed using freeware available from the World Wide Web. TOEFL Scores of territories hosting institutional affiliations as well as various readability indices were investigated. Further assessment was conducted using Principal Component Analysis. Laboratory examination was performed to evaluate the quality of attempts to extract content from the examined linguistic features. The PUBMED search yielded a total of 14,420 abstracts (3,190,219 words). The range of findings in laboratory experimentation were coherent with the variability of the results described in the analyzed body of literature. Increased concurrence of chemokine identifiers together with cancer related terms was found at the abstract and sentence level, whereas complexity of sentences remained fairly stable. The findings of the present study indicate that concurrent references to chemokines and cancer increased over time whereas text complexity remained stable. Copyright © 2016 Elsevier Ltd. All rights reserved.

  19. Information Gain Based Dimensionality Selection for Classifying Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Dumidu Wijayasekara; Milos Manic; Miles McQueen

    2013-06-01

    Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexity is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.

  20. Text mining facilitates database curation - extraction of mutation-disease associations from Bio-medical literature.

    Science.gov (United States)

    Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang

    2015-06-06

    Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating

  1. Compatibility between Text Mining and Qualitative Research in the Perspectives of Grounded Theory, Content Analysis, and Reliability

    Science.gov (United States)

    Yu, Chong Ho; Jannasch-Pennell, Angel; DiGangi, Samuel

    2011-01-01

    The objective of this article is to illustrate that text mining and qualitative research are epistemologically compatible. First, like many qualitative research approaches, such as grounded theory, text mining encourages open-mindedness and discourages preconceptions. Contrary to the popular belief that text mining is a linear and fully automated…

  2. Identifying child abuse through text mining and machine learning

    NARCIS (Netherlands)

    Amrit, Chintan; Paauw, Tim; Aly, Robin; Lavric, Miha

    2017-01-01

    In this paper, we describe how we used text mining and analysis to identify and predict cases of child abuse in a public health institution. Such institutions in the Netherlands try to identify and prevent different kinds of abuse. A significant part of the medical data that the institutions have on

  3. Practice-based evidence: profiling the safety of cilostazol by text-mining of clinical notes.

    Science.gov (United States)

    Leeper, Nicholas J; Bauer-Mehren, Anna; Iyer, Srinivasan V; Lependu, Paea; Olson, Cliff; Shah, Nigam H

    2013-01-01

    Peripheral arterial disease (PAD) is a growing problem with few available therapies. Cilostazol is the only FDA-approved medication with a class I indication for intermittent claudication, but carries a black box warning due to concerns for increased cardiovascular mortality. To assess the validity of this black box warning, we employed a novel text-analytics pipeline to quantify the adverse events associated with Cilostazol use in a clinical setting, including patients with congestive heart failure (CHF). We analyzed the electronic medical records of 1.8 million subjects from the Stanford clinical data warehouse spanning 18 years using a novel text-mining/statistical analytics pipeline. We identified 232 PAD patients taking Cilostazol and created a control group of 1,160 PAD patients not taking this drug using 1:5 propensity-score matching. Over a mean follow up of 4.2 years, we observed no association between Cilostazol use and any major adverse cardiovascular event including stroke (OR = 1.13, CI [0.82, 1.55]), myocardial infarction (OR = 1.00, CI [0.71, 1.39]), or death (OR = 0.86, CI [0.63, 1.18]). Cilostazol was not associated with an increase in any arrhythmic complication. We also identified a subset of CHF patients who were prescribed Cilostazol despite its black box warning, and found that it did not increase mortality in this high-risk group of patients. This proof of principle study shows the potential of text-analytics to mine clinical data warehouses to uncover 'natural experiments' such as the use of Cilostazol in CHF patients. We envision this method will have broad applications for examining difficult to test clinical hypotheses and to aid in post-marketing drug safety surveillance. Moreover, our observations argue for a prospective study to examine the validity of a drug safety warning that may be unnecessarily limiting the use of an efficacious therapy.

  4. Comparing a Video and Text Version of a Web-Based Computer-Tailored Intervention for Obesity Prevention: A Randomized Controlled Trial.

    Science.gov (United States)

    Walthouwer, Michel Jean Louis; Oenema, Anke; Lechner, Lilian; de Vries, Hein

    2015-10-19

    Web-based computer-tailored interventions often suffer from small effect sizes and high drop-out rates, particularly among people with a low level of education. Using videos as a delivery format can possibly improve the effects and attractiveness of these interventions The main aim of this study was to examine the effects of a video and text version of a Web-based computer-tailored obesity prevention intervention on dietary intake, physical activity, and body mass index (BMI) among Dutch adults. A second study aim was to examine differences in appreciation between the video and text version. The final study aim was to examine possible differences in intervention effects and appreciation per educational level. A three-armed randomized controlled trial was conducted with a baseline and 6 months follow-up measurement. The intervention consisted of six sessions, lasting about 15 minutes each. In the video version, the core tailored information was provided by means of videos. In the text version, the same tailored information was provided in text format. Outcome variables were self-reported and included BMI, physical activity, energy intake, and appreciation of the intervention. Multiple imputation was used to replace missing values. The effect analyses were carried out with multiple linear regression analyses and adjusted for confounders. The process evaluation data were analyzed with independent samples t tests. The baseline questionnaire was completed by 1419 participants and the 6 months follow-up measurement by 1015 participants (71.53%). No significant interaction effects of educational level were found on any of the outcome variables. Compared to the control condition, the video version resulted in lower BMI (B=-0.25, P=.049) and lower average daily energy intake from energy-dense food products (B=-175.58, PWeb-based computer-tailored obesity prevention intervention was the most effective intervention and most appreciated. Future research needs to examine if the

  5. PPI finder: a mining tool for human protein-protein interactions.

    Directory of Open Access Journals (Sweden)

    Min He

    Full Text Available BACKGROUND: The exponential increase of published biomedical literature prompts the use of text mining tools to manage the information overload automatically. One of the most common applications is to mine protein-protein interactions (PPIs from PubMed abstracts. Currently, most tools in mining PPIs from literature are using co-occurrence-based approaches or rule-based approaches. Hybrid methods (frame-based approaches by combining these two methods may have better performance in predicting PPIs. However, the predicted PPIs from these methods are rarely evaluated by known PPI databases and co-occurred terms in Gene Ontology (GO database. METHODOLOGY/PRINCIPAL FINDINGS: We here developed a web-based tool, PPI Finder, to mine human PPIs from PubMed abstracts based on their co-occurrences and interaction words, followed by evidences in human PPI databases and shared terms in GO database. Only 28% of the co-occurred pairs in PubMed abstracts appeared in any of the commonly used human PPI databases (HPRD, BioGRID and BIND. On the other hand, of the known PPIs in HPRD, 69% showed co-occurrences in the literature, and 65% shared GO terms. CONCLUSIONS: PPI Finder provides a useful tool for biologists to uncover potential novel PPIs. It is freely accessible at http://liweilab.genetics.ac.cn/tm/.

  6. Agile text mining for the 2014 i2b2/UTHealth Cardiac risk factors challenge.

    Science.gov (United States)

    Cormack, James; Nath, Chinmoy; Milward, David; Raja, Kalpana; Jonnalagadda, Siddhartha R

    2015-12-01

    This paper describes the use of an agile text mining platform (Linguamatics' Interactive Information Extraction Platform, I2E) to extract document-level cardiac risk factors in patient records as defined in the i2b2/UTHealth 2014 challenge. The approach uses a data-driven rule-based methodology with the addition of a simple supervised classifier. We demonstrate that agile text mining allows for rapid optimization of extraction strategies, while post-processing can leverage annotation guidelines, corpus statistics and logic inferred from the gold standard data. We also show how data imbalance in a training set affects performance. Evaluation of this approach on the test data gave an F-Score of 91.7%, one percent behind the top performing system. Copyright © 2015 Elsevier Inc. All rights reserved.

  7. Web-based Visual Analytics for Extreme Scale Climate Science

    Energy Technology Data Exchange (ETDEWEB)

    Steed, Chad A [ORNL; Evans, Katherine J [ORNL; Harney, John F [ORNL; Jewell, Brian C [ORNL; Shipman, Galen M [ORNL; Smith, Brian E [ORNL; Thornton, Peter E [ORNL; Williams, Dean N. [Lawrence Livermore National Laboratory (LLNL)

    2014-01-01

    In this paper, we introduce a Web-based visual analytics framework for democratizing advanced visualization and analysis capabilities pertinent to large-scale earth system simulations. We address significant limitations of present climate data analysis tools such as tightly coupled dependencies, ineffi- cient data movements, complex user interfaces, and static visualizations. Our Web-based visual analytics framework removes critical barriers to the widespread accessibility and adoption of advanced scientific techniques. Using distributed connections to back-end diagnostics, we minimize data movements and leverage HPC platforms. We also mitigate system dependency issues by employing a RESTful interface. Our framework embraces the visual analytics paradigm via new visual navigation techniques for hierarchical parameter spaces, multi-scale representations, and interactive spatio-temporal data mining methods that retain details. Although generalizable to other science domains, the current work focuses on improving exploratory analysis of large-scale Community Land Model (CLM) and Community Atmosphere Model (CAM) simulations.

  8. Efficacy of Web-Based Collection of Strength-Based Testimonials for Text Message Extension of Youth Suicide Prevention Program: Randomized Controlled Experiment.

    Science.gov (United States)

    Thiha, Phyo; Pisani, Anthony R; Gurditta, Kunali; Cherry, Erin; Peterson, Derick R; Kautz, Henry; Wyman, Peter A

    2016-11-09

    Equipping members of a target population to deliver effective public health messaging to peers is an established approach in health promotion. The Sources of Strength program has demonstrated the promise of this approach for "upstream" youth suicide prevention. Text messaging is a well-established medium for promoting behavior change and is the dominant communication medium for youth. In order for peer 'opinion leader' programs like Sources of Strength to use scalable, wide-reaching media such as text messaging to spread peer-to-peer messages, they need techniques for assisting peer opinion leaders in creating effective testimonials to engage peers and match program goals. We developed a Web interface, called Stories of Personal Resilience in Managing Emotions (StoryPRIME), which helps peer opinion leaders write effective, short-form messages that can be delivered to the target population in youth suicide prevention program like Sources of Strength. To determine the efficacy of StoryPRIME, a Web-based interface for remotely eliciting high school peer leaders, and helping them produce high-quality, personal testimonials for use in a text messaging extension of an evidence-based, peer-led suicide prevention program. In a double-blind randomized controlled experiment, 36 high school students wrote testimonials with or without eliciting from the StoryPRIME interface. The interface was created in the context of Sources of Strength-an evidence-based youth suicide prevention program-and 24 ninth graders rated these testimonials on relatability, usefulness/relevance, intrigue, and likability. Testimonials written with the StoryPRIME interface were rated as more relatable, useful/relevant, intriguing, and likable than testimonials written without StoryPRIME, P=.054. StoryPRIME is a promising way to elicit high-quality, personal testimonials from youth for prevention programs that draw on members of a target population to spread public health messages. ©Phyo Thiha, Anthony

  9. An Enhanced Rule-Based Web Scanner Based on Similarity Score

    Directory of Open Access Journals (Sweden)

    LEE, M.

    2016-08-01

    Full Text Available This paper proposes an enhanced rule-based web scanner in order to get better accuracy in detecting web vulnerabilities than the existing tools, which have relatively high false alarm rate when the web pages are installed in unconventional directory paths. Using the proposed matching method based on similarity score, the proposed scheme can determine whether two pages have the same vulnerabilities or not. With this method, the proposed scheme is able to figure out the target web pages are vulnerable by comparing them to the web pages that are known to have vulnerabilities. We show the proposed scanner reduces 12% false alarm rate compared to the existing well-known scanner through the performance evaluation via various experiments. The proposed scheme is especially helpful in detecting vulnerabilities of the web applications which come from well-known open-source web applications after small customization, which happens frequently in many small-sized companies.

  10. Web-based tools for data analysis and quality assurance on a life-history trait database of plants of Northwest Europe

    NARCIS (Netherlands)

    Stadler, Michael; Ahlers, Dirk; Bekker, Rene M.; Finke, Jens; Kunzmann, Dierk; Sonnenschein, Michael

    2006-01-01

    Most data mining techniques have rarely been used in ecology. To address the specific needs of scientists analysing data from a plant trait database developed during the LEDA project, a web-based data mining tool has been developed. This paper presents the DIONE data miner and the project it has

  11. Smartphone-based evaluations of clinical placements—a useful complement to web-based evaluation tools

    Directory of Open Access Journals (Sweden)

    Jesper Hessius

    2015-11-01

    Full Text Available Purpose: Web-based questionnaires are currently the standard method for course evaluations. The high rate of smartphone adoption in Sweden makes possible a range of new uses, including course evaluation. This study examines the potential advantages and disadvantages of using a smartphone app as a complement to web-based course evaluationsystems. Methods: An iPhone app for course evaluations was developed and interfaced to an existing web-based tool. Evaluations submitted using the app were compared with those submitted using the web between August 2012 and June 2013, at the Faculty of Medicine at Uppsala University, Sweden. Results: At the time of the study, 49% of the students were judged to own iPhones. Over the course of the study, 3,340 evaluations were submitted, of which 22.8% were submitted using the app. The median of mean scores in the submitted evaluations was 4.50 for the app (with an interquartile range of 3.70-5.20 and 4.60 (3.70-5.20 for the web (P=0.24. The proportion of evaluations that included a free-text comment was 50.5% for the app and 49.9% for the web (P=0.80. Conclusion: An app introduced as a complement to a web-based course evaluation system met with rapid adoption. We found no difference in the frequency of free-text comments or in the evaluation scores. Apps appear to be promising tools for course evaluations. web-based course evaluation system met with rapid adoption. We found no difference in the frequency of free-text comments or in the evaluation scores. Apps appear to be promising tools for course evaluations.

  12. Chemotext: A Publicly Available Web Server for Mining Drug-Target-Disease Relationships in PubMed.

    Science.gov (United States)

    Capuzzi, Stephen J; Thornton, Thomas E; Liu, Kammy; Baker, Nancy; Lam, Wai In; O'Banion, Colin P; Muratov, Eugene N; Pozefsky, Diane; Tropsha, Alexander

    2018-02-26

    Elucidation of the mechanistic relationships between drugs, their targets, and diseases is at the core of modern drug discovery research. Thousands of studies relevant to the drug-target-disease (DTD) triangle have been published and annotated in the Medline/PubMed database. Mining this database affords rapid identification of all published studies that confirm connections between vertices of this triangle or enable new inferences of such connections. To this end, we describe the development of Chemotext, a publicly available Web server that mines the entire compendium of published literature in PubMed annotated by Medline Subject Heading (MeSH) terms. The goal of Chemotext is to identify all known DTD relationships and infer missing links between vertices of the DTD triangle. As a proof-of-concept, we show that Chemotext could be instrumental in generating new drug repurposing hypotheses or annotating clinical outcomes pathways for known drugs. The Chemotext Web server is freely available at http://chemotext.mml.unc.edu .

  13. Editorial: Web-Based Learning: Innovations and Challenges

    Directory of Open Access Journals (Sweden)

    Mudasser F. Wyne

    2010-12-01

    Full Text Available This special issue of the Knowledge Management & E-Learning: an international journal(KM&EL aims to stimulate interest in the web based issues in both teaching and learning, expose natural collaboration among the authors and readers, inform the larger research community of the interest and importance of this area and create a forum for evaluating innovations and challenges. We intend to bring together researchers and practitioners interested in developing and enhancing web-based learning environment. The objectives for this attempt are to provide a forum for discussion of ideas and techniques developed and used in web based learning. In addition the issue can also be used for educators and developers to discuss requirements for web-based education. Both theoretical papers and papers reporting implementation models, technology used and practical results are included in the issue.

  14. Evolution of bayesian-related research over time: a temporal text mining task

    CSIR Research Space (South Africa)

    de Waal, A

    2006-06-01

    Full Text Available Ronald Reagan’s Radio Addresses? Bayesian Analysis 2006, Volume 1, Number 2, pp. 189-383. 2. Mei Q and Zhai C, 2005. Discovering Evolutionary Theme Patterns from Text – An Exploration of Temporal Text Mining. KDD’05, August 21-24, 2005. Chicago...

  15. Scrutinizing the Cybersell: Teen-Targeted Web Sites as Texts

    Science.gov (United States)

    Crovitz, Darren

    2007-01-01

    Darren Crovitz explains that the explosive growth of Web-based content and communication in recent years compels us to teach students how to examine the "rhetorical nature and ethical dimensions of the online world." He demonstrates successful approaches to accomplish this goal through his analysis of the selling techniques of two Web sites…

  16. Integrated pathway-based transcription regulation network mining and visualization based on gene expression profiles.

    Science.gov (United States)

    Kibinge, Nelson; Ono, Naoaki; Horie, Masafumi; Sato, Tetsuo; Sugiura, Tadao; Altaf-Ul-Amin, Md; Saito, Akira; Kanaya, Shigehiko

    2016-06-01

    Conventionally, workflows examining transcription regulation networks from gene expression data involve distinct analytical steps. There is a need for pipelines that unify data mining and inference deduction into a singular framework to enhance interpretation and hypotheses generation. We propose a workflow that merges network construction with gene expression data mining focusing on regulation processes in the context of transcription factor driven gene regulation. The pipeline implements pathway-based modularization of expression profiles into functional units to improve biological interpretation. The integrated workflow was implemented as a web application software (TransReguloNet) with functions that enable pathway visualization and comparison of transcription factor activity between sample conditions defined in the experimental design. The pipeline merges differential expression, network construction, pathway-based abstraction, clustering and visualization. The framework was applied in analysis of actual expression datasets related to lung, breast and prostrate cancer. Copyright © 2016 Elsevier Inc. All rights reserved.

  17. PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results

    Directory of Open Access Journals (Sweden)

    Zhao Xuechun

    2007-02-01

    Full Text Available Abstract Background BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Results Personal BLAST Navigator (PLAN is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1 query and target sequence database management, (2 automated high-throughput BLAST searching, (3 indexing and searching of results, (4 filtering results online, (5 managing results of personal interest in favorite categories, (6 automated sequence annotation (such as NCBI NR and ontology-based annotation. PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. Conclusion PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results

  18. DDMGD: the database of text-mined associations between genes methylated in diseases from different species

    KAUST Repository

    Raies, A. B.; Mansour, H.; Incitti, R.; Bajic, Vladimir B.

    2014-01-01

    ://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD's scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we

  19. Identifying Engineering Students' English Sentence Reading Comprehension Errors: Applying a Data Mining Technique

    Science.gov (United States)

    Tsai, Yea-Ru; Ouyang, Chen-Sen; Chang, Yukon

    2016-01-01

    The purpose of this study is to propose a diagnostic approach to identify engineering students' English reading comprehension errors. Student data were collected during the process of reading texts of English for science and technology on a web-based cumulative sentence analysis system. For the analysis, the association-rule, data mining technique…

  20. An Enhanced Text-Mining Framework for Extracting Disaster Relevant Data through Social Media and Remote Sensing Data Fusion

    Science.gov (United States)

    Scheele, C. J.; Huang, Q.

    2016-12-01

    In the past decade, the rise in social media has led to the development of a vast number of social media services and applications. Disaster management represents one of such applications leveraging massive data generated for event detection, response, and recovery. In order to find disaster relevant social media data, current approaches utilize natural language processing (NLP) methods based on keywords, or machine learning algorithms relying on text only. However, these approaches cannot be perfectly accurate due to the variability and uncertainty in language used on social media. To improve current methods, the enhanced text-mining framework is proposed to incorporate location information from social media and authoritative remote sensing datasets for detecting disaster relevant social media posts, which are determined by assessing the textual content using common text mining methods and how the post relates spatiotemporally to the disaster event. To assess the framework, geo-tagged Tweets were collected for three different spatial and temporal disaster events: hurricane, flood, and tornado. Remote sensing data and products for each event were then collected using RealEarthTM. Both Naive Bayes and Logistic Regression classifiers were used to compare the accuracy within the enhanced text-mining framework. Finally, the accuracies from the enhanced text-mining framework were compared to the current text-only methods for each of the case study disaster events. The results from this study address the need for more authoritative data when using social media in disaster management applications.

  1. ESTminer: a Web interface for mining EST contig and cluster databases.

    Science.gov (United States)

    Huang, Yecheng; Pumphrey, Janie; Gingle, Alan R

    2005-03-01

    ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows 'queries within queries' where the result set of a query is further filtered by the subsequent query. ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp agingle@uga.edu.

  2. Using Text Mining to Uncover Students' Technology-Related Problems in Live Video Streaming

    Science.gov (United States)

    Abdous, M'hammed; He, Wu

    2011-01-01

    Because of their capacity to sift through large amounts of data, text mining and data mining are enabling higher education institutions to reveal valuable patterns in students' learning behaviours without having to resort to traditional survey methods. In an effort to uncover live video streaming (LVS) students' technology related-problems and to…

  3. Soil food web changes during spontaneous succession at post mining sites: a possible ecosystem engineering effect on food web organization?

    Science.gov (United States)

    Frouz, Jan; Thébault, Elisa; Pižl, Václav; Adl, Sina; Cajthaml, Tomáš; Baldrián, Petr; Háněl, Ladislav; Starý, Josef; Tajovský, Karel; Materna, Jan; Nováková, Alena; de Ruiter, Peter C

    2013-01-01

    Parameters characterizing the structure of the decomposer food web, biomass of the soil microflora (bacteria and fungi) and soil micro-, meso- and macrofauna were studied at 14 non-reclaimed 1- 41-year-old post-mining sites near the town of Sokolov (Czech Republic). These observations on the decomposer food webs were compared with knowledge of vegetation and soil microstructure development from previous studies. The amount of carbon entering the food web increased with succession age in a similar way as the total amount of C in food web biomass and the number of functional groups in the food web. Connectance did not show any significant changes with succession age, however. In early stages of the succession, the bacterial channel dominated the food web. Later on, in shrub-dominated stands, the fungal channel took over. Even later, in the forest stage, the bacterial channel prevailed again. The best predictor of fungal bacterial ratio is thickness of fermentation layer. We argue that these changes correspond with changes in topsoil microstructure driven by a combination of plant organic matter input and engineering effects of earthworms. In early stages, soil is alkaline, and a discontinuous litter layer on the soil surface promotes bacterial biomass growth, so the bacterial food web channel can dominate. Litter accumulation on the soil surface supports the development of the fungal channel. In older stages, earthworms arrive, mix litter into the mineral soil and form an organo-mineral topsoil, which is beneficial for bacteria and enhances the bacterial food web channel.

  4. Lightweight monitoring and control system for coal mine safety using REST style.

    Science.gov (United States)

    Cheng, Bo; Cheng, Xin; Chen, Junliang

    2015-01-01

    The complex environment of a coal mine requires the underground environment, devices and miners to be constantly monitored to ensure safe coal production. However, existing coal mines do not meet these coverage requirements because blind spots occur when using a wired network. In this paper, we develop a Web-based, lightweight remote monitoring and control platform using a wireless sensor network (WSN) with the REST style to collect temperature, humidity and methane concentration data in a coal mine using sensor nodes. This platform also collects information on personnel positions inside the mine. We implement a RESTful application programming interface (API) that provides access to underground sensors and instruments through the Web such that underground coal mine physical devices can be easily interfaced to remote monitoring and control applications. We also implement three different scenarios for Web-based, lightweight remote monitoring and control of coal mine safety and measure and analyze the system performance. Finally, we present the conclusions from this study and discuss future work. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.

  5. Simple, Scalable, Script-based, Science Processor for Measurements - Data Mining Edition (S4PM-DME)

    Science.gov (United States)

    Pham, L. B.; Eng, E. K.; Lynnes, C. S.; Berrick, S. W.; Vollmer, B. E.

    2005-12-01

    The S4PM-DME is the Goddard Earth Sciences Distributed Active Archive Center's (GES DAAC) web-based data mining environment. The S4PM-DME replaces the Near-line Archive Data Mining (NADM) system with a better web environment and a richer set of production rules. S4PM-DME enables registered users to submit and execute custom data mining algorithms. The S4PM-DME system uses the GES DAAC developed Simple Scalable Script-based Science Processor for Measurements (S4PM) to automate tasks and perform the actual data processing. A web interface allows the user to access the S4PM-DME system. The user first develops personalized data mining algorithm on his/her home platform and then uploads them to the S4PM-DME system. Algorithms in C and FORTRAN languages are currently supported. The user developed algorithm is automatically audited for any potential security problems before it is installed within the S4PM-DME system and made available to the user. Once the algorithm has been installed the user can promote the algorithm to the "operational" environment. From here the user can search and order the data available in the GES DAAC archive for his/her science algorithm. The user can also set up a processing subscription. The subscription will automatically process new data as it becomes available in the GES DAAC archive. The generated mined data products are then made available for FTP pickup. The benefits of using S4PM-DME are 1) to decrease the downloading time it typically takes a user to transfer the GES DAAC data to his/her system thus off-load the heavy network traffic, 2) to free-up the load on their system, and last 3) to utilize the rich and abundance ocean, atmosphere data from the MODIS and AIRS instruments available from the GES DAAC.

  6. Personalized links recommendation based on data mining in adaptive educational hypermedia systems

    NARCIS (Netherlands)

    Romero, C.; Ventura, S.; Delgado, J.A.; De Bra, P.M.E.; Duval, E.; Klamma, R.; Wolpers, M.

    2007-01-01

    In this paper, we describe a personalized recommender system that uses web mining techniques for recommending a student which (next) links to visit within an adaptable educational hypermedia system. We present a specific mining tool and a recommender engine that we have integrated in the AHA! system

  7. [Exploring the clinical characters of Shugan Jieyu capsule through text mining].

    Science.gov (United States)

    Pu, Zheng-Ping; Xia, Jiang-Ming; Xie, Wei; He, Jin-Cai

    2017-09-01

    The study was main to explore the clinical characters of Shugan Jieyu capsule through text mining. The data sets of Shugan Jieyu capsule were downloaded from CMCC database by the method of literature retrieved from May 2009 to Jan 2016. Rules of Chinese medical patterns, diseases, symptoms and combination treatment were mined out by data slicing algorithm, and they were demonstrated in frequency tables and two dimension based network. Then totally 190 literature were recruited. The outcomess suggested that SC was most frequently correlated with liver Qi stagnation. Primary depression, depression due to brain disease, concomitant depression followed by physical diseases, concomitant depression followed by schizophrenia and functional dyspepsia were main diseases treated by Shugan Jieyu capsule. Symptoms like low mood, psychic anxiety, somatic anxiety and dysfunction of automatic nerve were mainy relieved bv Shugan Jieyu capsule.For combination treatment. Shugan Jieyu capsule was most commonly used with paroxetine, sertraline and fluoxetine. The research suggested that syndrome types and mining results of Shugan Jieyu capsule were almost the same as its instructions. Syndrome of malnutrition of heart spirit was the potential Chinese medical pattern of Shugan Jieyu capsule. Primary comorbid anxiety and depression, concomitant comorbid anxiety and depression followed by physical diseases, and postpartum depression were potential diseases treated by Shugan Jieyu capsule.For combination treatment, Shugan Jieyu capsule was most commonly used with paroxetine, sertraline and fluoxetine. Copyright© by the Chinese Pharmaceutical Association.

  8. Enriching semantic knowledge bases for opinion mining in big data applications

    OpenAIRE

    Weichselbraun, A.; Gindl, S.; Scharl, A.

    2014-01-01

    This paper presents a novel method for contextualizing and enriching large semantic knowledge bases for opinion mining with a focus on Web intelligence platforms and other high-throughput big data applications. The method is not only applicable to traditional sentiment lexicons, but also to more comprehensive, multi-dimensional affective resources such as SenticNet. It comprises the following steps: (i) identify ambiguous sentiment terms, (ii) provide context information extracted from a doma...

  9. Context mining and integration into predictive web analytics

    NARCIS (Netherlands)

    Kiseleva, Y.

    2013-01-01

    Predictive Web Analytics is aimed at understanding behavioural patterns of users of various web-based applications: e-commerce, ubiquitous and mobile computing, and computational advertising. Within these applications business decisions often rely on two types of predictions: an overall or

  10. DiMeX: A Text Mining System for Mutation-Disease Association Extraction.

    Science.gov (United States)

    Mahmood, A S M Ashique; Wu, Tsung-Jung; Mazumder, Raja; Vijay-Shanker, K

    2016-01-01

    The number of published articles describing associations between mutations and diseases is increasing at a fast pace. There is a pressing need to gather such mutation-disease associations into public knowledge bases, but manual curation slows down the growth of such databases. We have addressed this problem by developing a text-mining system (DiMeX) to extract mutation to disease associations from publication abstracts. DiMeX consists of a series of natural language processing modules that preprocess input text and apply syntactic and semantic patterns to extract mutation-disease associations. DiMeX achieves high precision and recall with F-scores of 0.88, 0.91 and 0.89 when evaluated on three different datasets for mutation-disease associations. DiMeX includes a separate component that extracts mutation mentions in text and associates them with genes. This component has been also evaluated on different datasets and shown to achieve state-of-the-art performance. The results indicate that our system outperforms the existing mutation-disease association tools, addressing the low precision problems suffered by most approaches. DiMeX was applied on a large set of abstracts from Medline to extract mutation-disease associations, as well as other relevant information including patient/cohort size and population data. The results are stored in a database that can be queried and downloaded at http://biotm.cis.udel.edu/dimex/. We conclude that this high-throughput text-mining approach has the potential to significantly assist researchers and curators to enrich mutation databases.

  11. A Sensor Web and Web Service-Based Approach for Active Hydrological Disaster Monitoring

    Directory of Open Access Journals (Sweden)

    Xi Zhai

    2016-09-01

    Full Text Available Rapid advancements in Earth-observing sensor systems have led to the generation of large amounts of remote sensing data that can be used for the dynamic monitoring and analysis of hydrological disasters. The management and analysis of these data could take advantage of distributed information infrastructure technologies such as Web service and Sensor Web technologies, which have shown great potential in facilitating the use of observed big data in an interoperable, flexible and on-demand way. However, it remains a challenge to achieve timely response to hydrological disaster events and to automate the geoprocessing of hydrological disaster observations. This article proposes a Sensor Web and Web service-based approach to support active hydrological disaster monitoring. This approach integrates an event-driven mechanism, Web services, and a Sensor Web and coordinates them using workflow technologies to facilitate the Web-based sharing and processing of hydrological hazard information. The design and implementation of hydrological Web services for conducting various hydrological analysis tasks on the Web using dynamically updating sensor observation data are presented. An application example is provided to demonstrate the benefits of the proposed approach over the traditional approach. The results confirm the effectiveness and practicality of the proposed approach in cases of hydrological disaster.

  12. An Evaluation of Text Mining Tools as Applied to Selected Scientific and Engineering Literature.

    Science.gov (United States)

    Trybula, Walter J.; Wyllys, Ronald E.

    2000-01-01

    Addresses an approach to the discovery of scientific knowledge through an examination of data mining and text mining techniques. Presents the results of experiments that investigated knowledge acquisition from a selected set of technical documents by domain experts. (Contains 15 references.) (Author/LRW)

  13. CrossRef text and data mining services

    Directory of Open Access Journals (Sweden)

    Rachael Lammey

    2015-02-01

    Full Text Available CrossRef is an association of scholarly publishers that develops shared infrastructure to support more effective scholarly communications. It is a registration agency for the digital object identifier (DOI, and has built additional services for CrossRef members around the DOI and the bibliographic metadata that publishers deposit in order to register DOIs for their publications. Among these services are CrossCheck, powered by iThenticate, which helps publishers screen for plagiarism in submitted manuscripts and FundRef, which gives publishers standard way to report funding sources for published scholarly research. To add to these services, Cross-Ref launched CrossRef text and data mining services in May 2014. This article will explain the thinking behind CrossRef launching this new service, what it offers to publishers and researchers alike, how publishers can participate in it, and the uptake of the service so far.

  14. Smartphone-based evaluations of clinical placements-a useful complement to web-based evaluation tools.

    Science.gov (United States)

    Hessius, Jesper; Johansson, Jakob

    2015-01-01

    Web-based questionnaires are currently the standard method for course evaluations. The high rate of smartphone adoption in Sweden makes possible a range of new uses, including course evaluation. This study examines the potential advantages and disadvantages of using a smartphone app as a complement to web-based course evaluationsystems. An iPhone app for course evaluations was developed and interfaced to an existing web-based tool. Evaluations submitted using the app were compared with those submitted using the web between August 2012 and June 2013, at the Faculty of Medicine at Uppsala University, Sweden. At the time of the study, 49% of the students were judged to own iPhones. Over the course of the study, 3,340 evaluations were submitted, of which 22.8% were submitted using the app. The median of mean scores in the submitted evaluations was 4.50 for the app (with an interquartile range of 3.70-5.20) and 4.60 (3.70-5.20) for the web (P=0.24). The proportion of evaluations that included a free-text comment was 50.5% for the app and 49.9% for the web (P=0.80). An app introduced as a complement to a web-based course evaluation system met with rapid adoption. We found no difference in the frequency of free-text comments or in the evaluation scores. Apps appear to be promising tools for course evaluations. web-based course evaluation system met with rapid adoption. We found no difference in the frequency of free-text comments or in the evaluation scores. Apps appear to be promising tools for course evaluations.

  15. A Parallel Approach for Frequent Subgraph Mining in a Single Large Graph Using Spark

    Directory of Open Access Journals (Sweden)

    Fengcai Qiao

    2018-02-01

    Full Text Available Frequent subgraph mining (FSM plays an important role in graph mining, attracting a great deal of attention in many areas, such as bioinformatics, web data mining and social networks. In this paper, we propose SSiGraM (Spark based Single Graph Mining, a Spark based parallel frequent subgraph mining algorithm in a single large graph. Aiming to approach the two computational challenges of FSM, we conduct the subgraph extension and support evaluation parallel across all the distributed cluster worker nodes. In addition, we also employ a heuristic search strategy and three novel optimizations: load balancing, pre-search pruning and top-down pruning in the support evaluation process, which significantly improve the performance. Extensive experiments with four different real-world datasets demonstrate that the proposed algorithm outperforms the existing GraMi (Graph Mining algorithm by an order of magnitude for all datasets and can work with a lower support threshold.

  16. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database.

    Directory of Open Access Journals (Sweden)

    Allan Peter Davis

    Full Text Available The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/ is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS, wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel. Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.

  17. Integrating text mining, data mining, and network analysis for identifying genetic breast cancer trends.

    Science.gov (United States)

    Jurca, Gabriela; Addam, Omar; Aksac, Alper; Gao, Shang; Özyer, Tansel; Demetrick, Douglas; Alhajj, Reda

    2016-04-26

    Breast cancer is a serious disease which affects many women and may lead to death. It has received considerable attention from the research community. Thus, biomedical researchers aim to find genetic biomarkers indicative of the disease. Novel biomarkers can be elucidated from the existing literature. However, the vast amount of scientific publications on breast cancer make this a daunting task. This paper presents a framework which investigates existing literature data for informative discoveries. It integrates text mining and social network analysis in order to identify new potential biomarkers for breast cancer. We utilized PubMed for the testing. We investigated gene-gene interactions, as well as novel interactions such as gene-year, gene-country, and abstract-country to find out how the discoveries varied over time and how overlapping/diverse are the discoveries and the interest of various research groups in different countries. Interesting trends have been identified and discussed, e.g., different genes are highlighted in relationship to different countries though the various genes were found to share functionality. Some text analysis based results have been validated against results from other tools that predict gene-gene relations and gene functions.

  18. Using ant-behavior-based simulation model AntWeb to improve website organization

    Science.gov (United States)

    Li, Weigang; Pinheiro Dib, Marcos V.; Teles, Wesley M.; Morais de Andrade, Vlaudemir; Alves de Melo, Alba C. M.; Cariolano, Judas T.

    2002-03-01

    Some web usage mining algorithms showed the potential application to find the difference among the organizations expected by visitors to the website. However, there are still no efficient method and criterion for a web administrator to measure the performance of the modification. In this paper, we developed an AntWeb, a model inspired by ants' behavior to simulate the sequence of visiting the website, in order to measure the efficient of the web structure. We implemented a web usage mining algorithm using backtrack to the intranet website of the Politec Informatic Ltd., Brazil. We defined throughput (the number of visitors to reach their target pages per time unit relates to the total number of visitors) as an index to measure the website's performance. We also used the link in a web page to represent the effect of visitors' pheromone trails. For every modification in the website organization, for example, putting a link from the expected location to the target object, the simulation reported the value of throughput as a quick answer about this modification. The experiment showed the stability of our simulation model, and a positive modification to the intranet website of the Politec.

  19. Service mining framework and application

    CERN Document Server

    Chang, Wei-Lun

    2014-01-01

    The shifting focus of service from the 1980s to 2000s has proved that IT not only lowers the cost of service but creates avenues to enhance and increase revenue through service. The new type of service, e-service, is mobile, flexible, interactive, and interchangeable. While service science provides an avenue for future service researches, the specific research areas from the IT perspective still need to be elaborated. This book introduces a novel concept-service mining-to address several research areas from technology, model, management, and application perspectives. Service mining is defined as "a systematical process including service discovery, service experience, service recovery, and service retention to discover unique patterns and exceptional values within the existing services." The goal of service mining is similar to data mining, text mining, or web mining, and aims to "detect something new" from the service pool. The major difference is the feature of service is quite distinct from the mining targe...

  20. Towards Technological Approaches for Concept Maps Mining from Text

    Directory of Open Access Journals (Sweden)

    Camila Zacche Aguiar

    2018-04-01

    Full Text Available Concept maps are resources for the representation and construction of knowledge. They allow showing, through concepts and relationships, how knowledge about a subject is organized. Technological advances have boosted the development of approaches for the automatic construction of a concept map, to facilitate and provide the benefits of that resource more broadly. Due to the need to better identify and analyze the functionalities and characteristics of those approaches, we conducted a detailed study on technological approaches for automatic construction of concept maps published between 1994 and 2016 in the IEEE Xplore, ACM and Elsevier Science Direct data bases. From this study, we elaborate a categorization defined on two perspectives, Data Source and Graphic Representation, and fourteen categories. That study collected 30 relevant articles, which were applied to the proposed categorization to identify the main features and limitations of each approach. A detailed view on these approaches, their characteristics and techniques are presented enabling a quantitative analysis. In addition, the categorization has given us objective conditions to establish new specification requirements for a new technological approach aiming at concept maps mining from texts.

  1. An unsupervised text mining method for relation extraction from biomedical literature.

    Directory of Open Access Journals (Sweden)

    Changqin Quan

    Full Text Available The wealth of interaction information provided in biomedical articles motivated the implementation of text mining approaches to automatically extract biomedical relations. This paper presents an unsupervised method based on pattern clustering and sentence parsing to deal with biomedical relation extraction. Pattern clustering algorithm is based on Polynomial Kernel method, which identifies interaction words from unlabeled data; these interaction words are then used in relation extraction between entity pairs. Dependency parsing and phrase structure parsing are combined for relation extraction. Based on the semi-supervised KNN algorithm, we extend the proposed unsupervised approach to a semi-supervised approach by combining pattern clustering, dependency parsing and phrase structure parsing rules. We evaluated the approaches on two different tasks: (1 Protein-protein interactions extraction, and (2 Gene-suicide association extraction. The evaluation of task (1 on the benchmark dataset (AImed corpus showed that our proposed unsupervised approach outperformed three supervised methods. The three supervised methods are rule based, SVM based, and Kernel based separately. The proposed semi-supervised approach is superior to the existing semi-supervised methods. The evaluation on gene-suicide association extraction on a smaller dataset from Genetic Association Database and a larger dataset from publicly available PubMed showed that the proposed unsupervised and semi-supervised methods achieved much higher F-scores than co-occurrence based method.

  2. PubMedPortable: A Framework for Supporting the Development of Text Mining Applications.

    Science.gov (United States)

    Döring, Kersten; Grüning, Björn A; Telukunta, Kiran K; Thomas, Philippe; Günther, Stefan

    2016-01-01

    Information extraction from biomedical literature is continuously growing in scope and importance. Many tools exist that perform named entity recognition, e.g. of proteins, chemical compounds, and diseases. Furthermore, several approaches deal with the extraction of relations between identified entities. The BioCreative community supports these developments with yearly open challenges, which led to a standardised XML text annotation format called BioC. PubMed provides access to the largest open biomedical literature repository, but there is no unified way of connecting its data to natural language processing tools. Therefore, an appropriate data environment is needed as a basis to combine different software solutions and to develop customised text mining applications. PubMedPortable builds a relational database and a full text index on PubMed citations. It can be applied either to the complete PubMed data set or an arbitrary subset of downloaded PubMed XML files. The software provides the infrastructure to combine stand-alone applications by exporting different data formats, e.g. BioC. The presented workflows show how to use PubMedPortable to retrieve, store, and analyse a disease-specific data set. The provided use cases are well documented in the PubMedPortable wiki. The open-source software library is small, easy to use, and scalable to the user's system requirements. It is freely available for Linux on the web at https://github.com/KerstenDoering/PubMedPortable and for other operating systems as a virtual container. The approach was tested extensively and applied successfully in several projects.

  3. Instructional Uses of Web-Based Survey Software

    Directory of Open Access Journals (Sweden)

    Concetta A. DePaolo, Ph.D.

    2006-07-01

    Full Text Available Recent technological advances have led to changes in how instruction is delivered. Such technology can create opportunities to enhance instruction and make instructors more efficient in performing instructional tasks, especially if the technology is easy to use and requires no training. One such technology, web-based survey software, is extremely accessible for anyone with basic computer skills. Web-based survey software can be used for a variety of instructional purposes to streamline instructor tasks, as well as enhance instruction and communication with students. Following a brief overview of the technology, we discuss how Web Forms from nTreePoint can be used to conduct instructional surveys, collect course feedback, conduct peer evaluations of group work, collect completed assignments, schedule meeting times among multiple people, and aid in pedagogical research. We also discuss our experiences with these tasks within traditional on-campus courses and how they were enhanced or expedited by the use of web-based survey software.

  4. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.

    Science.gov (United States)

    Vazquez, Miguel; Krallinger, Martin; Leitner, Florian; Valencia, Alfonso

    2011-06-01

    Providing prior knowledge about biological properties of chemicals, such as kinetic values, protein targets, or toxic effects, can facilitate many aspects of drug development. Chemical information is rapidly accumulating in all sorts of free text documents like patents, industry reports, or scientific articles, which has motivated the development of specifically tailored text mining applications. Despite the potential gains, chemical text mining still faces significant challenges. One of the most salient is the recognition of chemical entities mentioned in text. To help practitioners contribute to this area, a good portion of this review is devoted to this issue, and presents the basic concepts and principles underlying the main strategies. The technical details are introduced and accompanied by relevant bibliographic references. Other tasks discussed are retrieving relevant articles, identifying relationships between chemicals and other entities, or determining the chemical structures of chemicals mentioned in text. This review also introduces a number of published applications that can be used to build pipelines in topics like drug side effects, toxicity, and protein-disease-compound network analysis. We conclude the review with an outlook on how we expect the field to evolve, discussing its possibilities and its current limitations. Copyright © 2011 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  5. Mining free-text medical records for companion animal enteric syndrome surveillance.

    Science.gov (United States)

    Anholt, R M; Berezowski, J; Jamal, I; Ribble, C; Stephen, C

    2014-03-01

    Large amounts of animal health care data are present in veterinary electronic medical records (EMR) and they present an opportunity for companion animal disease surveillance. Veterinary patient records are largely in free-text without clinical coding or fixed vocabulary. Text-mining, a computer and information technology application, is needed to identify cases of interest and to add structure to the otherwise unstructured data. In this study EMR's were extracted from veterinary management programs of 12 participating veterinary practices and stored in a data warehouse. Using commercially available text-mining software (WordStat™), we developed a categorization dictionary that could be used to automatically classify and extract enteric syndrome cases from the warehoused electronic medical records. The diagnostic accuracy of the text-miner for retrieving cases of enteric syndrome was measured against human reviewers who independently categorized a random sample of 2500 cases as enteric syndrome positive or negative. Compared to the reviewers, the text-miner retrieved cases with enteric signs with a sensitivity of 87.6% (95%CI, 80.4-92.9%) and a specificity of 99.3% (95%CI, 98.9-99.6%). Automatic and accurate detection of enteric syndrome cases provides an opportunity for community surveillance of enteric pathogens in companion animals. Copyright © 2014 Elsevier B.V. All rights reserved.

  6. A survey of text clustering techniques used for web mining

    Directory of Open Access Journals (Sweden)

    Dan MUNTEANU

    2005-12-01

    Full Text Available This paper contains an overview of basic formulations and approaches to clustering. Then it presents two important clustering paradigms: a bottom-up agglomerative technique, which collects similar documents into larger and larger groups, and a top-down partitioning technique, which divides a corpus into topic-oriented partitions.

  7. Text Mining Effectively Scores and Ranks the Literature for Improving Chemical-Gene-Disease Curation at the Comparative Toxicogenomics Database

    Science.gov (United States)

    Johnson, Robin J.; Lay, Jean M.; Lennon-Hopkins, Kelley; Saraceni-Richards, Cynthia; Sciaky, Daniela; Murphy, Cynthia Grondin; Mattingly, Carolyn J.

    2013-01-01

    The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency. PMID:23613709

  8. Web-based interventions in nursing.

    Science.gov (United States)

    Im, Eun-Ok; Chang, Sun Ju

    2013-02-01

    With recent advances in computer and Internet technologies and high funding priority on technological aspects of nursing research, researchers at the field level began to develop, use, and test various types of Web-based interventions. Despite high potential impacts of Web-based interventions, little is still known about Web-based interventions in nursing. In this article, to identify strengths and weaknesses of Web-based nursing interventions, a literature review was conducted using multiple databases with combined keywords of "online," "Internet" or "Web," "intervention," and "nursing." A total of 95 articles were retrieved through the databases and sorted by research topics. These articles were then analyzed to identify strengths and weaknesses of Web-based interventions in nursing. A strength of the Web-based interventions was their coverage of various content areas. In addition, many of them were theory-driven. They had advantages in their flexibility and comfort. They could provide consistency in interventions and require less cost in the intervention implementation. However, Web-based intervention studies had selected participants. They lacked controllability and had high dropouts. They required technical expertise and high development costs. Based on these findings, directions for future Web-based intervention research were provided.

  9. The potential of text mining in data integration and network biology for plant research: a case study on Arabidopsis.

    Science.gov (United States)

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J; Inzé, Dirk; Van de Peer, Yves

    2013-03-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies.

  10. Improving Web Learning through model Optimization using Bootstrap for a Tour-Guide Robot

    Directory of Open Access Journals (Sweden)

    Rafael León

    2012-09-01

    Full Text Available We perform a review of Web Mining techniques and we describe a Bootstrap Statistics methodology applied to pattern model classifier optimization and verification for Supervised Learning for Tour-Guide Robot knowledge repository management. It is virtually impossible to test thoroughly Web Page Classifiers and many other Internet Applications with pure empirical data, due to the need for human intervention to generate training sets and test sets. We propose using the computer-based Bootstrap paradigm to design a test environment where they are checked with better reliability

  11. Evaluating a Bilingual Text-Mining System with a Taxonomy of Key Words and Hierarchical Visualization for Understanding Learner-Generated Text

    Science.gov (United States)

    Kong, Siu Cheung; Li, Ping; Song, Yanjie

    2018-01-01

    This study evaluated a bilingual text-mining system, which incorporated a bilingual taxonomy of key words and provided hierarchical visualization, for understanding learner-generated text in the learning management systems through automatic identification and counting of matching key words. A class of 27 in-service teachers studied a course…

  12. Semantic Web and Contextual Information: Semantic Network Analysis of Online Journalistic Texts

    Science.gov (United States)

    Lim, Yon Soo

    This study examines why contextual information is important to actualize the idea of semantic web, based on a case study of a socio-political issue in South Korea. For this study, semantic network analyses were conducted regarding English-language based 62 blog posts and 101 news stories on the web. The results indicated the differences of the meaning structures between blog posts and professional journalism as well as between conservative journalism and progressive journalism. From the results, this study ascertains empirical validity of current concerns about the practical application of the new web technology, and discusses how the semantic web should be developed.

  13. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  14. Text Mining Untuk Analisis Sentimen Review Film Menggunakan Algoritma K-Means

    OpenAIRE

    Setyo Budi

    2017-01-01

    Kemudahan manusia didalam menggunakan website mengakibatkan bertambahnya dokumen teks yang berupa pendapat dan informasi. Dalam waktu yang lama dokumen teks akan bertambah besar. Text mining merupakan salah satu teknik yang digunakan untuk menggali kumpulan dokumen text sehingga dapat diambil intisarinya. Ada beberapa algoritma yang di gunakan untuk penggalian dokumen untuk analisis sentimen, salah satunya adalah K-Means. Didalam penelitian ini algoritma yang digunakan adalah K-Means. Hasil p...

  15. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

    Science.gov (United States)

    Gonzalez, Graciela H; Tahsin, Tasnia; Goodale, Britton C; Greene, Anna C; Greene, Casey S

    2016-01-01

    Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine. © The Author 2015. Published by Oxford University Press.

  16. A Dynamic Recommender System for Improved Web Usage Mining and CRM Using Swarm Intelligence.

    Science.gov (United States)

    Alphy, Anna; Prabakaran, S

    2015-01-01

    In modern days, to enrich e-business, the websites are personalized for each user by understanding their interests and behavior. The main challenges of online usage data are information overload and their dynamic nature. In this paper, to address these issues, a WebBluegillRecom-annealing dynamic recommender system that uses web usage mining techniques in tandem with software agents developed for providing dynamic recommendations to users that can be used for customizing a website is proposed. The proposed WebBluegillRecom-annealing dynamic recommender uses swarm intelligence from the foraging behavior of a bluegill fish. It overcomes the information overload by handling dynamic behaviors of users. Our dynamic recommender system was compared against traditional collaborative filtering systems. The results show that the proposed system has higher precision, coverage, F1 measure, and scalability than the traditional collaborative filtering systems. Moreover, the recommendations given by our system overcome the overspecialization problem by including variety in recommendations.

  17. Onebox: Free-Text Interfaces as an Alternative to Complex Web Forms

    NARCIS (Netherlands)

    Tjin-Kam-Jet, Kien; Trieschnigg, Rudolf Berend; Hiemstra, Djoerd

    2011-01-01

    This paper investigates the problem of translating free-text queries into key-value pairs as an alternative means for searching `behind' web forms. We introduce a novel specication language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype

  18. Mining the Social Web Analyzing Data from Facebook, Twitter, LinkedIn, and Other Social Media Sites

    CERN Document Server

    Russell, Matthew

    2011-01-01

    Want to tap the tremendous amount of valuable social data in Facebook, Twitter, LinkedIn, and Google+? This refreshed edition helps you discover who's making connections with social media, what they're talking about, and where they're located. You'll learn how to combine social web data, analysis techniques, and visualization to find what you've been looking for in the social haystack-as well as useful information you didn't know existed. Each standalone chapter introduces techniques for mining data in different areas of the social Web, including blogs and email. All you need to get started

  19. Genomics Portals: integrative web-platform for mining genomics data

    Directory of Open Access Journals (Sweden)

    Ghosh Krishnendu

    2010-01-01

    Full Text Available Abstract Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc, and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.

  20. Development and testing of a text-mining approach to analyse patients' comments on their experiences of colorectal cancer care.

    Science.gov (United States)

    Wagland, Richard; Recio-Saucedo, Alejandra; Simon, Michael; Bracher, Michael; Hunt, Katherine; Foster, Claire; Downing, Amy; Glaser, Adam; Corner, Jessica

    2016-08-01

    Quality of cancer care may greatly impact on patients' health-related quality of life (HRQoL). Free-text responses to patient-reported outcome measures (PROMs) provide rich data but analysis is time and resource-intensive. This study developed and tested a learning-based text-mining approach to facilitate analysis of patients' experiences of care and develop an explanatory model illustrating impact on HRQoL. Respondents to a population-based survey of colorectal cancer survivors provided free-text comments regarding their experience of living with and beyond cancer. An existing coding framework was tested and adapted, which informed learning-based text mining of the data. Machine-learning algorithms were trained to identify comments relating to patients' specific experiences of service quality, which were verified by manual qualitative analysis. Comparisons between coded retrieved comments and a HRQoL measure (EQ5D) were explored. The survey response rate was 63.3% (21 802/34 467), of which 25.8% (n=5634) participants provided free-text comments. Of retrieved comments on experiences of care (n=1688), over half (n=1045, 62%) described positive care experiences. Most negative experiences concerned a lack of post-treatment care (n=191, 11% of retrieved comments) and insufficient information concerning self-management strategies (n=135, 8%) or treatment side effects (n=160, 9%). Associations existed between HRQoL scores and coded algorithm-retrieved comments. Analysis indicated that the mechanism by which service quality impacted on HRQoL was the extent to which services prevented or alleviated challenges associated with disease and treatment burdens. Learning-based text mining techniques were found useful and practical tools to identify specific free-text comments within a large dataset, facilitating resource-efficient qualitative analysis. This method should be considered for future PROM analysis to inform policy and practice. Study findings indicated that

  1. Texting and accessing the web while driving: traffic citations and crashes among young adult drivers.

    Science.gov (United States)

    Cook, Jerry L; Jones, Randall M

    2011-12-01

    We examined relations between young adult texting and accessing the web while driving with driving outcomes (viz. crashes and traffic citations). Our premise is that engaging in texting and accessing the web while driving is not only distracting but that these activities represent a pattern of behavior that leads to an increase in unwanted outcomes, such as crashes and citations. College students (N = 274) on 3 campuses (one in California and 2 in Utah) completed an electronic questionnaire regarding their driving experience and cell phone use. Our data indicate that 3 out of 4 (74.3%) young adults engage in texting while driving, over half on a weekly basis (51.8%), and some engage in accessing the web while driving (16.8%). Data analysis revealed a relationship between these cell phone behaviors and traffic citations and crashes. The findings support Jessor and Jessor's (1977) "problem behavior syndrome" by showing that traffic citations are related to texting and accessing the web while driving and that crashes are related to accessing the web while driving. Limitations and recommendations are discussed.

  2. A Text Matching Method to Facilitate the Validation of Frequent Order Sets Obtained Through Data Mining

    OpenAIRE

    Che, Chengjian; Rocha, Roberto A.

    2006-01-01

    In order to compare order sets discovered using a data mining algorithm with existing order sets, we developed an order matching tool based on Oracle Text. The tool includes both automated searching and manual review processes. The comparison between the automated process and the manual review process indicates that the sensitivity of the automated matching is 81% and the specificity is 84%.

  3. Separate but Equal? A Comparison of Content on Library Web Pages and Their Text Versions

    Science.gov (United States)

    Hazard, Brenda L.

    2008-01-01

    This study examines the Web sites of the Association of Research Libraries member libraries to determine the presence of a separate text version of the default graphical homepage. The content of the text version and the homepage is compared. Of 121 Web sites examined, twenty libraries currently offer a text version. Ten sites maintain wholly…

  4. Improving the web site's effectiveness by considering each page's temporal information

    NARCIS (Netherlands)

    Li, ZG; Sun, MT; Dunham, MH; Xiao, YQ; Dong, G; Tang, C; Wang, W

    2003-01-01

    Improving the effectiveness of a web site is always one of its owner's top concerns. By focusing on analyzing web users' visiting behavior, web mining researchers have developed a variety of helpful methods, based upon association rules, clustering, prediction and so on. However, we have found

  5. Data Mining of Acupoint Characteristics from the Classical Medical Text: DongUiBoGam of Korean Medicine

    Directory of Open Access Journals (Sweden)

    Taehyung Lee

    2014-01-01

    Full Text Available Throughout the history of East Asian medicine, different kinds of acupuncture treatment experiences have been accumulated in classical medical texts. Reexamining knowledge from classical medical texts is expected to provide meaningful information that could be utilized in current medical practices. In this study, we used data mining methods to analyze the association between acupoints and patterns of disorder with the classical medical book DongUiBoGam of Korean medicine. Using the term frequency-inverse document frequency (tf-idf method, we quantified the significance of acupoints to its targeting patterns and, conversely, the significance of patterns to acupoints. Through these processes, we extracted characteristics of each acupoint based on its treating patterns. We also drew practical information for selecting acupoints on certain patterns according to their association. Data analysis on DongUiBoGam’s acupuncture treatment gave us an insight into the main idea of DongUiBoGam. We strongly believe that our approach can provide a novel understanding of unknown characteristics of acupoint and pattern identification from the classical medical text using data mining methods.

  6. tmBioC: improving interoperability of text-mining tools with BioC.

    Science.gov (United States)

    Khare, Ritu; Wei, Chih-Hsuan; Mao, Yuqing; Leaman, Robert; Lu, Zhiyong

    2014-01-01

    The lack of interoperability among biomedical text-mining tools is a major bottleneck in creating more complex applications. Despite the availability of numerous methods and techniques for various text-mining tasks, combining different tools requires substantial efforts and time owing to heterogeneity and variety in data formats. In response, BioC is a recent proposal that offers a minimalistic approach to tool interoperability by stipulating minimal changes to existing tools and applications. BioC is a family of XML formats that define how to present text documents and annotations, and also provides easy-to-use functions to read/write documents in the BioC format. In this study, we introduce our text-mining toolkit, which is designed to perform several challenging and significant tasks in the biomedical domain, and repackage the toolkit into BioC to enhance its interoperability. Our toolkit consists of six state-of-the-art tools for named-entity recognition, normalization and annotation (PubTator) of genes (GenNorm), diseases (DNorm), mutations (tmVar), species (SR4GN) and chemicals (tmChem). Although developed within the same group, each tool is designed to process input articles and output annotations in a different format. We modify these tools and enable them to read/write data in the proposed BioC format. We find that, using the BioC family of formats and functions, only minimal changes were required to build the newer versions of the tools. The resulting BioC wrapped toolkit, which we have named tmBioC, consists of our tools in BioC, an annotated full-text corpus in BioC, and a format detection and conversion tool. Furthermore, through participation in the 2013 BioCreative IV Interoperability Track, we empirically demonstrate that the tools in tmBioC can be more efficiently integrated with each other as well as with external tools: Our experimental results show that using BioC reduces >60% in lines of code for text-mining tool integration. The tmBioC toolkit

  7. SeMPI: a genome-based secondary metabolite prediction and identification web server.

    Science.gov (United States)

    Zierep, Paul F; Padilla, Natàlia; Yonchev, Dimitar G; Telukunta, Kiran K; Klementz, Dennis; Günther, Stefan

    2017-07-03

    The secondary metabolism of bacteria, fungi and plants yields a vast number of bioactive substances. The constantly increasing amount of published genomic data provides the opportunity for an efficient identification of gene clusters by genome mining. Conversely, for many natural products with resolved structures, the encoding gene clusters have not been identified yet. Even though genome mining tools have become significantly more efficient in the identification of biosynthetic gene clusters, structural elucidation of the actual secondary metabolite is still challenging, especially due to as yet unpredictable post-modifications. Here, we introduce SeMPI, a web server providing a prediction and identification pipeline for natural products synthesized by polyketide synthases of type I modular. In order to limit the possible structures of PKS products and to include putative tailoring reactions, a structural comparison with annotated natural products was introduced. Furthermore, a benchmark was designed based on 40 gene clusters with annotated PKS products. The web server of the pipeline (SeMPI) is freely available at: http://www.pharmaceutical-bioinformatics.de/sempi. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  8. SVG-Based Web Publishing

    Science.gov (United States)

    Gao, Jerry Z.; Zhu, Eugene; Shim, Simon

    2003-01-01

    With the increasing applications of the Web in e-commerce, advertising, and publication, new technologies are needed to improve Web graphics technology due to the current limitation of technology. The SVG (Scalable Vector Graphics) technology is a new revolutionary solution to overcome the existing problems in the current web technology. It provides precise and high-resolution web graphics using plain text format commands. It sets a new standard for web graphic format to allow us to present complicated graphics with rich test fonts and colors, high printing quality, and dynamic layout capabilities. This paper provides a tutorial overview about SVG technology and its essential features, capability, and advantages. The reports a comparison studies between SVG and other web graphics technologies.

  9. Text-based language identification for the South African languages

    CSIR Research Space (South Africa)

    Botha, G

    2006-11-01

    Full Text Available -crawling ap- proach described in [2]. That method employed an early language-identification system for au- tomatic selection of Web pages, and turned out to suffer from two limitations, namely wrongly identified web pages and web pages with mixed text (i...

  10. Development of a Mine Rescue Drilling System (MRDS)

    Energy Technology Data Exchange (ETDEWEB)

    Raymond, David W. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Gaither, Katherine N. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Polsky, Yarom [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Knudsen, Steven D. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Broome, Scott Thomas [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Su, Jiann-Cherng [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Blankenship, Douglas A. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States); Costin, Laurence S. [Sandia National Lab. (SNL-NM), Albuquerque, NM (United States)

    2014-06-01

    Sandia National Laboratories (Sandia) has a long history in developing compact, mobile, very high-speed drilling systems and this technology could be applied to increasing the rate at which boreholes are drilled during a mine accident response. The present study reviews current technical approaches, primarily based on technology developed under other programs, analyzes mine rescue specific requirements to develop a conceptual mine rescue drilling approach, and finally, proposes development of a phased mine rescue drilling system (MRDS) that accomplishes (1) development of rapid drilling MRDS equipment; (2) structuring improved web communication through the Mine Safety & Health Administration (MSHA) web site; (3) development of an improved protocol for employment of existing drilling technology in emergencies; (4) deployment of advanced technologies to complement mine rescue drilling operations during emergency events; and (5) preliminary discussion of potential future technology development of specialized MRDS equipment. This phased approach allows for rapid fielding of a basic system for improved rescue drilling, with the ability to improve the system over time at a reasonable cost.

  11. Text mining factor analysis (TFA) in green tea patent data

    Science.gov (United States)

    Rahmawati, Sela; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Factor analysis has become one of the most widely used multivariate statistical procedures in applied research endeavors across a multitude of domains. There are two main types of analyses based on factor analysis: Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA). Both EFA and CFA aim to observed relationships among a group of indicators with a latent variable, but they differ fundamentally, a priori and restrictions made to the factor model. This method will be applied to patent data technology sector green tea to determine the development technology of green tea in the world. Patent analysis is useful in identifying the future technological trends in a specific field of technology. Database patent are obtained from agency European Patent Organization (EPO). In this paper, CFA model will be applied to the nominal data, which obtain from the presence absence matrix. While doing processing, analysis CFA for nominal data analysis was based on Tetrachoric matrix. Meanwhile, EFA model will be applied on a title from sector technology dominant. Title will be pre-processing first using text mining analysis.

  12. Towards Development of Web-based Assessment System Based on Semantic Web Technology

    Directory of Open Access Journals (Sweden)

    Hosam Farouk El-Sofany

    2011-01-01

    Full Text Available The assessment process in an educational system is an important and primordial part of its success to assure the correct way of knowledge transmission and to ensure that students are working correctly and succeed to acquire the needed knowledge. In this study, we aim to include Semantic Web technologies in the E-learning process, as new components. We use Semantic Web (SW to: 1 support the evaluation of open questions in e-learning courses, 2 support the creation of questions and exams automatically, 3 support the evaluation of exams created by the system. These components should allow for measuring academic performance, providing feedback mechanisms, and improving participative and collaborative ideas. Our goal is to use Semantic Web and Wireless technologies to design and implement the assessment system that allows the students, to take: web-based tutorials, quizzes, free exercises, and exams, to download: course reviews, previous exams and their model answers, to access the system through the Mobile and take quick quizzes and exercises. The system facilitates generation of automatic, balanced, and different exam sheets that contain different types of questions covering the entire curriculum, and display gradually from easiness to difficulty. The system provides the teachers and administrators with several services such as: store different types of questions, generate exams with specific criteria, and upload course assignments, exams, and reviews.

  13. Towards the Development of Web-based Business intelligence Tools

    DEFF Research Database (Denmark)

    Georgiev, Lachezar; Tanev, Stoyan

    2011-01-01

    This paper focuses on using web search techniques in examining the co-creation strategies of technology driven firms. It does not focus on the co-creation results but describes the implementation of a software tool using data mining techniques to analyze the content on firms’ websites. The tool...

  14. The Effectiveness of Web-Based Instruction: An Initial Inquiry

    Directory of Open Access Journals (Sweden)

    Tatana M. Olson

    2002-10-01

    Full Text Available As the use of Web-based instruction increases in the educational and training domains, many people have recognized the importance of evaluating its effects on student outcomes such as learning, performance, and satisfaction. Often, these results are compared to those of conventional classroom instruction in order to determine which method is “better.” However, major differences in technology and presentation rather than instructional content can obscure the true relationship between Web-based instruction and these outcomes. Computer-based instruction (CBI, with more features similar to Web-based instruction, may be a more appropriate benchmark than conventional classroom instruction. Furthermore, there is little consensus as to what variables should be examined or what measures of learning are the most appropriate, making comparisons between studies difficult and inconclusive. In this article, we review the historical findings of CBI as an appropriate benchmark to Web-based instruction. In addition, we review 47 reports of evaluations of Web-based courses in higher education published between 1996 and 2002. A tabulation of the documented findings into eight characteristics is offered, along with our assessments of the experimental designs, effect sizes, and the degree to which the evaluations incorporated features unique to Web-based instruction.

  15. Analysis of mesenchymal stem cell differentiation in vitro using classification association rule mining.

    Science.gov (United States)

    Wang, Weiqi; Wang, Yanbo Justin; Bañares-Alcántara, René; Coenen, Frans; Cui, Zhanfeng

    2009-12-01

    In this paper, data mining is used to analyze the data on the differentiation of mammalian Mesenchymal Stem Cells (MSCs), aiming at discovering known and hidden rules governing MSC differentiation, following the establishment of a web-based public database containing experimental data on the MSC proliferation and differentiation. To this effect, a web-based public interactive database comprising the key parameters which influence the fate and destiny of mammalian MSCs has been constructed and analyzed using Classification Association Rule Mining (CARM) as a data-mining technique. The results show that the proposed approach is technically feasible and performs well with respect to the accuracy of (classification) prediction. Key rules mined from the constructed MSC database are consistent with experimental observations, indicating the validity of the method developed and the first step in the application of data mining to the study of MSCs.

  16. CAI System with Multi-Media Text Through Web Browser for NC Lathe Programming

    Science.gov (United States)

    Mizugaki, Yoshio; Kikkawa, Koichi; Mizui, Masahiko; Kamijo, Keisuke

    A new Computer Aided Instruction (CAI) system for NC lathe programming has been developed with use of multi-media texts including movies, animations, pictures, sound and texts through Web browser. Although many CAI systems developed previously for NC programming consist of text-based instructions, it is difficult for beginners to learn NC programming with use of them. In the developed CAI system, multi-media texts are adopted for the help of users' understanding, and it is available through Web browser anytime and anywhere. Also the error log is automatically recorded for the future references. According to the NC programming coded by a user, the movement of the NC lathe is animated and shown in the monitor screen in front of the user. If its movement causes the collision between a cutting tool and the lathe, some sound and the caution remark are generated. If the user makes mistakes some times at a certain stage in learning NC, the corresponding suggestion is shown in the form of movies, animations, and so forth. By using the multimedia texts, users' attention is kept concentrated during a training course. In this paper, the configuration of the CAI system is explained and the actual procedures for users to learn the NC programming are also explained too. Some beginners tested this CAI system and their results are illustrated and discussed from the viewpoint of the efficiency and usefulness of this CAI system. A brief conclusion is also mentioned.

  17. An ultrasonic-based localization system for underground mines

    CSIR Research Space (South Africa)

    Jordaan, JP

    2017-07-01

    Full Text Available -based localization system for underground mines 2017 IEEE 15th International Conference on Industrial Informatics (INDIN), 24-26 July 2017, Emden, Germany JP Jordaan, CP Kruger, BJ Silva and GP Hancke Abstract: Localization is important for a wide range...

  18. DEVELOPING A WEB-BASED PPGIS, AS AN ENVIRONMENTAL REPORTING SERVICE

    Directory of Open Access Journals (Sweden)

    N. Ranjbar Nooshery

    2017-09-01

    Full Text Available Today municipalities are searching for new tools to empower locals for changing the future of their own areas by increasing their participation in different levels of urban planning. These tools should involve the community in planning process using participatory approaches instead of long traditional top-down planning models and help municipalities to obtain proper insight about major problems of urban neighborhoods from the residents’ point of view. In this matter, public participation GIS (PPGIS which enables citizens to record and following up their feeling and spatial knowledge regarding problems of the city in the form of maps have been introduced. In this research, a tool entitled CAER (Collecting & Analyzing of Environmental Reports is developed. In the first step, a software framework based on Web-GIS tool, called EPGIS (Environmental Participatory GIS has been designed to support public participation in reporting urban environmental problems and to facilitate data flow between citizens and municipality. A web-based cartography tool was employed for geo-visualization and dissemination of map-based reports. In the second step of CAER, a subsystem is developed based on SOLAP (Spatial On-Line Analytical Processing, as a data mining tools to elicit the local knowledge facilitating bottom-up urban planning practices and to help urban managers to find hidden relations among the recorded reports. This system is implemented in a case study area in Boston, Massachusetts and its usability was evaluated. The CAER should be considered as bottom-up planning tools to collect people’s problems and views about their neighborhood and transmits them to the city officials. It also helps urban planners to find solutions for better management from citizen’s viewpoint and gives them this chance to develop good plans to the neighborhoods that should be satisfied the citizens.

  19. Data mining of text as a tool in authorship attribution

    Science.gov (United States)

    Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu

    2001-03-01

    It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

  20. Complementing the Numbers: A Text Mining Analysis of College Course Withdrawals

    Science.gov (United States)

    Michalski, Greg V.

    2011-01-01

    Excessive college course withdrawals are costly to the student and the institution in terms of time to degree completion, available classroom space, and other resources. Although generally well quantified, detailed analysis of the reasons given by students for course withdrawal is less common. To address this, a text mining analysis was performed…

  1. TIME SERIES ANALYSIS ON STOCK MARKET FOR TEXT MINING CORRELATION OF ECONOMY NEWS

    Directory of Open Access Journals (Sweden)

    Sadi Evren SEKER

    2014-01-01

    Full Text Available This paper proposes an information retrieval methodfor the economy news. Theeffect of economy news, are researched in the wordlevel and stock market valuesare considered as the ground proof.The correlation between stock market prices and economy news is an already ad-dressed problem for most of the countries. The mostwell-known approach is ap-plying the text mining approaches to the news and some time series analysis tech-niques over stock market closing values in order toapply classification or cluster-ing algorithms over the features extracted. This study goes further and tries to askthe question what are the available time series analysis techniques for the stockmarket closing values and which one is the most suitable? In this study, the newsand their dates are collected into a database and text mining is applied over thenews, the text mining part has been kept simple with only term frequency – in-verse document frequency method. For the time series analysis part, we havestudied 10 different methods such as random walk, moving average, acceleration,Bollinger band, price rate of change, periodic average, difference, momentum orrelative strength index and their variation. In this study we have also explainedthese techniques in a comparative way and we have applied the methods overTurkish Stock Market closing values for more than a2 year period. On the otherhand, we have applied the term frequency – inversedocument frequency methodon the economy news of one of the high-circulatingnewspapers in Turkey.

  2. Informing child welfare policy and practice: using knowledge discovery and data mining technology via a dynamic Web site.

    Science.gov (United States)

    Duncan, Dean F; Kum, Hye-Chung; Weigensberg, Elizabeth Caplick; Flair, Kimberly A; Stewart, C Joy

    2008-11-01

    Proper management and implementation of an effective child welfare agency requires the constant use of information about the experiences and outcomes of children involved in the system, emphasizing the need for comprehensive, timely, and accurate data. In the past 20 years, there have been many advances in technology that can maximize the potential of administrative data to promote better evaluation and management in the field of child welfare. Specifically, this article discusses the use of knowledge discovery and data mining (KDD), which makes it possible to create longitudinal data files from administrative data sources, extract valuable knowledge, and make the information available via a user-friendly public Web site. This article demonstrates a successful project in North Carolina where knowledge discovery and data mining technology was used to develop a comprehensive set of child welfare outcomes available through a public Web site to facilitate information sharing of child welfare data to improve policy and practice.

  3. Arabic web pages clustering and annotation using semantic class features

    Directory of Open Access Journals (Sweden)

    Hanan M. Alghamdi

    2014-12-01

    Full Text Available To effectively manage the great amount of data on Arabic web pages and to enable the classification of relevant information are very important research problems. Studies on sentiment text mining have been very limited in the Arabic language because they need to involve deep semantic processing. Therefore, in this paper, we aim to retrieve machine-understandable data with the help of a Web content mining technique to detect covert knowledge within these data. We propose an approach to achieve clustering with semantic similarities. This approach comprises integrating k-means document clustering with semantic feature extraction and document vectorization to group Arabic web pages according to semantic similarities and then show the semantic annotation. The document vectorization helps to transform text documents into a semantic class probability distribution or semantic class density. To reach semantic similarities, the approach extracts the semantic class features and integrates them into the similarity weighting schema. The quality of the clustering result has evaluated the use of the purity and the mean intra-cluster distance (MICD evaluation measures. We have evaluated the proposed approach on a set of common Arabic news web pages. We have acquired favorable clustering results that are effective in minimizing the MICD, expanding the purity and lowering the runtime.

  4. PLAN: a web platform for automating high-throughput BLAST searches and for managing and mining results.

    Science.gov (United States)

    He, Ji; Dai, Xinbin; Zhao, Xuechun

    2007-02-09

    BLAST searches are widely used for sequence alignment. The search results are commonly adopted for various functional and comparative genomics tasks such as annotating unknown sequences, investigating gene models and comparing two sequence sets. Advances in sequencing technologies pose challenges for high-throughput analysis of large-scale sequence data. A number of programs and hardware solutions exist for efficient BLAST searching, but there is a lack of generic software solutions for mining and personalized management of the results. Systematically reviewing the results and identifying information of interest remains tedious and time-consuming. Personal BLAST Navigator (PLAN) is a versatile web platform that helps users to carry out various personalized pre- and post-BLAST tasks, including: (1) query and target sequence database management, (2) automated high-throughput BLAST searching, (3) indexing and searching of results, (4) filtering results online, (5) managing results of personal interest in favorite categories, (6) automated sequence annotation (such as NCBI NR and ontology-based annotation). PLAN integrates, by default, the Decypher hardware-based BLAST solution provided by Active Motif Inc. with a greatly improved efficiency over conventional BLAST software. BLAST results are visualized by spreadsheets and graphs and are full-text searchable. BLAST results and sequence annotations can be exported, in part or in full, in various formats including Microsoft Excel and FASTA. Sequences and BLAST results are organized in projects, the data publication levels of which are controlled by the registered project owners. In addition, all analytical functions are provided to public users without registration. PLAN has proved a valuable addition to the community for automated high-throughput BLAST searches, and, more importantly, for knowledge discovery, management and sharing based on sequence alignment results. The PLAN web interface is platform

  5. Sentiment analysis of Arabic tweets using text mining techniques

    Science.gov (United States)

    Al-Horaibi, Lamia; Khan, Muhammad Badruddin

    2016-07-01

    Sentiment analysis has become a flourishing field of text mining and natural language processing. Sentiment analysis aims to determine whether the text is written to express positive, negative, or neutral emotions about a certain domain. Most sentiment analysis researchers focus on English texts, with very limited resources available for other complex languages, such as Arabic. In this study, the target was to develop an initial model that performs satisfactorily and measures Arabic Twitter sentiment by using machine learning approach, Naïve Bayes and Decision Tree for classification algorithms. The datasets used contains more than 2,000 Arabic tweets collected from Twitter. We performed several experiments to check the performance of the two algorithms classifiers using different combinations of text-processing functions. We found that available facilities for Arabic text processing need to be made from scratch or improved to develop accurate classifiers. The small functionalities developed by us in a Python language environment helped improve the results and proved that sentiment analysis in the Arabic domain needs lot of work on the lexicon side.

  6. Biomedical text mining for research rigor and integrity: tasks, challenges, directions.

    Science.gov (United States)

    Kilicoglu, Halil

    2017-06-13

    An estimated quarter of a trillion US dollars is invested in the biomedical research enterprise annually. There is growing alarm that a significant portion of this investment is wasted because of problems in reproducibility of research findings and in the rigor and integrity of research conduct and reporting. Recent years have seen a flurry of activities focusing on standardization and guideline development to enhance the reproducibility and rigor of biomedical research. Research activity is primarily communicated via textual artifacts, ranging from grant applications to journal publications. These artifacts can be both the source and the manifestation of practices leading to research waste. For example, an article may describe a poorly designed experiment, or the authors may reach conclusions not supported by the evidence presented. In this article, we pose the question of whether biomedical text mining techniques can assist the stakeholders in the biomedical research enterprise in doing their part toward enhancing research integrity and rigor. In particular, we identify four key areas in which text mining techniques can make a significant contribution: plagiarism/fraud detection, ensuring adherence to reporting guidelines, managing information overload and accurate citation/enhanced bibliometrics. We review the existing methods and tools for specific tasks, if they exist, or discuss relevant research that can provide guidance for future work. With the exponential increase in biomedical research output and the ability of text mining approaches to perform automatic tasks at large scale, we propose that such approaches can support tools that promote responsible research practices, providing significant benefits for the biomedical research enterprise. Published by Oxford University Press 2017. This work is written by a US Government employee and is in the public domain in the US.

  7. Fundamentals for Future Mobile-Health (mHealth): A Systematic Review of Mobile Phone and Web-Based Text Messaging in Mental Health.

    Science.gov (United States)

    Berrouiguet, Sofian; Baca-García, Enrique; Brandt, Sara; Walter, Michel; Courtet, Philippe

    2016-06-10

    Mobile phone text messages (short message service, SMS) are used pervasively as a form of communication. Almost 100% of the population uses text messaging worldwide and this technology is being suggested as a promising tool in psychiatry. Text messages can be sent either from a classic mobile phone or a web-based application. Reviews are needed to better understand how text messaging can be used in mental health care and other fields of medicine. The objective of the study was to review the literature regarding the use of mobile phone text messaging in mental health care. We conducted a thorough literature review of studies involving text messaging in health care management. Searches included PubMed, PsycINFO, Cochrane, Scopus, Embase and Web of Science databases on May 25, 2015. Studies reporting the use of text messaging as a tool in managing patients with mental health disorders were included. Given the heterogeneity of studies, this review was summarized using a descriptive approach. From 677 initial citations, 36 studies were included in the review. Text messaging was used in a wide range of mental health situations, notably substance abuse (31%), schizophrenia (22%), and affective disorders (17%). We identified four ways in which text messages were used: reminders (14%), information (17%), supportive messages (42%), and self-monitoring procedures (42%). Applications were sometimes combined. We report growing interest in text messaging since 2006. Text messages have been proposed as a health care tool in a wide spectrum of psychiatric disorders including substance abuse, schizophrenia, affective disorders, and suicide prevention. Most papers described pilot studies, while some randomized clinical trials (RCTs) were also reported. Overall, a positive attitude toward text messages was reported. RCTs reported improved treatment adherence and symptom surveillance. Other positive points included an increase in appointment attendance and in satisfaction with

  8. MouseMine: a new data warehouse for MGI.

    Science.gov (United States)

    Motenko, H; Neuhauser, S B; O'Keefe, M; Richardson, J E

    2015-08-01

    MouseMine (www.mousemine.org) is a new data warehouse for accessing mouse data from Mouse Genome Informatics (MGI). Based on the InterMine software framework, MouseMine supports powerful query, reporting, and analysis capabilities, the ability to save and combine results from different queries, easy integration into larger workflows, and a comprehensive Web Services layer. Through MouseMine, users can access a significant portion of MGI data in new and useful ways. Importantly, MouseMine is also a member of a growing community of online data resources based on InterMine, including those established by other model organism databases. Adopting common interfaces and collaborating on data representation standards are critical to fostering cross-species data analysis. This paper presents a general introduction to MouseMine, presents examples of its use, and discusses the potential for further integration into the MGI interface.

  9. Data pre-processing for web log mining: Case study of commercial bank website usage analysis

    Directory of Open Access Journals (Sweden)

    Jozef Kapusta

    2013-01-01

    Full Text Available We use data cleaning, integration, reduction and data conversion methods in the pre-processing level of data analysis. Data processing techniques improve the overall quality of the patterns mined. The paper describes using of standard pre-processing methods for preparing data of the commercial bank website in the form of the log file obtained from the web server. Data cleaning, as the simplest step of data pre-processing, is non–trivial as the analysed content is highly specific. We had to deal with the problem of frequent changes of the content and even frequent changes of the structure. Regular changes in the structure make use of the sitemap impossible. We presented approaches how to deal with this problem. We were able to create the sitemap dynamically just based on the content of the log file. In this case study, we also examined just the one part of the website over the standard analysis of an entire website, as we did not have access to all log files for the security reason. As the result, the traditional practices had to be adapted for this special case. Analysing just the small fraction of the website resulted in the short session time of regular visitors. We were not able to use recommended methods to determine the optimal value of session time. Therefore, we proposed new methods based on outliers identification for raising the accuracy of the session length in this paper.

  10. Text Mining Metal-Organic Framework Papers.

    Science.gov (United States)

    Park, Sanghoon; Kim, Baekjun; Choi, Sihoon; Boyd, Peter G; Smit, Berend; Kim, Jihan

    2018-02-26

    We have developed a simple text mining algorithm that allows us to identify surface area and pore volumes of metal-organic frameworks (MOFs) using manuscript html files as inputs. The algorithm searches for common units (e.g., m 2 /g, cm 3 /g) associated with these two quantities to facilitate the search. From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume values. Further application to a test set of randomly chosen MOF html files yielded 73.2% and 85.1% accuracies for the two respective quantities. Most of the errors stem from unorthodox sentence structures that made it difficult to identify the correct data as well as bolded notations of MOFs (e.g., 1a) that made it difficult identify its real name. These types of tools will become useful when it comes to discovering structure-property relationships among MOFs as well as collecting a large set of data for references.

  11. Application of Ferulic Acid for Alzheimer's Disease: Combination of Text Mining and Experimental Validation.

    Science.gov (United States)

    Meng, Guilin; Meng, Xiulin; Ma, Xiaoye; Zhang, Gengping; Hu, Xiaolin; Jin, Aiping; Zhao, Yanxin; Liu, Xueyuan

    2018-01-01

    Alzheimer's disease (AD) is an increasing concern in human health. Despite significant research, highly effective drugs to treat AD are lacking. The present study describes the text mining process to identify drug candidates from a traditional Chinese medicine (TCM) database, along with associated protein target mechanisms. We carried out text mining to identify literatures that referenced both AD and TCM and focused on identifying compounds and protein targets of interest. After targeting one potential TCM candidate, corresponding protein-protein interaction (PPI) networks were assembled in STRING to decipher the most possible mechanism of action. This was followed by validation using Western blot and co-immunoprecipitation in an AD cell model. The text mining strategy using a vast amount of AD-related literature and the TCM database identified curcumin, whose major component was ferulic acid (FA). This was used as a key candidate compound for further study. Using the top calculated interaction score in STRING, BACE1 and MMP2 were implicated in the activity of FA in AD. Exposure of SHSY5Y-APP cells to FA resulted in the decrease in expression levels of BACE-1 and APP, while the expression of MMP-2 and MMP-9 increased in a dose-dependent manner. This suggests that FA induced BACE1 and MMP2 pathways maybe novel potential mechanisms involved in AD. The text mining of literature and TCM database related to AD suggested FA as a promising TCM ingredient for the treatment of AD. Potential mechanisms interconnected and integrated with Aβ aggregation inhibition and extracellular matrix remodeling underlying the activity of FA were identified using in vitro studies.

  12. Text Mining for Precision Medicine: Bringing structure to EHRs and biomedical literature to understand genes and health

    Science.gov (United States)

    Simmons, Michael; Singhal, Ayush; Lu, Zhiyong

    2018-01-01

    The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text — found in biomedical publications and clinical notes — is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine. PMID:27807747

  13. Text Mining for Precision Medicine: Bringing Structure to EHRs and Biomedical Literature to Understand Genes and Health.

    Science.gov (United States)

    Simmons, Michael; Singhal, Ayush; Lu, Zhiyong

    2016-01-01

    The key question of precision medicine is whether it is possible to find clinically actionable granularity in diagnosing disease and classifying patient risk. The advent of next-generation sequencing and the widespread adoption of electronic health records (EHRs) have provided clinicians and researchers a wealth of data and made possible the precise characterization of individual patient genotypes and phenotypes. Unstructured text-found in biomedical publications and clinical notes-is an important component of genotype and phenotype knowledge. Publications in the biomedical literature provide essential information for interpreting genetic data. Likewise, clinical notes contain the richest source of phenotype information in EHRs. Text mining can render these texts computationally accessible and support information extraction and hypothesis generation. This chapter reviews the mechanics of text mining in precision medicine and discusses several specific use cases, including database curation for personalized cancer medicine, patient outcome prediction from EHR-derived cohorts, and pharmacogenomic research. Taken as a whole, these use cases demonstrate how text mining enables effective utilization of existing knowledge sources and thus promotes increased value for patients and healthcare systems. Text mining is an indispensable tool for translating genotype-phenotype data into effective clinical care that will undoubtedly play an important role in the eventual realization of precision medicine.

  14. Automatic detection of adverse events to predict drug label changes using text and data mining techniques.

    Science.gov (United States)

    Gurulingappa, Harsha; Toldo, Luca; Rajput, Abdul Mateen; Kors, Jan A; Taweel, Adel; Tayrouz, Yorki

    2013-11-01

    The aim of this study was to assess the impact of automatically detected adverse event signals from text and open-source data on the prediction of drug label changes. Open-source adverse effect data were collected from FAERS, Yellow Cards and SIDER databases. A shallow linguistic relation extraction system (JSRE) was applied for extraction of adverse effects from MEDLINE case reports. Statistical approach was applied on the extracted datasets for signal detection and subsequent prediction of label changes issued for 29 drugs by the UK Regulatory Authority in 2009. 76% of drug label changes were automatically predicted. Out of these, 6% of drug label changes were detected only by text mining. JSRE enabled precise identification of four adverse drug events from MEDLINE that were undetectable otherwise. Changes in drug labels can be predicted automatically using data and text mining techniques. Text mining technology is mature and well-placed to support the pharmacovigilance tasks. Copyright © 2013 John Wiley & Sons, Ltd.

  15. Using anchor text, spam filtering and Wikipedia for web search and entity ranking

    NARCIS (Netherlands)

    Kamps, J.; Kaptein, R.; Koolen, M.; Voorhees, E.M.; Buckland, L.P.

    2010-01-01

    In this paper, we document our efforts in participating to the TREC 2010 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track we wanted to compare the effectiveness of anchor text of the category A and B collections and the impact of global document quality measures such as

  16. Key word placing in Web page body text to increase visibility to search engines

    Directory of Open Access Journals (Sweden)

    W. T. Kritzinger

    2007-11-01

    Full Text Available The growth of the World Wide Web has spawned a wide variety of new information sources, which has also left users with the daunting task of determining which sources are valid. Many users rely on the Web as an information source because of the low cost of information retrieval. It is also claimed that the Web has evolved into a powerful business tool. Examples include highly popular business services such as Amazon.com and Kalahari.net. It is estimated that around 80% of users utilize search engines to locate information on the Internet. This, by implication, places emphasis on the underlying importance of Web pages being listed on search engines indices. Empirical evidence that the placement of key words in certain areas of the body text will have an influence on the Web sites' visibility to search engines could not be found in the literature. The result of two experiments indicated that key words should be concentrated towards the top, and diluted towards the bottom of a Web page to increase visibility. However, care should be taken in terms of key word density, to prevent search engine algorithms from raising the spam alarm.

  17. Mixed Reality Environment for Web-Based Laboratory Interactive Learning

    Directory of Open Access Journals (Sweden)

    A. I. Saleem

    2008-02-01

    Full Text Available This paper presents a web-based laboratory fordistance learners by incorporating simulation andhardware implementation into web-based e-learningsystems. It presents a development consisting of laboratorycourse through internet based on mixed reality technique tosetup, run and manipulateset of experiments. Eachexperiment has been designed in a way that allows thelearner to manipulate the components and check if it worksproperly in order to achieve the experiment objective. Theproposed laboratory e-learning tool has web-basedcomponents accessed by authorized users. Learners canacquire the necessary skills they need, while learning thetheory of the experiment and the basic characteristics ofeach component used in the experiment. Finally, a casestudy was conducted to show the feasibility and efficiencyof the proposed method.

  18. SU-F-P-10: A Web-Based Radiation Safety Relational Database Module for Regulatory Compliance

    Energy Technology Data Exchange (ETDEWEB)

    Rosen, C; Ramsay, B; Konerth, S; Roller, D; Ramsay, A [Dade Moeller Health Group, Kalamazoo, MI (United States)

    2016-06-15

    Purpose: Maintaining compliance with Radioactive Materials Licenses is inherently a time-consuming task requiring focus and attention to detail. Staff tasked with these responsibilities, such as the Radiation Safety Officer and associated personnel must retain disparate records for eventual placement into one or more annual reports. Entering results and records in a relational database using a web browser as the interface, and storing that data in a cloud-based storage site, removes procedural barriers. The data becomes more adaptable for mining and sharing. Methods: Web-based code was written utilizing the web framework Django, written in Python. Additionally, the application utilizes JavaScript for front-end interaction, SQL, HTML and CSS. Quality assurance code testing is performed in a sequential style, and new code is only added after the successful testing of the previous goals. Separate sections of the module include data entry and analysis for audits, surveys, quality management, and continuous quality improvement. Data elements can be adapted for quarterly and annual reporting, and for immediate notification of user determined alarm settings. Results: Current advances are focusing on user interface issues, and determining the simplest manner by which to teach the user to build query forms. One solution has been to prepare library documents that a user can select or edit in place of creation a new document. Forms are being developed based upon Nuclear Regulatory Commission federal code, and will be expanded to include State Regulations. Conclusion: Establishing a secure website to act as the portal for data entry, storage and manipulation can lead to added efficiencies for a Radiation Safety Program. Access to multiple databases can lead to mining for big data programs, and for determining safety issues before they occur. Overcoming web programming challenges, a category that includes mathematical handling, is providing challenges that are being overcome.

  19. SU-F-P-10: A Web-Based Radiation Safety Relational Database Module for Regulatory Compliance

    International Nuclear Information System (INIS)

    Rosen, C; Ramsay, B; Konerth, S; Roller, D; Ramsay, A

    2016-01-01

    Purpose: Maintaining compliance with Radioactive Materials Licenses is inherently a time-consuming task requiring focus and attention to detail. Staff tasked with these responsibilities, such as the Radiation Safety Officer and associated personnel must retain disparate records for eventual placement into one or more annual reports. Entering results and records in a relational database using a web browser as the interface, and storing that data in a cloud-based storage site, removes procedural barriers. The data becomes more adaptable for mining and sharing. Methods: Web-based code was written utilizing the web framework Django, written in Python. Additionally, the application utilizes JavaScript for front-end interaction, SQL, HTML and CSS. Quality assurance code testing is performed in a sequential style, and new code is only added after the successful testing of the previous goals. Separate sections of the module include data entry and analysis for audits, surveys, quality management, and continuous quality improvement. Data elements can be adapted for quarterly and annual reporting, and for immediate notification of user determined alarm settings. Results: Current advances are focusing on user interface issues, and determining the simplest manner by which to teach the user to build query forms. One solution has been to prepare library documents that a user can select or edit in place of creation a new document. Forms are being developed based upon Nuclear Regulatory Commission federal code, and will be expanded to include State Regulations. Conclusion: Establishing a secure website to act as the portal for data entry, storage and manipulation can lead to added efficiencies for a Radiation Safety Program. Access to multiple databases can lead to mining for big data programs, and for determining safety issues before they occur. Overcoming web programming challenges, a category that includes mathematical handling, is providing challenges that are being overcome.

  20. Streaming Media for Web Based Training.

    Science.gov (United States)

    Childers, Chad; Rizzo, Frank; Bangert, Linda

    This paper discusses streaming media for World Wide Web-based training (WBT). The first section addresses WBT in the 21st century, including the Synchronized Multimedia Integration Language (SMIL) standard that allows multimedia content such as text, pictures, sound, and video to be synchronized for a coherent learning experience. The second…

  1. Leveraging Bibliographic RDF Data for Keyword Prediction with Association Rule Mining (ARM

    Directory of Open Access Journals (Sweden)

    Nidhi Kushwaha

    2014-11-01

    Full Text Available The Semantic Web (Web 3.0 has been proposed as an efficient way to access the increasingly large amounts of data on the internet. The Linked Open Data Cloud project at present is the major effort to implement the concepts of the Seamtic Web, addressing the problems of inhomogeneity and large data volumes. RKBExplorer is one of many repositories implementing Open Data and contains considerable bibliographic information. This paper discusses bibliographic data, an important part of cloud data. Effective searching of bibiographic datasets can be a challenge as many of the papers residing in these databases do not have sufficient or comprehensive keyword information. In these cases however, a search engine based on RKBExplorer is only able to use information to retrieve papers based on author names and title of papers without keywords. In this paper we attempt to address this problem by using the data mining algorithm Association Rule Mining (ARM to develop keywords based on features retrieved from Resource Description Framework (RDF data within a bibliographic citation. We have demonstrate the applicability of this method for predicting missing keywords for bibliographic entries in several typical databases. −−−−− Paper presented at 1st International Symposium on Big Data and Cloud Computing Challenges (ISBCC-2014 March 27-28, 2014. Organized by VIT University, Chennai, India. Sponsored by BRNS.

  2. Text mining applied to electronic cardiovascular procedure reports to identify patients with trileaflet aortic stenosis and coronary artery disease.

    Science.gov (United States)

    Small, Aeron M; Kiss, Daniel H; Zlatsin, Yevgeny; Birtwell, David L; Williams, Heather; Guerraty, Marie A; Han, Yuchi; Anwaruddin, Saif; Holmes, John H; Chirinos, Julio A; Wilensky, Robert L; Giri, Jay; Rader, Daniel J

    2017-08-01

    Interrogation of the electronic health record (EHR) using billing codes as a surrogate for diagnoses of interest has been widely used for clinical research. However, the accuracy of this methodology is variable, as it reflects billing codes rather than severity of disease, and depends on the disease and the accuracy of the coding practitioner. Systematic application of text mining to the EHR has had variable success for the detection of cardiovascular phenotypes. We hypothesize that the application of text mining algorithms to cardiovascular procedure reports may be a superior method to identify patients with cardiovascular conditions of interest. We adapted the Oracle product Endeca, which utilizes text mining to identify terms of interest from a NoSQL-like database, for purposes of searching cardiovascular procedure reports and termed the tool "PennSeek". We imported 282,569 echocardiography reports representing 81,164 individuals and 27,205 cardiac catheterization reports representing 14,567 individuals from non-searchable databases into PennSeek. We then applied clinical criteria to these reports in PennSeek to identify patients with trileaflet aortic stenosis (TAS) and coronary artery disease (CAD). Accuracy of patient identification by text mining through PennSeek was compared with ICD-9 billing codes. Text mining identified 7115 patients with TAS and 9247 patients with CAD. ICD-9 codes identified 8272 patients with TAS and 6913 patients with CAD. 4346 patients with AS and 6024 patients with CAD were identified by both approaches. A randomly selected sample of 200-250 patients uniquely identified by text mining was compared with 200-250 patients uniquely identified by billing codes for both diseases. We demonstrate that text mining was superior, with a positive predictive value (PPV) of 0.95 compared to 0.53 by ICD-9 for TAS, and a PPV of 0.97 compared to 0.86 for CAD. These results highlight the superiority of text mining algorithms applied to electronic

  3. Reliability, compliance, and security in web-based course assessments

    Directory of Open Access Journals (Sweden)

    Scott Bonham

    2008-04-01

    Full Text Available Pre- and postcourse assessment has become a very important tool for education research in physics and other areas. The web offers an attractive alternative to in-class paper administration, but concerns about web-based administration include reliability due to changes in medium, student compliance rates, and test security, both question leakage and utilization of web resources. An investigation was carried out in introductory astronomy courses comparing pre- and postcourse administration of assessments using the web and on paper. Overall no difference was seen in performance due to the medium. Compliance rates fluctuated greatly, and factors that seemed to produce higher rates are identified. Notably, email reminders increased compliance by 20%. Most of the 559 students complied with requests to not copy, print, or save questions nor use web resources; about 1% did copy some question text and around 2% frequently used other windows or applications while completing the assessment.

  4. Error Checking for Chinese Query by Mining Web Log

    Directory of Open Access Journals (Sweden)

    Jianyong Duan

    2015-01-01

    Full Text Available For the search engine, error-input query is a common phenomenon. This paper uses web log as the training set for the query error checking. Through the n-gram language model that is trained by web log, the queries are analyzed and checked. Some features including query words and their number are introduced into the model. At the same time data smoothing algorithm is used to solve data sparseness problem. It will improve the overall accuracy of the n-gram model. The experimental results show that it is effective.

  5. Web based remote instrumentation and control

    International Nuclear Information System (INIS)

    Dhekne, P.S.; Patil, Jitendra; Kulkarni, Jitendra; Babu, Prasad; Lad, U.C.; Rahurkar, A.G.; Kaura, H.K.

    2001-01-01

    The Web-based technology provides a very powerful communication medium for transmitting effectively multimedia information containing data generated from various sources, which may be in the form of audio, video, text, still or moving images etc. Large number of sophisticated web based software tools are available that can be used to monitor and control distributed electronic instrumentation projects. For example data can be collected online from various smart sensors/instruments such as images from CCD camera, pressure/ humidity sensor, light intensity transducer, smoke detectors etc and uploaded in real time to a central web server. This information can be processed further, to take control action in real time from any remote client, of course with due security care. The web-based technology offers greater flexibility, higher functionality, and high degree of integration providing standardization. Further easy to use standard browser based interface at the client end to monitor, view and control the desired process parameters allow you to cut down the development time and cost to a great extent. A system based on a web client-server approach has been designed and developed at Computer division, BARC and is operational since last year to monitor and control remotely various environmental parameters of distributed computer centers. In this paper we shall discuss details of this system, its current status and additional features which are currently under development. This type of system is typically very useful for Meteorology, Environmental monitoring of Nuclear stations, Radio active labs, Nuclear waste immobilization plants, Medical and Biological research labs., Security surveillance and in many such distributed situations. A brief description of various tools used for this project such as Java, CGI, Java Script, HTML, VBScript, M-JPEG, TCP/IP, UDP, RTP etc. along with their merits/demerits have also been included

  6. An automated text-based synthetic face with emotions for web lectures

    NARCIS (Netherlands)

    Fitrianie, S.; Rothkrantz, L.J.M.

    2006-01-01

    Web lectures have many positive aspects, e.g. they enable learners to easily control the learning experiences). To develop high-quality online learning materials takes a lot of time and human efforts [2]. An alternative is to develop a digital teacher. We have developed a prototype of a synthetic 3D

  7. BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

    Directory of Open Access Journals (Sweden)

    Tsafnat Guy

    2011-04-01

    Full Text Available Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH and the PharmacoKinetic Interaction Screening (PKIS database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100% and 157 (of 197 minor drug classes (80% with areas under the receiver operating characteristic curve (AUC > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238 adverse events (73%, up to 12 (of 15 groups of clinically significant cytochrome P450 enzyme (CYP inducers or inhibitors (80%, and up to 11 (of 14 groups of narrow therapeutic index drugs (79%. Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task. Conclusions BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.

  8. Finding novel relationships with integrated gene-gene association network analysis of Synechocystis sp. PCC 6803 using species-independent text-mining.

    Science.gov (United States)

    Kreula, Sanna M; Kaewphan, Suwisa; Ginter, Filip; Jones, Patrik R

    2018-01-01

    The increasing move towards open access full-text scientific literature enhances our ability to utilize advanced text-mining methods to construct information-rich networks that no human will be able to grasp simply from 'reading the literature'. The utility of text-mining for well-studied species is obvious though the utility for less studied species, or those with no prior track-record at all, is not clear. Here we present a concept for how advanced text-mining can be used to create information-rich networks even for less well studied species and apply it to generate an open-access gene-gene association network resource for Synechocystis sp. PCC 6803, a representative model organism for cyanobacteria and first case-study for the methodology. By merging the text-mining network with networks generated from species-specific experimental data, network integration was used to enhance the accuracy of predicting novel interactions that are biologically relevant. A rule-based algorithm (filter) was constructed in order to automate the search for novel candidate genes with a high degree of likely association to known target genes by (1) ignoring established relationships from the existing literature, as they are already 'known', and (2) demanding multiple independent evidences for every novel and potentially relevant relationship. Using selected case studies, we demonstrate the utility of the network resource and filter to ( i ) discover novel candidate associations between different genes or proteins in the network, and ( ii ) rapidly evaluate the potential role of any one particular gene or protein. The full network is provided as an open-source resource.

  9. Literature Mining Methods for Toxicology and Construction of ...

    Science.gov (United States)

    Webinar Presentation on text-mining methodologies in use at NCCT and how they can be used to assist with the OECD Retinoid project. Presentation to 1st Workshop/Scientific Expert Group meeting on the OECD Retinoid Project - April 26, 2016 –Brussels, Presented remotely via web.

  10. Interactive Web-based e-learning for Studying Flexible Manipulator Systems

    Directory of Open Access Journals (Sweden)

    Abul K. M. Azad

    2008-03-01

    Full Text Available Abstract— This paper presents a web-based e-leaning facility for simulation, modeling, and control of flexible manipulator systems. The simulation and modeling part includes finite difference and finite element simulations along with neural network and genetic algorithm based modeling strategies for flexible manipulator systems. The controller part constitutes a number of open-loop and closed-loop designs. Closed loop control designs include the classical, adaptive, and neuro-model based strategies. Matlab software package and its associated toolboxes are used to implement these. The Matlab web server is used as the gateway between the facility and web-access. ASP.NET technology and SQL database are utilized to develop web applications for access control, user account and password maintenance, administrative management, and facility utilization monitoring. The reported facility provides a flexible but effective approach of web-based interactive e-learning facility of an engineering system. This can be extended to incorporate additional engineering systems within the e-learning framework.

  11. Project Management Methodology for the Development of M-Learning Web Based Applications

    Directory of Open Access Journals (Sweden)

    Adrian VISOIU

    2010-01-01

    Full Text Available M-learning web based applications are a particular case of web applications designed to be operated from mobile devices. Also, their purpose is to implement learning aspects. Project management of such applications takes into account the identified peculiarities. M-learning web based application characteristics are identified. M-learning functionality covers the needs of an educational process. Development is described taking into account the mobile web and its influences over the analysis, design, construction and testing phases. Activities building up a work breakdown structure for development of m-learning web based applications are presented. Project monitoring and control techniques are proposed. Resources required for projects are discussed.

  12. Systematic analysis of molecular mechanisms for HCC metastasis via text mining approach.

    Science.gov (United States)

    Zhen, Cheng; Zhu, Caizhong; Chen, Haoyang; Xiong, Yiru; Tan, Junyuan; Chen, Dong; Li, Jin

    2017-02-21

    To systematically explore the molecular mechanism for hepatocellular carcinoma (HCC) metastasis and identify regulatory genes with text mining methods. Genes with highest frequencies and significant pathways related to HCC metastasis were listed. A handful of proteins such as EGFR, MDM2, TP53 and APP, were identified as hub nodes in PPI (protein-protein interaction) network. Compared with unique genes for HBV-HCCs, genes particular to HCV-HCCs were less, but may participate in more extensive signaling processes. VEGFA, PI3KCA, MAPK1, MMP9 and other genes may play important roles in multiple phenotypes of metastasis. Genes in abstracts of HCC-metastasis literatures were identified. Word frequency analysis, KEGG pathway and PPI network analysis were performed. Then co-occurrence analysis between genes and metastasis-related phenotypes were carried out. Text mining is effective for revealing potential regulators or pathways, but the purpose of it should be specific, and the combination of various methods will be more useful.

  13. GROUPING WEB ACCESS SEQUENCES uSING SEQUENCE ALIGNMENT METHOD

    OpenAIRE

    BHUPENDRA S CHORDIA; KRISHNAKANT P ADHIYA

    2011-01-01

    In web usage mining grouping of web access sequences can be used to determine the behavior or intent of a set of users. Grouping websessions is how to measure the similarity between web sessions. There are many shortcomings in traditional measurement methods. The taskof grouping web sessions based on similarity and consists of maximizing the intra-group similarity while minimizing the inter-groupsimilarity is done using sequence alignment method. This paper introduces a new method to group we...

  14. Grist: Grid-based Data Mining for Astronomy

    Science.gov (United States)

    Jacob, J. C.; Katz, D. S.; Miller, C. D.; Walia, H.; Williams, R. D.; Djorgovski, S. G.; Graham, M. J.; Mahabal, A. A.; Babu, G. J.; vanden Berk, D. E.; Nichol, R.

    2005-12-01

    The Grist project is developing a grid-technology based system as a research environment for astronomy with massive and complex datasets. This knowledge extraction system will consist of a library of distributed grid services controlled by a workflow system, compliant with standards emerging from the grid computing, web services, and virtual observatory communities. This new technology is being used to find high redshift quasars, study peculiar variable objects, search for transients in real time, and fit SDSS QSO spectra to measure black hole masses. Grist services are also a component of the ``hyperatlas'' project to serve high-resolution multi-wavelength imagery over the Internet. In support of these science and outreach objectives, the Grist framework will provide the enabling fabric to tie together distributed grid services in the areas of data access, federation, mining, subsetting, source extraction, image mosaicking, statistics, and visualization.

  15. Grist : grid-based data mining for astronomy

    Science.gov (United States)

    Jacob, Joseph C.; Katz, Daniel S.; Miller, Craig D.; Walia, Harshpreet; Williams, Roy; Djorgovski, S. George; Graham, Matthew J.; Mahabal, Ashish; Babu, Jogesh; Berk, Daniel E. Vanden; hide

    2004-01-01

    The Grist project is developing a grid-technology based system as a research environment for astronomy with massive and complex datasets. This knowledge extraction system will consist of a library of distributed grid services controlled by a workflow system, compliant with standards emerging from the grid computing, web services, and virtual observatory communities. This new technology is being used to find high redshift quasars, study peculiar variable objects, search for transients in real time, and fit SDSS QSO spectra to measure black hole masses. Grist services are also a component of the 'hyperatlas' project to serve high-resolution multi-wavelength imagery over the Internet. In support of these science and outreach objectives, the Grist framework will provide the enabling fabric to tie together distributed grid services in the areas of data access, federation, mining, subsetting, source extraction, image mosaicking, statistics, and visualization.

  16. Educational data mining: a sample of review and study case

    Directory of Open Access Journals (Sweden)

    Alejandro Pena, Rafael Domínguez, Jose de Jesus Medel

    2009-12-01

    Full Text Available The aim of this work is to encourage the research in a novel merged field: Educational data mining (EDM. Thereby, twosubjects are outlined: The first one corresponds to a review of data mining (DM methods and EDM applications. Thesecond topic represents an EDM study case. As a result of the application of DM in Web-based Education Systems (WBES,stratified groups of students were found during a trial. Such groups reveal key attributes of volunteers that deserted orremained during a WBES experiment. This kind of discovered knowledge inspires the statement of correlational hypothesisto set relations between attributes and behavioral patterns of WBES users. We concluded that: When EDM findings aretaken into account for designing and managing WBES, the learning objectives are improved

  17. The Potential of Text Mining in Data Integration and Network Biology for Plant Research: A Case Study on Arabidopsis[C][W

    Science.gov (United States)

    Van Landeghem, Sofie; De Bodt, Stefanie; Drebert, Zuzanna J.; Inzé, Dirk; Van de Peer, Yves

    2013-01-01

    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein–protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies. PMID:23532071

  18. Open-Source web-based geographical information system for health exposure assessment

    Directory of Open Access Journals (Sweden)

    Evans Barry

    2012-01-01

    Full Text Available Abstract This paper presents the design and development of an open source web-based Geographical Information System allowing users to visualise, customise and interact with spatial data within their web browser. The developed application shows that by using solely Open Source software it was possible to develop a customisable web based GIS application that provides functions necessary to convey health and environmental data to experts and non-experts alike without the requirement of proprietary software.

  19. Can abstract screening workload be reduced using text mining? User experiences of the tool Rayyan.

    Science.gov (United States)

    Olofsson, Hanna; Brolund, Agneta; Hellberg, Christel; Silverstein, Rebecca; Stenström, Karin; Österberg, Marie; Dagerhamn, Jessica

    2017-09-01

    One time-consuming aspect of conducting systematic reviews is the task of sifting through abstracts to identify relevant studies. One promising approach for reducing this burden uses text mining technology to identify those abstracts that are potentially most relevant for a project, allowing those abstracts to be screened first. To examine the effectiveness of the text mining functionality of the abstract screening tool Rayyan. User experiences were collected. Rayyan was used to screen abstracts for 6 reviews in 2015. After screening 25%, 50%, and 75% of the abstracts, the screeners logged the relevant references identified. A survey was sent to users. After screening half of the search result with Rayyan, 86% to 99% of the references deemed relevant to the study were identified. Of those studies included in the final reports, 96% to 100% were already identified in the first half of the screening process. Users rated Rayyan 4.5 out of 5. The text mining function in Rayyan successfully helped reviewers identify relevant studies early in the screening process. Copyright © 2017 John Wiley & Sons, Ltd.

  20. Texts and data mining and their possibilities applied to the process of news production

    Directory of Open Access Journals (Sweden)

    Walter Teixeira Lima Jr

    2008-06-01

    Full Text Available The proposal of this essay is to discuss the challenges of representing in a formalist computational process the knowledge which the journalist uses to articulate news values for the purpose of selecting and imposing hierarchy on news. It discusses how to make bridges to emulate this knowledge obtained in an empirical form with the bases of computational science, in the area of storage, recovery and linked to data in a database, which must show the way human brains treat information obtained through their sensorial system. Systemizing and automating part of the journalistic process in a database contributes to eliminating distortions, faults and to applying, in an efficient manner, techniques for Data Mining and/or Texts which, by definition, permit the discovery of nontrivial relations.

  1. Texts and data mining and their possibilities applied to the process of news production

    Directory of Open Access Journals (Sweden)

    Walter Teixeira Lima Jr

    2011-02-01

    Full Text Available The proposal of this essay is to discuss the challenges of representing in a formalist computational process the knowledge which the journalist uses to articulate news values for the purpose of selecting and imposing hierarchy on news. It discusses how to make bridges to emulate this knowledge obtained in an empirical form with the bases of computational science, in the area of storage, recovery and linked to data in a database, which must show the way human brains treat information obtained through their sensorial system. Systemizing and automating part of the journalistic process in a database contributes to eliminating distortions, faults and to applying, in an efficient manner, techniques for Data Mining and/or Texts which, by definition, permit the discovery of nontrivial relations.

  2. TDCCREC: AN EFFICIENT AND SCALABLE WEB-BASED RECOMMENDATION SYSTEM

    Directory of Open Access Journals (Sweden)

    K.Latha

    2010-10-01

    Full Text Available Web browsers are provided with complex information space where the volume of information available to them is huge. There comes the Recommender system which effectively recommends web pages that are related to the current webpage, to provide the user with further customized reading material. To enhance the performance of the recommender systems, we include an elegant proposed web based recommendation system; Truth Discovery based Content and Collaborative RECommender (TDCCREC which is capable of addressing scalability. Existing approaches such as Learning automata deals with usage and navigational patterns of users. On the other hand, Weighted Association Rule is applied for recommending web pages by assigning weights to each page in all the transactions. Both of them have their own disadvantages. The websites recommended by the search engines have no guarantee for information correctness and often delivers conflicting information. To solve them, content based filtering and collaborative filtering techniques are introduced for recommending web pages to the active user along with the trustworthiness of the website and confidence of facts which outperforms the existing methods. Our results show how the proposed recommender system performs better in predicting the next request of web users.

  3. Web-based Surveys: Changing the Survey Process

    OpenAIRE

    Gunn, Holly

    2002-01-01

    Web-based surveys are having a profound influence on the survey process. Unlike other types of surveys, Web page design skills and computer programming expertise play a significant role in the design of Web-based surveys. Survey respondents face new and different challenges in completing a Web-based survey. This paper examines the different types of Web-based surveys, the advantages and challenges of using Web-based surveys, the design of Web-based surveys, and the issues of validity, error, ...

  4. Design and Implementation WebGIS for Improving the Quality of Exploration Decisions at Sin-Quyen Copper Mine, Northern Vietnam

    Science.gov (United States)

    Quang Truong, Xuan; Luan Truong, Xuan; Nguyen, Tuan Anh; Nguyen, Dinh Tuan; Cong Nguyen, Chi

    2017-12-01

    The objective of this study is to design and implement a WebGIS Decision Support System (WDSS) for reducing uncertainty and supporting to improve the quality of exploration decisions in the Sin-Quyen copper mine, northern Vietnam. The main distinctive feature of the Sin-Quyen deposit is an unusual composition of ores. Computer and software applied to the exploration problem have had a significant impact on the exploration process over the past 25 years, but up until now, no online system has been undertaken. The system was completely built on open source technology and the Open Geospatial Consortium Web Services (OWS). The input data includes remote sensing (RS), Geographical Information System (GIS) and data from drillhole explorations, the drillhole exploration data sets were designed as a geodatabase and stored in PostgreSQL. The WDSS must be able to processed exploration data and support users to access 2-dimensional (2D) or 3-dimensional (3D) cross-sections and map of boreholles exploration data and drill holes. The interface was designed in order to interact with based maps (e.g., Digital Elevation Model, Google Map, OpenStreetMap) and thematic maps (e.g., land use and land cover, administrative map, drillholes exploration map), and to provide GIS functions (such as creating a new map, updating an existing map, querying and statistical charts). In addition, the system provides geological cross-sections of ore bodies based on Inverse Distance Weighting (IDW), nearest neighbour interpolation and Kriging methods (e.g., Simple Kriging, Ordinary Kriging, Indicator Kriging and CoKriging). The results based on data available indicate that the best estimation method (of 23 borehole exploration data sets) for estimating geological cross-sections of ore bodies in Sin-Quyen copper mine is Ordinary Kriging. The WDSS could provide useful information to improve drilling efficiency in mineral exploration and for management decision making.

  5. The Feasibility of Using Large-Scale Text Mining to Detect Adverse Childhood Experiences in a VA-Treated Population.

    Science.gov (United States)

    Hammond, Kenric W; Ben-Ari, Alon Y; Laundry, Ryan J; Boyko, Edward J; Samore, Matthew H

    2015-12-01

    Free text in electronic health records resists large-scale analysis. Text records facts of interest not found in encoded data, and text mining enables their retrieval and quantification. The U.S. Department of Veterans Affairs (VA) clinical data repository affords an opportunity to apply text-mining methodology to study clinical questions in large populations. To assess the feasibility of text mining, investigation of the relationship between exposure to adverse childhood experiences (ACEs) and recorded diagnoses was conducted among all VA-treated Gulf war veterans, utilizing all progress notes recorded from 2000-2011. Text processing extracted ACE exposures recorded among 44.7 million clinical notes belonging to 243,973 veterans. The relationship of ACE exposure to adult illnesses was analyzed using logistic regression. Bias considerations were assessed. ACE score was strongly associated with suicide attempts and serious mental disorders (ORs = 1.84 to 1.97), and less so with behaviorally mediated and somatic conditions (ORs = 1.02 to 1.36) per unit. Bias adjustments did not remove persistent associations between ACE score and most illnesses. Text mining to detect ACE exposure in a large population was feasible. Analysis of the relationship between ACE score and adult health conditions yielded patterns of association consistent with prior research. Copyright © 2015 International Society for Traumatic Stress Studies.

  6. Documenting clinical pharmacist intervention before and after the introduction of a web-based tool.

    Science.gov (United States)

    Nurgat, Zubeir A; Al-Jazairi, Abdulrazaq S; Abu-Shraie, Nada; Al-Jedai, Ahmed

    2011-04-01

    To develop a database for documenting pharmacist intervention through a web-based application. The secondary endpoint was to determine if the new, web-based application provides any benefits with regards to documentation compliance by clinical pharmacists and ease of calculating cost savings compared with our previous method of documenting pharmacist interventions. A tertiary care hospital in Saudi Arabia. The documentation of interventions using a web-based documentation application was retrospectively compared with previous methods of documentation of clinical pharmacists' interventions (multi-user PC software). The number and types of interventions recorded by pharmacists, data mining of archived data, efficiency, cost savings, and the accuracy of the data generated. The number of documented clinical interventions increased from 4,926, using the multi-user PC software, to 6,840 for the web-based application. On average, we observed 653 interventions per clinical pharmacist using the web-based application, which showed an increase compared to an average of 493 interventions using the old multi-user PC software. However, using a paired Student's t-test there was no statistical significance difference between the two means (P = 0.201). Using a χ² test, which captured management level and the type of system used, we found a strong effect of management level (P educational level and the number of interventions documented (P = 0.045). The mean ± SD time required to document an intervention using the web-based application was 66.55 ± 8.98 s. Using the web-based application, 29.06% of documented interventions resulted in cost-savings, while using the multi-user PC software only 4.75% of interventions did so. The majority of cost savings across both platforms resulted from the discontinuation of unnecessary drugs and a change in dosage regimen. Data collection using the web-based application was consistently more complete when compared to the multi-user PC software

  7. Aplicación de técnicas de text mining para analizar las interacciones de los estudiantes en el proceso de aprendizaje de gestión de proyectos

    OpenAIRE

    Olarte Valentín, Rubén; González Marcos, Ana; Alba Elías, Fernando; Ordieres-Meré, Joaquín

    2016-01-01

    En este artículo se presenta la aplicación de técnicas de text mining para analizar la comunicación online de estudiantes que trabajan juntos en un mismo proyecto, con el fin de identificar la aparición de problemas en el desarrollo de la experiencia de aprendizaje en gestión de proyectos. Los datos empleados en este estudio son los mensajes que los estudiantes intercambiaron a través de las herramientas de comunicación existentes en la plataforma web empleada específicamente para el desarrol...

  8. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

    Science.gov (United States)

    Singhal, Ayush; Simmons, Michael; Lu, Zhiyong

    2016-11-01

    The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient's genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer's disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease

  9. AHCODA-DB: a data repository with web-based mining tools for the analysis of automated high-content mouse phenomics data.

    Science.gov (United States)

    Koopmans, Bastijn; Smit, August B; Verhage, Matthijs; Loos, Maarten

    2017-04-04

    Systematic, standardized and in-depth phenotyping and data analyses of rodent behaviour empowers gene-function studies, drug testing and therapy design. However, no data repositories are currently available for standardized quality control, data analysis and mining at the resolution of individual mice. Here, we present AHCODA-DB, a public data repository with standardized quality control and exclusion criteria aimed to enhance robustness of data, enabled with web-based mining tools for the analysis of individually and group-wise collected mouse phenotypic data. AHCODA-DB allows monitoring in vivo effects of compounds collected from conventional behavioural tests and from automated home-cage experiments assessing spontaneous behaviour, anxiety and cognition without human interference. AHCODA-DB includes such data from mutant mice (transgenics, knock-out, knock-in), (recombinant) inbred strains, and compound effects in wildtype mice and disease models. AHCODA-DB provides real time statistical analyses with single mouse resolution and versatile suite of data presentation tools. On March 9th, 2017 AHCODA-DB contained 650 k data points on 2419 parameters from 1563 mice. AHCODA-DB provides users with tools to systematically explore mouse behavioural data, both with positive and negative outcome, published and unpublished, across time and experiments with single mouse resolution. The standardized (automated) experimental settings and the large current dataset (1563 mice) in AHCODA-DB provide a unique framework for the interpretation of behavioural data and drug effects. The use of common ontologies allows data export to other databases such as the Mouse Phenome Database. Unbiased presentation of positive and negative data obtained under the highly standardized screening conditions increase cost efficiency of publicly funded mouse screening projects and help to reach consensus conclusions on drug responses and mouse behavioural phenotypes. The website is publicly

  10. GenCLiP 2.0: a web server for functional clustering of genes and construction of molecular networks based on free terms.

    Science.gov (United States)

    Wang, Jia-Hong; Zhao, Ling-Feng; Lin, Pei; Su, Xiao-Rong; Chen, Shi-Jun; Huang, Li-Qiang; Wang, Hua-Feng; Zhang, Hai; Hu, Zhen-Fu; Yao, Kai-Tai; Huang, Zhong-Xi

    2014-09-01

    Identifying biological functions and molecular networks in a gene list and how the genes may relate to various topics is of considerable value to biomedical researchers. Here, we present a web-based text-mining server, GenCLiP 2.0, which can analyze human genes with enriched keywords and molecular interactions. Compared with other similar tools, GenCLiP 2.0 offers two unique features: (i) analysis of gene functions with free terms (i.e. any terms in the literature) generated by literature mining or provided by the user and (ii) accurate identification and integration of comprehensive molecular interactions from Medline abstracts, to construct molecular networks and subnetworks related to the free terms. http://ci.smu.edu.cn. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  11. Measuring participant rurality in Web-based interventions

    Directory of Open Access Journals (Sweden)

    McKay H Garth

    2007-08-01

    Full Text Available Abstract Background Web-based health behavior change programs can reach large groups of disparate participants and thus they provide promise of becoming important public health tools. Data on participant rurality can complement other demographic measures to deepen our understanding of the success of these programs. Specifically, analysis of participant rurality can inform recruitment and social marketing efforts, and facilitate the targeting and tailoring of program content. Rurality analysis can also help evaluate the effectiveness of interventions across population groupings. Methods We describe how the RUCAs (Rural-Urban Commuting Area Codes methodology can be used to examine results from two Randomized Controlled Trials of Web-based tobacco cessation programs: the ChewFree.com project for smokeless tobacco cessation and the Smokers' Health Improvement Program (SHIP project for smoking cessation. Results Using RUCAs methodology helped to highlight the extent to which both Web-based interventions reached a substantial percentage of rural participants. The ChewFree program was found to have more rural participation which is consistent with the greater prevalence of smokeless tobacco use in rural settings as well as ChewFree's multifaceted recruitment program that specifically targeted rural settings. Conclusion Researchers of Web-based health behavior change programs targeted to the US should routinely include RUCAs as a part of analyzing participant demographics. Researchers in other countries should examine rurality indices germane to their country.

  12. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature.

    Science.gov (United States)

    Müller, H-M; Van Auken, K M; Li, Y; Sternberg, P W

    2018-03-09

    The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements

  13. Web-Based Course Management and Web Services

    Science.gov (United States)

    Mandal, Chittaranjan; Sinha, Vijay Luxmi; Reade, Christopher M. P.

    2004-01-01

    The architecture of a web-based course management tool that has been developed at IIT [Indian Institute of Technology], Kharagpur and which manages the submission of assignments is discussed. Both the distributed architecture used for data storage and the client-server architecture supporting the web interface are described. Further developments…

  14. Automatic target validation based on neuroscientific literature mining for tractography

    Directory of Open Access Journals (Sweden)

    Xavier eVasques

    2015-05-01

    Full Text Available Target identification for tractography studies requires solid anatomical knowledge validated by an extensive literature review across species for each seed structure to be studied. Manual literature review to identify targets for a given seed region is tedious and potentially subjective. Therefore, complementary approaches would be useful. We propose to use text-mining models to automatically suggest potential targets from the neuroscientific literature, full-text articles and abstracts, so that they can be used for anatomical connection studies and more specifically for tractography. We applied text-mining models to three structures: two well studied structures, since validated deep brain stimulation targets, the internal globus pallidus and the subthalamic nucleus and, the nucleus accumbens, an exploratory target for treating psychiatric disorders. We performed a systematic review of the literature to document the projections of the three selected structures and compared it with the targets proposed by text-mining models, both in rat and primate (including human. We ran probabilistic tractography on the nucleus accumbens and compared the output with the results of the text-mining models and literature review. Overall, text-mining the literature could find three times as many targets as two man-weeks of curation could. The overall efficiency of the text-mining against literature review in our study was 98% recall (at 36% precision, meaning that over all the targets for the three selected seeds, only one target has been missed by text-mining. We demonstrate that connectivity for a structure of interest can be extracted from a very large amount of publications and abstracts. We believe this tool will be useful in helping the neuroscience community to facilitate connectivity studies of particular brain regions. The text mining tools used for the study are part of the HBP Neuroinformatics Platform, publicly available at http://connectivity-brainer.rhcloud.com/.

  15. Mining the human phenome using semantic web technologies: a case study for Type 2 Diabetes.

    Science.gov (United States)

    Pathak, Jyotishman; Kiefer, Richard C; Bielinski, Suzette J; Chute, Christopher G

    2012-01-01

    The ability to conduct genome-wide association studies (GWAS) has enabled new exploration of how genetic variations contribute to health and disease etiology. However, historically GWAS have been limited by inadequate sample size due to associated costs for genotyping and phenotyping of study subjects. This has prompted several academic medical centers to form "biobanks" where biospecimens linked to personal health information, typically in electronic health records (EHRs), are collected and stored on large number of subjects. This provides tremendous opportunities to discover novel genotype-phenotype associations and foster hypothesis generation. In this work, we study how emerging Semantic Web technologies can be applied in conjunction with clinical and genotype data stored at the Mayo Clinic Biobank to mine the phenotype data for genetic associations. In particular, we demonstrate the role of using Resource Description Framework (RDF) for representing EHR diagnoses and procedure data, and enable federated querying via standardized Web protocols to identify subjects genotyped with Type 2 Diabetes for discovering gene-disease associations. Our study highlights the potential of Web-scale data federation techniques to execute complex queries.

  16. A hybrid BCI web browser based on EEG and EOG signals.

    Science.gov (United States)

    Shenghong He; Tianyou Yu; Zhenghui Gu; Yuanqing Li

    2017-07-01

    In this study, we propose a new web browser based on a hybrid brain computer interface (BCI) combining electroencephalographic (EEG) and electrooculography (EOG) signals. Specifically, the user can control the horizontal movement of the mouse by imagining left/right hand motion, and control the vertical movement of the mouse, select/reject a target, or input text in an edit box by blinking eyes in synchrony with the flashes of the corresponding buttons on the GUI. Based on mouse control, target selection and text input, the user can open a web page of interest, select an intended target in the web and read the page content. An online experiment was conducted involving five healthy subjects. The experimental results demonstrated the effectiveness of the proposed method.

  17. Comparison of Physics Frameworks for WebGL-Based Game Engine

    Directory of Open Access Journals (Sweden)

    Yogya Resa

    2014-03-01

    Full Text Available Recently, a new technology called WebGL shows a lot of potentials for developing games. However since this technology is still new, there are still many potentials in the game development area that are not explored yet. This paper tries to uncover the potential of integrating physics frameworks with WebGL technology in a game engine for developing 2D or 3D games. Specifically we integrated three open source physics frameworks: Bullet, Cannon, and JigLib into a WebGL-based game engine. Using experiment, we assessed these frameworks in terms of their correctness or accuracy, performance, completeness and compatibility. The results show that it is possible to integrate open source physics frameworks into a WebGLbased game engine, and Bullet is the best physics framework to be integrated into the WebGL-based game engine.

  18. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine.

    Directory of Open Access Journals (Sweden)

    Ayush Singhal

    2016-11-01

    Full Text Available The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient's genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed. Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer's disease, hemochromatosis, age-related macular degeneration (AMD, diabetes mellitus, and cystic fibrosis. We then evaluate our approach in two ways: (1 a direct comparison with the state of the art using benchmark datasets; (2 a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F1-measure (from 0.62 to 0.79 over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB, we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets

  19. A Web-Based Rice Plant Expert System Using Rule-Based Reasoning

    Directory of Open Access Journals (Sweden)

    Anton Setiawan Honggowibowo

    2009-12-01

    Full Text Available Rice plants can be attacked by various kinds of diseases which are possible to be determined from their symptoms. However, it is to recognize that to find out the exact type of disease, an agricultural expert’s opinion is needed, meanwhile the numbers of agricultural experts are limited and there are too many problems to be solved at the same time. This makes a system with a capability as an expert is required. This system must contain the knowledge of the diseases and symptom of rice plants as an agricultural expert has to have. This research designs a web-based expert system using rule-based reasoning. The rule are modified from the method of forward chaining inference and backward chaining in order to to help farmers in the rice plant disease diagnosis. The web-based rice plants disease diagnosis expert system has the advantages to access and use easily. With web-based features inside, it is expected that the farmer can accesse the expert system everywhere to overcome the problem to diagnose rice diseases.

  20. BrainBrowser: distributed, web-based neurological data visualization

    Directory of Open Access Journals (Sweden)

    Tarek eSherif

    2015-01-01

    Full Text Available Recent years have seen massive, distributed datasets become the norm in neuroimaging research, and the methodologies used analyze them have, in response, become more collaborative and exploratory. Tools and infrastructure are continuously being developed and deployed to facilitate research in this context: grid computation platforms to process the data, distributed data stores to house and share them, high-speed networks to move them around and collaborative, often web-based, platforms to provide access to and sometimes manage the entire system. BrainBrowser is a lightweight, high-performance JavaScript visualization library built to provide easy-to-use, powerful, on-demand visualization of remote datasets in this new research environment. BrainBrowser leverages modern Web technologies, such as WebGL, HTML5 and Web Workers, to visualize 3D surface and volumetric neuroimaging data in any modern web browser without requiring any browser plugins. It is thus trivial to integrate BrainBrowser into any web-based platform. BrainBrowser is simple enough to produce a basic web-based visualization in a few lines of code, while at the same time being robust enough to create full-featured visualization applications. BrainBrowser can dynamically load the data required for a given visualization, so no network bandwidth needs to be waisted on data that will not be used. BrainBrowser's integration into the standardized web platform also allows users to consider using 3D data visualization in novel ways, such as for data distribution, data sharing and dynamic online publications. BrainBrowser is already being used in two major online platforms, CBRAIN and LORIS, and has been used to make the 1TB MACACC dataset openly accessible.

  1. Mining for associations between text and brain activation in a functional neuroimaging database

    DEFF Research Database (Denmark)

    Nielsen, Finn Årup; Hansen, Lars Kai; Balslev, D.

    2004-01-01

    We describe a method for mining a neuroimaging database for associations between text and brain locations. The objective is to discover association rules between words indicative of cognitive function as described in abstracts of neuroscience papers and sets of reported stereotactic Talairach...

  2. Text mining analysis of public comments regarding high-level radioactive waste disposal

    International Nuclear Information System (INIS)

    Kugo, Akihide; Yoshikawa, Hidekazu; Shimoda, Hiroshi; Wakabayashi, Yasunaga

    2005-01-01

    In order to narrow the risk perception gap as seen in social investigations between the general public and people who are involved in nuclear industry, public comments on high-level radioactive waste (HLW) disposal have been conducted to find the significant talking points with the general public for constructing an effective risk communication model of social risk information regarding HLW disposal. Text mining was introduced to examine public comments to identify the core public interest underlying the comments. The utilized test mining method is to cluster specific groups of words with negative meanings and then to analyze public understanding by employing text structural analysis to extract words from subjective expressions. Using these procedures, it was found that the public does not trust the nuclear fuel cycle promotion policy and shows signs of anxiety about the long-lasting technological reliability of waste storage. To develop effective social risk communication of HLW issues, these findings are expected to help experts in the nuclear industry to communicate with the general public more effectively to obtain their trust. (author)

  3. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action.

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens.

  4. Combining QSAR Modeling and Text-Mining Techniques to Link Chemical Structures and Carcinogenic Modes of Action

    Science.gov (United States)

    Papamokos, George; Silins, Ilona

    2016-01-01

    There is an increasing need for new reliable non-animal based methods to predict and test toxicity of chemicals. Quantitative structure-activity relationship (QSAR), a computer-based method linking chemical structures with biological activities, is used in predictive toxicology. In this study, we tested the approach to combine QSAR data with literature profiles of carcinogenic modes of action automatically generated by a text-mining tool. The aim was to generate data patterns to identify associations between chemical structures and biological mechanisms related to carcinogenesis. Using these two methods, individually and combined, we evaluated 96 rat carcinogens of the hematopoietic system, liver, lung, and skin. We found that skin and lung rat carcinogens were mainly mutagenic, while the group of carcinogens affecting the hematopoietic system and the liver also included a large proportion of non-mutagens. The automatic literature analysis showed that mutagenicity was a frequently reported endpoint in the literature of these carcinogens, however, less common endpoints such as immunosuppression and hormonal receptor-mediated effects were also found in connection with some of the carcinogens, results of potential importance for certain target organs. The combined approach, using QSAR and text-mining techniques, could be useful for identifying more detailed information on biological mechanisms and the relation with chemical structures. The method can be particularly useful in increasing the understanding of structure and activity relationships for non-mutagens. PMID:27625608

  5. A Web text acquisition method%基于Delphi的Web文本获取方法

    Institute of Scientific and Technical Information of China (English)

    刘建培

    2016-01-01

    提出基于delphi的Web文本获取方法,从网页中获取Web页面格式的源文件(.html文件),分析它的结构信息,处理它的控制符,通过分析过滤源文件的格式来提取网页中的文本信息。利用标点符号对文本信息进行章节、段落、句子等预处理,将文本信息转换成句子序列,让用户快速地定位到需要了解的内容,从而让用户远离钓鱼网站、恶意广告、欺诈信息以及在浏览网页内容时产生的骚扰,提高互联网体验。%In this paper, a method of Web text acquisition with Delphi is proposed, which obtains the source files of the Web page format (.Html file) from the Web page, analyzes its structure information, deals with its control character, and extracts the text information from the Web page by analyzing and filtering the source files’ formats. The method makes use of punctuation marks to preprocess the text information for sections, paragraphs and sentences, converts the text information into sentence sequences, which allows the users to quickly navigate to the contents needed to know, allows the users to stay away from phishing sites, malicious advertising, fraud information and the harassment generated by browsing the content of Web pages, and improves their Internet experience.

  6. Análisis de sesiones de la web del Cindoc: una aproximación a la minería de uso web

    OpenAIRE

    Ortega-Priego, José-Luis

    2005-01-01

    This paper try an usability and navigability study of the Cindoc web site through web log files of the main server for october 2003. For this, web mining are used, concretly, web usage mining techniques to the detection of sessions with the aim of determine navigation patterns and design faults. Several design problems are detected in the navigation menu, in the layouth of the contents and in the web structure. Different navigation identificated patterns are discussed and many advices are ...

  7. From university research to innovation: Detecting knowledge transfer via text mining

    Energy Technology Data Exchange (ETDEWEB)

    Woltmann, S.; Clemmensen, L.; Alkærsig, L

    2016-07-01

    Knowledge transfer by universities is a top priority in innovation policy and a primary purpose for public research funding, due to being an important driver of technical change and innovation. Current empirical research on the impact of university research relies mainly on formal databases and indicators such as patents, collaborative publications and license agreements, to assess the contribution to the socioeconomic surrounding of universities. In this study, we present an extension of the current empirical framework by applying new computational methods, namely text mining and pattern recognition. Text samples for this purpose can include files containing social media contents, company websites and annual reports. The empirical focus in the present study is on the technical sciences and in particular on the case of the Technical University of Denmark (DTU). We generated two independent text collections (corpora) to identify correlations of university publications and company webpages. One corpus representing the company sites, serving as sample of the private economy and a second corpus, providing the reference to the university research, containing relevant publications. We associated the former with the latter to obtain insights into possible text and semantic relatedness. The text mining methods are extrapolating the correlations, semantic patterns and content comparison of the two corpora to define the document relatedness. We expect the development of a novel tool using contemporary techniques for the measurement of public research impact. The approach aims to be applicable across universities and thus enable a more holistic comparable assessment. This rely less on formal databases, which is certainly beneficial in terms of the data reliability. We seek to provide a supplementary perspective for the detection of the dissemination of university research and hereby enable policy makers to gain additional insights of (informal) contributions of knowledge

  8. Retrieving top-k prestige-based relevant spatial web objects

    DEFF Research Database (Denmark)

    Cao, Xin; Cong, Gao; Jensen, Christian S.

    2010-01-01

    The location-aware keyword query returns ranked objects that are near a query location and that have textual descriptions that match query keywords. This query occurs inherently in many types of mobile and traditional web services and applications, e.g., Yellow Pages and Maps services. Previous...... of prestige-based relevance to capture both the textual relevance of an object to a query and the effects of nearby objects. Based on this, a new type of query, the Location-aware top-k Prestige-based Text retrieval (LkPT) query, is proposed that retrieves the top-k spatial web objects ranked according...... to both prestige-based relevance and location proximity. We propose two algorithms that compute LkPT queries. Empirical studies with real-world spatial data demonstrate that LkPT queries are more effective in retrieving web objects than a previous approach that does not consider the effects of nearby...

  9. A Random-Dot Kinematogram for Web-Based Vision Research

    Directory of Open Access Journals (Sweden)

    Sivananda Rajananda

    2018-01-01

    Full Text Available Web-based experiments using visual stimuli have become increasingly common in recent years, but many frequently-used stimuli in vision research have yet to be developed for online platforms. Here, we introduce the first open access random-dot kinematogram (RDK for use in web browsers. This fully customizable RDK offers options to implement several different types of noise (random position, random walk, random direction and parameters to control aperture shape, coherence level, the number of dots, and other features. We include links to commented JavaScript code for easy implementation in web-based experiments, as well as an example of how this stimulus can be integrated as a plugin with a JavaScript library for online studies (jsPsych.

  10. Personality and Education Mining based Job Advisory System

    Directory of Open Access Journals (Sweden)

    Rajendra S. Choudhary

    2014-09-01

    Full Text Available Every job demands an employee with some specific qualities in addition to the basic educational qualification. For example, an introvert person cannot be a good leader despite of a very good academic qualification. Thinking and logical ability is required for a person to be a successful software engineer. So, the aim of this paper is to present a novel approach for advising an ideal job to the job seeker while considering his personality trait and educational qualification both. Very well-known theories of personality like MBTI indicator and OCEAN theory, are used for personality mining. For education mining, score based system is used. The score based system captures the information from attributes like most scoring subject, dream job etc. After personality mining, the resultant values are coalesced with the information extracted from education mining. And finally, the most suited jobs, in terms of personality and educational qualification are recommended to the job seekers. The experiment is conducted on the students who have earned an engineering degree in the field of computer science, information technology and electronics. Nevertheless, the same architecture can easily be extended to other educational degrees also. To the best of the author’s knowledge, this is a first e-job advisory system that recommends the job best suited as per one’s personality using MBTI and OCEAN theory both.

  11. Applied data mining for business and industry

    CERN Document Server

    Giudici, Paolo

    2009-01-01

    The increasing availability of data in our current, information overloaded society has led to the need for valid tools for its modelling and analysis. Data mining and applied statistical methods are the appropriate tools to extract knowledge from such data. This book provides an accessible introduction to data mining methods in a consistent and application oriented statistical framework, using case studies drawn from real industry projects and highlighting the use of data mining methods in a variety of business applications. Introduces data mining methods and applications.Covers classical and Bayesian multivariate statistical methodology as well as machine learning and computational data mining methods.Includes many recent developments such as association and sequence rules, graphical Markov models, lifetime value modelling, credit risk, operational risk and web mining.Features detailed case studies based on applied projects within industry.Incorporates discussion of data mining software, with case studies a...

  12. Web content adaptation for mobile device: A fuzzy-based approach

    Directory of Open Access Journals (Sweden)

    Frank C.C. Wu

    2012-03-01

    Full Text Available While HTML will continue to be used to develop Web content, how to effectively and efficiently transform HTML-based content automatically into formats suitable for mobile devices remains a challenge. In this paper, we introduce a concept of coherence set and propose an algorithm to automatically identify and detect coherence sets based on quantified similarity between adjacent presentation groups. Experimental results demonstrate that our method enhances Web content analysis and adaptation on the mobile Internet.

  13. Using Web Crawler Technology for Geo-Events Analysis: A Case Study of the Huangyan Island Incident

    Directory of Open Access Journals (Sweden)

    Hao Hu

    2014-04-01

    Full Text Available Social networking and network socialization provide abundant text information and social relationships into our daily lives. Making full use of these data in the big data era is of great significance for us to better understand the changing world and the information-based society. Though politics have been integrally involved in the hyperlinked world issues since the 1990s, the text analysis and data visualization of geo-events faced the bottleneck of traditional manual analysis. Though automatic assembly of different geospatial web and distributed geospatial information systems utilizing service chaining have been explored and built recently, the data mining and information collection are not comprehensive enough because of the sensibility, complexity, relativity, timeliness, and unexpected characteristics of political events. Based on the framework of Heritrix and the analysis of web-based text, word frequency, sentiment tendency, and dissemination path of the Huangyan Island incident were studied by using web crawler technology and the text analysis. The results indicate that tag cloud, frequency map, attitudes pie, individual mention ratios, and dissemination flow graph, based on the crawled information and data processing not only highlight the characteristics of geo-event itself, but also implicate many interesting phenomenon and deep-seated problems behind it, such as related topics, theme vocabularies, subject contents, hot countries, event bodies, opinion leaders, high-frequency vocabularies, information sources, semantic structure, propagation paths, distribution of different attitudes, and regional difference of net citizens’ response in the Huangyan Island incident. Furthermore, the text analysis of network information with the help of focused web crawler is able to express the time-space relationship of crawled information and the information characteristic of semantic network to the geo-events. Therefore, it is a useful tool to

  14. Churn prediction based on text mining and CRM data analysis

    OpenAIRE

    Schatzmann, Anders; Heitz, Christoph; Münch, Thomas

    2014-01-01

    Within quantitative marketing, churn prediction on a single customer level has become a major issue. An extensive body of literature shows that, today, churn prediction is mainly based on structured CRM data. However, in the past years, more and more digitized customer text data has become available, originating from emails, surveys or scripts of phone calls. To date, this data source remains vastly untapped for churn prediction, and corresponding methods are rarely described in literature. ...

  15. Weighted mining of massive collections of [Formula: see text]-values by convex optimization.

    Science.gov (United States)

    Dobriban, Edgar

    2018-06-01

    Researchers in data-rich disciplines-think of computational genomics and observational cosmology-often wish to mine large bodies of [Formula: see text]-values looking for significant effects, while controlling the false discovery rate or family-wise error rate. Increasingly, researchers also wish to prioritize certain hypotheses, for example, those thought to have larger effect sizes, by upweighting, and to impose constraints on the underlying mining, such as monotonicity along a certain sequence. We introduce Princessp , a principled method for performing weighted multiple testing by constrained convex optimization. Our method elegantly allows one to prioritize certain hypotheses through upweighting and to discount others through downweighting, while constraining the underlying weights involved in the mining process. When the [Formula: see text]-values derive from monotone likelihood ratio families such as the Gaussian means model, the new method allows exact solution of an important optimal weighting problem previously thought to be non-convex and computationally infeasible. Our method scales to massive data set sizes. We illustrate the applications of Princessp on a series of standard genomics data sets and offer comparisons with several previous 'standard' methods. Princessp offers both ease of operation and the ability to scale to extremely large problem sizes. The method is available as open-source software from github.com/dobriban/pvalue_weighting_matlab (accessed 11 October 2017).

  16. Personalization of Rule-based Web Services.

    Science.gov (United States)

    Choi, Okkyung; Han, Sang Yong

    2008-04-04

    Nowadays Web users have clearly expressed their wishes to receive personalized services directly. Personalization is the way to tailor services directly to the immediate requirements of the user. However, the current Web Services System does not provide any features supporting this such as consideration of personalization of services and intelligent matchmaking. In this research a flexible, personalized Rule-based Web Services System to address these problems and to enable efficient search, discovery and construction across general Web documents and Semantic Web documents in a Web Services System is proposed. This system utilizes matchmaking among service requesters', service providers' and users' preferences using a Rule-based Search Method, and subsequently ranks search results. A prototype of efficient Web Services search and construction for the suggested system is developed based on the current work.

  17. Tracing Knowledge Transfer from Universities to Industry: A Text Mining Approach

    DEFF Research Database (Denmark)

    Woltmann, Sabrina; Alkærsig, Lars

    2017-01-01

    This paper identifies transferred knowledge between universities and the industry by proposing the use of a computational linguistic method. Current research on university-industry knowledge exchange relies often on formal databases and indicators such as patents, collaborative publications and l...... is the first step to enable the identification of common knowledge and knowledge transfer via text mining to increase its measurability....... and license agreements, to assess the contribution to the socioeconomic surrounding of universities. We, on the other hand, use the texts from university abstracts to identify university knowledge and compare them with texts from firm webpages. We use these text data to identify common key words and thereby...... identify overlapping contents among the texts. As method we use a well-established word ranking method from the field of information retrieval term frequency–inverse document frequency (TFIDF) to identify commonalities between texts from university. In examining the outcomes of the TFIDF statistic we find...

  18. Evaluation of a Web-based Online Grant Application Review Solution

    Directory of Open Access Journals (Sweden)

    Marius Daniel PETRISOR

    2013-12-01

    Full Text Available This paper focuses on the evaluation of a web-based application used in grant application evaluations, software developed in our university, and underlines the need for simple solutions, based on recent technology, specifically tailored to one’s needs. We asked the reviewers to answer a short questionnaire, in order to assess their satisfaction with such a web-based grant application evaluation solution. All 20 reviewers accepted to answer the questionnaire, which contained 8 closed items (YES/NO answers related to reviewer’s previous experience in evaluating grant applications, previous use of such software solutions and his familiarity in using computer systems. The presented web-based application, evaluated by the users, shown a high level of acceptance and those respondents stated that they are willing to use such a solution in the future.

  19. Free web-based modelling platform for managed aquifer recharge (MAR) applications

    Science.gov (United States)

    Stefan, Catalin; Junghanns, Ralf; Glaß, Jana; Sallwey, Jana; Fatkhutdinov, Aybulat; Fichtner, Thomas; Barquero, Felix; Moreno, Miguel; Bonilla, José; Kwoyiga, Lydia

    2017-04-01

    Managed aquifer recharge represents a valuable instrument for sustainable water resources management. The concept implies purposeful infiltration of surface water into underground for later recovery or environmental benefits. Over decades, MAR schemes were successfully installed worldwide for a variety of reasons: to maximize the natural storage capacity of aquifers, physical aquifer management, water quality management, and ecological benefits. The INOWAS-DSS platform provides a collection of free web-based tools for planning, management and optimization of main components of MAR schemes. The tools are grouped into 13 specific applications that cover most relevant challenges encountered at MAR sites, both from quantitative and qualitative perspectives. The applications include among others the optimization of MAR site location, the assessment of saltwater intrusion, the restoration of groundwater levels in overexploited aquifers, the maximization of natural storage capacity of aquifers, the improvement of water quality, the design and operational optimization of MAR schemes, clogging development and risk assessment. The platform contains a collection of about 35 web-based tools of various degrees of complexity, which are either included in application specific workflows or used as standalone modelling instruments. Among them are simple tools derived from data mining and empirical equations, analytical groundwater related equations, as well as complex numerical flow and transport models (MODFLOW, MT3DMS and SEAWAT). Up to now, the simulation core of the INOWAS-DSS, which is based on the finite differences groundwater flow model MODFLOW, is implemented and runs on the web. A scenario analyser helps to easily set up and evaluate new management options as well as future development such as land use and climate change and compare them to previous scenarios. Additionally simple tools such as analytical equations to assess saltwater intrusion are already running online

  20. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    Energy Technology Data Exchange (ETDEWEB)

    Hirdt, J.A. [Department of Mathematics and Computer Science, St. Joseph' s College, Patchogue, NY 11772 (United States); Brown, D.A., E-mail: dbrown@bnl.gov [National Nuclear Data Center, Brookhaven National Laboratory, Upton, NY 11973-5000 (United States)

    2016-01-15

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  1. Identifying Understudied Nuclear Reactions by Text-mining the EXFOR Experimental Nuclear Reaction Library

    International Nuclear Information System (INIS)

    Hirdt, J.A.; Brown, D.A.

    2016-01-01

    The EXFOR library contains the largest collection of experimental nuclear reaction data available as well as the data's bibliographic information and experimental details. We text-mined the REACTION and MONITOR fields of the ENTRYs in the EXFOR library in order to identify understudied reactions and quantities. Using the results of the text-mining, we created an undirected graph from the EXFOR datasets with each graph node representing a single reaction and quantity and graph links representing the various types of connections between these reactions and quantities. This graph is an abstract representation of the connections in EXFOR, similar to graphs of social networks, authorship networks, etc. We use various graph theoretical tools to identify important yet understudied reactions and quantities in EXFOR. Although we identified a few cross sections relevant for shielding applications and isotope production, mostly we identified charged particle fluence monitor cross sections. As a side effect of this work, we learn that our abstract graph is typical of other real-world graphs.

  2. Text mining to decipher free-response consumer complaints: insights from the NHTSA vehicle owner's complaint database.

    Science.gov (United States)

    Ghazizadeh, Mahtab; McDonald, Anthony D; Lee, John D

    2014-09-01

    This study applies text mining to extract clusters of vehicle problems and associated trends from free-response data in the National Highway Traffic Safety Administration's vehicle owner's complaint database. As the automotive industry adopts new technologies, it is important to systematically assess the effect of these changes on traffic safety. Driving simulators, naturalistic driving data, and crash databases all contribute to a better understanding of how drivers respond to changing vehicle technology, but other approaches, such as automated analysis of incident reports, are needed. Free-response data from incidents representing two severity levels (fatal incidents and incidents involving injury) were analyzed using a text mining approach: latent semantic analysis (LSA). LSA and hierarchical clustering identified clusters of complaints for each severity level, which were compared and analyzed across time. Cluster analysis identified eight clusters of fatal incidents and six clusters of incidents involving injury. Comparisons showed that although the airbag clusters across the two severity levels have the same most frequent terms, the circumstances around the incidents differ. The time trends show clear increases in complaints surrounding the Ford/Firestone tire recall and the Toyota unintended acceleration recall. Increases in complaints may be partially driven by these recall announcements and the associated media attention. Text mining can reveal useful information from free-response databases that would otherwise be prohibitively time-consuming and difficult to summarize manually. Text mining can extend human analysis capabilities for large free-response databases to support earlier detection of problems and more timely safety interventions.

  3. Large-Scale Constraint-Based Pattern Mining

    Science.gov (United States)

    Zhu, Feida

    2009-01-01

    We studied the problem of constraint-based pattern mining for three different data formats, item-set, sequence and graph, and focused on mining patterns of large sizes. Colossal patterns in each data formats are studied to discover pruning properties that are useful for direct mining of these patterns. For item-set data, we observed robustness of…

  4. Public reactions to e-cigarette regulations on Twitter: a text mining analysis.

    Science.gov (United States)

    Lazard, Allison J; Wilcox, Gary B; Tuttle, Hannah M; Glowacki, Elizabeth M; Pikowski, Jessica

    2017-12-01

    In May 2016, the Food and Drug Administration (FDA) issued a final rule that deemed e-cigarettes to be within their regulatory authority as a tobacco product. News and opinions about the regulation were shared on social media platforms, such as Twitter, which can play an important role in shaping the public's attitudes. We analysed information shared on Twitter for insights into initial public reactions. A text mining approach was used to uncover important topics among reactions to the e-cigarette regulations on Twitter. SAS Text Miner V.12.1 software was used for descriptive text mining to uncover the primary topics from tweets collected from May 1 to May 17 2016 using NUVI software to gather the data. A total of nine topics were generated. These topics reveal initial reactions to whether the FDA's e-cigarette regulations will benefit or harm public health, how the regulations will impact the emerging e-cigarette market and efforts to share the news. The topics were dominated by negative or mixed reactions. In the days following the FDA's announcement of the new deeming regulations, the public reaction on Twitter was largely negative. Public health advocates should consider using social media outlets to better communicate the policy's intentions, reach and potential impact for public good to create a more balanced conversation. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  5. Our Policies, Their Text: German Language Students' Strategies with and Beliefs about Web-Based Machine Translation

    Science.gov (United States)

    White, Kelsey D.; Heidrich, Emily

    2013-01-01

    Most educators are aware that some students utilize web-based machine translators for foreign language assignments, however, little research has been done to determine how and why students utilize these programs, or what the implications are for language learning and teaching. In this mixed-methods study we utilized surveys, a translation task,…

  6. Utilizing coal remaining resources and post-mining land use planning based on GIS-based optimization method : study case at PT Adaro coal mine in South Kalimantan

    Directory of Open Access Journals (Sweden)

    Mohamad Anis

    2017-06-01

    Full Text Available Coal mining activities may cause a series of environmental and socio-economic issues in communities around the mining area. Mining can become an obstacle to environmental sustainability and a major hidden danger to the security of the local ecology. Therefore, the coal mining industry should follow some specific principles and factors in achieving sustainable development. These factors include geological conditions, land use, mining technology, environmental sustainability policies and government regulations, socio-economic factors, as well as sustainability optimization for post-mining land use. Resources of the remains of the coal which is defined as the last remaining condition of the resources and reserves of coal when the coal companies have already completed the life of the mine or the expiration of the licensing contract (in accordance with government permission. This research uses approch of knowledge-driven GIS based methods mainly Analytical Hierarchy Process (AHP and Fuzzy logic for utilizing coal remaining resources and post-mining land use planning. The mining area selected for this study belongs to a PKP2B (Work Agreement for Coal Mining company named Adaro Indonesia (PT Adaro. The result shows that geologically the existing formation is dominated by Coal Bearing Formation (Warukin Formation which allows the presence of remains coal resource potential after the lifetime of mine, and the suitability of rubber plantation for the optimization of land use in all mining sites and also in some disposal places in conservation areas and protected forests.

  7. A grammar checker based on web searching

    Directory of Open Access Journals (Sweden)

    Joaquim Moré

    2006-05-01

    Full Text Available This paper presents an English grammar and style checker for non-native English speakers. The main characteristic of this checker is the use of an Internet search engine. As the number of web pages written in English is immense, the system hypothesises that a piece of text not found on the Web is probably badly written. The system also hypothesises that the Web will provide examples of how the content of the text segment can be expressed in a grammatically correct and idiomatic way. Thus, when the checker warns the user about the odd nature of a text segment, the Internet engine searches for contexts that can help the user decide whether he/she should correct the segment or not. By means of a search engine, the checker also suggests use of other expressions that appear on the Web more often than the expression he/she actually wrote.

  8. Assessment of Mining Extent and Expansion in Myanmar Based on Freely-Available Satellite Imagery

    Directory of Open Access Journals (Sweden)

    Katherine J. LaJeunesse Connette

    2016-11-01

    Full Text Available Using freely-available data and open-source software, we developed a remote sensing methodology to identify mining areas and assess recent mining expansion in Myanmar. Our country-wide analysis used Landsat 8 satellite data from a select number of mining areas to create a raster layer of potential mining areas. We used this layer to guide a systematic scan of freely-available fine-resolution imagery, such as Google Earth, in order to digitize likely mining areas. During this process, each mining area was assigned a ranking indicating our certainty in correct identification of the mining land use. Finally, we identified areas of recent mining expansion based on the change in albedo, or brightness, between Landsat images from 2002 and 2015. We identified 90,041 ha of potential mining areas in Myanmar, of which 58% (52,312 ha was assigned high certainty, 29% (26,251 ha medium certainty, and 13% (11,478 ha low certainty. Of the high-certainty mining areas, 62% of bare ground was disturbed (had a large increase in albedo since 2002. This four-month project provides the first publicly-available database of mining areas in Myanmar, and it demonstrates an approach for large-scale assessment of mining extent and expansion based on freely-available data.

  9. Security Assessment of Web Based Distributed Applications

    Directory of Open Access Journals (Sweden)

    Catalin BOJA

    2010-01-01

    Full Text Available This paper presents an overview about the evaluation of risks and vulnerabilities in a web based distributed application by emphasizing aspects concerning the process of security assessment with regards to the audit field. In the audit process, an important activity is dedicated to the measurement of the characteristics taken into consideration for evaluation. From this point of view, the quality of the audit process depends on the quality of assessment methods and techniques. By doing a review of the fields involved in the research process, the approach wants to reflect the main concerns that address the web based distributed applications using exploratory research techniques. The results show that many are the aspects which must carefully be worked with, across a distributed system and they can be revealed by doing a depth introspective analyze upon the information flow and internal processes that are part of the system. This paper reveals the limitations of a non-existing unified security risk assessment model that could prevent such risks and vulnerabilities debated. Based on such standardize models, secure web based distributed applications can be easily audited and many vulnerabilities which can appear due to the lack of access to information can be avoided.

  10. Classifying unstructed textual data using the Product Score Model: an alternative text mining algorithm

    NARCIS (Netherlands)

    He, Qiwei; Veldkamp, Bernard P.; Eggen, T.J.H.M.; Veldkamp, B.P.

    2012-01-01

    Unstructured textual data such as students’ essays and life narratives can provide helpful information in educational and psychological measurement, but often contain irregularities and ambiguities, which creates difficulties in analysis. Text mining techniques that seek to extract useful

  11. Networks Models of Actin Dynamics during Spermatozoa Postejaculatory Life: A Comparison among Human-Made and Text Mining-Based Models

    Directory of Open Access Journals (Sweden)

    Nicola Bernabò

    2016-01-01

    Full Text Available Here we realized a networks-based model representing the process of actin remodelling that occurs during the acquisition of fertilizing ability of human spermatozoa (HumanMade_ActinSpermNetwork, HM_ASN. Then, we compared it with the networks provided by two different text mining tools: Agilent Literature Search (ALS and PESCADOR. As a reference, we used the data from the online repository Kyoto Encyclopaedia of Genes and Genomes (KEGG, referred to the actin dynamics in a more general biological context. We found that HM_ALS and the networks from KEGG data shared the same scale-free topology following the Barabasi-Albert model, thus suggesting that the information is spread within the network quickly and efficiently. On the contrary, the networks obtained by ALS and PESCADOR have a scale-free hierarchical architecture, which implies a different pattern of information transmission. Also, the hubs identified within the networks are different: HM_ALS and KEGG networks contain as hubs several molecules known to be involved in actin signalling; ALS was unable to find other hubs than “actin,” whereas PESCADOR gave some nonspecific result. This seems to suggest that the human-made information retrieval in the case of a specific event, such as actin dynamics in human spermatozoa, could be a reliable strategy.

  12. Comparison of SOAP and REST Based Web Services Using Software Evaluation Metrics

    Directory of Open Access Journals (Sweden)

    Tihomirovs Juris

    2016-12-01

    Full Text Available The usage of Web services has recently increased. Therefore, it is important to select right type of Web services at the project design stage. The most common implementations are based on SOAP (Simple Object Access Protocol and REST (Representational State Transfer Protocol styles. Maintainability of REST and SOAP Web services has become an important issue as popularity of Web services is increasing. Choice of the right approach is not an easy decision since it is influenced by development requirements and maintenance considerations. In the present research, we present the comparison of SOAP and REST based Web services using software evaluation metrics. To achieve this aim, a systematic literature review will be made to compare REST and SOAP Web services in terms of the software evaluation metrics.

  13. Web-Based Distributed XML Query Processing

    NARCIS (Netherlands)

    Smiljanic, M.; Feng, L.; Jonker, Willem; Blanken, Henk; Grabs, T.; Schek, H-J.; Schenkel, R.; Weikum, G.

    2003-01-01

    Web-based distributed XML query processing has gained in importance in recent years due to the widespread popularity of XML on the Web. Unlike centralized and tightly coupled distributed systems, Web-based distributed database systems are highly unpredictable and uncontrollable, with a rather

  14. An Overview of GIS-Based Modeling and Assessment of Mining-Induced Hazards: Soil, Water, and Forest

    Directory of Open Access Journals (Sweden)

    Jangwon Suh

    2017-11-01

    Full Text Available In this study, current geographic information system (GIS-based methods and their application for the modeling and assessment of mining-induced hazards were reviewed. Various types of mining-induced hazard, including soil contamination, soil erosion, water pollution, and deforestation were considered in the discussion of the strength and role of GIS as a viable problem-solving tool in relation to mining-induced hazards. The various types of mining-induced hazard were classified into two or three subtopics according to the steps involved in the reclamation procedure, or elements of the hazard of interest. Because GIS is appropriated for the handling of geospatial data in relation to mining-induced hazards, the application and feasibility of exploiting GIS-based modeling and assessment of mining-induced hazards within the mining industry could be expanded further.

  15. Analysis and Design of Web-Based Database Application for Culinary Community

    Directory of Open Access Journals (Sweden)

    Choirul Huda

    2017-03-01

    Full Text Available This research is based on the rapid development of the culinary and information technology. The difficulties in communicating with the culinary expert and on recipe documentation make a proper support for media very important. Therefore, a web-based database application for the public is important to help the culinary community in communication, searching and recipe management. The aim of the research was to design a web-based database application that could be used as social media for the culinary community. This research used literature review, user interviews, and questionnaires. Moreover, the database system development life cycle was used as a guide for designing a database especially for conceptual database design, logical database design, and physical design database. Web-based application design used eight golden rules for user interface design. The result of this research is the availability of a web-based database application that can fulfill the needs of users in the culinary field related to communication and recipe management.

  16. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  17. RCrawler: An R package for parallel web crawling and scraping

    Directory of Open Access Journals (Sweden)

    Salim Khalil

    2017-01-01

    Full Text Available RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results.

  18. A node linkage approach for sequential pattern mining.

    Directory of Open Access Journals (Sweden)

    Osvaldo Navarro

    Full Text Available Sequential Pattern Mining is a widely addressed problem in data mining, with applications such as analyzing Web usage, examining purchase behavior, and text mining, among others. Nevertheless, with the dramatic increase in data volume, the current approaches prove inefficient when dealing with large input datasets, a large number of different symbols and low minimum supports. In this paper, we propose a new sequential pattern mining algorithm, which follows a pattern-growth scheme to discover sequential patterns. Unlike most pattern growth algorithms, our approach does not build a data structure to represent the input dataset, but instead accesses the required sequences through pseudo-projection databases, achieving better runtime and reducing memory requirements. Our algorithm traverses the search space in a depth-first fashion and only preserves in memory a pattern node linkage and the pseudo-projections required for the branch being explored at the time. Experimental results show that our new approach, the Node Linkage Depth-First Traversal algorithm (NLDFT, has better performance and scalability in comparison with state of the art algorithms.

  19. Multi-Course Comparison of Traditional versus Web-based Course Delivery Systems

    Directory of Open Access Journals (Sweden)

    J. Michael Weber, PhD.,

    2007-07-01

    Full Text Available The purpose of this paper is to measure and compare the effectiveness of a Web-based course delivery system to a traditional course delivery system. The results indicate that a web-based course is effective and equivalent to a traditional classroom environment. As with the implementation of all new technologies, there are some pros and cons that should be considered. The significant pro is the element of convenience which eliminates the constrictive boundaries of space and time. The most notable con involves the impersonal nature of the online environment. Overall, we found the web-based course delivery system to be very successful in terms of learning outcomes and student satisfaction.

  20. Stemming Malay Text and Its Application in Automatic Text Categorization

    Science.gov (United States)

    Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  1. Semantic web for integrated network analysis in biomedicine.

    Science.gov (United States)

    Chen, Huajun; Ding, Li; Wu, Zhaohui; Yu, Tong; Dhanapalan, Lavanya; Chen, Jake Y

    2009-03-01

    The Semantic Web technology enables integration of heterogeneous data on the World Wide Web by making the semantics of data explicit through formal ontologies. In this article, we survey the feasibility and state of the art of utilizing the Semantic Web technology to represent, integrate and analyze the knowledge in various biomedical networks. We introduce a new conceptual framework, semantic graph mining, to enable researchers to integrate graph mining with ontology reasoning in network data analysis. Through four case studies, we demonstrate how semantic graph mining can be applied to the analysis of disease-causal genes, Gene Ontology category cross-talks, drug efficacy analysis and herb-drug interactions analysis.

  2. Clustering box office movie with Partition Around Medoids (PAM) Algorithm based on Text Mining of Indonesian subtitle

    Science.gov (United States)

    Alfarizy, A. D.; Indahwati; Sartono, B.

    2017-03-01

    Indonesia is the largest Hollywood movie industry target market in Southeast Asia in 2015. Hollywood movies distributed in Indonesia targeted people in all range of ages including children. Low awareness of guiding children while watching movies make them could watch any rated films even the unsuitable ones for their ages. Even after being translated into Bahasa and passed the censorship phase, words that uncomfortable for children to watch still exist. The purpose of this research is to cluster box office Hollywood movies based on Indonesian subtitle, revenue, IMDb user rating and genres as one of the reference for adults to choose right movies for their children to watch. Text mining is used to extract words from the subtitles and count the frequency for three group of words (bad words, sexual words and terror words), while Partition Around Medoids (PAM) Algorithm with Gower similarity coefficient as proximity matrix is used as clustering method. We clustered 624 movies from 2006 until first half of 2016 from IMDb. Cluster with highest silhouette coefficient value (0.36) is the one with 5 clusters. Animation, Adventure and Comedy movies with high revenue like in cluster 5 is recommended for children to watch, while Comedy movies with high revenue like in cluster 4 should be avoided to watch.

  3. MetaRanker 2.0: a web server for prioritization of genetic variation data.

    Science.gov (United States)

    Pers, Tune H; Dworzyński, Piotr; Thomas, Cecilia Engel; Lage, Kasper; Brunak, Søren

    2013-07-01

    MetaRanker 2.0 is a web server for prioritization of common and rare frequency genetic variation data. Based on heterogeneous data sets including genetic association data, protein-protein interactions, large-scale text-mining data, copy number variation data and gene expression experiments, MetaRanker 2.0 prioritizes the protein-coding part of the human genome to shortlist candidate genes for targeted follow-up studies. MetaRanker 2.0 is made freely available at www.cbs.dtu.dk/services/MetaRanker-2.0.

  4. Development, implementation and pilot evaluation of a Web-based Virtual Patient Case Simulation environment – Web-SP

    Directory of Open Access Journals (Sweden)

    Boberg Jonas

    2006-02-01

    Full Text Available Abstract Background The Web-based Simulation of Patients (Web-SP project was initiated in order to facilitate the use of realistic and interactive virtual patients (VP in medicine and healthcare education. Web-SP focuses on moving beyond the technology savvy teachers, when integrating simulation-based education into health sciences curricula, by making the creation and use of virtual patients easier. The project strives to provide a common generic platform for design/creation, management, evaluation and sharing of web-based virtual patients. The aim of this study was to evaluate if it was possible to develop a web-based virtual patient case simulation environment where the entire case authoring process might be handled by teachers and which would be flexible enough to be used in different healthcare disciplines. Results The Web-SP system was constructed to support easy authoring, management and presentation of virtual patient cases. The case authoring environment was found to facilitate for teachers to create full-fledged patient cases without the assistance of computer specialists. Web-SP was successfully implemented at several universities by taking into account key factors such as cost, access, security, scalability and flexibility. Pilot evaluations in medical, dentistry and pharmacy courses shows that students regarded Web-SP as easy to use, engaging and to be of educational value. Cases adapted for all three disciplines were judged to be of significant educational value by the course leaders. Conclusion The Web-SP system seems to fulfil the aim of providing a common generic platform for creation, management and evaluation of web-based virtual patient cases. The responses regarding the authoring environment indicated that the system might be user-friendly enough to appeal to a majority of the academic staff. In terms of implementation strengths, Web-SP seems to fulfil most needs from course directors and teachers from various educational

  5. The use of web ontology languages and other semantic web tools in drug discovery.

    Science.gov (United States)

    Chen, Huajun; Xie, Guotong

    2010-05-01

    To optimize drug development processes, pharmaceutical companies require principled approaches to integrate disparate data on a unified infrastructure, such as the web. The semantic web, developed on the web technology, provides a common, open framework capable of harmonizing diversified resources to enable networked and collaborative drug discovery. We survey the state of art of utilizing web ontologies and other semantic web technologies to interlink both data and people to support integrated drug discovery across domains and multiple disciplines. Particularly, the survey covers three major application categories including: i) semantic integration and open data linking; ii) semantic web service and scientific collaboration and iii) semantic data mining and integrative network analysis. The reader will gain: i) basic knowledge of the semantic web technologies; ii) an overview of the web ontology landscape for drug discovery and iii) a basic understanding of the values and benefits of utilizing the web ontologies in drug discovery. i) The semantic web enables a network effect for linking open data for integrated drug discovery; ii) The semantic web service technology can support instant ad hoc collaboration to improve pipeline productivity and iii) The semantic web encourages publishing data in a semantic way such as resource description framework attributes and thus helps move away from a reliance on pure textual content analysis toward more efficient semantic data mining.

  6. Integrated Text Mining and Chemoinformatics Analysis Associates Diet to Health Benefit at Molecular Level

    DEFF Research Database (Denmark)

    Jensen, Kasper; Panagiotou, Gianni; Kouskoumvekaki, Irene

    2014-01-01

    , lipids and nutrients. In this work, we applied text mining and Naïve Bayes classification to assemble the knowledge space of food-phytochemical and food-disease associations, where we distinguish between disease prevention/amelioration and disease progression. We subsequently searched for frequently...

  7. Study on the Method of Association Rules Mining Based on Genetic Algorithm and Application in Analysis of Seawater Samples

    Directory of Open Access Journals (Sweden)

    Qiuhong Sun

    2014-04-01

    Full Text Available Based on the data mining research, the data mining based on genetic algorithm method, the genetic algorithm is briefly introduced, while the genetic algorithm based on two important theories and theoretical templates principle implicit parallelism is also discussed. Focuses on the application of genetic algorithms for association rule mining method based on association rule mining, this paper proposes a genetic algorithm fitness function structure, data encoding, such as the title of the improvement program, in particular through the early issues study, proposed the improved adaptive Pc, Pm algorithm is applied to the genetic algorithm, thereby improving efficiency of the algorithm. Finally, a genetic algorithm based association rule mining algorithm, and be applied in sea water samples database in data mining and prove its effective.

  8. Quality of reporting web-based and non-web-based survey studies: What authors, reviewers and consumers should consider.

    Science.gov (United States)

    Turk, Tarek; Elhady, Mohamed Tamer; Rashed, Sherwet; Abdelkhalek, Mariam; Nasef, Somia Ahmed; Khallaf, Ashraf Mohamed; Mohammed, Abdelrahman Tarek; Attia, Andrew Wassef; Adhikari, Purushottam; Amin, Mohamed Alsabbahi; Hirayama, Kenji; Huy, Nguyen Tien

    2018-01-01

    Several influential aspects of survey research have been under-investigated and there is a lack of guidance on reporting survey studies, especially web-based projects. In this review, we aim to investigate the reporting practices and quality of both web- and non-web-based survey studies to enhance the quality of reporting medical evidence that is derived from survey studies and to maximize the efficiency of its consumption. Reporting practices and quality of 100 random web- and 100 random non-web-based articles published from 2004 to 2016 were assessed using the SUrvey Reporting GuidelinE (SURGE). The CHERRIES guideline was also used to assess the reporting quality of Web-based studies. Our results revealed a potential gap in the reporting of many necessary checklist items in both web-based and non-web-based survey studies including development, description and testing of the questionnaire, the advertisement and administration of the questionnaire, sample representativeness and response rates, incentives, informed consent, and methods of statistical analysis. Our findings confirm the presence of major discrepancies in reporting results of survey-based studies. This can be attributed to the lack of availability of updated universal checklists for quality of reporting standards. We have summarized our findings in a table that may serve as a roadmap for future guidelines and checklists, which will hopefully include all types and all aspects of survey research.

  9. HPIminer: A text mining system for building and visualizing human protein interaction networks and pathways.

    Science.gov (United States)

    Subramani, Suresh; Kalpana, Raja; Monickaraj, Pankaj Moses; Natarajan, Jeyakumar

    2015-04-01

    The knowledge on protein-protein interactions (PPI) and their related pathways are equally important to understand the biological functions of the living cell. Such information on human proteins is highly desirable to understand the mechanism of several diseases such as cancer, diabetes, and Alzheimer's disease. Because much of that information is buried in biomedical literature, an automated text mining system for visualizing human PPI and pathways is highly desirable. In this paper, we present HPIminer, a text mining system for visualizing human protein interactions and pathways from biomedical literature. HPIminer extracts human PPI information and PPI pairs from biomedical literature, and visualize their associated interactions, networks and pathways using two curated databases HPRD and KEGG. To our knowledge, HPIminer is the first system to build interaction networks from literature as well as curated databases. Further, the new interactions mined only from literature and not reported earlier in databases are highlighted as new. A comparative study with other similar tools shows that the resultant network is more informative and provides additional information on interacting proteins and their associated networks. Copyright © 2015 Elsevier Inc. All rights reserved.

  10. pubmed. mineR: An R package with text-mining algorithms to ...

    Indian Academy of Sciences (India)

    2016-08-26

    Aug 26, 2016 ... Three case studies are presented, namely, `Evolving role of diabetes educators', `Cancer risk assessment' and `Dynamic concepts on disease and comorbidity' to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus ...

  11. An Open-Source Web-Based Tool for Resource-Agnostic Interactive Translation Prediction

    Directory of Open Access Journals (Sweden)

    Daniel Torregrosa

    2014-09-01

    Full Text Available We present a web-based open-source tool for interactive translation prediction (ITP and describe its underlying architecture. ITP systems assist human translators by making context-based computer-generated suggestions as they type. Most of the ITP systems in literature are strongly coupled with a statistical machine translation system that is conveniently adapted to provide the suggestions. Our system, however, follows a resource-agnostic approach and suggestions are obtained from any unmodified black-box bilingual resource. This paper reviews our ITP method and describes the architecture of Forecat, a web tool, partly based on the recent technology of web components, that eases the use of our ITP approach in any web application requiring this kind of translation assistance. We also evaluate the performance of our method when using an unmodified Moses-based statistical machine translation system as the bilingual resource.

  12. Web-Based Learning Support System

    Science.gov (United States)

    Fan, Lisa

    Web-based learning support system offers many benefits over traditional learning environments and has become very popular. The Web is a powerful environment for distributing information and delivering knowledge to an increasingly wide and diverse audience. Typical Web-based learning environments, such as Web-CT, Blackboard, include course content delivery tools, quiz modules, grade reporting systems, assignment submission components, etc. They are powerful integrated learning management systems (LMS) that support a number of activities performed by teachers and students during the learning process [1]. However, students who study a course on the Internet tend to be more heterogeneously distributed than those found in a traditional classroom situation. In order to achieve optimal efficiency in a learning process, an individual learner needs his or her own personalized assistance. For a web-based open and dynamic learning environment, personalized support for learners becomes more important. This chapter demonstrates how to realize personalized learning support in dynamic and heterogeneous learning environments by utilizing Adaptive Web technologies. It focuses on course personalization in terms of contents and teaching materials that is according to each student's needs and capabilities. An example of using Rough Set to analyze student personal information to assist students with effective learning and predict student performance is presented.

  13. miRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases.

    Science.gov (United States)

    Gupta, Samir; Ross, Karen E; Tudor, Catalina O; Wu, Cathy H; Schmidt, Carl J; Vijay-Shanker, K

    2016-04-29

    MicroRNAs are increasingly being appreciated as critical players in human diseases, and questions concerning the role of microRNAs arise in many areas of biomedical research. There are several manually curated databases of microRNA-disease associations gathered from the biomedical literature; however, it is difficult for curators of these databases to keep up with the explosion of publications in the microRNA-disease field. Moreover, automated literature mining tools that assist manual curation of microRNA-disease associations currently capture only one microRNA property (expression) in the context of one disease (cancer). Thus, there is a clear need to develop more sophisticated automated literature mining tools that capture a variety of microRNA properties and relations in the context of multiple diseases to provide researchers with fast access to the most recent published information and to streamline and accelerate manual curation. We have developed miRiaD (microRNAs in association with Disease), a text-mining tool that automatically extracts associations between microRNAs and diseases from the literature. These associations are often not directly linked, and the intermediate relations are often highly informative for the biomedical researcher. Thus, miRiaD extracts the miR-disease pairs together with an explanation for their association. We also developed a procedure that assigns scores to sentences, marking their informativeness, based on the microRNA-disease relation observed within the sentence. miRiaD was applied to the entire Medline corpus, identifying 8301 PMIDs with miR-disease associations. These abstracts and the miR-disease associations are available for browsing at http://biotm.cis.udel.edu/miRiaD . We evaluated the recall and precision of miRiaD with respect to information of high interest to public microRNA-disease database curators (expression and target gene associations), obtaining a recall of 88.46-90.78. When we expanded the evaluation to

  14. Asymmetric threat data mining and knowledge discovery

    Science.gov (United States)

    Gilmore, John F.; Pagels, Michael A.; Palk, Justin

    2001-03-01

    Asymmetric threats differ from the conventional force-on- force military encounters that the Defense Department has historically been trained to engage. Terrorism by its nature is now an operational activity that is neither easily detected or countered as its very existence depends on small covert attacks exploiting the element of surprise. But terrorism does have defined forms, motivations, tactics and organizational structure. Exploiting a terrorism taxonomy provides the opportunity to discover and assess knowledge of terrorist operations. This paper describes the Asymmetric Threat Terrorist Assessment, Countering, and Knowledge (ATTACK) system. ATTACK has been developed to (a) data mine open source intelligence (OSINT) information from web-based newspaper sources, video news web casts, and actual terrorist web sites, (b) evaluate this information against a terrorism taxonomy, (c) exploit country/region specific social, economic, political, and religious knowledge, and (d) discover and predict potential terrorist activities and association links. Details of the asymmetric threat structure and the ATTACK system architecture are presented with results of an actual terrorist data mining and knowledge discovery test case shown.

  15. Web-based computational chemistry education with CHARMMing I: Lessons and tutorial.

    Directory of Open Access Journals (Sweden)

    Benjamin T Miller

    2014-07-01

    Full Text Available This article describes the development, implementation, and use of web-based "lessons" to introduce students and other newcomers to computer simulations of biological macromolecules. These lessons, i.e., interactive step-by-step instructions for performing common molecular simulation tasks, are integrated into the collaboratively developed CHARMM INterface and Graphics (CHARMMing web user interface (http://www.charmming.org. Several lessons have already been developed with new ones easily added via a provided Python script. In addition to CHARMMing's new lessons functionality, web-based graphical capabilities have been overhauled and are fully compatible with modern mobile web browsers (e.g., phones and tablets, allowing easy integration of these advanced simulation techniques into coursework. Finally, one of the primary objections to web-based systems like CHARMMing has been that "point and click" simulation set-up does little to teach the user about the underlying physics, biology, and computational methods being applied. In response to this criticism, we have developed a freely available tutorial to bridge the gap between graphical simulation setup and the technical knowledge necessary to perform simulations without user interface assistance.

  16. Social Web mining and exploitation for serious applications: Technosocial Predictive Analytics and related technologies for public health, environmental and national security surveillance.

    Science.gov (United States)

    Kamel Boulos, Maged N; Sanfilippo, Antonio P; Corley, Courtney D; Wheeler, Steve

    2010-10-01

    This paper explores Technosocial Predictive Analytics (TPA) and related methods for Web "data mining" where users' posts and queries are garnered from Social Web ("Web 2.0") tools such as blogs, micro-blogging and social networking sites to form coherent representations of real-time health events. The paper includes a brief introduction to commonly used Social Web tools such as mashups and aggregators, and maps their exponential growth as an open architecture of participation for the masses and an emerging way to gain insight about people's collective health status of whole populations. Several health related tool examples are described and demonstrated as practical means through which health professionals might create clear location specific pictures of epidemiological data such as flu outbreaks. Copyright 2010 Elsevier Ireland Ltd. All rights reserved.

  17. Students' Satisfaction and Perceived Learning with a Web-based Course

    Directory of Open Access Journals (Sweden)

    Derek Holton

    2003-01-01

    Full Text Available This paper describes a study, which explored students' responses and reactions to a Web-based tertiary statistics course supporting problem-based learning. The study was undertaken among postgraduate students in a Malaysian university. The findings revealed that the majority of the students were satisfied with their learning experience and achieved comparable learning outcomes to students in the face-to-face version of the course. Students appreciated the flexibility of anytime, anywhere learning. The majority of the students was motivated to learn and had adequate technical support to complete the course. Improvement in computer skills was an incidental learning outcome from the course. The student-student and student-teacher communication was satisfactory but a few students felt isolated learning in the Web environment. These students expressed a need for some face-to-face lectures. While the majority of the students saw value in learning in a problem-based setting, around a third of the students expressed no opinion on, or were dissatisfied with, the problem-based environment. They were satisfied with the group facilitators and learning materials but were unhappy with the group dynamics. Some of the students felt unable to contribute to or learn from the asynchronous Web-based conferences using problem-based approach. Some of the students were not punctual and were not prepared to take part in the Web-based conferences. The findings have suggested a need to explicitly design an organising strategy in the asynchronous Web-based conferences using problem-based approach to aid students in completing the problem-based learning process.

  18. Pricing Mining Concessions Based on Combined Multinomial Pricing Model

    Directory of Open Access Journals (Sweden)

    Chang Xiao

    2017-01-01

    Full Text Available A combined multinomial pricing model is proposed for pricing mining concession in which the annualized volatility of the price of mineral products follows a multinomial distribution. First, a combined multinomial pricing model is proposed which consists of binomial pricing models calculated according to different volatility values. Second, a method is provided to calculate the annualized volatility and the distribution. Third, the value of convenience yields is calculated based on the relationship between the futures price and the spot price. The notion of convenience yields is used to adjust our model as well. Based on an empirical study of a Chinese copper mine concession, we verify that our model is easy to use and better than the model with constant volatility when considering the changing annualized volatility of the price of the mineral product.

  19. Tools and databases of the KOMICS web portal for preprocessing, mining, and dissemination of metabolomics data.

    Science.gov (United States)

    Sakurai, Nozomu; Ara, Takeshi; Enomoto, Mitsuo; Motegi, Takeshi; Morishita, Yoshihiko; Kurabayashi, Atsushi; Iijima, Yoko; Ogata, Yoshiyuki; Nakajima, Daisuke; Suzuki, Hideyuki; Shibata, Daisuke

    2014-01-01

    A metabolome--the collection of comprehensive quantitative data on metabolites in an organism--has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal), where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.

  20. Rweb:Web-based Statistical Analysis

    Directory of Open Access Journals (Sweden)

    Jeff Banfield

    1999-03-01

    Full Text Available Rweb is a freely accessible statistical analysis environment that is delivered through the World Wide Web (WWW. It is based on R, a well known statistical analysis package. The only requirement to run the basic Rweb interface is a WWW browser that supports forms. If you want graphical output you must, of course, have a browser that supports graphics. The interface provides access to WWW accessible data sets, so you may run Rweb on your own data. Rweb can provide a four window statistical computing environment (code input, text output, graphical output, and error information through browsers that support Javascript. There is also a set of point and click modules under development for use in introductory statistics courses.