WorldWideScience

Sample records for ascii text documents

  1. Documents and legal texts

    International Nuclear Information System (INIS)

    This section reprints a selection of recently published legislative texts and documents: - Russian Federation: Federal Law No.170 of 21 November 1995 on the use of atomic energy, Adopted by the State Duma on 20 October 1995; - Uruguay: Law No.19.056 On the Radiological Protection and Safety of Persons, Property and the Environment (4 January 2013); - Japan: Third Supplement to Interim Guidelines on Determination of the Scope of Nuclear Damage resulting from the Accident at the Tokyo Electric Power Company Fukushima Daiichi and Daini Nuclear Power Plants (concerning Damages related to Rumour-Related Damage in the Agriculture, Forestry, Fishery and Food Industries), 30 January 2013; - France and the United States: Joint Statement on Liability for Nuclear Damage (Aug 2013); - Franco-Russian Nuclear Power Declaration (1 November 2013)

  2. Documents and legal texts

    International Nuclear Information System (INIS)

    This section treats of the following Documents and legal texts: 1 - Canada: Nuclear Liability and Compensation Act (An Act respecting civil liability and compensation for damage in case of a nuclear incident, repealing the Nuclear Liability Act and making consequential amendments to other acts); 2 - Japan: Act on Compensation for Nuclear Damage (The purpose of this act is to protect persons suffering from nuclear damage and to contribute to the sound development of the nuclear industry by establishing a basic system regarding compensation in case of nuclear damage caused by reactor operation etc.); Act on Indemnity Agreements for Compensation of Nuclear Damage; 3 - Slovak Republic: Act on Civil Liability for Nuclear Damage and on its Financial Coverage and on Changes and Amendments to Certain Laws (This Act regulates: a) The civil liability for nuclear damage incurred in the causation of a nuclear incident, b) The scope of powers of the Nuclear Regulatory Authority (hereinafter only as the 'Authority') in relation to the application of this Act, c) The competence of the National Bank of Slovakia in relation to the supervised financial market entities in the financial coverage of liability for nuclear damage; and d) The penalties for violation of this Act)

  3. An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

    OpenAIRE

    Amandeep Singh Sidhu; Er. Meenakshi Garg

    2014-01-01

    Compression algorithms reduce the redundancy in data representation thus increasing effective data density. Data compression is a very useful technique that helps in reducing the size of text data and storing the same amount of data in relatively fewer bits resulting in reducing the data storage space, resource usage or transmission capacity. There are a number of techniques that have been used for text data compression which can be categorized as Lossy and Lossless data compre...

  4. Survey on Text Document Clustering

    OpenAIRE

    M.Thangamani; Dr.P.Thangaraj

    2010-01-01

    Document clustering is also referred as text clustering, and its concept is merely equal to data clustering. It is hardly difficult to find the selective information from an ‘N’number of series information, so that document clustering came into picture. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is a fundamental data analysi...

  5. Text documents as social networks

    Science.gov (United States)

    Balinsky, Helen; Balinsky, Alexander; Simske, Steven J.

    2012-03-01

    The extraction of keywords and features is a fundamental problem in text data mining. Document processing applications directly depend on the quality and speed of the identification of salient terms and phrases. Applications as disparate as automatic document classification, information visualization, filtering and security policy enforcement all rely on the quality of automatically extracted keywords. Recently, a novel approach to rapid change detection in data streams and documents has been developed. It is based on ideas from image processing and in particular on the Helmholtz Principle from the Gestalt Theory of human perception. By modeling a document as a one-parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helmholtz's principle, we demonstrated that for some range of the parameters, the resulting graph becomes a small-world network. In this article we investigate the natural orientation of edges in such small world networks. For two connected sentences, we can say which one is the first and which one is the second, according to their position in a document. This will make such a graph look like a small WWW-type network and PageRank type algorithms will produce interesting ranking of nodes in such a document.

  6. Emotion Detection From Text Documents

    Directory of Open Access Journals (Sweden)

    Shiv Naresh Shivhare

    2014-11-01

    Full Text Available Emotion Detection is one of the most emerging issues in human computer interaction. A sufficient amount of work has been done by researchers to detect emotions from facial and audio information whereas recognizing emotions from textual data is still a fresh and hot research area. This paper presented a knowledge based survey on emotion detection based on textual data and the methods used for this purpose. At the next step paper also proposed a new architecture for recognizing emotions from text document.Proposed architecture is composed of two main parts, emotion ontology and emotion detector algorithm.Proposed emotion detector system takes a text document and the emotion ontology as inputs and produces one of the six emotion classes (i.e. love, joy, anger, sadness, fear and surprise as the output.

  7. Text line Segmentation of Curved Document Images

    Directory of Open Access Journals (Sweden)

    Anusree.M

    2014-05-01

    Full Text Available Document image analysis has been widely used in historical and heritage studies, education and digital library. Document image analytical techniques are mainly used for improving the human readability and the OCR quality of the document. During the digitization, camera captured images contain warped document due perspective and geometric distortions. The main difficulty is text line detection in the document. Many algorithms had been proposed to address the problem of printed document text line detection, but they failed to extract text lines in curved document. This paper describes a segmentation technique that detects the curled text line in camera captured document images.

  8. Plagiarism in text documents: Methods of Plagiarism

    OpenAIRE

    Opička, Jan

    2009-01-01

    This thesis is devoted to detection of plagiarism among documents in large document databases. The problem of detection of plagiarism is more appealing today than ever. Easy accessibility of documents in digital form contributes to this problem. To enforce author rights and wipe out plagiarism it is necessary to project such system that will be able to distinguish plagiarism among documents with certainty. Such system is valuable help in academic field, where it can be used for controlling of...

  9. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-12-01

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. However, these metaphors (e.g., word clouds, tag clouds, etc.) often lack interactivity to explore the information and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. Further, transitioning between levels of detail (i.e., from terms to full documents) can be challanging. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multipel levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geography metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  10. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-10-06

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. These visual metaphors (e.g., word clouds, tag clouds, etc.) traditionally show a series of terms organized by space-filling algorithms. However, often lacking in these views is the ability to interactively explore the information to gain more detail, and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multiple levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geographic metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  11. GENERATION OF A SET OF KEY TERMS CHARACTERISING TEXT DOCUMENTS

    Directory of Open Access Journals (Sweden)

    Kristina Machova

    2007-06-01

    Full Text Available The presented paper describes statistical methods (information gain, mutual X^2 statistics, and TF-IDF method for key words generation from a text document collection. These key words should characterise the content of text documents and can be used to retrieve relevant documents from a document collection. Term relations were detected on the base of conditional probability of term occurrences. The focus is on the detection of those words, which occur together very often. Thus, key words, which consist from two terms were generated additionally. Several tests were carried out using the 20 News Groups collection of text documents.

  12. A New Fragile Watermarking Scheme for Text Documents Authentication

    Institute of Scientific and Technical Information of China (English)

    XIANG Huazheng; SUN Xingming; TANG Chengliang

    2006-01-01

    Because there are different modification types of deleting characters and inserting characters in text documents, the algorithms for image authentication can not be used for text documents authentication directly. A text watermarking scheme for text document authentication is proposed in this paper. By extracting the features of character cascade together with the user secret key, the scheme combines the features of the text with the user information as a watermark which is embedded into the transformed text itself. The receivers can verify the integrity and the authentication of the text through the blind detection technique. A further research demonstrates that it can also localize the tamper, classify the type of modification, and recover part of modified text documents. The aforementioned conclusion has been proved by both our experiment results and analysis.

  13. CERCLIS (Superfund) ASCII Text Format - CPAD Database

    Data.gov (United States)

    U.S. Environmental Protection Agency — The Comprehensive Environmental Response, Compensation and Liability Information System (CERCLIS) (Superfund) Public Access Database (CPAD) contains a selected set...

  14. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  15. Classification process in a text document recommender system

    Directory of Open Access Journals (Sweden)

    Dan MUNTEANU

    2005-12-01

    Full Text Available This paper presents the classification process in a recommender system used for textual documents taken especially from web. The system uses in the classification process a combination of content filters, event filters and collaborative filters and it uses implicit and explicit feedback for evaluating documents.

  16. Literature Review of Automatic Multiple Documents Text Summarization

    Directory of Open Access Journals (Sweden)

    Md. Majharul Haque

    2013-05-01

    Full Text Available For the blessing of World Wide Web, the corpus of online information is gigantic in its volume. Search engines have been developed such as Google, AltaVista, Yahoo, etc., to retrieve specific information from this huge amount of data. But the outcome of search engine is unable to provide expected result as the quantity of information is increasing enormously day by day and the findings are abundant. So, the automatic text summarization is demanded for salient information retrieval. Automatic text summarization is a system of summarizing text by computer where a text is given to the computer as input and the output is a shorter and less redundant form of the original text. An informative précis is very much helpful in our daily life to save valuable time. Research was first started naively on single document abridgement but recently information is found from various sources about a single topic in different website, journal, newspaper, text book, etc., for which multi-document summarization is required. In this paper, automatic multiple documents text summarization task is addressed and different procedure of various researchers are discussed. Various techniques are compared here that have done for multi-document summarization. Some promising approaches are indicated here and particular concentration is dedicated to describe different methods from raw level to similar like human experts, so that in future one can get significant instruction for further analysis.

  17. Integrated Clustering and Feature Selection Scheme for Text Documents.

    Directory of Open Access Journals (Sweden)

    M. Thangamani

    2010-01-01

    Full Text Available Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The cube size is very high and accuracy is low in the term based text clustering and feature selection method. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. The proposed system was designed to identify the semantic relations using the ontology. The ontology was used to represent the term and concept relationship. Results: The synonym, meronym and hypernym relationships were represented in the ontology. The concept weights were estimated with reference to the ontology. The concept weight was used for the clustering process. The system was implemented in two methods. They were term clustering with feature selection and semantic clustering with feature selection. Conclusion: The performance analysis was carried out with the term clustering and semantic clustering methods. The accuracy and efficiency factors were analyzed in the performance analysis.

  18. EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION

    Directory of Open Access Journals (Sweden)

    N. Adilah Hanin Zahri

    2015-03-01

    Full Text Available Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents

  19. Raw Data (ASCII format) - PLACE | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available [ Credits ] BLAST Search Image Search Home About Archive Update History Contact us PLACE Raw... Data (ASCII format) Data detail Data name Raw Data (ASCII format) Description of data contents The...e Database Description Download License Update History of This Database Site Policy | Contact Us Raw Data (ASCII format) - PLACE | LSDB Archive ...

  20. Text recognition in both ancient and cartographic documents

    OpenAIRE

    Zaghden, Nizar; Khelifi, Badreddine; Alimi, Adel M.; Mullot, Remy

    2013-01-01

    This paper deals with the recognition and matching of text in both cartographic maps and ancient documents. The purpose of this work is to find similar text regions based on statistical and global features. A phase of normalization is done first, in object to well categorize the same quantity of information. A phase of wordspotting is done next by combining local and global features. We make different experiments by combining the different techniques of extracting features in order to obtain ...

  1. A Fuzzy Approach to Classification of Text Documents

    Institute of Scientific and Technical Information of China (English)

    LIU WeiYi(刘惟一); SONG Ning(宋宁)

    2003-01-01

    This paper discusses the classification problems of text documents. Based onthe concept of the proximity degree, the set of words is partitioned into some equivalence classes.Particularly, the concepts of the semantic field and association degree are given in this paper.Based on the above concepts, this paper presents a fuzzy classification approach for documentcategorization. Furthermore, applying the concept of the entropy of information, the approachesto select key words from the set of words covering the classification of documents and to constructthe hierarchical structure of key words are obtained.

  2. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  3. Term Weighting Schemes for Slovak Text Document Clustering

    Directory of Open Access Journals (Sweden)

    ZLACKÝ Daniel

    2013-05-01

    Full Text Available Text representation is the task of transforming the textual data into a multidimensional space with corresponding weights for every word. Wehave tested several widely used term weighting methods on manually created database from Slovak Wikipedia articles. The created vector space models were used as an input in unsupervised clustering algorithms, which cluster text documents based on these created models. We have tested nine different weighting schemes withK-mean clustering algorithm. The best results were obtained by TF-RIDF weighting scheme. However, the next experiments with different clustering techniques have not confirmed previous results.

  4. Approaches to Ontology Based Algorithms for Clustering Text Documents

    Directory of Open Access Journals (Sweden)

    V.Sureka

    2012-09-01

    Full Text Available The advancement in digital technology and WorldWide Web has increased the usage of digitaldocuments being used for various purposes like epublishing,digital library. Increase in number oftext documents requires efficient techniques thatcan help during searching and retrieval. Documentclustering is one such technique whichautomatically organizes text documents intomeaningful groups. This paper compares theperformance of enhanced ontological algorithmsbased on K-Means and DBScan clustering.Ontology is introduced by using a concept weightwhich is calculated by considering the correlationcoefficient of the word and probability of concept.Various experiments were conducted duringperformance evaluation and the results showedthat the inclusion of ontology increased theefficiency of clustering and the performance ofontology-based DBScan algorithm is better thanthe ontology-based K-Means algorithm

  5. Finding Text Information in the Ocean of Electronic Documents

    Energy Technology Data Exchange (ETDEWEB)

    Medvick, Patricia A.; Calapristi, Augustin J.

    2003-02-05

    Information management in natural resources has become an overwhelming task. A massive amount of electronic documents and data is now available for creating informed decisions. The problem is finding the relevant information to support the decision-making process. Determining gaps in knowledge in order to propose new studies or to determine which proposals to fund for maximum potential is a time-consuming and difficult task. Additionally, available data stores are increasing in complexity; they now may include not only text and numerical data, but also images, sounds, and video recordings. Information visualization specialists at Pacific Northwest National Laboratory (PNNL) have software tools for exploring electronic data stores and for discovering and exploiting relationships within data sets. These provide capabilities for unstructured text explorations, the use of data signatures (a compact format for the essence of a set of scientific data) for visualization (Wong et al 2000), visualizations for multiple query results (Havre et al. 2001), and others (http://www.pnl.gov/infoviz ). We will focus on IN-SPIRE, a MS Windows vision of PNNL’s SPIRE (Spatial Paradigm for Information Retrieval and Exploration). IN-SPIRE was developed to assist information analysts find and discover information in huge masses of text documents.

  6. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  7. Leveraging Text Content for Management of Construction Project Documents

    Science.gov (United States)

    Alqady, Mohammed

    2012-01-01

    The construction industry is a knowledge intensive industry. Thousands of documents are generated by construction projects. Documents, as information carriers, must be managed effectively to ensure successful project management. The fact that a single project can produce thousands of documents and that a lot of the documents are generated in a…

  8. Transliterating non-ASCII characters with Python

    Directory of Open Access Journals (Sweden)

    Seth Bernstein

    2013-10-01

    Full Text Available This lesson shows how to use Python to transliterate automatically a list of words from a language with a non-Latin alphabet to a standardized format using the American Standard Code for Information Interchange (ASCII characters. It builds on readers’ understanding of Python from the lessons “Viewing HTML Files,” “Working with Web Pages,” “From HTML to List of Words (part 1” and “Intro to Beautiful Soup.” At the end of the lesson, we will use the transliteration dictionary to convert the names from a database of the Russian organization Memorial from Cyrillic into Latin characters. Although the example uses Cyrillic characters, the technique can be reproduced with other alphabets using Unicode.

  9. Information Gain Based Dimensionality Selection for Classifying Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Dumidu Wijayasekara; Milos Manic; Miles McQueen

    2013-06-01

    Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexity is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.

  10. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    OpenAIRE

    Vishwadeepak Singh Baghela; S. P. Tripathi

    2012-01-01

    A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mi...

  11. Cluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization

    Directory of Open Access Journals (Sweden)

    A. K. Santra

    2011-09-01

    Full Text Available An efficient cluster based hybrid niche mimetic and genetic algorithm for text document categorization to improve the retrieval rate of relevant document fetching is addressed. The proposal minimizes the processing of structuring the document with better feature selection using hybrid algorithm. In addition restructuring of feature words to associated documents gets reduced, in turn increases document clustering rate. The performance of the proposed work is measured in terms of cluster objects accuracy, term weight, term frequency and inverse document frequency. Experimental results demonstrate that it achieves very good performance on both feature selection and text document categorization, compared to other classifier methods.

  12. Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images

    OpenAIRE

    Raju, Sabari S; Pati, Peeta Basa; Ramakrishnan, AG

    2004-01-01

    Extraction of text areas is a necessary first step for taking a complex document image for character recognition task. In digital libraries, such OCR'ed text facilitates access to the image of document page through keyword search. Gabor filters, known to be simulating certain characteristics of the Human Visual System (HVS), have been employed for this task by a large number of scientists, in scanned document images.Adapting such a scheme for camera based document images is a relatively new ...

  13. MULTI-DOCUMENT TEXT SUMMARIZATION USING CLUSTERING TECHNIQUES AND LEXICAL CHAINING

    Directory of Open Access Journals (Sweden)

    S. Saraswathi

    2010-07-01

    Full Text Available This paper investigates the use of clustering and lexical chains to produce coherent summaries of multiple documents in text format to generate an indicative, less redundant summary. The summary is designed as per user’s requirement of conciseness i.e., the documents are summarized according to the percentage input by the user. For achieving the above, various clustering techniques are used. Clustering is done at two levels, one at single document level and then at multi-document level. The clustered sentences are scored based on five different methods and lexically linked to produce the final summary in a text document.

  14. Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record

    Science.gov (United States)

    Wrenn, Jesse

    2010-01-01

    In order to ease the burden of electronic note entry on physicians, electronic documentation support tools have been developed to assist in note authoring. There is little evidence of the effects of these tools on attributes of clinical documentation, including document quality. Furthermore, the resultant abundance of duplicated text and…

  15. THE SEGMENTATION OF A TEXT LINE FOR A HANDWRITTEN UNCONSTRAINED DOCUMENT USING THINING ALGORITHM

    NARCIS (Netherlands)

    Tsuruoka, S.; Adachi, Y.; Yoshikawa, T.

    2004-01-01

    For printed documents, the projection analysis of black pixels is widely used for the segmentation of a text line. However, for handwritten documents, we think that the projection analysis is not appropriate, as the separating border line of a text line is not a straight line on a paper with no rule

  16. Can ASCII data files be standardized for Earth Science?

    Science.gov (United States)

    Evans, K. D.; Chen, G.; Wilson, A.; Law, E.; Olding, S. W.; Krotkov, N. A.; Conover, H.

    2015-12-01

    NASA's Earth Science Data Systems Working Groups (ESDSWG) was created over 10 years ago. The role of the ESDSWG is to make recommendations relevant to NASA's Earth science data systems from user experiences. Each group works independently focusing on a unique topic. Participation in ESDSWG groups comes from a variety of NASA-funded science and technology projects, such as MEaSUREs, NASA information technology experts, affiliated contractor, staff and other interested community members from academia and industry. Recommendations from the ESDSWG groups will enhance NASA's efforts to develop long term data products. Each year, the ESDSWG has a face-to-face meeting to discuss recommendations and future efforts. Last year's (2014) ASCII for Science Data Working Group (ASCII WG) completed its goals and made recommendations on a minimum set of information that is needed to make ASCII files at least human readable and usable for the foreseeable future. The 2014 ASCII WG created a table of ASCII files and their components as a means for understanding what kind of ASCII formats exist and what components they have in common. Using this table and adding information from other ASCII file formats, we will discuss the advantages and disadvantages of a standardized format. For instance, Space Geodesy scientists have been using the same RINEX/SINEX ASCII format for decades. Astronomers mostly archive their data in the FITS format. Yet Earth scientists seem to have a slew of ASCII formats, such as ICARTT, netCDF (an ASCII dump) and the IceBridge ASCII format. The 2015 Working Group is focusing on promoting extendibility and machine readability of ASCII data. Questions have been posed, including, Can we have a standardized ASCII file format? Can it be machine-readable and simultaneously human-readable? We will present a summary of the current used ASCII formats in terms of advantages and shortcomings, as well as potential improvements.

  17. Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.

    Science.gov (United States)

    Frasconi, Paolo; Soda, Giovanni; Vullo, Alessandro

    Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying…

  18. A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

    OpenAIRE

    V.M.Navaneethakumar; C Chandrasekar

    2012-01-01

    Text mining is a growing innovative field that endeavors to collect significant information from natural language processing term. It might be insecurely distinguished as the course of examining texts to extract information that is practical for particular purposes. In this case, the mining model can detain provisions that identify the concepts of the sentence or document, which tends to detect the subject of the document. In an existing work, the concept-based mining model is used only for n...

  19. Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering

    Directory of Open Access Journals (Sweden)

    Hanane Froud

    2013-02-01

    Full Text Available Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, PearsonCorrelation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.

  20. ARABIC TEXT SUMMARIZATION BASED ON LATENT SEMANTIC ANALYSIS TO ENHANCE ARABIC DOCUMENTS CLUSTERING

    Directory of Open Access Journals (Sweden)

    Hanane Froud

    2013-01-01

    Full Text Available Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.

  1. Thematic clustering of text documents using an EM-based approach.

    Science.gov (United States)

    Kim, Sun; Wilbur, W John

    2012-10-01

    Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well.

  2. LOG2MARKUP: State module to transform a Stata text log into a markup document

    DEFF Research Database (Denmark)

    2016-01-01

    log2markup extract parts of the text version from the Stata log command and transform the logfile into a markup based document with the same name, but with extension markup (or otherwise specified in option extension) instead of log. The author usually uses markdown for writing documents. However...... other users may decide on all sorts of markup languages, eg HTML or LaTex. The key is that markup of Stata code and Stata output can be set by the options....

  3. A Novel Model for Timed Event Extraction and Temporal Reasoning In Legal Text Documents

    Directory of Open Access Journals (Sweden)

    Kolikipogu Ramakrishna

    2011-02-01

    Full Text Available Information Retrieval is in a nascent stage to provide any type of information queried by naïve user.Question Answering System is one such successful area of Information retrieval. Legal Documents (caselaw, statute or transactional document are increasing day by day with the new applications (Mobiletransactions, Medical Diagnosis reports, law cases etc. in the world. Documentation of various Businessand Human Resource (HR applications involve Legal documents. Analysis and temporal reasoning ofsuch documents is a demanding area of research. In this paper we build a novel model for timed eventextraction and temporal reasoning in legal text documents. This paper mainly works on “how one can dofurther reasoning with the extracted temporal information”. Exploring temporal information in legal textdocuments is an important task to support legal practitioner lawyer, in order to determine temporalbased context decisions. Legal documents are available in different natural languages; hence it uses NLPSystem for pre-processing steps, Temporal constraint structure for temporal expressions, associatedtagger, Post-Processor with a knowledge-based sub system helps in discovering implicit information. Theresultant information resolves temporal expressions and deals with issues such as granularity, vagueness,and a reasoning mechanism which models the temporal constraint satisfaction network.

  4. A COMPARATIVE STUDY TO FIND A SUITABLE METHOD FOR TEXT DOCUMENT CLUSTERING

    Directory of Open Access Journals (Sweden)

    Dr.M.Punithavalli

    2012-01-01

    Full Text Available Text mining is used in various text related tasks such as information extraction, concept/entity extraction,document summarization, entity relation modeling (i.e., learning relations between named entities,categorization/classification and clustering. This paper focuses on document clustering, a field of textmining, which groups a set of documents into a list of meaningful categories. The main focus of thispaper is to present a performance analysis of various techniques available for document clustering. Theresults of this comparative study can be used to improve existing text data mining frameworks andimprove the way of knowledge discovery. This paper considers six clustering techniques for documentclustering. The techniques are grouped into three groups namely Group 1 - K-means and its variants(traditional K-means and K* Means algorithms, Group 2 - Expectation Maximization and its variants(traditional EM, Spherical Gaussian EM algorithm and Linear Partitioning and Reallocation clustering(LPR using EM algorithms, Group 3 - Semantic-based techniques (Hybrid method and Feature-basedalgorithms. A total of seven algorithms are considered and were selected based on their popularity inthe text mining field. Several experiments were conducted to analyze the performance of the algorithmand to select the winner in terms of cluster purity, clustering accuracy and speed of clustering.

  5. Approach for Arabic Handwritten Image Processing: Case of Text Detection in Degraded Documents

    Science.gov (United States)

    Boulid, Youssef; Youssfi Elkettani, Mohamed

    2014-09-01

    This study presents a new approach for processing of Arabic handwritten documents based on the extraction of characteristics and mechanisms involved in the process of human visual perception. The architecture which has been developed is based on the concept of multi-agent systems, allowing the integration of different stages of character recognition process in a cooperative way. This is illustrated using as example the prepossessing of binary noisy document. Therefore, a method was proposed, in order to distinguish between text and non-text components, using a new geometric primitives extracted from the analysis of the characteristics of Arabic script. Results show pixel-level precision and recall respectively of 98% and 93% for noise removal. This proves the effectiveness of the proposed approach in processing degraded documents and, consequently, improving the recognition performance.

  6. Issues and approaches for electronic document approval and transmittal using digital signatures and text authentication: Prototype documentation

    Science.gov (United States)

    Boling, M. E.

    1989-09-01

    Prototypes were assembled pursuant to recommendations made in report K/DSRD-96, Issues and Approaches for Electronic Document Approval and Transmittal Using Digital Signatures and Text Authentication, and to examine and discover the possibilities for integrating available hardware and software to provide cost effective systems for digital signatures and text authentication. These prototypes show that on a LAN, a multitasking, windowed, mouse/keyboard menu-driven interface can be assembled to provide easy and quick access to bit-mapped images of documents, electronic forms and electronic mail messages with a means to sign, encrypt, deliver, receive or retrieve and authenticate text and signatures. In addition they show that some of this same software may be used in a classified environment using host to terminal transactions to accomplish these same operations. Finally, a prototype was developed demonstrating that binary files may be signed electronically and sent by point to point communication and over ARPANET to remote locations where the authenticity of the code and signature may be verified. Related studies on the subject of electronic signatures and text authentication using public key encryption were done within the Department of Energy. These studies include timing studies of public key encryption software and hardware and testing of experimental user-generated host resident software for public key encryption. This software used commercially available command-line source code. These studies are responsive to an initiative within the Office of the Secretary of Defense (OSD) for the protection of unclassified but sensitive data. It is notable that these related studies are all built around the same commercially available public key encryption products from the private sector and that the software selection was made independently by each study group.

  7. Automatic Extraction of Spatio-Temporal Information from Arabic Text Documents

    Directory of Open Access Journals (Sweden)

    Abdelkoui Feriel

    2015-10-01

    Full Text Available Unstructured Arabic text documents are an important source of geographical and temporal information. The possibility of automatically tracking spatio-temporal information, capturing changes relating to events from text documents, is a new challenge in the fields of geographic information retrieval (GIR, temporal information retrieval (TIR and natural language processing (NLP. There was a lot of work on the extraction of information in other languages that use Latin alphabet, such as English,, French, or Spanish, by against the Arabic language is still not well supported in GIR and TIR and it needs to conduct more researches. In this paper, we present an approach that support automated exploration and extraction of spatio-temporal information from Arabic text documents in order to capture and model such information before it can be utilized in search and exploration tasks. The system has been successfully tested on 50 documents that include a mixture of types of Spatial/temporal information. The result achieved 91.01% of recall and of 80% precision. This illustrates that our approach is effective and its performance is satisfactory.

  8. Semi-supervised learning for detecting text-lines in noisy document images

    Science.gov (United States)

    Liu, Zongyi; Zhou, Hanning

    2010-01-01

    Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.

  9. Robust Text Extraction for Automated Processing of Multi-Lingual Personal Identity Documents

    Directory of Open Access Journals (Sweden)

    Pushpa B R

    2016-04-01

    Full Text Available Text extraction is a technique to extract the textual portion from non-textual background like images. It plays an important role in deciphering valuable information from images. Variation in text size, font, orientation, alignment, contrast etc. makes the task of text extraction challenging. Existing text extraction methods focus on certain regions of interest and address characteristics like noise, blur, distortion and variations in fonts makes text extraction difficult. This paper proposes a technique to extract textual characters from scanned personal identity document images. Current procedures keep track of user records manually and thus give way to inefficient practices and need for abundant time and human resources. The proposed methodology digitizes personal identity documents and eliminates the need for a large portion of the manual work involved in existing data entry and verification procedures. The proposed method has been experimented extensively with large datasets of varying sizes and image qualities. The results obtained indicate high accuracy in the extraction of important textual features from the document images.

  10. Finding falls in ambulatory care clinical documents using statistical text mining

    Science.gov (United States)

    McCart, James A; Berndt, Donald J; Jarman, Jay; Finch, Dezon K; Luther, Stephen L

    2013-01-01

    Objective To determine how well statistical text mining (STM) models can identify falls within clinical text associated with an ambulatory encounter. Materials and Methods 2241 patients were selected with a fall-related ICD-9-CM E-code or matched injury diagnosis code while being treated as an outpatient at one of four sites within the Veterans Health Administration. All clinical documents within a 48-h window of the recorded E-code or injury diagnosis code for each patient were obtained (n=26 010; 611 distinct document titles) and annotated for falls. Logistic regression, support vector machine, and cost-sensitive support vector machine (SVM-cost) models were trained on a stratified sample of 70% of documents from one location (dataset Atrain) and then applied to the remaining unseen documents (datasets Atest–D). Results All three STM models obtained area under the receiver operating characteristic curve (AUC) scores above 0.950 on the four test datasets (Atest–D). The SVM-cost model obtained the highest AUC scores, ranging from 0.953 to 0.978. The SVM-cost model also achieved F-measure values ranging from 0.745 to 0.853, sensitivity from 0.890 to 0.931, and specificity from 0.877 to 0.944. Discussion The STM models performed well across a large heterogeneous collection of document titles. In addition, the models also generalized across other sites, including a traditionally bilingual site that had distinctly different grammatical patterns. Conclusions The results of this study suggest STM-based models have the potential to improve surveillance of falls. Furthermore, the encouraging evidence shown here that STM is a robust technique for mining clinical documents bodes well for other surveillance-related topics. PMID:23242765

  11. Human Rights Texts: Converting Human Rights Primary Source Documents into Data.

    Science.gov (United States)

    Fariss, Christopher J; Linder, Fridolin J; Jones, Zachary M; Crabtree, Charles D; Biek, Megan A; Ross, Ana-Sophia M; Kaur, Taranamol; Tsai, Michael

    2015-01-01

    We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability.

  12. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques

    International Nuclear Information System (INIS)

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  13. Les références aux documents en ligne dans les textes scientifiques

    Directory of Open Access Journals (Sweden)

    Marc Couture

    2010-01-01

    Full Text Available [Français] Avec le développement de la diffusion en ligne de documents scientifiques, la primauté reconnue traditionnellement à l’imprimé devient de moins en moins pertinente ; les références à des ressources et documents en ligne forment ainsi une proportion significative des références dans certaines revues scientifiques, notamment celles qui, comme la RITPU, s’intéressent à l’information et à la communication. Dans cet article, on passe d’abord en revue les divers rôles des références répertoriés dans la littérature. On décrit ensuite les caractéristiques et les conditions permettant aux documents en ligne de jouer pleinement ces rôles. On montre enfin comment, dans l’adaptation française des normes de l’APA adoptée par la RITPU, certains choix touchant le format des notices des documents en ligne facilitent la tâche des évaluateurs et des lecteurs des articles. Par la même occasion, les auteurs y trouveront conseils et consignes visant à rendre plus pertinente et efficace cette dimension souvent négligée de la communication scientifique. [English] With the increase of online scientific publications, the traditional primacy of print documents has become less and less relevant. In some scientific journals, notably in the field of information technology and its applications (like IJTHE, a significant part of the citations now refer to online documents. This paper first reviews the various roles played by citations in scientific texts according to the literature. It then describes the characteristics and conditions which must be met if citations to online documents are to fully play these roles. Finally, it shows how, in the French-language adaptation of the APA reference formats which was adopted by IJTHE, a few choices have been made regarding online references in order to ease the task of both referees and readers. By the same token, authors will find guidelines and suggestions which should improve

  14. ParaText : scalable solutions for processing and searching very large document collections : final LDRD report.

    Energy Technology Data Exchange (ETDEWEB)

    Crossno, Patricia Joyce; Dunlavy, Daniel M.; Stanton, Eric T.; Shead, Timothy M.

    2010-09-01

    This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

  15. Using complex networks for text classification: Discriminating informative and imaginative documents

    Science.gov (United States)

    de Arruda, Henrique F.; Costa, Luciano da F.; Amancio, Diego R.

    2016-01-01

    Statistical methods have been widely employed in recent years to grasp many language properties. The application of such techniques have allowed an improvement of several linguistic applications, such as machine translation and document classification. In the latter, many approaches have emphasised the semantical content of texts, as is the case of bag-of-word language models. These approaches have certainly yielded reasonable performance. However, some potential features such as the structural organization of texts have been used only in a few studies. In this context, we probe how features derived from textual structure analysis can be effectively employed in a classification task. More specifically, we performed a supervised classification aiming at discriminating informative from imaginative documents. Using a networked model that describes the local topological/dynamical properties of function words, we achieved an accuracy rate of up to 95%, which is much higher than similar networked approaches. A systematic analysis of feature relevance revealed that symmetry and accessibility measurements are among the most prominent network measurements. Our results suggest that these measurements could be used in related language applications, as they play a complementary role in characterising texts.

  16. Text Feature Weighting For Summarization Of Document Bahasa Indonesia Using Genetic Algorithm

    Directory of Open Access Journals (Sweden)

    Aristoteles.

    2012-05-01

    Full Text Available This paper aims to perform the text feature weighting for summarization of document bahasa Indonesia using genetic algorithm. There are eleven text features, i.e, sentence position (f1, positive keywords in sentence (f2, negative keywords in sentence (f3, sentence centrality (f4, sentence resemblance to the title (f5, sentence inclusion of name entity (f6, sentence inclusion of numerical data (f7, sentence relative length (f8, bushy path of the node (f9, summation of similarities for each node (f10, and latent semantic feature (f11. We investigate the effect of the first ten sentence features on the summarization task. Then, we use latent semantic feature to increase the accuracy. All feature score functions are used to train a genetic algorithm model to obtain a suitable combination of feature weights. Evaluation of text summarization uses F-measure. The F-measure directly related to the compression rate. The results showed that adding f11 increases the F-measure by 3.26% and 1.55% for compression ratio of 10% and 30%, respectively. On the other hand, it decreases the F-measure by 0.58% for compression ratio of 20%. Analysis of text feature weight showed that only using f2, f4, f5, and f11 can deliver a similar performance using all eleven features.

  17. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Directory of Open Access Journals (Sweden)

    Hamish Cunningham

    Full Text Available This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  18. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Science.gov (United States)

    Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina

    2013-01-01

    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  19. An Efficient Technique to Implement Similarity Measures in Text Document Clustering using Artificial Neural Networks Algorithm

    Directory of Open Access Journals (Sweden)

    K. Selvi

    2014-12-01

    Full Text Available Pattern recognition, envisaging supervised and unsupervised method, optimization, associative memory and control process are some of the diversified troubles that can be resolved by artificial neural networks. Problem identified: Of late, discovering the required information in massive quantity of data is the challenging tasks. The model of similarity evaluation is the central element in accomplishing a perceptive of variables and perception that encourage behavior and mediate concern. This study proposes Artificial Neural Networks algorithms to resolve similarity measures. In order to apply singular value decomposition the frequency of word pair is established in the given document. (1 Tokenization: The splitting up of a stream of text into words, phrases, signs, or other significant parts is called tokenization. (2 Stop words: Preceding or succeeding to processing natural language data, the words that are segregated is called stop words. (3 Porter stemming: The main utilization of this algorithm is as part of a phrase normalization development that is characteristically completed while setting up in rank recovery technique. (4 WordNet: The compilation of lexical data base for the English language is called as WordNet Based on Artificial Neural Networks, the core part of this study work extends n-gram proposed algorithm. All the phonemes, syllables, letters, words or base pair corresponds in accordance to the application. Future work extends the application of this same similarity measures in various other neural network algorithms to accomplish improved results.

  20. A novel technique for estimation of skew in binary text document images based on linear regression analysis

    Indian Academy of Sciences (India)

    P Shivakumara; G Hemantha Kumar; D S Guru; P Nagabhushan

    2005-02-01

    When a document is scanned either mechanically or manually for digitization, it often suffers from some degree of skew or tilt. Skew-angle detection plays an important role in the field of document analysis systems and OCR in achieving the expected accuracy. In this paper, we consider skew estimation of Roman script. The method uses the boundary growing approach to extract the lowermost and uppermost coordinates of pixels of characters of text lines present in the document, which can be subjected to linear regression analysis (LRA) to determine the skew angle of a skewed document. Further, the proposed technique works fine for scaled text binary documents also. The technique works based on the assumption that the space between the text lines is greater than the space between the words and characters. Finally, in order to evaluate the performance of the proposed methodology we compare the experimental results with those of well-known existing methods.

  1. Interuniversity Style Guide for Writing Institutional Texts in English: Model Documents

    OpenAIRE

    Xarxa Vives d'Universitats. Grup de Treball de Qualitat Lingüística

    2015-01-01

    Manual d'estil interuniversitari per a la redacció de textos institucionals en anglès: models de documents. Pautes i models per a la redacció en anglès de sol·licituds, resolucions, notificacions, certificats, diligències, cartes, correus electrònics i convenis.

  2. Content analysis to detect high stress in oral interviews and text documents

    Science.gov (United States)

    Thirumalainambi, Rajkumar (Inventor); Jorgensen, Charles C. (Inventor)

    2012-01-01

    A system of interrogation to estimate whether a subject of interrogation is likely experiencing high stress, emotional volatility and/or internal conflict in the subject's responses to an interviewer's questions. The system applies one or more of four procedures, a first statistical analysis, a second statistical analysis, a third analysis and a heat map analysis, to identify one or more documents containing the subject's responses for which further examination is recommended. Words in the documents are characterized in terms of dimensions representing different classes of emotions and states of mind, in which the subject's responses that manifest high stress, emotional volatility and/or internal conflict are identified. A heat map visually displays the dimensions manifested by the subject's responses in different colors, textures, geometric shapes or other visually distinguishable indicia.

  3. Ultrasound-guided nerve blocks - is documentation and education feasible using only text and pictures?

    DEFF Research Database (Denmark)

    Worm, Bjarne Skjødt; Krag, Mette; Jensen, Kenneth

    2014-01-01

    With the advancement of ultrasound-guidance for peripheral nerve blocks, still pictures from representative ultrasonograms are increasingly used for clinical procedure documentation of the procedure and for educational purposes in textbook materials. However, little is actually known about...... the clinical and educational usefulness of these still pictures, in particular how well nerve structures can be identified compared to real-time ultrasound examination. We aimed to quantify gross visibility or ultrastructure using still picture sonograms compared to real time ultrasound for trainees...... and experts, for large or small nerves, and discuss the clinical or educational relevance of these findings....

  4. Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Watson, Andrew; Quigley, Aaron; Klein, Ewan; Coates, Colin M.

    2015-01-01

    Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians, computational linguists, and visualization specialists. It combines text mining and information vi...

  5. Trading Consequences: A Case Study of Combining Text Mining & Visualisation to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Quigley, Aaron

    2014-01-01

    Trading Consequences is an interdisciplinary research project between historians, computational linguists and visualization specialists. We use text mining and visualisations to explore the growth of the global commodity trade in the nineteenth century. Feedback from a group of environmental historians during a workshop provided essential information to adapt advanced text mining and visualisation techniques to historical research. Expert feedback is an essential tool for effective interdisci...

  6. MeSH Up: Effective MeSH text classification for improved document retrieval

    NARCIS (Netherlands)

    Trieschnigg, D.; Pezik, P.; Lee, V.; Jong, F.de; Kraaij, W.; Rebholz-Schuhmann, D.

    2009-01-01

    Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeS

  7. MeSH Up: effective MeSH text classification for improved document retrieval

    NARCIS (Netherlands)

    Trieschnigg, Dolf; Pezik, Piotr; Lee, Vivian; Jong, de Franciska; Kraaij, Wessel; Rebholz-Schuhmann, Dietrich

    2009-01-01

    Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeS

  8. Using ImageMagick to Automatically Increase Legibility of Scanned Text Documents

    OpenAIRE

    Doreva Belfiore

    2011-01-01

    The Law Library Digitization Project of the Rutgers University School of Law in Camden, New Jersey, developed a Perl script to use the open-source module PerlMagick to automatically adjust the brightness levels of digitized images from scanned microfiche. This script can be adapted by novice Perl programmers to manipulate large numbers of text and image files using commands available in PerlMagick and ImageMagick.

  9. Farsi/Arabic Document Image Retrieval through Sub -Letter Shape Coding for mixed Farsi/Arabic and English text

    Directory of Open Access Journals (Sweden)

    Zahra Bahmani

    2011-09-01

    Full Text Available A retrieval method for explicit recognition free Farsi/Arabic document is proposed in this paper. The system can be used in mixed Farsi/Arabic and English text. The method consists of Preprocessing, word and sub_word extraction, detection and cancelation of sub_letter connectors, annotation sub_letters by shape coding, classifier of sub_letters by use of decision tree and using of RBF neural network for sub_letter recognition. The Proposed system retrieves document images by a new sub_letter shape coding scheme in Farsi/Arabic documents. In this method document content captures through sub_letter coding of words. The decision tree-based classifier partitions the sub_letters space into a number of sub regions by splitting the sub_letter space, using one topological shape features at a time. Topological shape Features include height, width, holes, openings, valleys, jags, sub_letter ascenders/descanters. Experimental results show advantages of this method in Farsi/Arabic Document Image Retrieval.

  10. Progress Report on the ASCII for Science Data, Airborne and Geospatial Working Groups of the 2014 ESDSWG for MEaSUREs

    Science.gov (United States)

    Evans, K. D.; Krotkov, N. A.; Mattmann, C. A.; Boustani, M.; Law, E.; Conover, H.; Chen, G.; Olding, S. W.; Walter, J.

    2014-12-01

    The Earth Science Data Systems Working Groups (ESDSWG) were setup by NASA HQ 10 years ago. The role of the ESDSWG is to make recommendations relevant to NASA's Earth science data systems from users experiences. Each group works independently focussing on a unique topic. Participation in ESDSWG groups comes from a variety of NASA-funded science and technology projects, NASA information technology experts, affiliated contractor staff and other interested community members from academia and industry. Recommendations from the ESDSWG groups will enhance NASA's efforts to develop long term data products. The ASCII for Science Data Working Group (WG) will define a minimum set of information that should be included in ASCII file headers so that the users will be able to access the data using only the header information. After reviewing various use cases, such as field data and ASCII data exported from software tools, and reviewing ASCII data guidelines documentation, this WG will deliver guidelines for creating ASCII files that contain enough header information to allow the user to access the science data. The Airborne WG's goal is to improve airborne data access and use for NASA science. The first step is to evaluate the state of airborne data and make recommendations focusing on data delivery to the DAACs (data centers). The long term goal is to improve airborne data use for Earth Science research. Many data aircraft observations are reported in ASCII format. The ASCII and Airborne WGs seem like the same group, but the Airborne WG is concerned with maintaining and using airborne for science research, not just the data format. The Geospatial WG focus is on the interoperability issues of Geospatial Information System (GIS) and remotely sensed data, in particular, focusing on DAAC(s) data from NASA's Earth Science Enterprise. This WG will provide a set of tools (GIS libraries) to use with training and/or cookbooks through the use of Open Source technologies. A progress

  11. Memoria documental en textos chilenos del período colonial (siglos XVI y XVII (Documental memory in Chilean texts of the colonial period (sixteenth and seventeenth centuries

    Directory of Open Access Journals (Sweden)

    Manuel Contreras Seitz

    2013-06-01

    Full Text Available En este trabajo se da cuenta de las nociones básicas para la conformación de un corpus documental diacrónico que abarque el período colonial chileno, centrándose con particular énfasis en los siglos XVI y XVII. Se discute, además, los aspectos metodológicos para la edición crítica preliminar de dichos documentos, tanto en lo concerniente a la transcripción paleográfica de los mismos, la adecuación a normas filológicas específicas, así como el aparato crítico que es necesario implementar de acuerdo a los destinatarios, sin dejar de lado la rigurosidad histórica y documental. Especial mención se hará de los requisitos léxico-semánticos para la edición de estos documentos, el problema de las grafías y las abreviaturas, así como de los pasos previos que es necesario implementar para la creación de un programa de reconocimiento óptico de caracteres para textos manuscritos del período. (This article explains the basic notions for the conformation of a diachronic textual corpus that embraces the colonial Chilean period, focusing with particular emphasis on the XVI and XVII centuries. Some methodological aspects for the preliminary critical edition of these documents are also discussed, so much with what is concerned with aspects to the paleographical transcription of the same ones, the adaptation to philological specific norms, as well as the critical apparatus that is necessary to implement according to the addressees, without leaving aside the historical and documental rigor. Special mention will be made to the lexicon-semantic requirements for the edition of these documents, the problem of the graphs and the abbreviations, as well as of the previous steps that are necessary to implement for the creation of an optical character recognition program for handwritten texts of the period.

  12. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents

    OpenAIRE

    Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

    2016-01-01

    The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pa...

  13. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.

    Science.gov (United States)

    Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

    2016-01-01

    The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). PMID:27386386

  14. Lidar Bathymetry Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline and is the only large cape south of Cape Fear, North Carolina....

  15. PHYSICAL MODELLING OF TERRAIN DIRECTLY FROM SURFER GRID AND ARC/INFO ASCII DATA FORMATS#

    Directory of Open Access Journals (Sweden)

    Y.K. Modi

    2012-01-01

    Full Text Available

    ENGLISH ABSTRACT: Additive manufacturing technology is used to make physical models of terrain using GIS surface data. Attempts have been made to understand several other GIS file formats, such as the Surfer grid and the ARC/INFO ASCII grid. The surface of the terrain in these file formats has been converted into an STL file format that is suitable for additive manufacturing. The STL surface is converted into a 3D model by making the walls and the base. In this paper, the terrain modelling work has been extended to several other widely-used GIS file formats. Terrain models can be created in less time and at less cost, and intricate geometries of terrain can be created with ease and great accuracy.

    AFRIKAANSE OPSOMMING: Laagvervaardigingstegnologie word gebruik om fisiese modelle van terreine vanaf GIS oppervlakdata te maak. Daar is gepoog om verskeie ander GIS lêerformate, soos die Surfer rooster en die ARC/INFO ASCII rooster, te verstaan. Die oppervlak van die terrein in hierdie lêerformate is omgeskakel in 'n STL lêerformaat wat geskik is vir laagvervaardiging. Verder is die STL oppervlak omgeskakel in 'n 3D model deur die kante en die basis te modelleer. In hierdie artikel is die terreinmodelleringswerk uitgebrei na verskeie ander algemeen gebruikte GIS lêerformate. Terreinmodelle kan so geskep word in korter tyd en teen laer koste, terwyl komplekse geometrieë van terreine met gemak en groot akkuraatheid geskep kan word.

  16. Formation of skill of interpretation of legal documents and potential of graphic means registrations of the text

    OpenAIRE

    Kosareva T. B.

    2010-01-01

    The article deals with teaching translation of legal documents, ways of effective learning legal vocabulary and testing for educational purposes. Testing is seen as a kind of training in achieving automatic skills of interpretation in which teaching materials are designed with the help of graphic highlighting.

  17. Is there still an unknown Freud? A note on the publications of Freud's texts and on unpublished documents.

    Science.gov (United States)

    Falzeder, Ernst

    2007-01-01

    This article presents an overview of the existing editions of what Freud wrote (works, letters, manuscripts and drafts, diaries and calendar notes, dedications and margin notes in books, case notes, and patient calendars) and what he is recorded as having said (minutes of meetings, interviews, memoirs of and interviews with patients, family members, and followers, and other quotes). There follows a short overview of biographies of Freud and other documentation on his life. It is concluded that a wealth of material is now available to Freud scholars, although more often than not this information is used in a biased and partisan way.

  18. The Hong Kong Chinese University Document Retrieval Database——The Hong Kong Newspaper Full-text Database Projeet

    Institute of Scientific and Technical Information of China (English)

    MichaelM.Lee

    1994-01-01

    This project is to collect, organize, index and store full-text and graphics of selected Chinese and English newspapers currently published in Hang Kong. The end product will be an electronic database available to researchers through local area network, Internet and dial-up users. New items of the day before and up to six months will be available for online searching, via key word or subject, Earlier cumulated nateriats alone with the same indexing and searchmg software will be archived to optical media (CD ROM disks). As Itong Kong experiences rapid social, financial, conmtercial, political, educational and cultural changes, our state-of-the-art comprehensive coverage of local and regional newspapers will be a landmark contribution to information industries and researchers internationally. As the coverage of the database will be comprehensive and centralized, retrieval of news items of major Hang Kong newspapers will be fast and immtediate. Users do no need to look through daily or bi-monthly indexes in order to go to the newspapers or cuttings to obtain the hard copy, and then bring to the photocopier machine to copy,At this stage, we are hiring librarians, information specialists and support staff to work on this project. We also met and work with newspaper indexing and retrieval system developers in Beijing and Hang Kong to study cooperative systems to speed up the process. So far, we have received funding support from the Chinese University and the Hong Kong Government for two years. It is our plan to have a presentable sample database done by mid 1995, and have several newspapers indexed and stored in the structure arid for mat easy formigration to the eventual database system by the end of 1996.

  19. Oracle Text全文检索技术在文档资料管理中的应用%Application of Full-Text Search of Oracle Text in Documents Management

    Institute of Scientific and Technical Information of China (English)

    李培军; 毕于慧; 张权; 董玮

    2014-01-01

    本文利用Oracle Text全文检索技术,根据数据库业务逻辑构建了关键词表,通过为关键词表建立索引的方式进行检索,提高了检索效率;以ViusalC++6为开发平台,采用C/S结构技术研发了多类型文档资料管理系统,实现了办公文档资料的高效管理。%Based on the full-text search of Oracle Text, this article first created key words table according to the logical database, the search efficiency was improved used by creating index for the table;and then a documents management system for multi-type files was developed on the platform of Visual C++6 with C/S structure technology to manage official documents efficiently.

  20. Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

    KAUST Repository

    Bin Raies, Arwa

    2013-10-16

    Background:In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.Methodology:We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.Conclusion:The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download. © 2013 Bin Raies et al.

  1. Scholars in the Humanities Are Reluctant to Cite E-Texts as Primary Materials. A Review of: Sukovic, S. (2009. References to e-texts in academic publications. Journal of Documentation, 65(6, 997-1015.

    Directory of Open Access Journals (Sweden)

    Deena Yanofsky

    2011-03-01

    collections as well as ‘electronically born’ documents, works of art and popular culture artifacts. Of the 22 works resulting from the research projects examined during the study period, half did not cite e-texts as primary materials. The 11 works that made at least one reference to an e-text included 4 works in which the only reference was to e-texts created by the actual author. In total, only 7 works referred to e-texts created by outside authors. These 7 final works were written by 5 participants, representing 31 percent of the total number of study participants.Analysis of the participants’ citation practices revealed that decisions to cite an electronic source or omit it from publication were based on two important factors: (1 the perceived trustworthiness of an e-text and (2 a sense of what was acceptable practice.Participants established trustworthiness through a process of verification. To confirm the authenticity and reliability of an e-text, most participants compared electronic documents against a print version to verify provenance, context, and details. Even when digitized materials were established as trustworthy sources, however, hard copies were often cited because they were considered more authoritative or accurate.Traditions of a particular discipline also had a strong influence on a participant’s willingness to cite e-texts. Participants working on traditional historical topics were more reluctant to cite electronic resources, while researchers who worked on topics that explored relatively new fields were more willing to acknowledge the use of e-texts in published works. Traditional practices also influenced participants’ decisions about how to cite materials. Some participants always cited original works in hard copy, regardless of electronic access because it was accepted scholarly practice.Conclusions – The results of this study suggest that the small number of citations to electronic sources in publications in the humanities is directly

  2. PDF文档HTML化中文本重排问题研究%A Study of Text Rearrang in Conversion of PDF Documents into HTML

    Institute of Scientific and Technical Information of China (English)

    林青; 李健

    2014-01-01

    目前各种PDF转化工具中,将PDF元素抽取后还原顺序的方法是根据每个文字元素的坐标---由左到右,由上到下的顺序重排元素。这种重排方式无法正确还原多栏或者多区域的PDF文档。文章提出了一种页面分块算法。所提算法将页面划分为不同的区域,在分区基础上重排,有效的提高了多栏或者多区域的PDF文档文本顺序还原的正确性。%Most of the existing PDF converters fulfill text detection by locating the coordinate of each text element. Specifically, text detection is realized by rearranging these elements from left to right as well as from top to bottom in the order. Unfortunately, such methods fail to work in complex multiple-column PDF documents. To settle this problem, this work proposed a novel page segmentation algorithm. The proposed algorithm first divides a page into several blocks, and then reorders these blocks. With the proposed algorithm, the correctness of returning to original complex multiple-column text increases effectively.

  3. Single-Beam Bathymetry Sounding Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline, and is the only large cape south of Cape Fear, North Carolina....

  4. Text files of the navigation logged by the U.S. Geological Survey offshore of Fire Island, NY in 2011 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The U.S. Geological Survey (USGS) mapped approximately 336 square kilometers of the lower shoreface and inner-continental shelf offshore of Fire Island, New York in...

  5. Research on Document Relevancy Based on Full-Text Retrieval System%一种基于全文检索系统的文档关联研究与实现

    Institute of Scientific and Technical Information of China (English)

    饶祎; 郭辉; 蔡庆生

    2003-01-01

    As a important application of the Full-Text retrieval system, document relevancy has powerful function. In this paper, a document relevancy method based on the Full-Text retrieval system is presented, which is deeply discussed from two aspects, content relevancy and properties relevancy. This system is proved to have good response time and precision by tests. It has great prospects in application area.

  6. Native Language Processing using Exegy Text Miner

    Energy Technology Data Exchange (ETDEWEB)

    Compton, J

    2007-10-18

    Lawrence Livermore National Laboratory's New Architectures Testbed recently evaluated Exegy's Text Miner appliance to assess its applicability to high-performance, automated native language analysis. The evaluation was performed with support from the Computing Applications and Research Department in close collaboration with Global Security programs, and institutional activities in native language analysis. The Exegy Text Miner is a special-purpose device for detecting and flagging user-supplied patterns of characters, whether in streaming text or in collections of documents at very high rates. Patterns may consist of simple lists of words or complex expressions with sub-patterns linked by logical operators. These searches are accomplished through a combination of specialized hardware (i.e., one or more field-programmable gates arrays in addition to general-purpose processors) and proprietary software that exploits these individual components in an optimal manner (through parallelism and pipelining). For this application the Text Miner has performed accurately and reproducibly at high speeds approaching those documented by Exegy in its technical specifications. The Exegy Text Miner is primarily intended for the single-byte ASCII characters used in English, but at a technical level its capabilities are language-neutral and can be applied to multi-byte character sets such as those found in Arabic and Chinese. The system is used for searching databases or tracking streaming text with respect to one or more lexicons. In a real operational environment it is likely that data would need to be processed separately for each lexicon or search technique. However, the searches would be so fast that multiple passes should not be considered as a limitation a priori. Indeed, it is conceivable that large databases could be searched as often as necessary if new queries were deemed worthwhile. This project is concerned with evaluating the Exegy Text Miner installed in the

  7. Text Steganographic Approaches: A Comparison

    Directory of Open Access Journals (Sweden)

    Monika Agarwal

    2013-02-01

    Full Text Available This paper presents three novel approaches of text steganography. The first approach uses the theme ofmissing letter puzzle where each character of message is hidden by missing one or more letters in a wordof cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover andstego file. The second approach hides a message in a wordlist where ASCII value of embedded characterdetermines length and starting letter of a word. The third approach conceals a message, withoutdegrading cover, by using start and end letter of words of the cover. For enhancing the security of secretmessage, the message is scrambled using one-time pad scheme before being concealed and cipher text isthen concealed in cover. We also present an empirical comparison of the proposed approaches with someof the popular text steganographic approaches and show that our approaches outperform the existingapproaches.

  8. 一种大容量文本集的智能检索方法%Intelligent information retrieval approach for large-scale collections of full-text document

    Institute of Scientific and Technical Information of China (English)

    金小峰

    2011-01-01

    分析了潜在语义模型,研究了潜在语义空间中文本的表示方法,提出了一种大容量文本集的检索策略.检索过程由粗粒度非相关剔除和相关文本的精确检索两个步骤组成.使用潜在语义空间模型对文本集进行初步的筛选,剔除非相关文本;使用大规模文本检索方法对相关文本在段落一级进行精确检索,其中为了提高检索的执行效率,在检索算法中引入了遗传算法;输出这些候选的段落序号.实验结果证明了这种方法的有效性和高效性.%An information retrieval approach for large-scale collections of full-text document is proposed according to latent model analysis and investigation of latent space-based text representation form. The retrieval process is divided into rough irrelative full-text documents culling procedure,and relative full-text document precise searching prueedure.lrrelative documents are removed by the first procedure. Relative full-text documents are retrieved in passage level by the second one,and in this process, GA algorithm is introduced in order to achieve best performance. Finally, the candidate passage indices are returned.The validity and high efficiency of the proposed method are shown by experimental results.

  9. TRMM Gridded Text Products

    Science.gov (United States)

    Stocker, Erich Franz

    2007-01-01

    NASA's Tropical Rainfall Measuring Mission (TRMM) has many products that contain instantaneous or gridded rain rates often among many other parameters. However, these products because of their completeness can often seem intimidating to users just desiring surface rain rates. For example one of the gridded monthly products contains well over 200 parameters. It is clear that if only rain rates are desired, this many parameters might prove intimidating. In addition, for many good reasons these products are archived and currently distributed in HDF format. This also can be an inhibiting factor in using TRMM rain rates. To provide a simple format and isolate just the rain rates from the many other parameters, the TRMM product created a series of gridded products in ASCII text format. This paper describes the various text rain rate products produced. It provides detailed information about parameters and how they are calculated. It also gives detailed format information. These products are used in a number of applications with the TRMM processing system. The products are produced from the swath instantaneous rain rates and contain information from the three major TRMM instruments: radar, radiometer, and combined. They are simple to use, human readable, and small for downloading.

  10. Extraction of text content from PDF documents based on automaton theory%基于自动机理论的PDF文本内容抽取

    Institute of Scientific and Technical Information of China (English)

    王晓娟; 谭建龙; 刘燕兵; 刘金刚

    2012-01-01

    现有的从PDF文档抽取文本内容的方法(如PDFBox类库采用的方法)处理速度较低,无法满足高速网络中内容分析的需求,也不能对网络中部分到达的PDF数据包进行流式的处理.为此,提出了基于自动机理论的PDF文本内容抽取方法.该方法通过建立具有层次的关键字自动机,可以快速地抽取完整PDF文档和不完整PDF文档中的文本内容.在中文和英文PDF文档数据集下的实验结果表明,基于自动机理论的PDF文本内容抽取方法耗时仅为PDFBox方法的17% ~37%.%The existing methods of extracting text content from a PDF file, such as the one adopted by the PDFBox library, are not efficient enough to handle the high-speed network traffic. Moreover, these methods cannot extract the contents streamingly from partial PDF packets in transfer. This paper proposed a new method based on automaton theory. The method adopted a hierarchical keyword Deterministic Finite Automaton (DFA) to extract information from complete or incomplete PDF files. The experimental results show that the response time of the proposed method is about 17% - 37% of the algorithm used by PDFBox when processing PDF files in Chinese or English.

  11. 可全文检索的校园文档管理系统设计%The Design of Campus Full-text Search and Manage Document System

    Institute of Scientific and Technical Information of China (English)

    韩金松

    2013-01-01

    Generally search engine can only be able to search web contents but can’t search the content of attached documents. This article focuses on document searching method, combining with the actual situation in school, gives the design of campus full-text search and manage document system.%一般的搜索引擎仅仅能够搜索网页内容而无法检索网页内附加的文档内容,本文着重阐述了文档内容检索方法,并结合学校实际情况,完成了校园文档全文检索与管理系统的设计。

  12. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  13. Documenting the Earliest Chinese Journals

    Directory of Open Access Journals (Sweden)

    Jian-zhong (Joe Zhou

    2001-10-01

    Full Text Available

    頁次:19-24

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">According to various authoritative sources, the English word "journal" was first used in the 16lh century, but the existence of the journal in its original meaning as a daily record can be traced back to Acta Diuma (Daily Events in ancient Roman cities as early as 59 B.C. This article documents the first appearance of Chinese daily records that were much early than 59 B.C.

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">The evidence of the earlier Chinese daily records came from some important archaeological discoveries in the 1970's, but they were also documented by Sima Qian (145 B.C. - 85 B.C., the grand historian of the Han Dynasty imperial court. Sima's lifetime contribution was the publication of Shi Ji (ascii-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-font-kerning: 0pt; mso-hansi-font-family: 'Times New Roman';">史記 (The Grand Scribe's Records, the Records hereafter. The Records is a book of history of a grand scope. It encompasses all Chinese history from 30lh century B.C. through the end of the second century B.C. in 130 chapters and over 525,000 Chinese

  14. Locations and analysis of sediment samples collected offshore of Massachusetts within Northern Cape Cod Bay(CCB_SedSamples Esri Shapefile, and ASCII text format, WGS84)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — These data were collected under a cooperative agreement with the Massachusetts Office of Coastal Zone Management (CZM) and the U.S. Geological Survey (USGS),...

  15. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  16. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  17. De que modo os textos oficiais prescrevem o trabalho do professor? Análise comparativa de documentos brasileiros e genebrinos How do official texts prescribe the teacher's work? A comparative analysis of the brazilian and genebrian Documents

    Directory of Open Access Journals (Sweden)

    Anna Rachel Machado

    2005-12-01

    Full Text Available Neste artigo, são apresentados os resultados de análises de dois documentos produzidos por instâncias oficiais para orientar o trabalho dos professores no Brasil e na Suíça. De um lado, buscamos detectar as características da textualização da prescrição do trabalho do professor. Os resultados mostram que, além das propriedades comuns aos textos prescritivos (apagamento do enunciador, contrato de felicidade etc., esses documentos se caracterizam por apresentar uma estrutura temática mais complexa, articulando um agir prescritivo, um agir-fonte e um agir - prescrito. Além disso, buscamos identificar as formas de construção do objeto da prescrição, o que permitiu verificar que, nos dois contextos, esse objeto se configura como uma proposta pedagógica global e não como trabalho concreto dos professores, não estando eles representados, nesses textos, como atores que têm uma real responsabilidade no desenvolvimento das propostas e, paralelamente, apresentando-se os alunos como alvos inertes. Esse trabalho também nos permitiu levantar algumas diferenças das formas de textualização das prescrições examinadas, diferenças essas que relacionamos ao contexto político-econômico dos dois países. Ao final, chegamos a questionamentos referentes às razões da não-consideração do trabalho efetivo dos professores nesse tipo de documento.In this article we present the results of two documents produced by official agencies that aim at guiding the teachers' work in Brazil and in Switzerland. On the one hand, we focused on detecting the textualization features used to prescribe the teacher's work. Results show that besides the common prescriptive features of the two texts (enunciator's erasure, felicity contract etc, these documents carry a more complex thematic structure, articulating a prescriptive doing, a source-doing and a prescribed-doing. We have also tried to identify the forms of building the object of prescription, which

  18. EMOTION DETECTION FROM TEXT

    Directory of Open Access Journals (Sweden)

    Shiv Naresh Shivhare

    2012-05-01

    Full Text Available Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content – based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  19. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  20. Text Classification using Artificial Intelligence

    CERN Document Server

    Kamruzzaman, S M

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of na\\"ive Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A syste...

  1. Centroid Based Text Clustering

    Directory of Open Access Journals (Sweden)

    Priti Maheshwari

    2010-09-01

    Full Text Available Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tacklesproblems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mySql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing .We apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering ofdocuments.

  2. School Survey on Crime and Safety (SSOCS) 2000 Public-Use Data Files, User's Manual, and Detailed Data Documentation. [CD-ROM].

    Science.gov (United States)

    National Center for Education Statistics (ED), Washington, DC.

    This CD-ROM contains the raw, public-use data from the 2000 School Survey on Crime and Safety (SSOCS) along with a User's Manual and Detailed Data Documentation. The data are provided in SAS, SPSS, STATA, and ASCII formats. The User's Manual and the Detailed Data Documentation are provided as .pdf files. (Author)

  3. Emotion Detection from Text

    CERN Document Server

    Shivhare, Shiv Naresh

    2012-01-01

    Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content - based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  4. Exploiting Document Level Semantics in Document Clustering

    Directory of Open Access Journals (Sweden)

    Muhammad Rafi

    2016-06-01

    Full Text Available Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus into smaller, more manageable, subject homogeneous collections (clusters. Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1 to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2 to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii we propose a similarity function using these three, and (iii we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained

  5. Quality text editing

    Directory of Open Access Journals (Sweden)

    Gyöngyi Bujdosó

    2009-10-01

    Full Text Available Text editing is more than the knowledge of word processing techniques. Originally typographers, printers, text editors were the ones qualified to edit texts, which were well structured, legible, easily understandable, clear, and were able to emphasize the coreof the text. Time has changed, and nowadays everyone has access to computers as well as to text editing software and most users believe that having these tools is enough to edit texts. However, text editing requires more skills. Texts appearing either in printed or inelectronic form reveal that most of the users do not realize that they are not qualified to edit and publish their works. Analyzing the ‘text-products’ of the last decade a tendency can clearly be drawn. More and more documents appear, which instead of emphasizingthe subject matter, are lost in the maze of unstructured text slices. Without further thoughts different font types, colors, sizes, strange arrangements of objects, etc. are applied. We present examples with the most common typographic and text editing errors. Our aim is to call the attention to these mistakes and persuadeusers to spend time to educate themselves in text editing. They have to realize that a well-structured text is able to strengthen the effect on the reader, thus the original message will reach the target group.

  6. Integrated Documents

    OpenAIRE

    Sawitzki, Günther

    2000-01-01

    An introduction to integrated documents in statistics. Integrated documents allow a seamless integration of interactive statistics and data analysis components in 'life' documents while keeping the full computational power needed for simulation or resampling.

  7. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2012-01-01

    <正>Centre for Agriculture and Bioscience International( CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health,our Internet Resources and Abstract Journals. There are currently over 60,000 full text articles available to access. These documents,made possible by agreement with third

  8. 2005-004-FA_HYPACK: Text files of the Wide Area Augmentation System (WAAS) navigation collected by the U.S. Geological Survey in Moultonborough Bay, Lake Winnipesaukee, New Hampshire in 2005 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — In freshwater bodies of New Hampshire, the most problematic aquatic invasive plant species is Myriophyllum heterophyllum or variable leaf water-milfoil. Once...

  9. Discover Effective Pattern for Text Mining

    OpenAIRE

    Khade, A. D.; A. B. Karche

    2014-01-01

    Many data mining techniques have been discovered for finding useful patterns in documents like text document. However, how to use effective and bring to up to date discovered patterns is still an open research task, especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help users to find what they exactly want...

  10. Arabic Short Text Compression

    Directory of Open Access Journals (Sweden)

    Eman Omer

    2010-01-01

    Full Text Available Problem statement: Text compression permits representing a document by using less space. This is useful not only to save disk space, but more importantly, to save disk transfer and network transmission time. With the continues increase in the number of Arabic short text messages sent by mobile phones, the use of a suitable compression scheme would allow users to use more characters than the default value specified by the provider. The development of an efficient compression scheme to compress short Arabic texts is not a straight forward task. Approach: This study combined the benefits of pre-processing, entropy reduction through splitting files and hybrid dynamic coding: A new technique proposed in this study that uses the fact that Arabic texts have single case letters. Experimental tests had been performed on short Arabic texts and a comparison with the well known plain Huffman compression was made to measure the performance of the proposed schema for Arabic short text. Results: The proposed schema can achieve a compression ratio around 4.6 bits byte-1 for very short Arabic text sequences of 15 bytes and around 4 bits byte-1 for 50 bytes text sequences, using only 8 Kbytes overhead of memory. Conclusion: Furthermore, a reasonable compression ratio can be achieved using less than 0.4 KB of memory overhead. We recommended the use of proposed schema to compress small Arabic text with recourses limited.

  11. Documenting localities

    CERN Document Server

    Cox, Richard J

    1996-01-01

    Now in paperback! Documenting Localities is the first effort to summarize the past decade of renewed discussion about archival appraisal theory and methodology and to provide a practical guide for the documentation of localities.This book discusses the continuing importance of the locality in American historical research and archival practice, traditional methods archivists have used to document localities, and case studies in documenting localities. These chapters draw on a wide range of writings from archivists, historians, material culture specialists, historic preservationists

  12. Text Mining for Neuroscience

    Science.gov (United States)

    Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis

    2011-06-01

    Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in

  13. Clustering Text Data Streams

    Institute of Scientific and Technical Information of China (English)

    Yu-Bao Liu; Jia-Rong Cai; Jian Yin; Ada Wai-Chee Fu

    2008-01-01

    Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF.IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

  14. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  15. Visualization Guided Document Reading by Citation and Text Summarization%基于文本摘要及引用关系的可视辅助文献阅读

    Institute of Scientific and Technical Information of China (English)

    张加万; 杨思琪; 李泽宇; 杨伟强; 王锦东; 贺瑞芳; 黄茂林

    2016-01-01

    With growing volume of publications in recent years, researchers have to read much more literatures. Therefore, how to read a scientific article in an efficient way becomes an importance issue. When reading an article, it's necessary to read its references in order to get a better understanding. However, how to differentiate between the relevant and non-relevant references, and how to stay in topic in a large document collection are still challenging tasks. This paper presents GUDOR (GUidedDOcument Reader), a visualization guided reader based on citation and summarization. It (1) extracts the important sentences from a scientific article with an objective-based summarization technique, and visualizes the extraction results by a multi-resolution method; (2) identifies the main topics of thereferences with a LDA (Latent Dirichlet Allocation) model; (3) tracks user's reading behavior to keep him or her focusing on the reading objective. In addition, the paper describes the functions and operations of the system in a usage scenario and validates its applicability by a user study.%近年来,科技论文发表数量与日俱增,科研人员需要阅读文献的数量也随之迅速增长.如何快速而有效地阅读一篇科技论文,逐渐成为一个重要的研究课题.另一方面,在阅读科技论文时,理解与其相关的重要参考文献可帮助读者更好地理解文章的内容.然而,如何从众多的参考文献中快速找到最重要、最相关的几篇,如何避免在阅读过程中迷失在文档的多维空间,仍是值得研究的问题.为了解决上述问题,提出了一个基于文本摘要和引用关系的可视辅助文献阅读系统.该系统利用一种基于阅读目的的文本摘要技术提取出论文中重要的句子,并采用多尺度的可视化方式进行展示;使用LDA(latent dirichlet allocation)话题模型抽取参考文献的核心话题;记录用户的阅读行为,用于提示其阅读上下文,以保证用户关

  16. Termination Documentation

    Science.gov (United States)

    Duncan, Mike; Hill, Jillian

    2014-01-01

    In this study, we examined 11 workplaces to determine how they handle termination documentation, an empirically unexplored area in technical communication and rhetoric. We found that the use of termination documentation is context dependent while following a basic pattern of infraction, investigation, intervention, and termination. Furthermore,…

  17. Maury Documentation

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Supporting documentation for the Maury Collection of marine observations. Includes explanations from Maury himself, as well as guides and descriptions by the U.S....

  18. DM Documentation

    OpenAIRE

    Sick, Jonathan

    2016-01-01

    An overview of resources for the science community to learn about, and interact with, LSST Data Management. This talk highlights the LSST Community Forum, https://community.lsst.org, as well as Data Management Technical Notes and software documentation projects.

  19. Arabic Text Mining Using Rule Based Classification

    OpenAIRE

    Fadi Thabtah; Omar Gharaibeh; Rashid Al-Zubaidy

    2012-01-01

    A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studi...

  20. Learning Context for Text Categorization

    CERN Document Server

    Haribhakta, Y V

    2011-01-01

    This paper describes our work which is based on discovering context for text document categorization. The document categorization approach is derived from a combination of a learning paradigm known as relation extraction and an technique known as context discovery. We demonstrate the effectiveness of our categorization approach using reuters 21578 dataset and synthetic real world data from sports domain. Our experimental results indicate that the learned context greatly improves the categorization performance as compared to traditional categorization approaches.

  1. TEXT CATEGORIZATION USING QLEARNING ALOGRITHM

    OpenAIRE

    Dr.S.R.Suresh; T.Karthikeyan,; D.B.Shanmugam,; J.Dhilipan

    2011-01-01

    This paper aims at creation of an efficient document classification process using reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision-making. Onestrength of reinforcement learning is that it provides formalism for measuring the utility of actions that gives benefit only in the future. An effective and flexible classifier learning algorithm is provided, which classifies a set of text documents into a more specific domain like Cricket, Tenn...

  2. Typesafe Modeling in Text Mining

    OpenAIRE

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  3. Performance Documentation.

    Science.gov (United States)

    Foster, Paula

    2002-01-01

    Presents an interview with experts on performance documentation. Suggests that educators should strive to represent performance appraisal writing to students in a way that reflects the way it is perceived and evaluated in the workplace. Concludes that educators can enrich their pedagogy with practice by helping students understand the importance…

  4. Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

    CERN Document Server

    Kaser, Owen

    2007-01-01

    Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

  5. Short Text Classification: A Survey

    Directory of Open Access Journals (Sweden)

    Ge Song

    2014-05-01

    Full Text Available With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification

  6. A Survey of Unstructured Text Summarization Techniques

    Directory of Open Access Journals (Sweden)

    Sherif Elfayoumy

    2014-05-01

    Full Text Available Due to the explosive amounts of text data being created and organizations increased desire to leverage their data corpora, especially with the availability of Big Data platforms, there is not usually enough time to read and understand each document and make decisions based on document contents. Hence, there is a great demand for summarizing text documents to provide a representative substitute for the original documents. By improving summarizing techniques, precision of document retrieval through search queries against summarized documents is expected to improve in comparison to querying against the full spectrum of original documents. Several generic text summarization algorithms have been developed, each with its own advantages and disadvantages. For example, some algorithms are particularly good for summarizing short documents but not for long ones. Others perform well in identifying and summarizing single-topic documents but their precision degrades sharply with multi-topic documents. In this article we present a survey of the literature in text summarization. We also surveyed some of the most common evaluation methods for the quality of automated text summarization techniques. Last, we identified some of the challenging problems that are still open, in particular the need for a universal approach that yields good results for mixed types of documents.

  7. Interconnectedness und digitale Texte

    Directory of Open Access Journals (Sweden)

    Detlev Doherr

    2013-04-01

    Full Text Available Zusammenfassung Die multimedialen Informationsdienste im Internet werden immer umfangreicher und umfassender, wobei auch die nur in gedruckter Form vorliegenden Dokumente von den Bibliotheken digitalisiert und ins Netz gestellt werden. Über Online-Dokumentenverwaltungen oder Suchmaschinen können diese Dokumente gefunden und dann in gängigen Formaten wie z.B. PDF bereitgestellt werden. Dieser Artikel beleuchtet die Funktionsweise der Humboldt Digital Library, die seit mehr als zehn Jahren Dokumente von Alexander von Humboldt in englischer Übersetzung im Web als HDL (Humboldt Digital Library kostenfrei zur Verfügung stellt. Anders als eine digitale Bibliothek werden dabei allerdings nicht nur digitalisierte Dokumente als Scan oder PDF bereitgestellt, sondern der Text als solcher und in vernetzter Form verfügbar gemacht. Das System gleicht damit eher einem Informationssystem als einer digitalen Bibliothek, was sich auch in den verfügbaren Funktionen zur Auffindung von Texten in unterschiedlichen Versionen und Übersetzungen, Vergleichen von Absätzen verschiedener Dokumente oder der Darstellung von Bilden in ihrem Kontext widerspiegelt. Die Entwicklung von dynamischen Hyperlinks auf der Basis der einzelnen Textabsätze der Humboldt‘schen Werke in Form von Media Assets ermöglicht eine Nutzung der Programmierschnittstelle von Google Maps zur geographischen wie auch textinhaltlichen Navigation. Über den Service einer digitalen Bibliothek hinausgehend, bietet die HDL den Prototypen eines mehrdimensionalen Informationssystems, das mit dynamischen Strukturen arbeitet und umfangreiche thematische Auswertungen und Vergleiche ermöglicht. Summary The multimedia information services on Internet are becoming more and more comprehensive, even the printed documents are digitized and republished as digital Web documents by the libraries. Those digital files can be found by search engines or management tools and provided as files in usual formats as

  8. Segmentation of complex document

    Directory of Open Access Journals (Sweden)

    Souad Oudjemia

    2014-06-01

    Full Text Available In this paper we present a method for segmentation of documents image with complex structure. This technique based on GLCM (Grey Level Co-occurrence Matrix used to segment this type of document in three regions namely, 'graphics', 'background' and 'text'. Very briefly, this method is to divide the document image, in block size chosen after a series of tests and then applying the co-occurrence matrix to each block in order to extract five textural parameters which are energy, entropy, the sum entropy, difference entropy and standard deviation. These parameters are then used to classify the image into three regions using the k-means algorithm; the last step of segmentation is obtained by grouping connected pixels. Two performance measurements are performed for both graphics and text zones; we have obtained a classification rate of 98.3% and a Misclassification rate of 1.79%.

  9. Documenting Spreadsheets

    CERN Document Server

    Payette, Raymond

    2008-01-01

    This paper discusses spreadsheets documentation and new means to achieve this end by using Excel's built-in "Comment" function. By structuring comments, they can be used as an essential tool to fully explain spreadsheet. This will greatly facilitate spreadsheet change control, risk management and auditing. It will fill a crucial gap in corporate governance by adding essential information that can be managed in order to satisfy internal controls and accountability standards.

  10. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  11. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and ...

  12. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted.   CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat a...

  13. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natur...

  14. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natu...

  15. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of em¬pl...

  16. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the na...

  17. System for Distributed Text Mining

    OpenAIRE

    Torgersen, Martin Nordseth

    2011-01-01

    Text mining presents us with new possibilities for the use of collections of documents.There exists a large amount of hidden implicit information inside these collection, which text mining techniques may help us to uncover. Unfortunately, these techniques generally requires large amounts of computational power. This is addressed by the introduction of distributed systems and methods for distributed processing, such as Hadoop and MapReduce.This thesis aims to describe, design, implement and ev...

  18. CNEA's quality system documentation

    International Nuclear Information System (INIS)

    Full text: To obtain an effective and coherent documentation system suitable for CNEA's Quality Management Program, we decided to organize the CNEA's quality documentation with : a- Level 1. Quality manual. b- Level 2. Procedures. c-Level 3. Qualities plans. d- Level 4: Instructions. e- Level 5. Records and other documents. The objective of this work is to present a standardization of the documentation of the CNEA's quality system of facilities, laboratories, services, and R and D activities. Considering the diversity of criteria and formats for elaboration the documentation by different departments, and since ultimately each of them generally includes the same quality management policy, we proposed the elaboration of a system in order to improve the documentation, avoiding unnecessary time wasting and costs. This will aloud each sector to focus on their specific documentation. The quality manuals of the atomic centers fulfill the rule 3.6.1 of the Nuclear Regulatory Authority, and the Safety Series 50-C/SG-Q of the International Atomic Energy Agency. They are designed by groups of competent and highly trained people of different departments. The normative procedures are elaborated with the same methodology as the quality manuals. The quality plans which describe the organizational structure of working group and the appropriate documentation, will asses the quality manuals of facilities, laboratories, services, and research and development activities of atomic centers. The responsibilities for approval of the normative documentation are assigned to the management in charge of the administration of economic and human resources in order to fulfill the institutional objectives. Another improvement aimed to eliminate unnecessary invaluable processes is the inclusion of all quality system's normative documentation in the CNEA intranet. (author)

  19. TRMM .25 deg x .25 deg Gridded Precipitation Text Product

    Science.gov (United States)

    Stocker, Erich; Kelley, Owen

    2009-01-01

    Since the launch of the Tropical Rainfall Measuring Mission (TRMM), the Precipitation Measurement Missions science team has endeavored to provide TRMM precipitation retrievals in a variety of formats that are more easily usable by the broad science community than the standard Hierarchical Data Format (HDF) in which TRMM data is produced and archived. At the request of users, the Precipitation Processing System (PPS) has developed a .25 x .25 gridded product in an easily used ASCII text format. The entire TRMM mission data has been made available in this format. The paper provides the details of this new precipitation product that is designated with the TRMM designator 3G68.25. The format is packaged into daily files. It provides hourly precipitation information from the TRMM microwave imager (TMI), precipitation radar (PR), and TMI/PR combined rain retrievals. A major advantage of this approach is the inclusion only of rain data, compression when a particular grid has no rain from the PR or combined, and its direct ASCII text format. For those interested only in rain retrievals and whether rain is convection or stratiform, these products provide a huge reduction in the data volume inherent in the standard TRMM products. This paper provides examples of the 3G68 data products and their uses. It also provides information about C tools that can be used to aggregate daily files into larger time samples. In addition, it describes the possibilities inherent in the spatial sampling which allows resampling into coarser spatial sampling. The paper concludes with information about downloading the gridded text data products.

  20. Text Recognition from an Image

    Directory of Open Access Journals (Sweden)

    Shrinath Janvalkar

    2014-04-01

    Full Text Available To achieve high speed in data processing it is necessary to convert the analog data into digital data. Storage of hard copy of any document occupies large space and retrieving of information from that document is time consuming. Optical character recognition system is an effective way in recognition of printed character. It provides an easy way to recognize and convert the printed text on image into the editable text. It also increases the speed of data retrieval from the image. The image which contains characters can be scanned through scanner and then recognition engine of the OCR system interpret the images and convert images of printed characters into machine-readable characters [8].It improving the interface between man and machine in many applications

  1. Secure Copier Which Allows Reuse Copied Documents with Sorting Capability in Accordance with Document Types

    Directory of Open Access Journals (Sweden)

    Kohei Arai

    2013-09-01

    Full Text Available Secure copy machine which allows reuse copied documents with sorting capability in accordance with the document types. Through experiments with a variety of document types, it is found that copied documents can be shared and stored in database in accordance with automatically classified document types securely. The copied documents are protected by data hiding based on wavelet Multi Resolution Analysis: MRA.

  2. ASCII Text File of the Original 1-m Bathymetry from National Oceanic and Atmospheric Administration (NOAA) Survey H11321 in Central Rhode Island Sound (H11321_1M_UTM19NAD83.TXT)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The United States Geological Survey (USGS) is working cooperatively with the National Oceanic and Atmospheric Administration (NOAA) to interpret the surficial...

  3. Text Analytics to Data Warehousing

    Directory of Open Access Journals (Sweden)

    Kalli Srinivasa Nageswara Prasad

    2010-09-01

    Full Text Available Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language to extract and analyze data, textanalytics has evolved to handle unstructured data to helps unlock and predict business results via Business Intelligence and Data Warehousing. Text mining involves dealing with texts in documents and discovering hidden patterns, but Text Analytics enhances InformationRetrieval in form of search and enabling clustering of results and more over Text Analytics is text mining and visualization. In this paper we would discuss on handling unstructured data that are in documents so that they fit into business applications like Data Warehouses for further analysis and it helps in the framework we have used for the solution.

  4. INFORMATION RETRIEVAL FOR SHORT DOCUMENTS

    Institute of Scientific and Technical Information of China (English)

    Qi Haoliang; Li Mu; Gao Jianfeng; Li Sheng

    2006-01-01

    The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be any linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.

  5. Summit documents; Documents du sommet

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2003-07-01

    This document gathers three declarations about the non-proliferation of massive destruction weapons, made by the G8 organization participants during their last summit held in Evian (France): declaration about the enforcement and respect of the non-proliferation measures implemented by the IAEA and by the conventions for chemical and biological weapons; declaration about the protection of radioactive sources against diversion (regulatory control, inventory, control of sources export etc..); warranty about the security of radioactive sources (G8 approach, sustain of the IAEA action, sustain to the most vulnerable states, control mechanisms, political commitment of states, implementation of the recommendations of the international conference about the security and safety of radiation sources, held in Vienna (Austria) on March 2003. (J.S.)

  6. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the ICMS Web site. The following items can be found on: http://cms.cern.ch/iCMS Management – CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management – CB – MB – FB Agendas and minutes are accessible to CMS members through Indico. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2008 Annual Reviews are posted in Indico. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral student upon completion of their theses.  Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and name of their first employer. The Notes, Conference Reports and Theses published si...

  7. Open architecture for multilingual parallel texts

    CERN Document Server

    Benitez, M T Carrasco

    2008-01-01

    Multilingual parallel texts (abbreviated to parallel texts) are linguistic versions of the same content ("translations"); e.g., the Maastricht Treaty in English and Spanish are parallel texts. This document is about creating an open architecture for the whole Authoring, Translation and Publishing Chain (ATP-chain) for the processing of parallel texts.

  8. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  9. Omega documentation

    Energy Technology Data Exchange (ETDEWEB)

    Howerton, R.J.; Dye, R.E.; Giles, P.C.; Kimlinger, J.R.; Perkins, S.T.; Plechaty, E.F.

    1983-08-01

    OMEGA is a CRAY I computer program that controls nine codes used by LLNL Physical Data Group for: 1) updating the libraries of evaluated data maintained by the group (UPDATE); 2) calculating average values of energy deposited in secondary particles and residual nuclei (ENDEP); 3) checking the libraries for internal consistency, especially for energy conservation (GAMCHK); 4) producing listings, indexes and plots of the library data (UTILITY); 5) producing calculational constants such as group averaged cross sections and transfer matrices for diffusion and Sn transport codes (CLYDE); 6) producing and updating standard files of the calculational constants used by LLNL Sn and diffusion transport codes (NDFL); 7) producing calculational constants for Monte Carlo transport codes that use group-averaged cross sections and continuous energy for particles (CTART); 8) producing and updating standard files used by the LLNL Monte Carlo transport codes (TRTL); and 9) producing standard files used by the LANL pointwise Monte Carlo transport code MCNP (MCPOINT). The first four of these functions and codes deal with the libraries of evaluated data and the last five with various aspects of producing calculational constants for use by transport codes. In 1970 a series, called PD memos, of internal and informal memoranda was begun. These were intended to be circulated among the group for comment and then to provide documentation for later reference whenever questions arose about the subject matter of the memos. They have served this purpose and now will be drawn upon as source material for this more comprehensive report that deals with most of the matters covered in those memos.

  10. Omega documentation

    International Nuclear Information System (INIS)

    OMEGA is a CRAY I computer program that controls nine codes used by LLNL Physical Data Group for: 1) updating the libraries of evaluated data maintained by the group (UPDATE); 2) calculating average values of energy deposited in secondary particles and residual nuclei (ENDEP); 3) checking the libraries for internal consistency, especially for energy conservation (GAMCHK); 4) producing listings, indexes and plots of the library data (UTILITY); 5) producing calculational constants such as group averaged cross sections and transfer matrices for diffusion and Sn transport codes (CLYDE); 6) producing and updating standard files of the calculational constants used by LLNL Sn and diffusion transport codes (NDFL); 7) producing calculational constants for Monte Carlo transport codes that use group-averaged cross sections and continuous energy for particles (CTART); 8) producing and updating standard files used by the LLNL Monte Carlo transport codes (TRTL); and 9) producing standard files used by the LANL pointwise Monte Carlo transport code MCNP (MCPOINT). The first four of these functions and codes deal with the libraries of evaluated data and the last five with various aspects of producing calculational constants for use by transport codes. In 1970 a series, called PD memos, of internal and informal memoranda was begun. These were intended to be circulated among the group for comment and then to provide documentation for later reference whenever questions arose about the subject matter of the memos. They have served this purpose and now will be drawn upon as source material for this more comprehensive report that deals with most of the matters covered in those memos

  11. “Dreamers Often Lie”: On “Compromise”, the subversive documentation of an Israeli- Palestinian political adaptation of Shakespeare’s Romeo and Juliet

    Directory of Open Access Journals (Sweden)

    Yael Munk

    2012-07-01

    Full Text Available Normal 0 14 false false false IT X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabella normale"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} Normal 0 14 false false false IT X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabella normale"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} Is Romeo and Juliet relevant to a description of the Middle-East conflict? This is the question raised in Compromise, an Israeli documentary that follows the Jerusalem Khan Theater's production of the play in the mid-1990's. This paper describes how the cinematic documentation of a theatrical Shakespeare production can undermine the original intentions of its creators. This staging of the play was carefully planned in order to demonstrate to the country and the

  12. Bengali text summarization by sentence extraction

    CERN Document Server

    Sarkar, Kamal

    2012-01-01

    Text summarization is a process to produce an abstract or a summary by selecting significant portion of the information from one or more texts. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract or abstract of the original text(s). Many techniques have been developed for summarizing English text(s). But, a very few attempts have been made for Bengali text summarization. This paper presents a method for Bengali text summarization which extracts important sentences from a Bengali document to produce a summary.

  13. Securing XML Documents

    Directory of Open Access Journals (Sweden)

    Charles Shoniregun

    2004-11-01

    Full Text Available XML (extensible markup language is becoming the current standard for establishing interoperability on the Web. XML data are self-descriptive and syntax-extensible; this makes it very suitable for representation and exchange of semi-structured data, and allows users to define new elements for their specific applications. As a result, the number of documents incorporating this standard is continuously increasing over the Web. The processing of XML documents may require a traversal of all document structure and therefore, the cost could be very high. A strong demand for a means of efficient and effective XML processing has posed a new challenge for the database world. This paper discusses a fast and efficient indexing technique for XML documents, and introduces the XML graph numbering scheme. It can be used for indexing and securing graph structure of XML documents. This technique provides an efficient method to speed up XML data processing. Furthermore, the paper explores the classification of existing methods impact of query processing, and indexing.

  14. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  15. Automatic Multi Document Summarization Approaches

    Directory of Open Access Journals (Sweden)

    Yogan J. Kumar

    2012-01-01

    Full Text Available Problem statement: Text summarization can be of different nature ranging from indicative summary that identifies the topics of the document to informative summary which is meant to represent the concise description of the original document, providing an idea of what the whole content of document is all about. Approach: Single document summary seems to capture both the information well but it has not been the case for multi document summary where the overall comprehensive quality in presenting informative summary often lacks. It is found that most of the existing methods tend to focus on sentence scoring and less consideration is given to the contextual information content in multiple documents. Results: In this study, some survey on multi document summarization approaches has been presented. We will direct our focus notably on four well known approaches to multi document summarization namely the feature based method, cluster based method, graph based method and knowledge based method. The general ideas behind these methods have been described. Conclusion: Besides the general idea and concept, we discuss the benefits and limitations concerning these methods. With the aim of enhancing multi document summarization, specifically news documents, a novel type of approach is outlined to be developed in the future, taking into account the generic components of a news story in order to generate a better summary.

  16. Automatic text summarization

    CERN Document Server

    Torres Moreno, Juan Manuel

    2014-01-01

    This new textbook examines the motivations and the different algorithms for automatic document summarization (ADS). We performed a recent state of the art. The book shows the main problems of ADS, difficulties and the solutions provided by the community. It presents recent advances in ADS, as well as current applications and trends. The approaches are statistical, linguistic and symbolic. Several exemples are included in order to clarify the theoretical concepts.  The books currently available in the area of Automatic Document Summarization are not recent. Powerful algorithms have been develop

  17. Academic Journal Embargoes and Full Text Databases.

    Science.gov (United States)

    Brooks, Sam

    2003-01-01

    Documents the reasons for embargoes of academic journals in full text databases (i.e., publisher-imposed delays on the availability of full text content) and provides insight regarding common misconceptions. Tables present data on selected journals covering a cross-section of subjects and publishers and comparing two full text business databases.…

  18. Scalable Text Mining with Sparse Generative Models

    OpenAIRE

    Puurula, Antti

    2016-01-01

    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: gener...

  19. La Documentation photographique

    Directory of Open Access Journals (Sweden)

    Magali Hamm

    2009-03-01

    Full Text Available La Documentation photographique, revue destinée aux enseignants et étudiants en histoire-géographie, place l’image au cœur de sa ligne éditoriale. Afin de suivre les évolutions actuelles de la géographie, la collection propose une iconographie de plus en plus diversifiée : cartes, photographies, mais aussi caricatures, une de journal ou publicité, toutes étant considérées comme un document géographique à part entière. Car l’image peut se faire synthèse ; elle peut au contraire montrer les différentes facettes d’un objet ; souvent elle permet d’incarner des phénomènes géographiques. Associées à d’autres documents, les images aident les enseignants à initier leurs élèves à des raisonnements géographiques complexes. Mais pour apprendre à les lire, il est fondamental de les contextualiser, de les commenter et d’interroger leur rapport au réel.The Documentation photographique, magazine dedicated to teachers and students in History - Geography, places the image at the heart of its editorial line. In order to follow the evolutions of Geography, the collection presents a more and more diversified iconography: maps, photographs, but also drawings or advertisements, all this documents being considered as geographical ones. Because image can be a synthesis; on the contrary it can present the different facets of a same object; often it enables to portray geographical phenomena. Related to other documents, images assist the teachers in the students’ initiation to complex geographical reasoning. But in order to learn how to read them, it is fundamental to contextualize them, comment them and question their relations with reality.

  20. Text Coherence in Translation

    Science.gov (United States)

    Zheng, Yanping

    2009-01-01

    In the thesis a coherent text is defined as a continuity of senses of the outcome of combining concepts and relations into a network composed of knowledge space centered around main topics. And the author maintains that in order to obtain the coherence of a target language text from a source text during the process of translation, a translator can…

  1. Unstructured Documents Categorization: A Study

    Directory of Open Access Journals (Sweden)

    Debnath Bhattacharyya

    2008-12-01

    Full Text Available The main purpose of communication is to transfer information from onecorner to another of the world. The information is basically stored in forms of documents or files created on the basis of requirements. So, the randomness of creation and storage makes them unstructured in nature. As a consequence, data retrieval and modification become hard nut to crack. The data, that is required frequently, should maintain certain pattern. Otherwise, problems like retrievingerroneous data or anomalies in modification or time consumption in retrieving process may hike. As every problem has its own solution, these unstructured documents have also given the solution named unstructured document categorization. That means, the collected unstructured documents will be categorized based on some given constraints. This paper is a review which deals with different techniques like text and data mining, genetic algorithm, lexicalchaining, binarization method to reach the fulfillment of desired unstructured document categorization appeared in the literature.

  2. Multilingual Topic Models for Unaligned Text

    CERN Document Server

    Boyd-Graber, Jordan

    2012-01-01

    We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

  3. Vocabulary Constraint on Texts

    Directory of Open Access Journals (Sweden)

    C. Sutarsyah

    2008-01-01

    Full Text Available This case study was carried out in the English Education Department of State University of Malang. The aim of the study was to identify and describe the vocabulary in the reading text and to seek if the text is useful for reading skill development. A descriptive qualitative design was applied to obtain the data. For this purpose, some available computer programs were used to find the description of vocabulary in the texts. It was found that the 20 texts containing 7,945 words are dominated by low frequency words which account for 16.97% of the words in the texts. The high frequency words occurring in the texts were dominated by function words. In the case of word levels, it was found that the texts have very limited number of words from GSL (General Service List of English Words (West, 1953. The proportion of the first 1,000 words of GSL only accounts for 44.6%. The data also show that the texts contain too large proportion of words which are not in the three levels (the first 2,000 and UWL. These words account for 26.44% of the running words in the texts.  It is believed that the constraints are due to the selection of the texts which are made of a series of short-unrelated texts. This kind of text is subject to the accumulation of low frequency words especially those of content words and limited of words from GSL. It could also defeat the development of students' reading skills and vocabulary enrichment.

  4. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  5. Instant Sublime Text starter

    CERN Document Server

    Haughee, Eric

    2013-01-01

    A starter which teaches the basic tasks to be performed with Sublime Text with the necessary practical examples and screenshots. This book requires only basic knowledge of the Internet and basic familiarity with any one of the three major operating systems, Windows, Linux, or Mac OS X. However, as Sublime Text 2 is primarily a text editor for writing software, many of the topics discussed will be specifically relevant to software development. That being said, the Sublime Text 2 Starter is also suitable for someone without a programming background who may be looking to learn one of the tools of

  6. Context Based Word Sense Extraction in Text

    Directory of Open Access Journals (Sweden)

    Ranjeetsingh S.Suryawanshi

    2011-11-01

    Full Text Available In the era of modern e-document technology, everyone using computerized document for their purpose. Due to huge amount of text document available in the form of pdf, doc, txt, html, and xml user may confuse about reading sense of these entire documents, if same word interpret different sense. Word sense has always been an important problem in information retrieval and extraction, as well as, text mining, because machines don’t have that much intelligence as compared to human to sense word in particular context. User want to determine which sense of a word is used in a given context. Word is usage-based, and part of it can be created automatically from an electronic dictionary. This paper describes word sense as expressed by its WordNet synsets, arranged according to their relevance and their context are expressed by means of word association

  7. Text Classification and Classifiers:A Survey

    Directory of Open Access Journals (Sweden)

    Vandana Korde

    2012-03-01

    Full Text Available As most information (over 80% is stored as text, text mining is believed to have a high commercial potential value. knowledge may be discovered from many sources of information; yet, unstructured texts remain the largest readily available source of knowledge .Text classification which classifies the documents according to predefined categories .In this paper we are tried to give the introduction of text classification, process of text classification as well as the overview of the classifiers and tried to compare the some existing classifier on basis of few criteria like time complexity, principal and performance.

  8. TEXT CLASSIFICATION TOWARD A SCIENTIFIC FORUM

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Text mining, also known as discovering knowledge from the text, which has emerged as a possible solution for the current information explosion, refers to the process of extracting non-trivial and useful patterns from unstructured text. Among the general tasks of text mining such as text clustering,summarization, etc, text classification is a subtask of intelligent information processing, which employs unsupervised learning to construct a classifier from training text by which to predict the class of unlabeled text. Because of its simplicity and objectivity in performance evaluation, text classification was usually used as a standard tool to determine the advantage or weakness of a text processing method, such as text representation, text feature selection, etc. In this paper, text classification is carried out to classify the Web documents collected from XSSC Website (http://www. xssc.ac.cn). The performance of support vector machine (SVM) and back propagation neural network (BPNN) is compared on this task. Specifically, binary text classification and multi-class text classification were conducted on the XSSC documents. Moreover, the classification results of both methods are combined to improve the accuracy of classification. An experiment is conducted to show that BPNN can compete with SVM in binary text classification; but for multi-class text classification, SVM performs much better. Furthermore, the classification is improved in both binary and multi-class with the combined method.

  9. Automatic text categorisation of racist webpages

    OpenAIRE

    Greevy, Edel

    2004-01-01

    Automatic Text Categorisation (TC) involves the assignment of one or more predefined categories to text documents in order that they can be effectively managed. In this thesis we examine the possibility of applying automatic text categorisation to the problem of categorising texts (web pages) based on whether or not they are racist. TC has proven successful for topic-based problems such as news story categorisation. However, the problem of detecting racism is dissimilar to topic-based pro...

  10. GPU-Accelerated Text Mining

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.

  11. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  12. Linguistics in Text Interpretation

    DEFF Research Database (Denmark)

    Togeby, Ole

    2011-01-01

    A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'.......A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'....

  13. Systematic text condensation

    DEFF Research Database (Denmark)

    Malterud, Kirsti

    2012-01-01

    To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies.......To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies....

  14. Text Categorization with Latent Dirichlet Allocation

    Directory of Open Access Journals (Sweden)

    ZLACKÝ Daniel

    2014-05-01

    Full Text Available This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.

  15. Extracting Text from Video

    Directory of Open Access Journals (Sweden)

    Jayshree Ghorpade

    2011-09-01

    Full Text Available The text data present in images and video contain certain useful information for automatic annotation,indexing, and structuring of images. However variations of the text due to differences in text style, font, size, orientation, alignment as well as low image contrast and complex background make the problem of automatic text extraction extremely difficult and challenging job. A large number of techniques have been proposed to address this problem and the purpose of this paper is to design algorithms for each phase of extracting text from a video using java libraries and classes. Here first we frame the input video into stream of images using the Java Media Framework (JMF with the input being a real time or a video from the database. Then we apply pre processing algorithms to convert the image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal.Then we continue with the algorithms for localization, segmentation and recognition for which we use the neural network pattern matching technique. The performance of our approach is demonstrated by presenting experimental results for a set of static images.

  16. EXTRACTING TEXT FROM VIDEO

    Directory of Open Access Journals (Sweden)

    Jayshree Ghorpade

    2011-06-01

    Full Text Available The text data present in images and video contain certain useful information for automatic annotation,indexing, and structuring of images. However variations of the text due to differences in text style, font, size, orientation, alignment as well as low image contrast and complex background make the problem of automatic text extraction extremely difficult and challenging job. A large number of techniques have been proposed to address this problem and the purpose of this paper is to design algorithms for each phase of extracting text from a video using java libraries and classes. Here first we frame the input video into stream of images using the Java Media Framework (JMF with the input being a real time or a video from the database. Then we apply pre processing algorithms to convert the image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal.Then we continue with the algorithms for localization, segmentation and recognition for which we use the neural network pattern matching technique. The performance of our approach is demonstrated by presenting experimental results for a set of static images.

  17. Texts of Television Advertisements

    OpenAIRE

    Michalewski, Kazimierz

    1995-01-01

    Short advertisement films occupy a large part (especially around the peak viewing hours) of everyday programmes of the Polish stale television. Even though it is possible to imagine an advertisement film employing only extralinguistic means of communication, the advertisements in generał, have so far been using written and spoken texts. The basic function of such a text and of the whole film is to encourage the viewers to buy the advertised product. However, independently of th...

  18. Machine Translation from Text

    Science.gov (United States)

    Habash, Nizar; Olive, Joseph; Christianson, Caitlin; McCary, John

    Machine translation (MT) from text, the topic of this chapter, is perhaps the heart of the GALE project. Beyond being a well defined application that stands on its own, MT from text is the link between the automatic speech recognition component and the distillation component. The focus of MT in GALE is on translating from Arabic or Chinese to English. The three languages represent a wide range of linguistic diversity and make the GALE MT task rather challenging and exciting.

  19. Text simplification for children

    OpenAIRE

    De Belder, Jan; Moens, Marie-Francine

    2010-01-01

    The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children. We perform syntactic simplification, i.e. the splitting of sentences, and lexical simplification, i.e. replacing difficult words with easier synonyms. We test the performance of this approach for each component separately on a per sentence basis, and globally with the automatic construction of simplified news articles and encyclopedia articles. By including informatio...

  20. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2014-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly

  1. Arabic Text Classification Using Support Vector Machines

    NARCIS (Netherlands)

    Gharib, Tarek Fouad; Habib, Mena Badieh; Fayed, Zaki Taha; Zhu, Qiang

    2009-01-01

    Text classification (TC) is the process of classifying documents into a predefined set of categories based on their content. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In this paper we applied the Support Vector Machines (SVM) model in cl

  2. Documentation of Cultural Heritage Objects

    Directory of Open Access Journals (Sweden)

    Jon Grobovšek

    2013-09-01

    Full Text Available EXTENDED ABSTRACT:The first and important phase of documentation of cultural heritage objects is to understand which objects need to be documented. The entire documentation process is determined by the characteristics and scope of the cultural heritage object. The next question to be considered is the expected outcome of the documentation process and the purpose for which it will be used. These two essential guidelines determine each stage of the documentation workflow: the choice of the most appropriate data capturing technology and data processing method, how detailed should the documentation be, what problems may occur, what the expected outcome is, what it will be used for, and the plan for storing data and results. Cultural heritage objects require diverse data capturing and data processing methods. It is important that even the first stages of raw data capturing are oriented towards the applicability of results. The selection of the appropriate working method can facilitate the data processing and the preparation of final documentation. Documentation of paintings requires different data capturing method than documentation of buildings or building areas. The purpose of documentation can also be the preservation of the contemporary cultural heritage to posterity or the basis for future projects and activities on threatened objects. Documentation procedures should be adapted to our needs and capabilities. Captured and unprocessed data are lost unless accompanied by additional analyses and interpretations. Information on tools, procedures and outcomes must be included into documentation. A thorough analysis of unprocessed but accessible documentation, if adequately stored and accompanied by additional information, enables us to gather useful data. In this way it is possible to upgrade the existing documentation and to avoid data duplication or unintentional misleading of users. The documentation should be archived safely and in a way to meet

  3. A new graph based text segmentation using Wikipedia for automatic text summarization

    Directory of Open Access Journals (Sweden)

    Mohsen Pourvali

    2012-01-01

    Full Text Available The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. According to the input text, in this paper we use the knowledge base of Wikipedia and the words of the main text to create independent graphs. We will then determine the important of graphs. Then we are specified importance of graph and sentences that have topics with high importance. Finally, we extract sentences with high importance. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to state-of-the-art summarization approaches

  4. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health

  5. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health,our Internet Resources and Jour-

  6. Automatic Induction of Rule Based Text Categorization

    OpenAIRE

    D.Maghesh Kumar

    2010-01-01

    The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuingneed to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describ...

  7. Machine Learning in Automated Text Categorization

    OpenAIRE

    Sebastiani, Fabrizio

    2001-01-01

    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categori...

  8. Working with Documents in Databases

    Directory of Open Access Journals (Sweden)

    Marian DARDALA

    2008-01-01

    Full Text Available Using on a larger and larger scale the electronic documents within organizations and public institutions requires their storage and unitary exploitation by the means of databases. The purpose of this article is to present the way of loading, exploitation and visualization of documents in a database, taking as example the SGBD MSSQL Server. On the other hand, the modules for loading the documents in the database and for their visualization will be presented through code sequences written in C#. The interoperability between averages will be carried out by the means of ADO.NET technology of database access.

  9. Modified Approach to Transform Arc From Text to Linear Form Text : A Preprocessing Stage for OCR

    Directory of Open Access Journals (Sweden)

    Vijayashree C S

    2014-08-01

    Full Text Available Arc-form-text is an artistic-text which is quite common in several documents such as certificates, advertisements and history documents. OCRs fail to read such arc-form-text and it is necessary to transform the same to linear-form-text at preprocessing stage. In this paper, we present a modification to an existing transformation model for better readability by OCRs. The method takes the segmented arcform-text as input. Initially two concentric ellipses are approximated to enclose the arc-form-text and later the modified transformation model transforms the text in arc-form to linear-form. The proposed method is implemented on several upper semi-circular arc-form-text inputs and the readability of the transformed text is analyzed with an OCR

  10. Document Clustering based on Topic Maps

    CERN Document Server

    Rafi, Muhammad; Farooq, Amir; 10.5120/1640-2204

    2011-01-01

    Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algori...

  11. Reading Authorship into Texts.

    Science.gov (United States)

    Werner, Walter

    2000-01-01

    Provides eight concepts, with illustrative questions for interpreting the authorship of texts, that are borrowed from cultural studies literature: (1) representation; (2) the gaze; (3) voice; (4) intertextuality; (5) absence; (6) authority; (7) mediation; and (8) reflexivity. States that examples were taken from British Columbia's (Canada) social…

  12. Reading Authentic Texts

    DEFF Research Database (Denmark)

    Balling, Laura Winther

    2013-01-01

    Most research on cognates has focused on words presented in isolation that are easily defined as cognate between L1 and L2. In contrast, this study investigates what counts as cognate in authentic texts and how such cognates are read. Participants with L1 Danish read news articles in their highly...

  13. Texts in the landscape

    Directory of Open Access Journals (Sweden)

    James Graham-Campbell

    1998-11-01

    Full Text Available The Institute's members of UCL's "Celtic Inscribed Stones" project describe, in collaboration with Wendy Davies, Mark Handley and Paul Kershaw (Department of History, a major interdisciplinary study of inscriptions of the early middle ages from the Celtic areas of northwest Europe.

  14. Polymorphous Perversity in Texts

    Science.gov (United States)

    Johnson-Eilola, Johndan

    2012-01-01

    Here's the tricky part: If we teach ourselves and our students that texts are made to be broken apart, remixed, remade, do we lose the polymorphous perversity that brought us pleasure in the first place? Does the pleasure of transgression evaporate when the borders are opened?

  15. Text Induced Spelling Correction

    NARCIS (Netherlands)

    Reynaert, M.W.C.

    2004-01-01

    We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigram

  16. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International( CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  17. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  18. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2011-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  19. Text as Image.

    Science.gov (United States)

    Woal, Michael; Corn, Marcia Lynn

    As electronically mediated communication becomes more prevalent, print is regaining the original pictorial qualities which graphemes (written signs) lost when primitive pictographs (or picture writing) and ideographs (simplified graphemes used to communicate ideas as well as to represent objects) evolved into first written, then printed, texts of…

  20. Generic safety documentation model

    International Nuclear Information System (INIS)

    This document is intended to be a resource for preparers of safety documentation for Sandia National Laboratories, New Mexico facilities. It provides standardized discussions of some topics that are generic to most, if not all, Sandia/NM facilities safety documents. The material provides a ''core'' upon which to develop facility-specific safety documentation. The use of the information in this document will reduce the cost of safety document preparation and improve consistency of information

  1. Generic safety documentation model

    Energy Technology Data Exchange (ETDEWEB)

    Mahn, J.A.

    1994-04-01

    This document is intended to be a resource for preparers of safety documentation for Sandia National Laboratories, New Mexico facilities. It provides standardized discussions of some topics that are generic to most, if not all, Sandia/NM facilities safety documents. The material provides a ``core`` upon which to develop facility-specific safety documentation. The use of the information in this document will reduce the cost of safety document preparation and improve consistency of information.

  2. How Much Handwritten Text Is Needed for Text-Independent Writer Verification and Identification

    NARCIS (Netherlands)

    Brink, Axel; Bulacu, Marius; Schomaker, Lambert

    2008-01-01

    The performance of off-line text-independent writer verification and identification increases when the documents contain more text. This relation was examined by repeatedly conducting writer verification and identification performance tests while gradually increasing the amount of text on the pages.

  3. Data Security by Preprocessing the Text with Secret Hiding

    Directory of Open Access Journals (Sweden)

    Ajit Singh

    2012-06-01

    Full Text Available With the advent of the Internet, an open forum, the massive increase in the data travel across networkmake an issue for secure transmission. Cryptography is the term that involves many encryption method to make data secure. But the transmission of the secure data is an intricate task. Steganography here comes with effect of transmission without revealing the secure data. The research paper provide the mechanism which enhance the security of data by using a crypto+stegano combination to increase the security level without knowing the fact that some secret data is sharing across networks. In the firstphase data is encrypted by manipulating the text using the ASCII codes and some random generated strings for the codes by taking some parameters. Steganography related to cryptography forms the basisfor many data hiding techniques. The data is encrypted using a proposed approach and then hide the message in random N images with the help of perfect hashing scheme which increase the security of the message before sending across the medium. Thus the sending and receiving of message will be safe and secure with an increased confidentiality.

  4. Toponym Resolution in Text

    OpenAIRE

    Leidner, Jochen Lothar

    2007-01-01

    Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g.,...

  5. Reading Text While Driving

    OpenAIRE

    Liang, Yulan; Horrey, William J.; Hoffman, Joshua D.

    2015-01-01

    Objective In this study, we investigated how drivers adapt secondary-task initiation and time-sharing behavior when faced with fluctuating driving demands. Background Reading text while driving is particularly detrimental; however, in real-world driving, drivers actively decide when to perform the task. Method In a test track experiment, participants were free to decide when to read messages while driving along a straight road consisting of an area with increased driving demands (demand zone)...

  6. Urdu Text Classification using Majority Voting

    Directory of Open Access Journals (Sweden)

    Muhammad Usman

    2016-08-01

    Full Text Available Text classification is a tool to assign the predefined categories to the text documents using supervised machine learning algorithms. It has various practical applications like spam detection, sentiment detection, and detection of a natural language. Based on the idea we applied five well-known classification techniques on Urdu language corpus and assigned a class to the documents using majority voting. The corpus contains 21769 news documents of seven categories (Business, Entertainment, Culture, Health, Sports, and Weird. The algorithms were not able to work directly on the data, so we applied the preprocessing techniques like tokenization, stop words removal and a rule-based stemmer. After preprocessing 93400 features are extracted from the data to apply machine learning algorithms. Furthermore, we achieved up to 94% precision and recall using majority voting.

  7. Modeling statistical properties of written text.

    Directory of Open Access Journals (Sweden)

    M Angeles Serrano

    Full Text Available Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

  8. Text and Music Revisited

    OpenAIRE

    Fornäs, Johan

    1997-01-01

    Are words and music two separate symbolic modes, or rather variants of the same human symbolic practice? Are they parallel, opposing or over­lap­ping? What do they have in common and how does each of them exceed the other? Is music perhaps incomparably dif­fer­ent from words, or even their anti-verbal Other? Distinctions between text (in the verbal sense of units of words rather than in the wide sense of symbolic webs in general) and music are regularly made – but also prob­lem­atized – withi...

  9. Events and Trends in Text Streams

    Energy Technology Data Exchange (ETDEWEB)

    Engel, David W.; Whitney, Paul D.; Cramer, Nicholas O.

    2010-03-04

    "Text streams--collections of documents or messages that are generated and observed over time--are ubiquitous. Our research and development are targeted at developing algorithms to find and characterize changes in topic within text streams. To date, this research has demonstrated the ability to detect and describe 1) short duration, atypical events and 2) the emergence of longer-term shifts in topical content. This technology has been applied to predefined temporally ordered document collections but is also suitable for application to near-real-time textual data streams."

  10. Weitere Texte physiognomischen Inhalts

    Directory of Open Access Journals (Sweden)

    Böck, Barbara

    2004-12-01

    Full Text Available The present article offers the edition of three cuneiform texts belonging to the Akkadian handbook of omens drawn from the physical appearance as well as the morals and behaviour of man. The book comprising up to 27 chapters with more than 100 omens each was entitled in antiquity Alamdimmû. The edition of the three cuneiform tablets completes, thus, the author's monographic study on the ancient Mesopotamian divinatory discipline of physiognomy (Die babylonisch-assyrische Morphoskopie (Wien 2000 [=AfO Beih. 27].

    En este artículo se presenta la editio princeps de tres textos cuneiformes conservados en el British Museum (Londres y el Vorderasiatisches Museum (Berlín, que pertenecen al libro asirio-babilonio de presagios fisiognómicos. Este libro, titulado originalmente Alamdimmû ('forma, figura', consta de 27 capítulos, cada uno con más de cien presagios escritos en lengua acadia. Los tres textos completan así el estudio monográfico de la autora sobre la disciplina adivinatoria de la fisiognomía en el antiguo Oriente (Die babylonisch-assyrische Morphoskopie (Wien 2000 [=AfO Beih. 27].

  11. Texts of presentation

    Energy Technology Data Exchange (ETDEWEB)

    Magnin, G.; Vidolov, K.; Dufour-Fallot, B.; Dewarrat, Th.; Rose, T.; Favatier, A.; Gazeley, D.; Pujol, T.; Worner, D.; Van de Wel, E.; Revaz, J.M.; Clerfayt, G.; Creedy, A.; Moisan, F.; Geissler, M.; Isbell, P.; Macaluso, M.; Litzka, V.; Gillis, W.; Jarvis, I.; Gorg, M.; Bebie, B.

    2004-07-01

    Implementing a sustainable local energy policy involves a long term reflection on the general interest, energy efficiency, distributed generation and environmental protection. Providing services on a market involves looking for activities that are profitable, if possible in the 'short-term'. The aim of this conference is to analyse the possibility of reconciling these apparently contradictory requirements and how this can be achieved. This conference brings together the best specialists from European municipalities as well as important partners for local authorities (energy agencies, service companies, institutions, etc.) in order to discuss the public-private partnerships concerning the various functions that municipalities may perform in the energy field as consumers and customers, planners and organizers of urban space and rousers as regards inhabitants and economic players of their areas. This document contains the summaries of the following presentations: 1 - Performance contracting: Bulgarian municipalities use private capital for energy efficiency improvement (K. VIDOLOV, Varna (BG)), Contracting experiences in Swiss municipalities: consistent energy policy thanks to the Energy-city label (B. DUFOUR-FALLOT and T. DEWARRAT (CH)), Experience of contracting in the domestic sector (T. ROSE (GB)); 2 - Public procurement: Multicolor electricity (A. FAVATIER (CH)), Tendering for new green electricity capacity (D. GAZELEY (GB)), The Barcelona solar thermal ordinance (T. PUJOL (ES)); 3 - Urban planning and schemes: Influencing energy issues through urban planning (D. WOERNER (DE)), Tendering for the supply of energy infrastructure (E. VAN DE WEL (NL)), Concessions and public utility warranty (J.M. REVAZ (CH)); 4 - Certificate schemes: the market of green certificates in Wallonia region in a liberalized power market (G. CLERFAYT (BE)), The Carbon Neutral{sup R} project: a voluntary certification scheme with opportunity for implementation in other European

  12. Weaving with text

    DEFF Research Database (Denmark)

    Hagedorn-Rasmussen, Peter

    This paper explores how a school principal by means of practical authorship creates reservoirs of language that provide a possible context for collective sensemaking. The paper draws upon a field study in which a school principal, and his managerial team, was shadowed in a period of intensive cha...... changes. The paper explores how the manager weaves with text, extracted from stakeholders, administration, politicians, employees, public discourse etc., as a means of creating a new fabric, a texture, of diverse perspectives that aims for collective sensemaking.......This paper explores how a school principal by means of practical authorship creates reservoirs of language that provide a possible context for collective sensemaking. The paper draws upon a field study in which a school principal, and his managerial team, was shadowed in a period of intensive...

  13. Audit of Orthopaedic Surgical Documentation

    Directory of Open Access Journals (Sweden)

    Fionn Coughlan

    2015-01-01

    Full Text Available Introduction. The Royal College of Surgeons in England published guidelines in 2008 outlining the information that should be documented at each surgery. St. James’s Hospital uses a standard operation sheet for all surgical procedures and these were examined to assess documentation standards. Objectives. To retrospectively audit the hand written orthopaedic operative notes according to established guidelines. Methods. A total of 63 operation notes over seven months were audited in terms of date and time of surgery, surgeon, procedure, elective or emergency indication, operative diagnosis, incision details, signature, closure details, tourniquet time, postop instructions, complications, prosthesis, and serial numbers. Results. A consultant performed 71.4% of procedures; however, 85.7% of the operative notes were written by the registrar. The date and time of surgery, name of surgeon, procedure name, and signature were documented in all cases. The operative diagnosis and postoperative instructions were frequently not documented in the designated location. Incision details were included in 81.7% and prosthesis details in only 30% while the tourniquet time was not documented in any. Conclusion. Completion and documentation of operative procedures were excellent in some areas; improvement is needed in documenting tourniquet time, prosthesis and incision details, and the location of operative diagnosis and postoperative instructions.

  14. Towards document engineering

    OpenAIRE

    Quint, Vincent; Nanard, M.; André, Jacques

    1990-01-01

    This article compares methods and techniques used in software engineering with the ones used for handling electronic documents. It shows the common features in both domains, but also the differences and it proposes an approach which extends the field of document manipulation to document engineering. It shows also in what respect document engineering is different from software engineering. Therefore specific techniques must be developped for building integrated environments for document engine...

  15. Handwritten Document Editor: An Approach

    Directory of Open Access Journals (Sweden)

    Sumit Nalawade

    2014-05-01

    Full Text Available With advancement in new technologies many individuals are moving towards personalization of the same. The same idea inspired us to develop a system which can provide a personal touch to all our documents including both electronic and paper media. In this article we are proposing a novel idea for creating an edito r system which will take handwritten scanned document as the input, recognizes the characters from the document, then proceed with creating the font of recognized handwriting to allow user to edit the document. We have proposed use of genetic algorithm along with K-NN classifier for fast recognition of handwritten characters and use of marching squares algorithm for tracing contour points of characters to generate a handwritten font.

  16. Handwritten Document Editor: An Approach

    Directory of Open Access Journals (Sweden)

    Sumit Nalawade

    2015-11-01

    Full Text Available With advancement in new technologies many individuals are moving towards personalization of the same. The same idea inspired us to develop a system which can provide a personal touch to all our documents including both electronic and paper media. In this article we are proposing a novel idea for creating an editor system which will take handwritten scanned document as the input, recognizes the characters from the document, then proceed with creating the font of recognized handwriting to allow user to edit the document. We have proposed use of genetic algorithm along with K-NN classifier for fast recognition of handwritten characters and use of marching squares algorithm for tracing contour points of characters to generate a handwritten font.

  17. HANDWRITTEN TEXT IMAGE AUTHENTICATION USING BACK PROPAGATION

    Directory of Open Access Journals (Sweden)

    A S N Chakravarthy

    2011-10-01

    Full Text Available Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involveconfirming the identity of a person, tracing the origins of an artefact, ensuring that a product is whatit’s packaging and labelling claims to be, or assuring that a computer program is a trusted one. Theauthentication of information can pose special problems (especially man-in-the-middle attacks, and isoften wrapped up with authenticating identity. Literary can involve imitating the style of a famous author.If an original manuscript, typewritten text, or recording is available, then the medium itself (or itspackaging - anything from a box to e-mail headers can help prove or disprove the authenticity of thedocument. The use of digital images of handwritten historical documents has become more popular inrecent years. Volunteers around the world now read thousands of these images as part of theirindexing process. Handwritten text images of old documents are sometimes difficult to read or noisy dueto the preservation of the document and quality of the image [1]. Handwritten text offers challenges thatare rarely encountered in machine-printed text. In addition, most problems faced in reading machineprintedtext (e.g., character recognition, word segmentation, letter segmentation, etc. are more severe, inhandwritten text. In this paper we Here in this paper we proposed a method for authenticating handwritten text images using back propagation algorithm..

  18. La Documentacion Automatica (Automated Documentation).

    Science.gov (United States)

    Levery, Francis

    1971-01-01

    Documentation centers are needed to handle the vast amount of scientific and technical information currently being issued. Such centers should be concerned both with handling inquiries in a particular field and with producing a general catalog of current information. Automatic analysis of texts by computers will be the best way to handle material,…

  19. Contextualizing Data Warehouses with Documents

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Berlanga, Rafael; Aramburu, Maria Jose;

    2008-01-01

    Current data warehouse and OLAP technologies are applied to analyze the structured data that companies store in databases. The context that helps to understand data over time is usually described separately in text-rich documents. This paper proposes to integrate the traditional corporate data...

  20. Text mining for the biocuration workflow

    OpenAIRE

    Hirschman, L.; Burns, G. A. P. C.; Krallinger, M.; Arighi, C.; Cohen, K. B.; Valencia, A.; Wu, C H; Chatr-aryamontri, A; Dowell, K. G.; Huala, E; Lourenco, A.; Nash, R; Veuthey, A.-L.; Wiegers, T.; Winter, A. G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations too...

  1. Registration document 2005; Document de reference 2005

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2005-07-01

    This reference document of Gaz de France provides information and data on the Group activities in 2005: financial informations, business, activities, equipments factories and real estate, trade, capital, organization charts, employment, contracts and research programs. (A.L.B.)

  2. Improving text recall with multiple summaries

    NARCIS (Netherlands)

    Meij, van der Hans; Meij, van der Jan

    2012-01-01

    Background. QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in

  3. Writing Treatment for Aphasia: A Texting Approach

    Science.gov (United States)

    Beeson, Pelagie M.; Higginson, Kristina; Rising, Kindle

    2013-01-01

    Purpose: Treatment studies have documented the therapeutic and functional value of lexical writing treatment for individuals with severe aphasia. The purpose of this study was to determine whether such retraining could be accomplished using the typing feature of a cellular telephone, with the ultimate goal of using text messaging for…

  4. Polarity Analysis of Texts using Discourse Structure

    NARCIS (Netherlands)

    Heerschop, Bas; Goosen, Frank; Hogenboom, Alexander; Frasincar, Flavius; Kaymak, Uzay; Jong, de Franciska

    2011-01-01

    Sentiment analysis has applications in many areas and the exploration of its potential has only just begun. We propose Pathos, a framework which performs document sentiment analysis (partly) based on a document’s discourse structure. We hypothesize that by splitting a text into important and less im

  5. Automatic Induction of Rule Based Text Categorization

    Directory of Open Access Journals (Sweden)

    D.Maghesh Kumar

    2010-12-01

    Full Text Available The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuingneed to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describes, a novel method for the automatic induction of rule-based text classifiers. This method supports a hypothesis language of the form "if T1, … or Tn occurs in document d, and none of T1+n,... Tn+m occurs in d, then classify d under category c," where each Ti is a conjunction of terms. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. Issues pertaining tothree different problems, namely, document representation, classifier construction, and classifier evaluation were discussed in detail.

  6. 2002 reference document; Document de reference 2002

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2002-07-01

    This 2002 reference document of the group Areva, provides information on the society. Organized in seven chapters, it presents the persons responsible for the reference document and for auditing the financial statements, information pertaining to the transaction, general information on the company and share capital, information on company operation, changes and future prospects, assets, financial position, financial performance, information on company management and executive board and supervisory board, recent developments and future prospects. (A.L.B.)

  7. Didaktischer Informationsdienst Mathematik. Thema: Proportion. Curriculum Heft 22. [and] Quellensammlung zu: Didaktischer Informationdienst Mathematik. Thema: Proportion. Dokumentation: Literaturnachweise 3. (Didactical Information Service for Mathematics. Topic: Proportion. Curriculum Text 22. [and] Source Materials for the: Didactical Information Service for Mathematics. Topic: Proportion. Documentation: Literature Review 3).

    Science.gov (United States)

    Andelfinger, Bernhard; Zuckett-Peerenboom, Rolf D.

    In 1979, a small team at the Landesinstitut Nordrhein-Westfalen Neuss, Federal Republic of Germany, started the Didactical Information Service for Mathematics in School (DID-M). They are developing a series of books containing information on and documentation for the most important mathematics topics for grades 5-10. Chosen as the first of these…

  8. Enterprise Document Management

    Data.gov (United States)

    US Agency for International Development — The function of the operation is to provide e-Signature and document management support for Acquisition and Assisitance (A&A) documents including vouchers in...

  9. TEXT MINING AND CLASSIFICATION OF PRODUCT REVIEWS USING STRUCTURED SUPPORT VECTOR MACHINE

    OpenAIRE

    Jincy B. Chrystal; Stephy Joseph

    2015-01-01

    Text mining and Text classification are the two prominent and challenging tasks in the field of Machine learning. Text mining refers to the process of deriving high quality and relevant information from text, while Text classification deals with the categorization of text documents into different classes. The real challenge in these areas is to address the problems like handling large text corpora, similarity of words in text documents, and association of text documents with a ...

  10. Document Analysis by Crosscount Approach

    Institute of Scientific and Technical Information of China (English)

    王海琴; 戴汝为

    1998-01-01

    In this paper a new feature called crosscount for document analysis is introduced.The reature crosscount is a function of white line segment with its start on the edge of document images.It reflects not only the contour of image,but also the periodicity of white lines(background)and text lines in the document images.In complex printed-page layouts,there are different blocks such as textual,graphical,tabular,and so on.Of these blocks,textual ones have the most obvious periodicity with their homogeneous white lines arranged regularly.The important property of textual blocks can be extracted by crosscount functions.here the document layouts are classified into three classes on the basis of their physical structures.Then the definition and properties of the crosscount function are described.According to the classification of document layouts,the application of this new feature to different types of document images' analysis and understanding is discussed.

  11. Electronic Document Management Using Inverted Files System

    Directory of Open Access Journals (Sweden)

    Suhartono Derwin

    2014-03-01

    Full Text Available The amount of documents increases so fast. Those documents exist not only in a paper based but also in an electronic based. It can be seen from the data sample taken by the SpringerLink publisher in 2010, which showed an increase in the number of digital document collections from 2003 to mid of 2010. Then, how to manage them well becomes an important need. This paper describes a new method in managing documents called as inverted files system. Related with the electronic based document, the inverted files system will closely used in term of its usage to document so that it can be searched over the Internet using the Search Engine. It can improve document search mechanism and document save mechanism.

  12. ODQ: A Fluid Office Document Query Language

    Directory of Open Access Journals (Sweden)

    Xuhong Liu

    2015-06-01

    Full Text Available Fluid office documents, as semi-structured data often represented by Extensible Markup Language (XML are important parts of Big Data. These office documents have different formats, and their matching Application Programming Interfaces (APIs depend on developing platform and versions, which causes difficulty in custom development and information retrieval from them. To solve this problem, we have been developing an office document query (ODQ language which provides a uniform method to retrieve content from documents with different formats and versions. ODQ builds common document model ontology to conceal the format details of documents and provides a uniform operation interface to handle office documents with different formats. The results show that ODQ has advantages in format independence, and can facilitate users in developing documents processing systems with good interoperability.

  13. Clinical document architecture.

    Science.gov (United States)

    Heitmann, Kai

    2003-01-01

    The Clinical Document Architecture (CDA), a standard developed by the Health Level Seven organisation (HL7), is an ANSI approved document architecture for exchange of clinical information using XML. A CDA document is comprised of a header with associated vocabularies and a body containing the structural clinical information. PMID:15061557

  14. Informative document waste plastics

    NARCIS (Netherlands)

    Nagelhout D; Sein AA; Duvoort GL

    1989-01-01

    This "Informative document waste plastics" forms part of a series of "informative documents waste materials". These documents are conducted by RIVM on the indstruction of the Directorate General for the Environment, Waste Materials Directorate, in behalf of the program of acti

  15. Traceability Method for Software Engineering Documentation

    Directory of Open Access Journals (Sweden)

    Nur Adila Azram

    2012-03-01

    Full Text Available Traceability has been widely discussed in research area. It has been one of interest topic to be research in software engineering. Traceability in software documentation is one of the interesting topics to be research further. It is important in software documentation to trace out the flow or process in all the documents whether they depends with one another or not. In this paper, we present a traceability method for software engineering documentation. The objective of this research is to facilitate in tracing of the software documentation.

  16. Electronic documents circulation in taxation

    Directory of Open Access Journals (Sweden)

    Yanchev A.V.

    2015-03-01

    Full Text Available Scientific recommendations on electronic documents circulation’s introduction in the system of the taxation are developed. The model of the initial tax calculation is offered which in contrast to available one takes into account the specific of enterprise’s administrative business-processes and also the algorithm of tax-payer’s electronic authentication in global system of electronic government. This allows without additional costs to boost information streams’ quality on electronic documentation of tax calculations. This also allows avoiding documents’ doubling and their separate requisites in state database.

  17. Text Classification Using Sentential Frequent Itemsets

    Institute of Scientific and Technical Information of China (English)

    Shi-Zhu Liu; He-Ping Hu

    2007-01-01

    Text classification techniques mostly rely on single term analysis of the document data set, while more concepts,especially the specific ones, are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset, a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification. Experiments over the Reuters and newsgroup corpus are carried out, which validate the practicability of the proposed system.

  18. Document image cleanup and binarization

    Science.gov (United States)

    Wu, Victor; Manmatha, Raghaven

    1998-04-01

    Image binarization is a difficult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet effective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the first step, the input image is smoothed using a low-pass filter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the first peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the first and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass filter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of source: digitized video frames, photos, newspapers, advertisements in magazines or sales flyers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91 percent of the characters and 86 percent of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84 percent for the characters and 77 percent for the words.

  19. Scheme Program Documentation Tools

    DEFF Research Database (Denmark)

    Nørmark, Kurt

    2004-01-01

    This paper describes and discusses two different Scheme documentation tools. The first is SchemeDoc, which is intended for documentation of the interfaces of Scheme libraries (APIs). The second is the Scheme Elucidator, which is for internal documentation of Scheme programs. Although the tools...... are separate and intended for different documentation purposes they are related to each other in several ways. Both tools are based on XML languages for tool setup and for documentation authoring. In addition, both tools rely on the LAML framework which---in a systematic way---makes an XML language available...

  20. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  1. NEW TECHNIQUES USED IN AUTOMATED TEXT ANALYSIS

    Directory of Open Access Journals (Sweden)

    M. I strate

    2010-12-01

    Full Text Available Automated analysis of natural language texts is one of the most important knowledge discovery tasks for any organization. According to Gartner Group, almost 90% of knowledge available at an organization today is dispersed throughout piles of documents buried within unstructured text. Analyzing huge volumes of textual information is often involved in making informed and correct business decisions. Traditional analysis methods based on statistics fail to help processing unstructured texts and the society is in search of new technologies for text analysis. There exist a variety of approaches to the analysis of natural language texts, but most of them do not provide results that could be successfully applied in practice. This article concentrates on recent ideas and practical implementations in this area.

  2. ARABIC TEXT CATEGORIZATION ALGORITHM USING VECTOR EVALUATION METHOD

    Directory of Open Access Journals (Sweden)

    Ashraf Odeh

    2014-12-01

    Full Text Available Text categorization is the process of grouping documents into categories based on their contents. This process is important to make information retrieval easier, and it became more important due to the huge textual information available online. The main problem in text categorization is how to improve the classification accuracy. Although Arabic text categorization is a new promising field, there are a few researches in this field. This paper proposes a new method for Arabic text categorization using vector evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the tested document's words are calculated to determine the document keywords which will be compared with the keywords of the corpus categorizes to determine the tested document's best category.

  3. Hypermedia and Free Text Retrieval.

    Science.gov (United States)

    Dunlop, Mark D.; van Rijsbergen, C. J.

    1993-01-01

    Discusses access to nontextual documents in large multimedia document bases. A hybrid information retrieval model, using queries in a hypertext environment for location of browsing areas, is presented; and two experiments using cluster-based descriptions of content are reported. (23 references) (EA)

  4. Exploiting Surrounding Text for Retrieving Web Images

    Directory of Open Access Journals (Sweden)

    S. A. Noah

    2008-01-01

    Full Text Available Web documents contain useful textual information that can be exploited for describing images. Research had been focused on representing images by means of its content (low level description such as color, shape and texture, little research had been directed to exploiting such textual information. The aim of this research was to systematically exploit the textual content of HTML documents for automatically indexing and ranking of images embedded in web documents. A heuristic approach for locating and assigning weight surrounding web images and a modified tf.idf weighting scheme was proposed. Precision-recall measures of evaluation had been conducted for ten queries and promising results had been achieved. The proposed approach showed slightly better precision measure as compared to a popular search engine with an average of 0.63 and 0.55 relative precision measures respectively.

  5. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  6. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  7. Modeling Documents with Event Model

    Directory of Open Access Journals (Sweden)

    Longhui Wang

    2015-08-01

    Full Text Available Currently deep learning has made great breakthroughs in visual and speech processing, mainly because it draws lessons from the hierarchical mode that brain deals with images and speech. In the field of NLP, a topic model is one of the important ways for modeling documents. Topic models are built on a generative model that clearly does not match the way humans write. In this paper, we propose Event Model, which is unsupervised and based on the language processing mechanism of neurolinguistics, to model documents. In Event Model, documents are descriptions of concrete or abstract events seen, heard, or sensed by people and words are objects in the events. Event Model has two stages: word learning and dimensionality reduction. Word learning is to learn semantics of words based on deep learning. Dimensionality reduction is the process that representing a document as a low dimensional vector by a linear mode that is completely different from topic models. Event Model achieves state-of-the-art results on document retrieval tasks.

  8. Text Character Extraction Implementation from Captured Handwritten Image to Text Conversionusing Template Matching Technique

    Directory of Open Access Journals (Sweden)

    Barate Seema

    2016-01-01

    Full Text Available Images contain various types of useful information that should be extracted whenever required. A various algorithms and methods are proposed to extract text from the given image, and by using that user will be able to access the text from any image. Variations in text may occur because of differences in size, style,orientation, alignment of text, and low image contrast, composite backgrounds make the problem during extraction of text. If we develop an application that extracts and recognizes those texts accurately in real time, then it can be applied to many important applications like document analysis, vehicle license plate extraction, text- based image indexing, etc and many applications have become realities in recent years. To overcome the above problems we develop such application that will convert the image into text by using algorithms, such as bounding box, HSV model, blob analysis,template matching, template generation.

  9. A Survey on Web Text Information Retrieval in Text Mining

    OpenAIRE

    Tapaswini Nayak; Srinivash Prasad; Manas Ranjan Senapat

    2015-01-01

    In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concep...

  10. Document Classification Using Support Vector Machine

    OpenAIRE

    Shweta Mayor; Bhasker Pant

    2012-01-01

    Information like NEWS FEEDS is generally stored in the form of documents and files created on the basis of daily occurrence in the world. Classifying an unstructured text in these large document corpora has become cumbersome. Efficiently and effectively retrieving and categorizing these document is a hard task to perform. This research paper discuss in detail the implementation of Support Vector Machine (SVM) for calculating term frequency of the features used as Sports, Business and Entertai...

  11. Document control program (DCP)

    Energy Technology Data Exchange (ETDEWEB)

    Burger, M.J.

    1978-01-01

    The management and control of classified and unclassified documents is tedious, time consuming, and error prone. DCP is a simple, inexpensive, but effective program for the Livermore Time Sharing System and is written in TRIX and TRIX AC. It is used to computerize the classified document control task with a completely self-contained program requiring essentially no modifications or programer support to implement or maintain. DCP provides a complete dialect to prepare interactively the input data, update the document master file, and interrogate and retrieve any information desired from the document file. 2 figures. (RWR)

  12. CAED Document Repository

    Data.gov (United States)

    U.S. Environmental Protection Agency — Compliance Assurance and Enforcement Division Document Repository (CAEDDOCRESP) provides internal and external access of Inspection Records, Enforcement Actions,...

  13. Health physics documentation

    International Nuclear Information System (INIS)

    When dealing with radioactive material the health physicist receives innumerable papers and documents within the fields of researching, prosecuting, organizing and justifying radiation protection. Some of these papers are requested by the health physicist and some are required by law. The scope, quantity and deposit periods of the health physics documentation at the Karlsruhe Nuclear Research Center are presented and rationalizing methods discussed. The aim of this documentation should be the application of physics to accident prevention, i.e. documentation should protect those concerned and not the health physicist. (H.K.)

  14. Introducing Text Analytics as a Graduate Business School Course

    Science.gov (United States)

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  15. TEXT MINING – PREREQUISITE FOR KNOWLEDGE MANAGEMENT SYSTEMS

    OpenAIRE

    Dragoº Marcel VESPAN

    2009-01-01

    Text mining is an interdisciplinary field with the main purpose of retrieving new knowledge from large collections of text documents. This paper presents the main techniques used for knowledge extraction through text mining and their main areas of applicability and emphasizes the importance of text mining in knowledge management systems.

  16. IR and OLAP in XML document warehouses

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Pedersen, Torben Bach; Berlanga, Rafael;

    2005-01-01

    In this paper we propose to combine IR and OLAP (On-Line Analytical Processing) technologies to exploit a warehouse of text-rich XML documents. In the system we plan to develop, a multidimensional implementation of a relevance modeling document model will be used for interactively querying...

  17. IDC System Specification Document.

    Energy Technology Data Exchange (ETDEWEB)

    Clifford, David J.

    2014-12-01

    This document contains the system specifications derived to satisfy the system requirements found in the IDC System Requirements Document for the IDC Reengineering Phase 2 project. Revisions Version Date Author/Team Revision Description Authorized by V1.0 12/2014 IDC Reengineering Project Team Initial delivery M. Harris

  18. INFCE plenary conference documents

    International Nuclear Information System (INIS)

    This document consists of the reports to the First INFCE Plenary Conference (November 1978) by the Working Groups a Plenary Conference of its actions and decisions, the Communique of the Final INFCE Plenary Conference (February 1980), and a list of all documents in the IAEA depository for INFCE

  19. Enriching software architecture documentation

    NARCIS (Netherlands)

    Jansen, Anton; Avgeriou, Paris; Ven, Jan Salvador van der

    2009-01-01

    The effective documentation of Architectural Knowledge (AK) is one of the key factors in leveraging the paradigm shift toward sharing and reusing AK. However, current documentation approaches have severe shortcomings in capturing the knowledge of large and complex systems and subsequently facilitati

  20. Extracting Conceptual Feature Structures from Text

    DEFF Research Database (Denmark)

    Andreasen, Troels; Bulskov, Henrik; Jensen, Per Anker;

    2011-01-01

    and mapped into concepts in a generative ontology. Synonymous but linguistically quite distinct expressions are mapped to the same concept in the ontology. This allows us to perform a content-based search which will retrieve relevant documents independently of the linguistic form of the query as well......This paper describes an approach to indexing texts by their conceptual content using ontologies along with lexico-syntactic information and semantic role assignment provided by lexical resources. The conceptual content of meaningful chunks of text is transformed into conceptual feature structures...

  1. Towards Multi Label Text Classification through Label Propagation

    Directory of Open Access Journals (Sweden)

    Shweta C. Dharmadhikari

    2012-06-01

    Full Text Available Classifying text data has been an active area of research for a long time. Text document is multifaceted object and often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text Classification. Our proposed approach models the relationship between class labels and also effectively represents input text documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for classification. Our proposed approach promises better classification accuracy and handling of complexity and elaborated on the basis of standard datasets such as Enron, Slashdot and Bibtex.

  2. Inclusivity, Gestalt Principles, and Plain Language in Document Design

    Directory of Open Access Journals (Sweden)

    Jennifer Turner

    2016-06-01

    Full Text Available In Brief: Good design makes documents easier to use, helps documents stand out from other pieces of information, and lends credibility to document creators. Librarians across library types and departments provide instruction and training materials to co-workers and library users. For these materials to be readable and accessible, they must follow guidelines for usable document design. […

  3. Automatic handwriting identification on medieval documents

    NARCIS (Netherlands)

    Bulacu, M.L.; Schomaker, L.R.B.

    2007-01-01

    In this paper, we evaluate the performance of text-independent writer identification methods on a handwriting dataset containing medieval English documents. Applicable identification rates are achieved by combining textural features (joint directional probability distributions) with allographic feat

  4. Handwriting segmentation of unconstrained Oriya text

    Indian Academy of Sciences (India)

    N Tripathy; U Pal

    2006-12-01

    Segmentation of handwritten text into lines, words and characters is one of the important steps in the handwritten text recognition process. In this paper we propose a water reservoir concept-based scheme for segmentation of unconstrained Oriya handwritten text into individual characters. Here, at first, the text image is segmented into lines, and the lines are then segmented into individual words. For line segmentation, the document is divided into vertical stripes. Analysing the heights of the water reservoirs obtained from different components of the document, the width of a stripe is calculated. Stripe-wise horizontal histograms are then computed and the relationship of the peak–valley points of the histograms is used for line segmentation. Based on vertical projection profiles and structural features of Oriya characters, text lines are segmented into words. For character segmentation, at first, the isolated and connected (touching) characters in a word are detected. Using structural, topological and water reservoir concept-based features, characters of the word that touch are then segmented. From experiments we have observed that the proposed “touching character” segmentation module has 96·7% accuracy for two-character touching strings.

  5. Illumination Compensation Algorithm for Unevenly Lighted Document Segmentation

    Directory of Open Access Journals (Sweden)

    Ju Zhiyong

    2013-07-01

    Full Text Available For the problem of segmenting the unevenly lighted document image, this paper proposes an illumination compensation segmentation algorithm which can effectively segment the unevenly lighted document. The illumination compensation method is proposed to equivalently convert unevenly lighted document image to evenly lighted document image, then segment the evenly lighted document directly. Experimental results show that the proposed method can get the accurate evenly lighted document images so that we can segment the document accurately and it is more efficient to process unevenly lighted document  images than traditional binarization methods. The algorithm effectively overcomes the difficulty in handling uneven lighting and enhances segmentation quality considerably.

  6. Classroom Texting in College Students

    Science.gov (United States)

    Pettijohn, Terry F.; Frazier, Erik; Rieser, Elizabeth; Vaughn, Nicholas; Hupp-Wilds, Bobbi

    2015-01-01

    A 21-item survey on texting in the classroom was given to 235 college students. Overall, 99.6% of students owned a cellphone and 98% texted daily. Of the 138 students who texted in the classroom, most texted friends or significant others, and indicate the reason for classroom texting is boredom or work. Students who texted sent a mean of 12.21…

  7. Methodological Aspects of Architectural Documentation

    Directory of Open Access Journals (Sweden)

    Arivaldo Amorim

    2011-12-01

    Full Text Available This paper discusses the methodological approach that is being developed in the state of Bahia in Brazil since 2003, in architectural and urban sites documentation, using extensive digital technologies. Bahia has a vast territory with important architectural ensembles ranging from the sixteenth century to present day. As part of this heritage is constructed of raw earth and wood, it is very sensitive to various deleterious agents. It is therefore critical document this collection that is under threats. To conduct those activities diverse digital technologies that could be used in documentation process are being experimented. The task is being developed as an academic research, with few financial resources, by scholarship students and some volunteers. Several technologies are tested ranging from the simplest to the more sophisticated ones, used in the main stages of the documentation project, as follows: work overall planning, data acquisition, processing and management and ultimately, to control and evaluate the work. The activities that motivated this paper are being conducted in the cities of Rio de Contas and Lençóis in the Chapada Diamantina, located at 420 km and 750 km from Salvador respectively, in Cachoeira city at Recôncavo Baiano area, 120 km from Salvador, the capital of Bahia state, and at Pelourinho neighbourhood, located in the historic capital. Part of the material produced can be consulted in the website: < www.lcad.ufba.br>.

  8. Document reconstruction by layout analysis of snippets

    Science.gov (United States)

    Kleber, Florian; Diem, Markus; Sablatnig, Robert

    2010-02-01

    Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. Also skew detection of scanned documents is performed to support OCR algorithms that are sensitive to skew. In this paper document analysis is applied to snippets of torn documents to calculate features for the reconstruction. Documents can either be destroyed by the intention to make the printed content unavailable (e.g. tax fraud investigation, business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, inpainting and texture synthesis techniques. In this paper the possibility of document analysis techniques of snippets to support the matching algorithm by considering additional features are shown. This implies a rotational analysis, a color analysis and a line detection. As a future work it is planned to extend the feature set with the paper type (blank, checked, lined), the type of the writing (handwritten vs. machine printed) and the text layout of a snippet (text size, line spacing). Preliminary results show that these pre-processing steps can be performed reliably on a real dataset consisting of 690 snippets.

  9. New Challenges of the Documentation in Media

    Directory of Open Access Journals (Sweden)

    Antonio García Jiménez

    2015-07-01

    Full Text Available This special issue, presented by index.comunicación, is focused on media related information & documentation. This field undergoes constant and profound changes, especially visible in documentation processes. A situation characterized by the existence of tablets, smartphones, applications, and by the almost achieved digitization of traditional documents, in addition to the crisis of the press business model, that involves mutations in the journalists’ tasks and in the relationship between them and Documentation. Papers included in this special issue focus on some of the concerns in this domain: the progressive autonomy of the journalist in access to information sources, the role of press offices as documentation sources, the search of information on the web, the situation of media blogs, the viability of elements of information architecture in smart TV and the development of social TV and its connection to Documentation.

  10. The Challenge of Challenging Text

    Science.gov (United States)

    Shanahan, Timothy; Fisher, Douglas; Frey, Nancy

    2012-01-01

    The Common Core State Standards emphasize the value of teaching students to engage with complex text. But what exactly makes a text complex, and how can teachers help students develop their ability to learn from such texts? The authors of this article discuss five factors that determine text complexity: vocabulary, sentence structure, coherence,…

  11. Text-Attentional Convolutional Neural Network for Scene Text Detection.

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-06-01

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this paper, we present a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/non-text information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates the main task of text/non-text classification. In addition, a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed, which extends the widely used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 data set, with an F-measure of 0.82, substantially improving the state-of-the-art results. PMID:27093723

  12. Handwritten Text Image Authentication using Back Propagation

    CERN Document Server

    Chakravarthy, A S N; Avadhani, P S

    2011-01-01

    Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involve confirming the identity of a person, tracing the origins of an artefact, ensuring that a product is what it's packaging and labelling claims to be, or assuring that a computer program is a trusted one. The authentication of information can pose special problems (especially man-in-the-middle attacks), and is often wrapped up with authenticating identity. Literary can involve imitating the style of a famous author. If an original manuscript, typewritten text, or recording is available, then the medium itself (or its packaging - anything from a box to e-mail headers) can help prove or disprove the authenticity of the document. The use of digital images of handwritten historical documents has become more popular in recent years. Volunteers around the world now read thousands of these images as part of their indexing process. Handwritten text images of old documents are sometimes difficult to read or noisy du...

  13. Text Mining Infrastructure in R

    OpenAIRE

    Kurt Hornik; Ingo Feinerer; David Meyer

    2008-01-01

    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classiffication and string kernels. (authors' abstract)

  14. Integrated Criteria document Chlorophenols

    NARCIS (Netherlands)

    Slooff W; Bremmer HJ; Janus JA; Matthijsen AJCM; van Beelen P; van den Berg R; Bloemen HJT; Canton JH; Eerens HC; Hrubec J; Janssens H; Jumelet JC; Knaap AGAC; de Leeuw FAAM; van der Linden AMA; Loch JPG; van Loveren H; Peijnenburg WJGM; Piersma AH; Struijs J; Taalman RDFM; Theelen RMC; van der Velde JMA; Verburgh JJ; Versteegh JFM; van der Woerd KF

    1991-01-01

    Bij dit rapport behoort een bijlage onder hetzelfde nummer getiteld: "Integrated Criteria document Chlorophenols: Effects:" Auteurs : Janus JA
    Taalman RDFM; Theelen RMC en is de engelse editie van 710401003

  15. NCDC Archive Documentation Manuals

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The National Climatic Data Center Tape Deck Documentation library is a collection of over 400 manuals describing NCDC's digital holdings (both historic and...

  16. Registration document 2005

    International Nuclear Information System (INIS)

    This reference document of Gaz de France provides information and data on the Group activities in 2005: financial informations, business, activities, equipments factories and real estate, trade, capital, organization charts, employment, contracts and research programs. (A.L.B.)

  17. Transportation System Requirements Document

    International Nuclear Information System (INIS)

    This Transportation System Requirements Document (Trans-SRD) describes the functions to be performed by and the technical requirements for the Transportation System to transport spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from Purchaser and Producer sites to a Civilian Radioactive Waste Management System (CRWMS) site, and between CRWMS sites. The purpose of this document is to define the system-level requirements for Transportation consistent with the CRWMS Requirement Document (CRD). These requirements include design and operations requirements to the extent they impact on the development of the physical segments of Transportation. The document also presents an overall description of Transportation, its functions, its segments, and the requirements allocated to the segments and the system-level interfaces with Transportation. The interface identification and description are published in the CRWMS Interface Specification

  18. Text analysis devices, articles of manufacture, and text analysis methods

    Science.gov (United States)

    Turner, Alan E; Hetzler, Elizabeth G; Nakamura, Grant C

    2013-05-28

    Text analysis devices, articles of manufacture, and text analysis methods are described according to some aspects. In one aspect, a text analysis device includes processing circuitry configured to analyze initial text to generate a measurement basis usable in analysis of subsequent text, wherein the measurement basis comprises a plurality of measurement features from the initial text, a plurality of dimension anchors from the initial text and a plurality of associations of the measurement features with the dimension anchors, and wherein the processing circuitry is configured to access a viewpoint indicative of a perspective of interest of a user with respect to the analysis of the subsequent text, and wherein the processing circuitry is configured to use the viewpoint to generate the measurement basis.

  19. Text-Attentional Convolutional Neural Network for Scene Text Detection

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-06-01

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature computed globally from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this work, we present a new system for scene text detection by proposing a novel Text-Attentional Convolutional Neural Network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/nontext information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates main task of text/non-text classification. In addition, a powerful low-level detector called Contrast- Enhancement Maximally Stable Extremal Regions (CE-MSERs) is developed, which extends the widely-used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 dataset, with a F-measure of 0.82, improving the state-of-the-art results substantially.

  20. Stroke Briefing: Technical Documentation

    OpenAIRE

    Institute of Public Health in Ireland

    2012-01-01

    A stroke happens when blood flow to a part of the brain is interrupted by a blocked or burst blood vessel. A lack of blood supply can damage brain cells and affect body functions. IPH has systematically estimated and forecast the prevalence of stroke on the island of Ireland. This document details the methods used to calculate these estimates and forecasts. Technical documentation      

  1. 2002 reference document

    International Nuclear Information System (INIS)

    This 2002 reference document of the group Areva, provides information on the society. Organized in seven chapters, it presents the persons responsible for the reference document and for auditing the financial statements, information pertaining to the transaction, general information on the company and share capital, information on company operation, changes and future prospects, assets, financial position, financial performance, information on company management and executive board and supervisory board, recent developments and future prospects. (A.L.B.)

  2. Contrastive Study of Coherence in Chinese Text and English Text

    Institute of Scientific and Technical Information of China (English)

    王婷

    2013-01-01

    The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse, coherence and cohesive. In addition we try to discover different manifestations of text between ET and CT, including different coherent structures.

  3. Documentation of spectrom-32

    International Nuclear Information System (INIS)

    SPECTROM-32 is a finite element program for analyzing two-dimensional and axisymmetric inelastic thermomechanical problems related to the geological disposal of nuclear waste. The code is part of the SPECTROM series of special-purpose computer programs that are being developed by RE/SPEC Inc. to address many unique rock mechanics problems encountered in analyzing radioactive wastes stored in geologic formations. This document presents the theoretical basis for the mathematical models, the finite element formulation and solution procedure of the program, a description of the input data for the program, verification problems, and details about program support and continuing documentation. The computer code documentation is intended to satisfy the requirements and guidelines outlined in the document entitled Final Technical Position on Documentation of Computer Codes for High-Level Waste Management. The principle component models used in the program involve thermoelastic, thermoviscoelastic, thermoelastic-plastic, and thermoviscoplastic types of material behavior. Special material considerations provide for the incorporation of limited-tension material behavior and consideration of jointed material behavior. Numerous program options provide the capabilities for various boundary conditions, sliding interfaces, excavation, backfill, arbitrary initial stresses, multiple material domains, load incrementation, plotting database storage and access of results, and other features unique to the geologic disposal of radioactive wastes. Numerous verification problems that exercise many of the program options and illustrate the required data input and printed results are included in the documentation

  4. Documentation of spectrom-32

    International Nuclear Information System (INIS)

    SPECTROM-32 is a finite element program for analyzing two-dimensional and axisymmetric inelastic thermomechanical problems related to the geological disposal of nuclear waste. The code is part of the SPECTROM series of special-purpose computer programs that are being developed by RE/SPEC Inc. to address many unique rock mechanics problems encountered in analyzing radioactive wastes stored in geologic formations. This document presents the theoretical basis for the mathematical models, the finite element formulation and solution procedure of the program, a description of the input data for the program, verification problems, and details about program support and continuing documentation. The computer code documentation is intended to satisfy the requirements and guidelines outlined in the document entitled Final Technical Position on Documentation of Computer Codes for High-Level Waste Management. The principal component models used in the program involve thermoelastic, thermoviscoelastic, thermoelastic-plastic, and thermoviscoplastic types of material behavior. Special material considerations provide for the incorporation of limited-tension material behavior and consideration of jointed material behavior. Numerous program options provide the capabilities for various boundary conditions, sliding interfaces, excavation, backfill, arbitrary initial stresses, multiple material domains, load incrementation, plotting database storage and access of results, and other features unique to the geologic disposal of radioactive wastes. Numerous verification problems that exercise many of the program options and illustrate the required data input and printed results are included in the documentation

  5. LCS Content Document Application

    Science.gov (United States)

    Hochstadt, Jake

    2011-01-01

    My project at KSC during my spring 2011 internship was to develop a Ruby on Rails application to manage Content Documents..A Content Document is a collection of documents and information that describes what software is installed on a Launch Control System Computer. It's important for us to make sure the tools we use everyday are secure, up-to-date, and properly licensed. Previously, keeping track of the information was done by Excel and Word files between different personnel. The goal of the new application is to be able to manage and access the Content Documents through a single database backed web application. Our LCS team will benefit greatly with this app. Admin's will be able to login securely to keep track and update the software installed on each computer in a timely manner. We also included exportability such as attaching additional documents that can be downloaded from the web application. The finished application will ease the process of managing Content Documents while streamlining the procedure. Ruby on Rails is a very powerful programming language and I am grateful to have the opportunity to build this application.

  6. Survey on Feature Selection in Document Clustering

    Directory of Open Access Journals (Sweden)

    MS. K.Mugunthadevi,

    2011-03-01

    Full Text Available Text mining is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide knowledge and to support in decision making. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. lustering is a fundamental data analysis technique used for variousapplications such as biology, psychology, control and signal processing, information theory and mining technologies. Text mining is not a stand-alone task that human analysts typically engage in. The goal is to transform text composed of everyday language into a structured, database format. In this way, heterogeneous documents are summarized and presented in a uniform manner. Among others, the challenging problems of text clustering are big volume, high dimensionality and complex semantics.

  7. Endangered Language Documentation and Transmission

    Directory of Open Access Journals (Sweden)

    D. Victoria Rau

    2007-01-01

    Full Text Available This paper describes an on-going project on digital archiving Yami language documentation (http://www.hrelp.org/grants/projects/index.php?projid=60. We present a cross-disciplinary approach, involving computer science and applied linguistics, to document the Yami language and prepare teaching materials. Our discussion begins with an introduction to an integrated framework for archiving, processing and developing learning materials for Yami (Yang and Rau 2005, followed by a historical account of Yami language teaching, from a grammatical syllabus (Dong and Rau 2000b to a communicative syllabus using a multimedia CD as a resource (Rau et al. 2005, to the development of interactive on-line learning based on the digital archiving project. We discuss the methods used and challenges of each stage of preparing Yami teaching materials, and present a proposal for rethinking pedagogical models for e-learning.

  8. Wilmar joint market model, Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Meibom, P.; Larsen, Helge V. [Risoe National Lab. (Denmark); Barth, R.; Brand, H. [IER, Univ. of Stuttgart (Germany); Weber, C.; Voll, O. [Univ. of Duisburg-Essen (Germany)

    2006-01-15

    The Wilmar Planning Tool is developed in the project Wind Power Integration in Liberalised Electricity Markets (WILMAR) supported by EU (Contract No. ENK5-CT-2002-00663). A User Shell implemented in an Excel workbook controls the Wilmar Planning Tool. All data are contained in Access databases that communicate with various sub-models through text files that are exported from or imported to the databases. The Joint Market Model (JMM) constitutes one of these sub-models. This report documents the Joint Market model (JMM). The documentation describes: 1. The file structure of the JMM. 2. The sets, parameters and variables in the JMM. 3. The equations in the JMM. 4. The looping structure in the JMM. (au)

  9. A Two Step Data Mining Approach for Amharic Text Classification

    Directory of Open Access Journals (Sweden)

    Seffi Gebeyehu

    2016-08-01

    Full Text Available Traditionally, text classifiers are built from labeled training examples (supervised. Labeling is usually done manually by human experts (or the users, which is a labor intensive and time consuming process. In the past few years, researchers have investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper is aimed to show as the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. In this paper, intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM and two classifiers: Naive Bayes (NB and locally weighted learning (LWL. NB first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents while LWL uses a class of function approximation to build a model around the current point of interest. An experiment conducted on a mixture of labeled and unlabeled Amharic text documents showed that the new method achieved a significant performance in comparison with that of a supervised LWL and NB. The result also pointed out that the use of unlabeled data with EM reduces the classification absolute error by 27.6%. In general, since unlabeled documents are much less expensive and easier to collect than labeled documents, this method will be useful for text categorization tasks including online data sources such as web pages, e-mails and news group postings. If one uses this method, building text categorization systems will be significantly faster and less expensive than the supervised learning approach.

  10. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  11. Document segmentation for high-quality printing

    Science.gov (United States)

    Ancin, Hakan

    1997-04-01

    A technique to segment dark texts on light background of mixed mode color documents is presented. This process does not perceptually change graphics and photo regions. Color documents are scanned and printed from various media which usually do not have clean background. This is especially the case for the printouts generated from thin magazine samples, these printouts usually include text and figures form the back of the page, which is called bleeding. Removal of bleeding artifacts improves the perceptual quality of the printed document and reduces the color ink usage. By detecting the light background of the document, these artifacts are removed from background regions. Also detection of dark text regions enables the halftoning algorithms to use true black ink for the black text pixels instead of composite black. The processed document contains sharp black text on white background, resulting improved perceptual quality and better ink utilization. The described method is memory efficient and requires a small number of scan lines of high resolution color documents during processing.

  12. Research on Text Mining Based on Domain Ontology

    OpenAIRE

    Li-hua, Jiang; Neng-fu, Xie; Hong-bin, Zhang

    2013-01-01

    This paper improves the traditional text mining technology which cannot understand the text semantics. The author discusses the text mining methods based on ontology and puts forward text mining model based on domain ontology. Ontology structure is built firstly and the “concept-concept” similarity matrix is introduced, then a conception vector space model based on domain ontology is used to take the place of traditional vector space model to represent the documents in order to realize text m...

  13. Author Gender Identification from Text

    OpenAIRE

    Rezaei, Atoosa Mohammad

    2014-01-01

    ABSTRACT: The identification of an author's gender from a text has become a popular research area within the scope of text categorization. The number of users of social network applications based on text, such as Twitter, Facebook and text messaging services, has grown rapidly over the past few decades. As a result, text has become one of the most important and prevalent media types on the Internet. This thesis aims to determine the gender of an author from an arbitrary piece of text such as,...

  14. Extracting information from free-text mammography reports

    OpenAIRE

    Esuli, Andrea; Marcheggiani, Diego; Sebastiani, Fabrizio

    2010-01-01

    Researchers from ISTI-CNR, Pisa, aim at effectively and efficiently extracting information from free-text mammography reports, as a step towards the automatic transformation of unstructured medical documentation into structured data.

  15. Using LSA and text segmentation to improve automatic Chinese dialogue text summarization

    Institute of Scientific and Technical Information of China (English)

    LIU Chuan-han; WANG Yong-cheng; ZHENG Fei; LIU De-rong

    2007-01-01

    Automatic Chinese text summarization for dialogue style is a relatively new research area. In this paper, Latent Semantic Analysis (LSA) is first used to extract semantic knowledge from a given document, all question paragraphs are identified,an automatic text segmentation approach analogous to TextTiling is exploited to improve the precision of correlating question paragraphs and answer paragraphs, and finally some "important" sentences are extracted from the generic content and the question-answer pairs to generate a complete summary. Experimental results showed that our approach is highly efficient and improves significantly the coherence of the summary while not compromising informativeness.

  16. Problems and Methods of Source Study of Cinema Documents

    Directory of Open Access Journals (Sweden)

    Grigory N. Lanskoy

    2016-03-01

    Full Text Available The article is devoted to basic problems of analysis and interpretation of cinema documents in historical studies, with the possibility of shared approach to the study of cinema and paper documents, the using of art studies principles to the analysis of cinema documents and the efficacy of textual approach to the study of cinema documents among them. The forms of applying different scientific methods to the evaluation of cinema documents as historical sources are also discussed in the article.

  17. An Evident Theoretic Feature Selection Approach for Text Categorization

    OpenAIRE

    UMARSATHIC ALI; JOTHI VENKATESWARAN

    2012-01-01

    With the exponential growth of textual documents available in unstructured form on the Internet, feature selection approaches are increasingly significant for the preprocessing of textual documents for automatic text categorization. Feature selection, which focuses on identifying relevant and informative features, can help reduce the computational cost of processing voluminous amounts of data as well asincrease the effectiveness for the subsequent text categorization tasks. In this paper, we ...

  18. Multioriented and Curved Text Lines Extraction from Documents

    OpenAIRE

    Vaibhav Gavali; B. R. Bombade

    2013-01-01

    There is need of the robust algorithm to extract text lines from script independent documents,color independent, font and size independent segmentation algorithms. This paper presents simple method toextract curved and multioriented text lines from the documents. The input is may be colored or grayscaleimage. Discrete wavelet transform is applied on input image to get four sub-bands. Thresholding is appliedon the three sub-bands (horizontal, vertical, diagonal). Edge detection is applied on t...

  19. Text mining: A Brief survey

    OpenAIRE

    Falguni N. Patel , Neha R. Soni

    2012-01-01

    The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have...

  20. Relation Based Mining Model for Enhancing Web Document Clustering

    Directory of Open Access Journals (Sweden)

    M.Reka

    2014-05-01

    Full Text Available The design of web Information management system becomes more complex one with more time complexity. Information retrieval is a difficult task due to the huge volume of web documents. The way of clustering makes the retrieval easier and less time consuming. Thisalgorithm introducesa web document clustering approach, which use the semantic relation between documents, which reduces the time complexity. It identifies the relations and concepts in a document and also computes the relation score between documents. This algorithm analyses the key concepts from the web documents by preprocessing, stemming, and stop word removal. Identified concepts are used to compute the document relation score and clusterrelation score. The domain ontology is used to compute the document relation score and cluster relation score. Based on the document relation score and cluster relation score, the web document cluster is identified. This algorithm uses 2,00,000 web documents for evaluation and 60 percentas trainingset and 40 percent as testing set.

  1. SEMANTIC METADATA FOR HETEROGENEOUS SPATIAL PLANNING DOCUMENTS

    Directory of Open Access Journals (Sweden)

    A. Iwaniak

    2016-09-01

    Full Text Available Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa. The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.

  2. Predicting Prosody from Text for Text-to-Speech Synthesis

    CERN Document Server

    Rao, K Sreenivasa

    2012-01-01

    Predicting Prosody from Text for Text-to-Speech Synthesis covers the specific aspects of prosody, mainly focusing on how to predict the prosodic information from linguistic text, and then how to exploit the predicted prosodic knowledge for various speech applications. Author K. Sreenivasa Rao discusses proposed methods along with state-of-the-art techniques for the acquisition and incorporation of prosodic knowledge for developing speech systems. Positional, contextual and phonological features are proposed for representing the linguistic and production constraints of the sound units present in the text. This book is intended for graduate students and researchers working in the area of speech processing.

  3. Monitoring interaction and collective text production through text mining

    Directory of Open Access Journals (Sweden)

    Macedo, Alexandra Lorandi

    2014-04-01

    Full Text Available This article presents the Concepts Network tool, developed using text mining technology. The main objective of this tool is to extract and relate terms of greatest incidence from a text and exhibit the results in the form of a graph. The Network was implemented in the Collective Text Editor (CTE which is an online tool that allows the production of texts in synchronized or non-synchronized forms. This article describes the application of the Network both in texts produced collectively and texts produced in a forum. The purpose of the tool is to offer support to the teacher in managing the high volume of data generated in the process of interaction amongst students and in the construction of the text. Specifically, the aim is to facilitate the teacher’s job by allowing him/her to process data in a shorter time than is currently demanded. The results suggest that the Concepts Network can aid the teacher, as it provides indicators of the quality of the text produced. Moreover, messages posted in forums can be analyzed without their content necessarily having to be pre-read.

  4. Text comprehension practice in school

    Directory of Open Access Journals (Sweden)

    Hernández, José Emilio

    2010-01-01

    Full Text Available The starting point of the study is the existence of relations between the two dimensions of text compression: the instrumental dimension and the cognitive dimension. The first one includes the system of actions, the second one the system of knowledge. A description of identifying, describing, inferring apprising and creating actions are suggested for each type of text. Likewise, the importance of implementing text comprehension is outlined on the basis of the assumption that the text is a tool for preserving and communicating culture, that allows human beings to wide their respective cultural horizons and develop cognitive and affective process that allow them to get universal morals.

  5. ONTOLOGY BASED DOCUMENT CLUSTERING USING MAPREDUCE

    Directory of Open Access Journals (Sweden)

    Abdelrahman Elsayed

    2015-05-01

    Full Text Available Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only enhances internal evaluation measures of document clustering; and decreases the documents features from thousands to tens features. Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means algorithm.

  6. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  7. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  8. Use of Printed and Online Documents.

    Science.gov (United States)

    Poupa, Christine

    2001-01-01

    Explains how written material started; describes the nature and supply of electronic documents; characterizes student practices in using paper texts and online electronic texts in higher education, including the role of reading; and considers communication and speed and informational inflation and choice. (Author/LRW)

  9. Customer Communication Document

    Science.gov (United States)

    2009-01-01

    This procedure communicates to the Customers of the Automation, Robotics and Simulation Division (AR&SD) Dynamics Systems Test Branch (DSTB) how to obtain services of the Six-Degrees-Of-Freedom Dynamic Test System (SDTS). The scope includes the major communication documents between the SDTS and its Customer. It established the initial communication and contact points as well as provides the initial documentation in electronic media for the customer. Contact the SDTS Manager (SM) for the names of numbers of the current contact points.

  10. Document Management vs. Knowledge Management

    Directory of Open Access Journals (Sweden)

    Sergiu JECAN

    2008-01-01

    Full Text Available Most large organizations have been investing in various disconnected management technologies during the past few years. Efforts to improve management have been especially noticeable over the last 18-24 months, as organizations try to tame the chaos behind their public internet and internal intranet sites. More recently, regulatory concerns have reawakened interest in records management, archiving and document management. In addition, organizations seeking to increase innovation and overall employee efficiency have initiated projects to improve collaborative capabilities. With business models constantly changing and organizations moving to outsourced solutions, the drive towards improving business processes has never been greater. Organizations expect outsourcing to streamline business processes efficiently and effectively if they are to achieve rapid payback and return on investment (ROI.This is where workflow, document management and knowledge management can support the in-house and outsourced business process improvements that help CEOs gain the business benefits they seek in order to remain competitive. We will show how processes can be improved through workflow, document management and knowledge management.

  11. GURMUKHI TEXT EXTRACTION FROM IMAGE USING SUPPORT VECTOR MACHINE (SVM

    Directory of Open Access Journals (Sweden)

    SUKHWINDER KAUR

    2011-04-01

    Full Text Available Extensive research has been done on image classification for different purposes like face recognition, identification of different objects and identification/extraction of text from image having some background. Text identification is an active research area where by system tries to identify the text area in a given image. Text area identified is then passed to OCR system for further recognition of the text. This work is about classifying image area in two classes text and non text using SVM (support vector machine. We identified the features and train a model based on the feature vector which is then used to classify text and non text area in an image. The system reports 70.5% accuracy for caption text images, 70.43% for document text images and 50.40% for scene text image.

  12. Sustainable, Extensible Documentation Generation Using inlinedocs

    Directory of Open Access Journals (Sweden)

    Toby Dylan Hocking

    2013-09-01

    Full Text Available This article presents inlinedocs, an R package for generating documentation from comments. The concept of structured, interwoven code and documentation has existed for many years, but existing systems that implement this for the R programming language do not tightly integrate with R code, leading to several drawbacks. This article attempts to address these issues and presents 2 contributions for documentation generation for the R community. First, we propose a new syntax for inline documentation of R code within comments adjacent to the relevant code, which allows for highly readable and maintainable code and documentation. Second, we propose an extensible system for parsing these comments, which allows the syntax to be easily augmented.

  13. Document Summarization Using Positive Pointwise Mutual Information

    Directory of Open Access Journals (Sweden)

    Aji S

    2012-05-01

    Full Text Available The degree of success in document summarization processes depends on the performance of the method used in identifying significant sentences in the documents. The collection of unique words characterizes the major signature of the document, and forms the basis for Term-Sentence-Matrix (TSM. The Positive Pointwise Mutual Information, which works well for measuring semantic similarity in the TermSentence-Matrix, is used in our method to assign weights for each entry in the Term-Sentence-Matrix. The Sentence-Rank-Matrix generated from this weighted TSM, is then used to extract a summary from the document. Our experiments show that such a method would outperform most of the existing methods in producing summaries from large documents.

  14. Detection of Plagiarism in Arabic Documents

    Directory of Open Access Journals (Sweden)

    Mohamed El Bachir Menai

    2012-09-01

    Full Text Available Many language-sensitive tools for detecting plagiarism in natural language documents have been developed, particularly for English. Language-independent tools exist as well, but are considered restrictive as they usually do not take into account specific language features. Detecting plagiarism in Arabic documents is particularly a challenging task because of the complex linguistic structure of Arabic. In this paper, we present a plagiarism detection tool for comparison of Arabic documents to identify potential similarities. The tool is based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons. We evaluate its performance in terms of precision and recall on a large data set of Arabic documents, and show its capability in identifying direct and sophisticated copying, such as sentence reordering and synonym substitution. We also demonstrate its advantages over other plagiarism detection tools, including Turnitin, the well-known language-independent tool.

  15. Exploring lexical patterns in text

    OpenAIRE

    Teich, Elke; Fankhauser, Peter

    2005-01-01

    We present a system for the linguistic exploration and analysis of lexical cohesion in English texts. Using an electronic thesaurus-like resource, Princeton WordNet, and the Brown Corpus of English, we have implemented a process of annotating text with lexical chains and a graphical user interface for inspection of the annotated text. We describe the system and report on some sample linguistic analyses carried out using the combined thesaurus-corpus resource.

  16. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  17. Text Type and Translation Strategy

    Institute of Scientific and Technical Information of China (English)

    刘福娟

    2015-01-01

    Translation strategy and translation standards are undoubtedly the core problems translators are confronted with in translation. There have arisen many kinds of translation strategies in translation history, among which the text type theory is considered an important breakthrough and a significant complement of traditional translation standards. This essay attempts to demonstrate the value of text typology (informative, expressive, and operative) to translation strategy, emphasizing the importance of text types and their communicative functions.

  18. Knowledge Representation in Travelling Texts

    DEFF Research Database (Denmark)

    Mousten, Birthe; Locmele, Gunta

    2014-01-01

    and the purpose of the text in a new context as well as on predefined parameters for text travel. For texts used in marketing and in technology, the question is whether culture-bound knowledge representation should be domesticated or kept as foreign elements, or should be mirrored or moulded—or should not travel......Today, information travels fast. Texts travel, too. In a corporate context, the question is how to manage which knowledge elements should travel to a new language area or market and in which form? The decision to let knowledge elements travel or not travel highly depends on the limitation...

  19. QA programme documentation

    International Nuclear Information System (INIS)

    The present paper deals with the following topics: The need for a documented Q.A. program; Establishing a Q.A. program; Q.A. activities; Fundamental policies; Q.A. policies; Quality objectives Q.A. manual. (orig./RW)

  20. Hypertension Briefing: Technical documentation

    OpenAIRE

    Institute of Public Health in Ireland

    2012-01-01

    Blood pressure is the force exerted on artery walls as the heart pumps blood through the body. Hypertension, or high blood pressure, occurs when blood pressure is constantly higher than the pressure needed to carry blood through the body. This document details how the IPH uses a systematic and consistent method to produce prevalence data for hypertension on the island of Ireland.

  1. Course documentation report

    DEFF Research Database (Denmark)

    Buus, Lillian; Bygholm, Ann; Walther, Tina Dyngby Lyng

    A documentation report on the three pedagogical courses developed during the MVU project period. The report describes the three processes taking departure in the structure and material avaiable at the virtual learning environment. Also the report describes the way the two of the courses developed...

  2. Extremely secure identification documents

    Energy Technology Data Exchange (ETDEWEB)

    Tolk, K.M. [Sandia National Labs., Albuquerque, NM (United States); Bell, M. [Sandia National Labs., Livermore, CA (United States)

    1997-09-01

    The technology developed in this project uses biometric information printed on the document and public key cryptography to ensure that an adversary cannot issue identification documents to unauthorized individuals or alter existing documents to allow their use by unauthorized individuals. This process can be used to produce many types of identification documents with much higher security than any currently in use. The system is demonstrated using a security badge as an example. This project focused on the technologies requiring development in order to make the approach viable with existing badge printing and laminating technologies. By far the most difficult was the image processing required to verify that the picture on the badge had not been altered. Another area that required considerable work was the high density printed data storage required to get sufficient data on the badge for verification of the picture. The image processing process was successfully tested, and recommendations are included to refine the badge system to ensure high reliability. A two dimensional data array suitable for printing the required data on the badge was proposed, but testing of the readability of the array had to be abandoned due to reallocation of the budgeted funds by the LDRD office.

  3. Integrated Criteria Document Arsenic

    NARCIS (Netherlands)

    Slooff W; Haring BJA; Hesse JM; Janus JA; Thomas R; van Beelen P; de Boer JLM; Boumans LJM; Buijsman E; Canton JH; Cremers PMA; van der Heijden CA; Knaap AGAC; Krajnc EI; Kramers PGN; Kreis IA; Kroese ED; Lebret E; Matthijsen AJCM; van de Meent D; van der Meulen A; Meulenbelt J; Taalman RDFM; Bijstra D; Bril J; Salomons W; van der Woerd KF

    1990-01-01

    Betreft de engelse editie van rapport 758701002. Bij dit rapport behoort een appendix ook met nummer 758701002 getiteld:"Integrated Criteria Document Arsenicum: Effects" Auteurs: Hesse JM;
    Janus JA;
    Krajnc EI;
    Kroese ED

  4. Extremely secure identification documents

    International Nuclear Information System (INIS)

    The technology developed in this project uses biometric information printed on the document and public key cryptography to ensure that an adversary cannot issue identification documents to unauthorized individuals or alter existing documents to allow their use by unauthorized individuals. This process can be used to produce many types of identification documents with much higher security than any currently in use. The system is demonstrated using a security badge as an example. This project focused on the technologies requiring development in order to make the approach viable with existing badge printing and laminating technologies. By far the most difficult was the image processing required to verify that the picture on the badge had not been altered. Another area that required considerable work was the high density printed data storage required to get sufficient data on the badge for verification of the picture. The image processing process was successfully tested, and recommendations are included to refine the badge system to ensure high reliability. A two dimensional data array suitable for printing the required data on the badge was proposed, but testing of the readability of the array had to be abandoned due to reallocation of the budgeted funds by the LDRD office

  5. Analysis of Design Documentation

    DEFF Research Database (Denmark)

    Hansen, Claus Thorp

    1998-01-01

    has been established where we seek to identify useful design work patterns by retrospective analyses of documentation created during design projects. This paper describes the analysis method, a tentatively defined metric to evaluate identified work patterns, and presents results from the first...... analysis accomplished....

  6. Documentation of CORTAX

    OpenAIRE

    Leon Bettendorf; Albert Van der Horst

    2006-01-01

    CORTAX is applied in Bettendorf et al. (2006), a simulation study on the economic and welfare implications of reforms in corporate income taxation. This technical documentation of the model consists of the derivation and listing of the equations of the model and a justification of the calibration.

  7. Documentation of spectrom-41

    International Nuclear Information System (INIS)

    SPECTROM-41 is a finite element heat transfer computer program developed to analyze thermal problems related to nuclear waste disposal. The code is part of the SPECTROM (Special Purpose Engineering Codes for Thermal/ROck Mechanics) series of special purpose finite element programs that are continually being developed by RE/SPEC Inc. (RSI) to address the many unique formations. This document presents the theoretical basis for the mathematical model, the finite element formulation of the program, and a description of the input data for the program, along with details about program support and continuing documentation. The documentation is intended to satisfy the requirements and guidelines outlined in NUREG-0856. The principal component model used in the programs based on Fourier's law of conductance. Numerous program options provide the capability of considering various boundary conditions, material stratification and anisotropy, and time-dependent heat generation that are characteristic of problems involving the disposal of nuclear waste in geologic formation. Numerous verification problems are included in the documentation in addition to highlights of past and ongoing verification and validation efforts. A typical repository problem is solving using SPECTROM-41 to demonstrate the use of the program in addressing problems related to the disposal of nuclear waste

  8. Geometric Correction for Braille Document Images

    Directory of Open Access Journals (Sweden)

    Padmavathi.S

    2016-04-01

    Full Text Available Braille system has been used by the visually impair ed people for reading.The shortage of Braille books has caused a need for conversion of Braille t o text. This paper addresses the geometric correction of a Braille document images. Due to the standard measurement of the Braille cells, identification of Braille characters could be achie ved by simple cell overlapping procedure. The standard measurement varies in a scaled document an d fitting of the cells become difficult if the document is tilted. This paper proposes a line fitt ing algorithm for identifying the tilt (skew angle. The horizontal and vertical scale factor is identified based on the ratio of distance between characters to the distance between dots. Th ese are used in geometric transformation matrix for correction. Rotation correction is done prior to scale correction. This process aids in increased accuracy. The results for various Braille documents are tabulated.

  9. Semantic Metadata for Heterogeneous Spatial Planning Documents

    Science.gov (United States)

    Iwaniak, A.; Kaczmarek, I.; Łukowicz, J.; Strzelecki, M.; Coetzee, S.; Paluszyński, W.

    2016-09-01

    Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa). The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.

  10. Technical approach document

    Energy Technology Data Exchange (ETDEWEB)

    1989-12-01

    The Uranium Mill Tailings Radiation Control Act (UMTRCA) of 1978, Public Law 95-604 (PL95-604), grants the Secretary of Energy the authority and responsibility to perform such actions as are necessary to minimize radiation health hazards and other environmental hazards caused by inactive uranium mill sites. This Technical Approach Document (TAD) describes the general technical approaches and design criteria adopted by the US Department of Energy (DOE) in order to implement remedial action plans (RAPS) and final designs that comply with EPA standards. It does not address the technical approaches necessary for aquifer restoration at processing sites; a guidance document, currently in preparation, will describe aquifer restoration concerns and technical protocols. This document is a second revision to the original document issued in May 1986; the revision has been made in response to changes to the groundwater standards of 40 CFR 192, Subparts A--C, proposed by EPA as draft standards. New sections were added to define the design approaches and designs necessary to comply with the groundwater standards. These new sections are in addition to changes made throughout the document to reflect current procedures, especially in cover design, water resources protection, and alternate site selection; only minor revisions were made to some of the sections. Sections 3.0 is a new section defining the approach taken in the design of disposal cells; Section 4.0 has been revised to include design of vegetated covers; Section 8.0 discusses design approaches necessary for compliance with the groundwater standards; and Section 9.0 is a new section dealing with nonradiological hazardous constituents. 203 refs., 18 figs., 26 tabs.

  11. Technical approach document

    International Nuclear Information System (INIS)

    The Uranium Mill Tailings Radiation Control Act (UMTRCA) of 1978, Public Law 95-604 (PL95-604), grants the Secretary of Energy the authority and responsibility to perform such actions as are necessary to minimize radiation health hazards and other environmental hazards caused by inactive uranium mill sites. This Technical Approach Document (TAD) describes the general technical approaches and design criteria adopted by the US Department of Energy (DOE) in order to implement remedial action plans (RAPS) and final designs that comply with EPA standards. It does not address the technical approaches necessary for aquifer restoration at processing sites; a guidance document, currently in preparation, will describe aquifer restoration concerns and technical protocols. This document is a second revision to the original document issued in May 1986; the revision has been made in response to changes to the groundwater standards of 40 CFR 192, Subparts A--C, proposed by EPA as draft standards. New sections were added to define the design approaches and designs necessary to comply with the groundwater standards. These new sections are in addition to changes made throughout the document to reflect current procedures, especially in cover design, water resources protection, and alternate site selection; only minor revisions were made to some of the sections. Sections 3.0 is a new section defining the approach taken in the design of disposal cells; Section 4.0 has been revised to include design of vegetated covers; Section 8.0 discusses design approaches necessary for compliance with the groundwater standards; and Section 9.0 is a new section dealing with nonradiological hazardous constituents. 203 refs., 18 figs., 26 tabs

  12. AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN

    OpenAIRE

    J.Sreemathy; P. S. Balamurugan

    2012-01-01

    The main objective is to propose a text classification based on the features selection and preprocessing thereby reducing the dimensionality of the Feature vector and increase the classificationaccuracy. Text classification is the process of assigning a document to one or more target categories, based on its contents. In the proposed method, machine learning methods for text classification is used to apply some text preprocessing methods in different dataset, and then to extract feature vecto...

  13. ERRORS AND DIFFICULTIES IN TRANSLATING LEGAL TEXTS

    Directory of Open Access Journals (Sweden)

    Camelia, CHIRILA

    2014-11-01

    Full Text Available Nowadays the accurate translation of legal texts has become highly important as the mistranslation of a passage in a contract, for example, could lead to lawsuits and loss of money. Consequently, the translation of legal texts to other languages faces many difficulties and only professional translators specialised in legal translation should deal with the translation of legal documents and scholarly writings. The purpose of this paper is to analyze translation from three perspectives: translation quality, errors and difficulties encountered in translating legal texts and consequences of such errors in professional translation. First of all, the paper points out the importance of performing a good and correct translation, which is one of the most important elements to be considered when discussing translation. Furthermore, the paper presents an overview of the errors and difficulties in translating texts and of the consequences of errors in professional translation, with applications to the field of law. The paper is also an approach to the differences between languages (English and Romanian that can hinder comprehension for those who have embarked upon the difficult task of translation. The research method that I have used to achieve the objectives of the paper was the content analysis of various Romanian and foreign authors' works.

  14. A NOVEL MULTIDICTIONARY BASED TEXT COMPRESSION

    Directory of Open Access Journals (Sweden)

    Y. Venkataramani

    2012-01-01

    Full Text Available The amount of digital contents grows at a faster speed as a result does the demand for communicate them. On the other hand, the amount of storage and bandwidth increases at a slower rate. Thus powerful and efficient compression methods are required. The repetition of words and phrases cause the reordered text much more compressible than the original text. On the whole system is fast and achieves close to the best result on the test files. In this study a novel fast dictionary based text compression technique MBRH (Multidictionary with burrows wheeler transforms, Run length coding and Huffman coding is proposed for the purpose of obtaining improved performance on various document sizes. MBRH algorithm comprises of two stages, the first stage is concerned with the conversion of input text into dictionary based compression .The second stage deals mainly with reduction of the redundancy in multidictionary based compression by using BWT, RLE and Huffman coding. Bib test files of input size of 111, 261 bytes achieves compression ratio of 0.192, bit rate of 1.538 and high speed using MBRH algorithm. The algorithm has attained a good compression ratio, reduction of bit rate and the increase in execution speed.

  15. Online Visual Analytics of Text Streams.

    Science.gov (United States)

    Liu, Shixia; Yin, Jialun; Wang, Xiting; Cui, Weiwei; Cao, Kelei; Pei, Jian

    2016-11-01

    We present an online visual analytics approach to helping users explore and understand hierarchical topic evolution in high-volume text streams. The key idea behind this approach is to identify representative topics in incoming documents and align them with the existing representative topics that they immediately follow (in time). To this end, we learn a set of streaming tree cuts from topic trees based on user-selected focus nodes. A dynamic Bayesian network model has been developed to derive the tree cuts in the incoming topic trees to balance the fitness of each tree cut and the smoothness between adjacent tree cuts. By connecting the corresponding topics at different times, we are able to provide an overview of the evolving hierarchical topics. A sedimentation-based visualization has been designed to enable the interactive analysis of streaming text data from global patterns to local details. We evaluated our method on real-world datasets and the results are generally favorable.

  16. Extraction of information from unstructured text

    Energy Technology Data Exchange (ETDEWEB)

    Irwin, N.H.; DeLand, S.M.; Crowder, S.V.

    1995-11-01

    Extracting information from unstructured text has become an emphasis in recent years due to the large amount of text now electronically available. This status report describes the findings and work done by the end of the first year of a two-year LDRD. Requirements of the approach included that it model the information in a domain independent way. This means that it would differ from current systems by not relying on previously built domain knowledge and that it would do more than keyword identification. Three areas that are discussed and expected to contribute to a solution include (1) identifying key entities through document level profiling and preprocessing, (2) identifying relationships between entities through sentence level syntax, and (3) combining the first two with semantic knowledge about the terms.

  17. Choices of texts for literary education

    DEFF Research Database (Denmark)

    Skyggebjerg, Anna Karlskov

    readers with literary interests, competences, possibilities, needs, etc. Generally speaking the criteria for the choice of texts for teaching literature in Danish schools have been dominated by considerations for the subject and Literature in itself. The predominant view of literature comes from...... literature studies at universities, where criteria concerning language and form are often more valued than criteria concerning character and content. This tendency to celebrate the formal aspects and the literariness of literature is recognized in governmental documents, teaching materials......, and in the registration of texts for examinations. Genres such as poetry and short stories, periods such as avant-garde and modernism, and acknowledged and well-known authorships are often included, whereas, representations of popular fiction and such genres as fantasy, sci-fi, and biography are rare. Often, pupils...

  18. PIXE analysis and imaging of papyrus documents

    Science.gov (United States)

    Lövestam, N. E. Göran; Swietlicki, Erik

    1990-01-01

    The analysis of antique papyrus documents using an external milliprobe is described. Missing characters of text in the documents were made visible by means of PIXE analysis and X-ray imaging of the areas studied. The contrast between the papyrus and the ink was further increased when the information contained in all the elements was taken into account simultaneously using a multivariate technique (partial least-squares regression).

  19. Improve Reading with Complex Texts

    Science.gov (United States)

    Fisher, Douglas; Frey, Nancy

    2015-01-01

    The Common Core State Standards have cast a renewed light on reading instruction, presenting teachers with the new requirements to teach close reading of complex texts. Teachers and administrators should consider a number of essential features of close reading: They are short, complex texts; rich discussions based on worthy questions; revisiting…

  20. Strategies for Translating Vocative Texts

    Directory of Open Access Journals (Sweden)

    Olga COJOCARU

    2014-12-01

    Full Text Available The paper deals with the linguistic and cultural elements of vocative texts and the techniques used in translating them by giving some examples of texts that are typically vocative (i.e. advertisements and instructions for use. Semantic and communicative strategies are popular in translation studies and each of them has its own advantages and disadvantages in translating vocative texts. The advantage of semantic translation is that it takes more account of the aesthetic value of the SL text, while communicative translation attempts to render the exact contextual meaning of the original text in such a way that both content and language are readily acceptable and comprehensible to the readership. Focus is laid on the strategies used in translating vocative texts, strategies that highlight and introduce a cultural context to the target audience, in order to achieve their overall purpose, that is to sell or persuade the reader to behave in a certain way. Thus, in order to do that, a number of advertisements from the field of cosmetics industry and electronic gadgets were selected for analysis. The aim is to gather insights into vocative text translation and to create new perspectives on this field of research, now considered a process of innovation and diversion, especially in areas as important as economy and marketing.

  1. Linguistic Dating of Biblical Texts

    DEFF Research Database (Denmark)

    Ehrensvärd, Martin Gustaf

    2003-01-01

    the chronology of the texts established by other means: the Hebrew of Genesis-2 Kings was judged to be early and that of Esther, Daniel, Ezra, Nehemiah, and Chronicles to be late. In the current debate where revisionists have questioned the traditional dating, linguistic arguments in the dating of texts have......For two centuries, scholars have pointed to consistent differences in the Hebrew of certain biblical texts and interpreted these differences as reflecting the date of composition of the texts. Until the 1980s, this was quite uncontroversial as the linguistic findings largely confirmed...... come more into focus. The study critically examines some linguistic arguments adduced to support the traditional position, and reviewing the arguments it points to weaknesses in the linguistic dating of EBH texts to pre-exilic times. When viewing the linguistic evidence in isolation it will be clear...

  2. Chemical-text hybrid search engines.

    Science.gov (United States)

    Zhou, Yingyao; Zhou, Bin; Jiang, Shumei; King, Frederick J

    2010-01-01

    As the amount of chemical literature increases, it is critical that researchers be enabled to accurately locate documents related to a particular aspect of a given compound. Existing solutions, based on text and chemical search engines alone, suffer from the inclusion of "false negative" and "false positive" results, and cannot accommodate diverse repertoire of formats currently available for chemical documents. To address these concerns, we developed an approach called Entity-Canonical Keyword Indexing (ECKI), which converts a chemical entity embedded in a data source into its canonical keyword representation prior to being indexed by text search engines. We implemented ECKI using Microsoft Office SharePoint Server Search, and the resultant hybrid search engine not only supported complex mixed chemical and keyword queries but also was applied to both intranet and Internet environments. We envision that the adoption of ECKI will empower researchers to pose more complex search questions that were not readily attainable previously and to obtain answers at much improved speed and accuracy.

  3. Chemical-text hybrid search engines.

    Science.gov (United States)

    Zhou, Yingyao; Zhou, Bin; Jiang, Shumei; King, Frederick J

    2010-01-01

    As the amount of chemical literature increases, it is critical that researchers be enabled to accurately locate documents related to a particular aspect of a given compound. Existing solutions, based on text and chemical search engines alone, suffer from the inclusion of "false negative" and "false positive" results, and cannot accommodate diverse repertoire of formats currently available for chemical documents. To address these concerns, we developed an approach called Entity-Canonical Keyword Indexing (ECKI), which converts a chemical entity embedded in a data source into its canonical keyword representation prior to being indexed by text search engines. We implemented ECKI using Microsoft Office SharePoint Server Search, and the resultant hybrid search engine not only supported complex mixed chemical and keyword queries but also was applied to both intranet and Internet environments. We envision that the adoption of ECKI will empower researchers to pose more complex search questions that were not readily attainable previously and to obtain answers at much improved speed and accuracy. PMID:20047295

  4. GPM Mission Gridded Text Products Providing Surface Precipitation Retrievals

    Science.gov (United States)

    Stocker, Erich Franz; Kelley, Owen; Huffman, George; Kummerow, Christian

    2015-04-01

    constellation satellites. Both of these gridded products are generated for a .25 degree x .25 degree hourly grid, which are packaged into daily ASCII files that can downloaded from the PPS FTP site. To reduce the download size, the files are compressed using the gzip utility. This paper will focus on presenting high-level details about the gridded text product being generated from the instruments on the GPM core satellite. But summary information will also be presented about the partner radiometer gridded product. All retrievals for the partner radiometer are done using the GPROF2014 algorithm using as input the PPS generated inter-calibrated 1C product for the radiometer.

  5. Document Delivery Services around the World

    Directory of Open Access Journals (Sweden)

    Ashrafosadat Foladi

    2008-04-01

    Full Text Available Given the importance of information access versus collection, the present study identified and investigated ten most important document delivery websites which had the highest frequency of citations in online directories and printed sources. The evaluation was based on the indicators and policies of Iranian Scientific Information and Documentation Center (IRANDOC. These included document diversity, document request mechanisms, document delivery options, response time, payment options, costs, and copy right clearance. The findings were then processed statistically using SPSS. It was found that based on document diversity BLDSC, LHL and DocDeliver are the frontrunners. On the account of subject comprehensiveness, Doc Deliver, BLDSC, Infotrieve, Ingenta, ISI and UMI are at the same level. All ten sites studied covered basic sciences. BL is strong with respect to diversity of document delivery options, payment options and response time. ISI is most suitable when diversity in request options is required. Ingenta is suitable when diversity in payment options are required. NTIS is in the lead when special documents such as technical reports are required, while UMI is most suitable for Dissertations and rare books

  6. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  7. Outer Texts in Bilingual Dictionaries

    Directory of Open Access Journals (Sweden)

    Rufus H. Gouws

    2011-10-01

    Full Text Available

    Abstract: Dictionaries often display a central list bias with little or no attention to the use ofouter texts. This article focuses on dictionaries as text compounds and carriers of different texttypes. Utilising either a partial or a complete frame structure, a variety of outer text types can beused to enhance the data distribution structure of a dictionary and to ensure a better informationretrieval by the intended target user. A distinction is made between primary frame structures andsecondary frame structures and attention is drawn to the use of complex outer texts and the need ofan extended complex outer text with its own table of contents to guide the user to the relevant textsin the complex outer text. It is emphasised that outer texts need to be planned in a meticulous wayand that they should participate in the lexicographic functions of the specific dictionary, bothknowledge-orientated and communication-orientated functions, to ensure a transtextual functionalapproach.

    Keywords: BACK MATTER, CENTRAL LIST, COMMUNICATION-ORIENTATED FUNCTIONS,COMPLEX TEXT, CULTURAL DATA, EXTENDED COMPLEX TEXT, EXTENDED TEXTS,FRONT MATTER, FRAME STRUCTURE, KNOWLEDGE-ORIENTATED FUNCTIONS, LEXICOGRAPHICFUNCTIONS, OUTER TEXTS, PRIMARY FRAME, SECONDARY FRAME

    Opsomming: Buitetekste in tweetalige woordeboeke. Woordeboeke vertoondikwels 'n partydigheid ten gunste van die sentrale lys met min of geen aandag aan die buitetekstenie. Hierdie artikel fokus op woordeboeke as tekssamestellings en draers van verskillende tekssoorte.Met die benutting van óf 'n gedeeltelike óf 'n volledige raamstruktuur kan 'n verskeidenheidbuitetekste aangewend word om die dataverspreidingstruktuur van 'n woordeboek te verbeteren om 'n beter herwinning van inligting deur die teikengebruiker te verseker. 'n Onderskeidword gemaak tussen primêre en sekondêre raamstrukture en die aandag word gevestig op kompleksebuitetekste en die behoefte aan 'n uitgebreide komplekse

  8. Data mining of text as a tool in authorship attribution

    Science.gov (United States)

    Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu

    2001-03-01

    It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

  9. Standardization of engineering documentation

    International Nuclear Information System (INIS)

    Many interrelated activities involving a number of organizational units comprise the process for the design and construction of a nuclear steam supply steam (NSSS). In the application of a standard NSSS design, many activities are duplicated from project to project and form a standard process for the engineering. This standard process in turn lends itself to a system for standardizing the engineering documentation associated with a particular design application. For these varied activities to be carried out successfully, a strong network of communication is required not only within each design organization but also externally among the various participants: the owner, the NSSS supplier, the architect-engineer, the construction agency, equipment suppliers, and others. This paper discusses, from the viewpoint of a NSSS supplier's engineering organization, the role of standard engineering documents in the design process and communication network

  10. Musique et document sonore

    OpenAIRE

    Javault, Patrick

    2013-01-01

    Tirée d'une thèse en musicologie et en esthétique, cette enquête est d'abord remarquable par sa façon d'établir des regroupements et de définir son chemin de réflexion. Pierre-Yves Macé, lui-même musicien, compositeur et praticien des archives phonographiques, trouve des transversales pour éclairer l'emploi du document sonore dans les musiques savantes et expérimentales (en écartant sciemment la pop) - document sonore envisagé comme Autre du musical et qui peut également être dit « le réel de...

  11. Areva - 2011 Reference document

    International Nuclear Information System (INIS)

    After having indicated the person responsible of this document and the legal account auditors, and provided some financial information, this document gives an overview of the different risk factors existing in the company: law risks, industrial and environmental risks, operational risks, risks related to large projects, market and liquidity risks. Then, after having recalled the history and evolution of the company and the evolution of its investments over the last five years, it proposes an overview of Areva's activities on the markets of nuclear energy and renewable energies, of its clients and suppliers, of its strategy, of the activities of its different departments. Other information are provided: company's flow chart, estate properties (plants, equipment), an analysis of its financial situation, its research and development policy, the present context, profit previsions or estimations, management organization and operation

  12. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  13. SANSMIC design document.

    Energy Technology Data Exchange (ETDEWEB)

    Weber, Paula D. [Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Rudeen, David Keith [GRAM, Inc., Albuquerque, NM (United States)

    2015-07-01

    The United States Strategic Petroleum Reserve (SPR) maintains an underground storage system consisting of caverns that were leached or solution mined in four salt domes located near the Gulf of Mexico in Texas and Louisiana. The SPR comprises more than 60 active caverns containing approximately 700 million barrels of crude oil. Sandia National Labo- ratories (SNL) is the geotechnical advisor to the SPR. As the most pressing need at the inception of the SPR was to create and fill storage volume with oil, the decision was made to leach the caverns and fill them simultaneously (leach-fill). Therefore, A.J. Russo developed SANSMIC in the early 1980s which allows for a transient oil-brine interface (OBI) making it possible to model leach-fill and withdrawal operations. As the majority of caverns are currently filled to storage capacity, the primary uses of SANSMIC at this time are related to the effects of small and large withdrawals, expansion of existing caverns, and projecting future pillar to diameter ratios. SANSMIC was identified by SNL as a priority candidate for qualification. This report continues the quality assurance (QA) process by documenting the "as built" mathematical and numerical models that comprise this document. The pro- gram flow is outlined and the models are discussed in detail. Code features that were added later or were not documented previously have been expounded. No changes in the code's physics have occurred since the original documentation (Russo, 1981, 1983) although recent experiments may yield improvements to the temperature and plume methods in the future.

  14. Why is Light Text Harder to Read Than Dark Text?

    Science.gov (United States)

    Scharff, Lauren V.; Ahumada, Albert J.

    2005-01-01

    Scharff and Ahumada (2002, 2003) measured text legibility for light text and dark text. For paragraph readability and letter identification, responses to light text were slower and less accurate for a given contrast. Was this polarity effect (1) an artifact of our apparatus, (2) a physiological difference in the separate pathways for positive and negative contrast or (3) the result of increased experience with dark text on light backgrounds? To rule out the apparatus-artifact hypothesis, all data were collected on one monitor. Its luminance was measured at all levels used, and the spatial effects of the monitor were reduced by pixel doubling and quadrupling (increasing the viewing distance to maintain constant angular size). Luminances of vertical and horizontal square-wave gratings were compared to assess display speed effects. They existed, even for 4-pixel-wide bars. Tests for polarity asymmetries in display speed were negative. Increased experience might develop full letter templates for dark text, while recognition of light letters is based on component features. Earlier, an observer ran all conditions at one polarity and then switched. If dark and light letters were intermixed, the observer might use component features on all trials and do worse on the dark letters, reducing the polarity effect. We varied polarity blocking (completely blocked, alternating smaller blocks, and intermixed blocks). Letter identification responses times showed polarity effects at all contrasts and display resolution levels. Observers were also more accurate with higher contrasts and more pixels per degree. Intermixed blocks increased the polarity effect by reducing performance on the light letters, but only if the randomized block occurred prior to the nonrandomized block. Perhaps observers tried to use poorly developed templates, or they did not work as hard on the more difficult items. The experience hypothesis and the physiological gain hypothesis remain viable explanations.

  15. Stemming Malay Text and Its Application in Automatic Text Categorization

    Science.gov (United States)

    Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  16. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  17. AREVA - 2013 Reference document

    International Nuclear Information System (INIS)

    This Reference Document contains information on the AREVA group's objectives, prospects and development strategies, as well as estimates of the markets, market shares and competitive position of the AREVA group. Content: 1 - Person responsible for the Reference Document; 2 - Statutory auditors; 3 - Selected financial information; 4 - Description of major risks confronting the company; 5 - Information about the issuer; 6 - Business overview; 7 - Organizational structure; 8 - Property, plant and equipment; 9 - Situation and activities of the company and its subsidiaries; 10 - Capital resources; 11 - Research and development programs, patents and licenses; 12 - Trend information; 13 - Profit forecasts or estimates; 14 - Management and supervisory bodies; 15 - Compensation and benefits; 16 - Functioning of the management and supervisory bodies; 17 - Human resources information; 18 - Principal shareholders; 19 - Transactions with related parties; 20 - Financial information concerning assets, financial positions and financial performance; 21 - Additional information; 22 - Major contracts; 23 - Third party information, statements by experts and declarations of interest; 24 - Documents on display; 25 - Information on holdings; Appendix 1: report of the supervisory board chairman on the preparation and organization of the board's activities and internal control procedures; Appendix 2: statutory auditors' reports; Appendix 3: environmental report; Appendix 4: non-financial reporting methodology and independent third-party report on social, environmental and societal data; Appendix 5: ordinary and extraordinary general shareholders' meeting; Appendix 6: values charter; Appendix 7: table of concordance of the management report; glossaries

  18. Content Documents Management

    Science.gov (United States)

    Muniz, R.; Hochstadt, J.; Boelke J.; Dalton, A.

    2011-01-01

    The Content Documents are created and managed under the System Software group with. Launch Control System (LCS) project. The System Software product group is lead by NASA Engineering Control and Data Systems branch (NEC3) at Kennedy Space Center. The team is working on creating Operating System Images (OSI) for different platforms (i.e. AIX, Linux, Solaris and Windows). Before the OSI can be created, the team must create a Content Document which provides the information of a workstation or server, with the list of all the software that is to be installed on it and also the set where the hardware belongs. This can be for example in the LDS, the ADS or the FR-l. The objective of this project is to create a User Interface Web application that can manage the information of the Content Documents, with all the correct validations and filters for administrator purposes. For this project we used one of the most excellent tools in agile development applications called Ruby on Rails. This tool helps pragmatic programmers develop Web applications with Rails framework and Ruby programming language. It is very amazing to see how a student can learn about OOP features with the Ruby language, manage the user interface with HTML and CSS, create associations and queries with gems, manage databases and run a server with MYSQL, run shell commands with command prompt and create Web frameworks with Rails. All of this in a real world project and in just fifteen weeks!

  19. An Integrated Multimedia Approach to Cultural Heritage e-Documents

    NARCIS (Netherlands)

    A.W.M. Smeulders; H.L. Hardman; G. Schreiber; J.M. Geusebroek

    2002-01-01

    We discuss access to e-documents from three different perspectives beyond the plain keyword web-search of the entire document. The first one is the situation-depending delivery of multimedia documents adapting the preferred form (picture, text, speech) to the available information capacity or need e

  20. Parsimonious language models for a terabyte of text

    NARCIS (Netherlands)

    D. Hiemstra; J. Kamps; R. Kaptein; R. Li

    2007-01-01

    The aims of this paper are twofold. Our first aim is to compare results of the earlier Terabyte tracks to the Million Query track. We submitted a number of runs using different document representations (such as full-text, title-fields, or incoming anchor-texts) to increase pool diversity. The initia

  1. Text Mining in Social Networks

    Science.gov (United States)

    Aggarwal, Charu C.; Wang, Haixun

    Social networks are rich in various kinds of contents such as text and multimedia. The ability to apply text mining algorithms effectively in the context of text data is critical for a wide variety of applications. Social networks require text mining algorithms for a wide variety of applications such as keyword search, classification, and clustering. While search and classification are well known applications for a wide variety of scenarios, social networks have a much richer structure both in terms of text and links. Much of the work in the area uses either purely the text content or purely the linkage structure. However, many recent algorithms use a combination of linkage and content information for mining purposes. In many cases, it turns out that the use of a combination of linkage and content information provides much more effective results than a system which is based purely on either of the two. This paper provides a survey of such algorithms, and the advantages observed by using such algorithms in different scenarios. We also present avenues for future research in this area.

  2. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  3. Text segmentation with character-level text embeddings

    NARCIS (Netherlands)

    Chrupała, Grzegorz

    2013-01-01

    Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture o

  4. Analysing ESP Texts, but How?

    Directory of Open Access Journals (Sweden)

    Borza Natalia

    2015-03-01

    Full Text Available English as a second language (ESL teachers instructing general English and English for specific purposes (ESP in bilingual secondary schools face various challenges when it comes to choosing the main linguistic foci of language preparatory courses enabling non-native students to study academic subjects in English. ESL teachers intending to analyse English language subject textbooks written for secondary school students with the aim of gaining information about what bilingual secondary school students need to know in terms of language to process academic textbooks cannot avoiding deal with a dilemma. It needs to be decided which way it is most appropriate to analyse the texts in question. Handbooks of English applied linguistics are not immensely helpful with regard to this problem as they tend not to give recommendation as to which major text analytical approaches are advisable to follow in a pre-college setting. The present theoretical research aims to address this lacuna. Respectively, the purpose of this pedagogically motivated theoretical paper is to investigate two major approaches of ESP text analysis, the register and the genre analysis, in order to find the more suitable one for exploring the language use of secondary school subject texts from the point of view of an English as a second language teacher. Comparing and contrasting the merits and limitations of the two contrastive approaches allows for a better understanding of the nature of the two different perspectives of text analysis. The study examines the goals, the scope of analysis, and the achievements of the register perspective and those of the genre approach alike. The paper also investigates and reviews in detail the starkly different methods of ESP text analysis applied by the two perspectives. Discovering text analysis from a theoretical and methodological angle supports a practical aspect of English teaching, namely making an informed choice when setting out to analyse

  5. Extracting laboratory test information from biomedical text

    Directory of Open Access Journals (Sweden)

    Yanna Shen Kang

    2013-01-01

    Full Text Available Background: No previous study reported the efficacy of current natural language processing (NLP methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens was very limited or when lexical morphology of the entity was distinctive (as in units of measures, yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.

  6. Multi-perspective Event Detection in Texts Documenting the 1944 Battle of Arnhem

    NARCIS (Netherlands)

    Düring, M.D.; Bosch, A.P.J. van den

    2014-01-01

    We present a pilot project which combines the respective strengths of research practices in history, memory studies, and computational linguistics. We present a proof-of-concept workflow for the semi-automatic detection and linking of narratives referring to the same event based on references to loc

  7. Chinese multi-document personal name disambiguation

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    This paper presents a new approach to determining whether an interested personal name across documents refers to the same entity. Firstly, three vectors for each text are formed: the personal name Boolean vectors denoting whether a personal name occurs in the text, the biographical word Boolean vector representing title, occupation and so forth, and the feature vector with real values. Then, by combining a heuristic strategy based on Boolean vectors with an agglomerative clustering algorithm based on feature vectors, it seeks to resolve multi-document personal name coreference. Experimental results show that this approach achieves a good performance by testing on "Wang Gang" corpus.

  8. Aiding the Interpretation of Ancient Documents

    DEFF Research Database (Denmark)

    Roued-Cunliffe, Henriette

    and Latin texts and the term ‘scholars’ is used to describe readers of these documents (e.g. papyrologists, epigraphers, palaeographers). However, the results from this research can be applicable to many other texts ranging from Nordic runes to 18th Century love letters. In order to develop an appropriate...... tool it is important first to comprehend the interpretation process involved in reading ancient documents. This is not a linear process but rather a recursive process where the scholar moves between different levels of reading, such as ‘understanding the meaning of a character’ or ‘understanding...

  9. Genetic Programming for Document Segmentation and Region Classification Using Discipulus

    Directory of Open Access Journals (Sweden)

    Priyadharshini N

    2013-02-01

    Full Text Available Document segmentation is a method of rending the document into distinct regions. A document is an assortment of information and a standard mode of conveying information to others. Pursuance of data from documents involves ton of human effort, time intense and might severely prohibit the usage of data systems. So, automatic information pursuance from the document has become a big issue. It is been shown that document segmentation will facilitate to beat such problems. This paper proposes a new approach to segment and classify the document regions as text, image, drawings and table. Document image is divided into blocks using Run length smearing rule and features are extracted from every blocks. Discipulus tool has been used to construct the Genetic programming based classifier model and located 97.5% classification accuracy.

  10. Tank waste remediation system functions and requirements document

    International Nuclear Information System (INIS)

    This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technical Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle

  11. Tank waste remediation system functions and requirements document

    Energy Technology Data Exchange (ETDEWEB)

    Carpenter, K.E

    1996-10-03

    This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technical Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle.

  12. Inferring Group Processes from Computer-Mediated Affective Text Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Schryver, Jack C [ORNL; Begoli, Edmon [ORNL; Jose, Ajith [Missouri University of Science and Technology; Griffin, Christopher [Pennsylvania State University

    2011-02-01

    Political communications in the form of unstructured text convey rich connotative meaning that can reveal underlying group social processes. Previous research has focused on sentiment analysis at the document level, but we extend this analysis to sub-document levels through a detailed analysis of affective relationships between entities extracted from a document. Instead of pure sentiment analysis, which is just positive or negative, we explore nuances of affective meaning in 22 affect categories. Our affect propagation algorithm automatically calculates and displays extracted affective relationships among entities in graphical form in our prototype (TEAMSTER), starting with seed lists of affect terms. Several useful metrics are defined to infer underlying group processes by aggregating affective relationships discovered in a text. Our approach has been validated with annotated documents from the MPQA corpus, achieving a performance gain of 74% over comparable random guessers.

  13. Princess Brambilla - images/text

    Directory of Open Access Journals (Sweden)

    Maria Aparecida Barbosa

    2016-06-01

    Full Text Available Read the illustrated literary text is simultaneously think pictures and words. This articulation between the written text and pictures adds potential, expands and becomes complex. Coincides with nowadays discussions on Giorgio Agamben's "contemporary" that add to what adheres to respectively time the displacement and the distance needed to understand it, shakes linear notions of historical chronology. Somehow the coincidence is related to the current interest in the concept of "Nachleben" (survival, which assumes the images of the past ransom, postulated by the art historian Aby Warburg in a research on ancient art of motion characteristics in Renaissance pictures Botticelli's. For the translation of the Princesa Brambilla – um capriccio segundo Jakob Callot, de E. T. A. Hoffmann, com 8 gravuras cunhadas a partir de moldes originais de Callot (1820 to Portuguese such discussions were fundamental, as I try to present in this article.

  14. Shape codebook based handwritten and machine printed text zone extraction

    Science.gov (United States)

    Kumar, Jayant; Prasad, Rohit; Cao, Huiagu; Abd-Almageed, Wael; Doermann, David; Natarajan, Premkumar

    2011-01-01

    In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level.

  15. Fuzzy Swarm Based Text Summarization

    Directory of Open Access Journals (Sweden)

    Mohammed S. Binwahlan

    2009-01-01

    Full Text Available Problem statement: The aim of automatic text summarization systems is to select the most relevant information from an abundance of text sources. A daily rapid growth of data on the internet makes the achieve events of such aim a big challenge. Approach: In this study, we incorporated fuzzy logic with swarm intelligence; so that risks, uncertainty, ambiguity and imprecise values of choosing the features weights (scores could be flexibly tolerated. The weights obtained from the swarm experiment were used to adjust the text features scores and then the features scores were used as inputs for the fuzzy inference system to produce the final sentence score. The sentences were ranked in descending order based on their scores and then the top n sentences were selected as final summary. Results: The experiments showed that the incorporation of fuzzy logic with swarm intelligence could play an important role in the selection process of the most important sentences to be included in the final summary. Also the results showed that the proposed method got a good performance outperforming the swarm model and the benchmark methods. Conclusion: Incorporating more than one technique for dealing with the sentence scoring proved to be an effective mechanism. The PSO was employed for producing the text features weights. The purpose of this process was to emphasize on dealing with the text features fairly based on their importance and to differentiate between more and less important features. The fuzzy inference system was employed to determine the final sentence score, on which the decision was made to include the sentence in the summary or not.

  16. Software design and documentation language

    Science.gov (United States)

    Kleine, H.

    1980-01-01

    Language supports design and documentation of complex software. Included are: design and documentation language for expressing design concepts; processor that produces intelligble documentation based on design specifications; and methodology for using language and processor to create well-structured top-down programs and documentation. Processor is written in SIMSCRIPT 11.5 programming language for use on UNIVAC, IBM, and CDC machines.

  17. Ontological representation of texts, and its applicationsin text analysis

    OpenAIRE

    Solheim, Bent André; Vågsnes, Kristian

    2003-01-01

    For the management of a company, the need to know what people think of their products or services is becoming increasingly important in an increasingly competitive market. As the Internet can nearly be described as a digital mirror of events in the ”real“ world, being able to make sense of the semi structured nature of natural language texts published in this ubiquitous medium has received growing interest. The approach proposed in the thesis combines natural language processin...

  18. Cluster Based Text Classification Model

    DEFF Research Database (Denmark)

    Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock

    2011-01-01

    We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases......, the classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups...... datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset....

  19. Quality Inspection of Printed Texts

    DEFF Research Database (Denmark)

    Pedersen, Jesper Ballisager; Nasrollahi, Kamal; Moeslund, Thomas B.

    2016-01-01

    Inspecting the quality of printed texts has its own importance in many industrial applications. To do so, this paper proposes a grading system which evaluates the performance of the printing task using some quality measures for each character and symbols. The purpose of these grading system is two......-folded: for costumers of the printing and verification system, the overall grade used to verify if the text is of sufficient quality, while for printer's manufacturer, the detailed character/symbols grades and quality measurements are used for the improvement and optimization of the printing task. The proposed system...

  20. A Guide Text or Many Texts? "That is the Question”

    Directory of Open Access Journals (Sweden)

    Delgado de Valencia Sonia

    2001-08-01

    Full Text Available The use of supplementary materials in the classroom has always been an essential part of the teaching and learning process. To restrict our teaching to the scope of one single textbook means to stand behind the advances of knowledge, in any area and context. Young learners appreciate any new and varied support that expands their knowledge of the world: diaries, letters, panels, free texts, magazines, short stories, poems or literary excerpts, and articles taken from Internet are materials that will allow learnersto share more and work more collaboratively. In this article we are going to deal with some of these materials, with the criteria to select, adapt, and create them that may be of interest to the learner and that may promote reading and writing processes. Since no text can entirely satisfy the needs of students and teachers, the creativity of both parties will be necessary to improve the quality of teaching through the adequate use and adaptation of supplementary materials.

  1. Density Based Script Identification of a Multilingual Document Image

    OpenAIRE

    Rumaan Bashir; S. M. K Quadri

    2015-01-01

    Automatic Pattern Recognition field has witnessed enormous growth in the past few decades. Being an essential element of Pattern Recognition, Document Image Analysis is the procedure of analyzing a document image with the intention of working out the contents so that they can be manipulated as per the requirements at various levels. It involves various procedures like document classification, organizing, conversion, identification and many more. Since a document chiefly contains text, Script ...

  2. AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN

    Directory of Open Access Journals (Sweden)

    J.Sreemathy

    2012-03-01

    Full Text Available The main objective is to propose a text classification based on the features selection and preprocessing thereby reducing the dimensionality of the Feature vector and increase the classificationaccuracy. Text classification is the process of assigning a document to one or more target categories, based on its contents. In the proposed method, machine learning methods for text classification is used to apply some text preprocessing methods in different dataset, and then to extract feature vectors for each new document by using various feature weighting methods for enhancing the text classification accuracy. Further training the classifier by Naive Bayesian (NB and K-nearest neighbor (KNN algorithms, the predication can be made according to the category distribution among this k nearest neighbors.Experimental results show that the methods are favorable in terms of their effectiveness and efficiencywhen compared with other classifier such as SVM.

  3. Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

    OpenAIRE

    Wuying Liu; Lin Wang; Mianzhu Yi

    2014-01-01

    Multiclass text classification (MTC) is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC) algorithm. Supported by a token level memory to store labeled documents, the SRSMTC al...

  4. SSC Safety Review Document

    Energy Technology Data Exchange (ETDEWEB)

    Toohig, T.E. [ed.

    1988-11-01

    The safety strategy of the Superconducting Super Collider (SSC) Central Design Group (CDG) is to mitigate potential hazards to personnel, as far as possible, through appropriate measures in the design and engineering of the facility. The Safety Review Document identifies, on the basis of the Conceptual Design Report (CDR) and related studies, potential hazards inherent in the SSC project independent of its site. Mitigative measures in the design of facilities and in the structuring of laboratory operations are described for each of the hazards identified.

  5. Automatic generation of documents

    OpenAIRE

    Rosa Gini; Jacopo Pasquini

    2006-01-01

    This paper describes a natural interaction between Stata and markup languages. Stata’s programming and analysis features, together with the flexibility in output formatting of markup languages, allow generation and/or update of whole documents (reports, presentations on screen or web, etc.). Examples are given for both LaTeX and HTML. Stata’s commands are mainly dedicated to analysis of data on a computer screen and output of analysis stored in a log file available to researchers for later re...

  6. AREVA 2009 reference document

    International Nuclear Information System (INIS)

    This Reference Document contains information on the AREVA group's objectives, prospects and development strategies. It contains information on the markets, market shares and competitive position of the AREVA group. This information provides an adequate picture of the size of these markets and of the AREVA group's competitive position. Content: 1 - Person responsible for the Reference Document and Attestation by the person responsible for the Reference Document; 2 - Statutory and Deputy Auditors; 3 - Selected financial information; 4 - Risks: Risk management and coverage, Legal risk, Industrial and environmental risk, Operating risk, Risk related to major projects, Liquidity and market risk, Other risk; 5 - Information about the issuer: History and development, Investments; 6 - Business overview: Markets for nuclear power and renewable energies, AREVA customers and suppliers, Overview and strategy of the group, Business divisions, Discontinued operations: AREVA Transmission and Distribution; 7 - Organizational structure; 8 - Property, plant and equipment: Principal sites of the AREVA group, Environmental issues that may affect the issuer's; 9 - Analysis of and comments on the group's financial position and performance: Overview, Financial position, Cash flow, Statement of financial position, Events subsequent to year-end closing for 2009; 10 - Capital Resources; 11 - Research and development programs, patents and licenses; 12 -trend information: Current situation, Financial objectives; 13 - Profit forecasts or estimates; 14 - Administrative, management and supervisory bodies and senior management; 15 - Compensation and benefits; 16 - Functioning of corporate bodies; 17 - Employees; 18 - Principal shareholders; 19 - Transactions with related parties: French state, CEA, EDF group; 20 - Financial information concerning assets, financial positions and financial performance; 21 - Additional information: Share capital, Certificate of incorporation and by-laws; 22 - Major

  7. Presentation of the math text

    OpenAIRE

    KREJČOVÁ, Iva

    2009-01-01

    The aim of this bachelor thesis is basic mapping out the mediums for creating mathematical texts and their presentation and the acquisition of basic user skills in the usage of these programs. These funds also compare in terms of availability and ease of use, their opportunities and quality of the output.

  8. Seductive Texts with Serious Intentions.

    Science.gov (United States)

    Nielsen, Harriet Bjerrum

    1995-01-01

    Debates whether a text claiming to have scientific value is using seduction irresponsibly at the expense of the truth, and discusses who is the subject and who is the object of such seduction. It argues that, rather than being an assault against scientific ethics, seduction is a necessary premise for a sensible conversation to take place. (GR)

  9. Values Education: Texts and Supplements.

    Science.gov (United States)

    Curriculum Review, 1979

    1979-01-01

    This column describes and evaluates almost 40 texts, instructional kits, and teacher resources on values, interpersonal relations, self-awareness, self-help skills, juvenile psychology, and youth suicide. Eight effective picture books for the primary grades and seven titles in values fiction for teens are also reviewed. (SJL)

  10. Comparison of Text Categorization Algorithms

    Institute of Scientific and Technical Information of China (English)

    SHI Yong-feng; ZHAO Yan-ping

    2004-01-01

    This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages.It provides clues for making use of appropriate automatic classifying algorithms in different fields.Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out.

  11. Multilingual text induced spelling correction

    NARCIS (Netherlands)

    Reynaert, M.W.C.

    2004-01-01

    We present TISC, a multilingual, language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from raw text corpora, without supervision, and contains word unigrams

  12. COMPENDEX/TEXT-PAC: CIS.

    Science.gov (United States)

    Standera, Oldrich

    This report evaluates the engineering information services provided by the University of Calgary since implementation of the COMPENDEX (tape service of Engineering Index, Inc.) service using the IBM TEXT-PAC system. Evaluation was made by a survey of the users of the Current Information Selection (CIS) service, the interaction between the system…

  13. Extractive Summarisation of Medical Documents

    Directory of Open Access Journals (Sweden)

    Abeed Sarker

    2012-09-01

    Full Text Available Background Evidence Based Medicine (EBM practice requires practitioners to extract evidence from published medical research when answering clinical queries. Due to the time-consuming nature of this practice, there is a strong motivation for systems that can automatically summarise medical documents and help practitioners find relevant information. Aim The aim of this work is to propose an automatic query-focused, extractive summarisation approach that selects informative sentences from medical documents. MethodWe use a corpus that is specifically designed for summarisation in the EBM domain. We use approximately half the corpus for deriving important statistics associated with the best possible extractive summaries. We take into account factors such as sentence position, length, sentence content, and the type of the query posed. Using the statistics from the first set, we evaluate our approach on a separate set. Evaluation of the qualities of the generated summaries is performed automatically using ROUGE, which is a popular tool for evaluating automatic summaries. Results Our summarisation approach outperforms all baselines (best baseline score: 0.1594; our score 0.1653. Further improvements are achieved when query types are taken into account. Conclusion The quality of extractive summarisation in the medical domain can be significantly improved by incorporating domain knowledge and statistics derived from a specialised corpus. Such techniques can therefore be applied for content selection in end-to-end summarisation systems.

  14. Does pedagogical documentation support maternal reminiscing conversations?

    Directory of Open Access Journals (Sweden)

    Bethany Fleck

    2015-12-01

    Full Text Available When parents talk with their children about lessons learned in school, they are participating in reminiscing of an unshared event. This study sought to understand if pedagogical documentation, from the Reggio Approach to early childhood education, would support and enhance the conversation. Mother–child dyads reminisced two separate times about preschool lessons, one time with documentation available to them and one time without. Transcripts were coded extracting variables indicative of high and low maternal reminiscing styles. Results indicate that mother and child conversation characteristics were more highly elaborative when documentation was present than when it was not. In addition, children added more information to the conversation supporting the notion that such conversations enhanced memory for lessons. Documentation could be used as a support tool for conversations and children’s memory about lessons learned in school.

  15. Visual Similarity Based Document Layout Analysis

    Institute of Scientific and Technical Information of China (English)

    Di Wen; Xiao-Qing Ding

    2006-01-01

    In this paper, a visual similarity based document layout analysis (DLA) scheme is proposed, which by using clustering strategy can adaptively deal with documents in different languages, with different layout structures and skew angles. Aiming at a robust and adaptive DLA approach, the authors first manage to find a set of representative filters and statistics to characterize typical texture patterns in document images, which is through a visual similarity testing process.Texture features are then extracted from these filters and passed into a dynamic clustering procedure, which is called visual similarity clustering. Finally, text contents are located from the clustered results. Benefit from this scheme, the algorithm demonstrates strong robustness and adaptability in a wide variety of documents, which previous traditional DLA approaches do not possess.

  16. A text mining framework in R and its applications

    OpenAIRE

    Feinerer, Ingo

    2008-01-01

    Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well es...

  17. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  18. AREVA - 2012 Reference document

    International Nuclear Information System (INIS)

    After a presentation of the person responsible for this Reference Document, of statutory auditors, and of a summary of financial information, this report address the different risk factors: risk management and coverage, legal risk, industrial and environmental risk, operational risk, risk related to major projects, liquidity and market risk, and other risks (related to political and economic conditions, to Group's structure, and to human resources). The next parts propose information about the issuer, a business overview (markets for nuclear power and renewable energies, customers and suppliers, group's strategy, operations), a brief presentation of the organizational structure, a presentation of properties, plants and equipment (principal sites, environmental issues which may affect these items), analysis and comments on the group's financial position and performance, a presentation of capital resources, a presentation of research and development activities (programs, patents and licenses), a brief description of financial objectives and profit forecasts or estimates, a presentation of administration, management and supervision bodies, a description of the operation of corporate bodies, an overview of personnel, of principal shareholders, and of transactions with related parties, a more detailed presentation of financial information concerning assets, financial positions and financial performance. Addition information regarding share capital is given, as well as an indication of major contracts, third party information, available documents, and information on holdings

  19. Regulatory guidance document

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1994-05-01

    The Office of Civilian Radioactive Waste Management (OCRWM) Program Management System Manual requires preparation of the OCRWM Regulatory Guidance Document (RGD) that addresses licensing, environmental compliance, and safety and health compliance. The document provides: regulatory compliance policy; guidance to OCRWM organizational elements to ensure a consistent approach when complying with regulatory requirements; strategies to achieve policy objectives; organizational responsibilities for regulatory compliance; guidance with regard to Program compliance oversight; and guidance on the contents of a project-level Regulatory Compliance Plan. The scope of the RGD includes site suitability evaluation, licensing, environmental compliance, and safety and health compliance, in accordance with the direction provided by Section 4.6.3 of the PMS Manual. Site suitability evaluation and regulatory compliance during site characterization are significant activities, particularly with regard to the YW MSA. OCRWM`s evaluation of whether the Yucca Mountain site is suitable for repository development must precede its submittal of a license application to the Nuclear Regulatory Commission (NRC). Accordingly, site suitability evaluation is discussed in Chapter 4, and the general statements of policy regarding site suitability evaluation are discussed in Section 2.1. Although much of the data and analyses may initially be similar, the licensing process is discussed separately in Chapter 5. Environmental compliance is discussed in Chapter 6. Safety and Health compliance is discussed in Chapter 7.

  20. AREVA 2010 Reference document

    International Nuclear Information System (INIS)

    After a presentation of the person responsible for this document, and of statutory auditors, this report proposes some selected financial information. Then, it addresses, presents and comments the different risk factors: risk management and coverage, legal risk, industrial and environmental risk, operational risk, risks related to major projects, liquidity and market risk, and other risk. Then, after a presentation of the issuer, it proposes a business overview (markets for nuclear and renewable energies, AREVA customers and suppliers, strategy, activities), a presentation of the organizational structure, a presentation of AREVA properties, plants and equipment (sites, environmental issues), an analysis and comment of the group's financial position and performance, a presentation of its capital resources, an overview of its research and development activities, programs, patents and licenses. It indicates profit forecast and estimates, presents the administrative, management and supervisory bodies, and compensation and benefits amounts, reports of the functioning of corporate bodies. It describes the human resource company policy, indicates the main shareholders and transactions with related parties. It proposes financial information concerning assets, financial positions and financial performance. This document contains its French and its English versions

  1. Regulatory guidance document

    International Nuclear Information System (INIS)

    The Office of Civilian Radioactive Waste Management (OCRWM) Program Management System Manual requires preparation of the OCRWM Regulatory Guidance Document (RGD) that addresses licensing, environmental compliance, and safety and health compliance. The document provides: regulatory compliance policy; guidance to OCRWM organizational elements to ensure a consistent approach when complying with regulatory requirements; strategies to achieve policy objectives; organizational responsibilities for regulatory compliance; guidance with regard to Program compliance oversight; and guidance on the contents of a project-level Regulatory Compliance Plan. The scope of the RGD includes site suitability evaluation, licensing, environmental compliance, and safety and health compliance, in accordance with the direction provided by Section 4.6.3 of the PMS Manual. Site suitability evaluation and regulatory compliance during site characterization are significant activities, particularly with regard to the YW MSA. OCRWM's evaluation of whether the Yucca Mountain site is suitable for repository development must precede its submittal of a license application to the Nuclear Regulatory Commission (NRC). Accordingly, site suitability evaluation is discussed in Chapter 4, and the general statements of policy regarding site suitability evaluation are discussed in Section 2.1. Although much of the data and analyses may initially be similar, the licensing process is discussed separately in Chapter 5. Environmental compliance is discussed in Chapter 6. Safety and Health compliance is discussed in Chapter 7

  2. ExactPack Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Singleton, Jr., Robert [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Israel, Daniel M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Doebling, Scott William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Woods, Charles Nathan [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Kaul, Ann [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Walter, Jr., John William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Rogers, Michael Lloyd [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

    2016-05-09

    For code verification, one compares the code output against known exact solutions. There are many standard test problems used in this capacity, such as the Noh and Sedov problems. ExactPack is a utility that integrates many of these exact solution codes into a common API (application program interface), and can be used as a stand-alone code or as a python package. ExactPack consists of python driver scripts that access a library of exact solutions written in Fortran or Python. The spatial profiles of the relevant physical quantities, such as the density, fluid velocity, sound speed, or internal energy, are returned at a time specified by the user. The solution profiles can be viewed and examined by a command line interface or a graphical user interface, and a number of analysis tools and unit tests are also provided. We have documented the physics of each problem in the solution library, and provided complete documentation on how to extend the library to include additional exact solutions. ExactPack’s code architecture makes it easy to extend the solution-code library to include additional exact solutions in a robust, reliable, and maintainable manner.

  3. TEXT SIGNAGE RECOGNITION IN ANDROID MOBILE DEVICES

    Directory of Open Access Journals (Sweden)

    Oi-Mean Foong

    2013-01-01

    Full Text Available This study presents a Text Signage Recognition (TSR model in Android mobile devices for Visually Impaired People (VIP. Independence navigation is always a challenge to VIP for indoor navigation in unfamiliar surroundings. Assistive Technology such as Android smart devices has great potential to assist VIPs in indoor navigation using built-in speech synthesizer. In contrast to previous TSR research which was deployed in standalone personal computer system using Otsu’s algorithm, we have developed an affordable Text Signage Recognition in Android Mobile Devices using Tesseract OCR engine. The proposed TSR model used the input images from the International Conference on Document Analysis and Recognition (ICDAR 2003 dataset for system training and testing. The TSR model was tested by four volunteers who were blind-folded. The system performance of the TSR model was assessed using different metrics (i.e., Precision, Recall, F-Score and Recognition Formulas to determine its accuracy. Experimental results show that the proposed TSR model has achieved recognition rate satisfactorily.

  4. Algorithmic Detection of Computer Generated Text

    CERN Document Server

    Lavoie, Allen

    2010-01-01

    Computer generated academic papers have been used to expose a lack of thorough human review at several computer science conferences. We assess the problem of classifying such documents. After identifying and evaluating several quantifiable features of academic papers, we apply methods from machine learning to build a binary classifier. In tests with two hundred papers, the resulting classifier correctly labeled papers either as human written or as computer generated with no false classifications of computer generated papers as human and a 2% false classification rate for human papers as computer generated. We believe generalizations of these features are applicable to similar classification problems. While most current text-based spam detection techniques focus on the keyword-based classification of email messages, a new generation of unsolicited computer-generated advertisements masquerade as legitimate postings in online groups, message boards and social news sites. Our results show that taking the formatti...

  5. Hierarchical Classification of Chinese Documents Based on N-grams

    Institute of Scientific and Technical Information of China (English)

    2001-01-01

    We explore the techniques of utilizing N-gram informatio n tocategorize Chinese text documents hierarchically so that the classifier can shak e off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A hierarchical Chinese text classif ier is implemented. Experimental results show that hierarchically classifying Chinese text documents based N-grams can achieve satisfactory performance and outperforms the other traditional Chinese text classifiers.

  6. A Hybrid Feature Selection Approach for Arabic Documents Classification

    NARCIS (Netherlands)

    Habib, Mena B.; Fayed, Zaki T.; Gharib, Tarek F.; Sarhan, Ahmed A. E.; Salem, Abdel-Badeeh M.

    2006-01-01

    Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge number of features. Feature selection tries to

  7. On Text Realization Image Steganography

    Directory of Open Access Journals (Sweden)

    Dr. Mohammed Nasser Hussein Al-Turfi

    2012-02-01

    Full Text Available In this paper the steganography strategy is going to be implemented but in a different way from a different scope since the important data will neither be hidden in an image nor transferred through the communication channel inside an image, but on the contrary, a well known image will be used that exists on both sides of the channel and a text message contains important data will be transmitted. With the suitable operations, we can re-mix and re-make the source image. MATLAB7 is the program where the algorithm implemented on it, where the algorithm shows high ability for achieving the task to different type and size of images. Perfect reconstruction was achieved on the receiving side. But the most interesting is that the algorithm that deals with secured image transmission transmits no images at all

  8. Linguistic dating of biblical texts

    DEFF Research Database (Denmark)

    Young, Ian; Rezetko, Robert; Ehrensvärd, Martin Gustaf

    at the university or divinity school level, but also to scholars of the Hebrew Bible in general who have not been exposed to the full scope of issues. The book is useful to a wide range of readers by introducing topics at a basic level before entering into detailed discussion. Among the many issues discussed......Since the beginning of critical scholarship biblical texts have been dated using linguistic evidence. In recent years this has become a controversial topic, especially with the publication of Ian Young (ed.), Biblical Hebrew: Studies in Chronology and Typology (2003). However, until now there has...... been no introduction and comprehensive study of the field. Volume 1 introduces the field of linguistic dating of biblical texts, particularly to intermediate and advanced students of biblical Hebrew who have a reasonable background in the language, having completed at least an introductory course...

  9. Challenges in Kurdish Text Processing

    OpenAIRE

    Esmaili, Kyumars Sheykh

    2012-01-01

    Despite having a large number of speakers, the Kurdish language is among the less-resourced languages. In this work we highlight the challenges and problems in providing the required tools and techniques for processing texts written in Kurdish. From a high-level perspective, the main challenges are: the inherent diversity of the language, standardization and segmentation issues, and the lack of language resources.

  10. Psychologische Interpretation. Biographien, Texte, Tests

    OpenAIRE

    Fahrenberg, Jochen

    2002-01-01

    Biographien, Texte und Tests werden psychologisch interpretiert. Psychologische Interpretation wird als Übersetzung einer Aussage mit beziehungsstiftenden Erläuterungen definiert. So werden Zusammenhänge erschlossen und Ergebnisse eingeordnet. Interpretation ist Übersetzung und Verständigung. Sie muss Heuristik und Methodenkritik verbinden. Eingeführt wird in diese methodischen Grundlagen und Regeln psychologischer Interpretationen. Die ersten Kapitel des Buches führen mit einer Interpretatio...

  11. Text Analytics to Data Warehousing

    OpenAIRE

    Kalli Srinivasa Nageswara Prasad; S. Ramakrishna

    2010-01-01

    Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language) to extract and analyze data, textanalytics has evolved to handle un...

  12. A Hough Transform based Technique for Text Segmentation

    CERN Document Server

    Saha, Satadal; Nasipuri, Mita; Basu, Dipak Kr

    2010-01-01

    Text segmentation is an inherent part of an OCR system irrespective of the domain of application of it. The OCR system contains a segmentation module where the text lines, words and ultimately the characters must be segmented properly for its successful recognition. The present work implements a Hough transform based technique for line and word segmentation from digitized images. The proposed technique is applied not only on the document image dataset but also on dataset for business card reader system and license plate recognition system. For standardization of the performance of the system the technique is also applied on public domain dataset published in the website by CMATER, Jadavpur University. The document images consist of multi-script printed and hand written text lines with variety in script and line spacing in single document image. The technique performs quite satisfactorily when applied on mobile camera captured business card images with low resolution. The usefulness of the technique is verifie...

  13. AN APPROACH FOR TEXT SUMMARIZATION USING DEEP LEARNING ALGORITHM

    Directory of Open Access Journals (Sweden)

    G. PadmaPriya

    2014-01-01

    Full Text Available Now days many research is going on for text summarization. Because of increasing information in the internet, these kind of research are gaining more and more attention among the researchers. Extractive text summarization generates a brief summary by extracting proper set of sentences from a document or multiple documents by deep learning. The whole concept is to reduce or minimize the important information present in the documents. The procedure is manipulated by Restricted Boltzmann Machine (RBM algorithm for better efficiency by removing redundant sentences. The restricted Boltzmann machine is a graphical model for binary random variables. It consist of three layers input, hidden and output layer. The input data uniformly distributed in the hidden layer for operation. The experimentation is carried out and the summary is generated for three different document set from different knowledge domain. The f-measure value is the identifier to the performance of the proposed text summarization method. The top responses of the three different knowledge domain in accordance with the f-measure are 0.85, 1.42 and 1.97 respectively for the three document set.

  14. Multilingual documentation and classification.

    Science.gov (United States)

    Donnelly, Kevin

    2008-01-01

    Health care providers around the world have used classification systems for decades as a basis for documentation, communications, statistical reporting, reimbursement and research. In more recent years machine-readable medical terminologies have taken on greater importance with the adoption of electronic health records and the need for greater granularity of data in clinical systems. Use of a clinical terminology harmonised with classifications, implemented within a clinical information system, will enable the delivery of many patient health benefits including electronic clinical decision support, disease screening and enhanced patient safety. In order to be usable these systems must be translated into the language of use, without losing meaning. It is evident that today one system cannot meet all requirements which call for collaboration and harmonisation in order to achieve true interoperability on a multilingual basis.

  15. Visualising Discourse Structure in Interactive Documents

    OpenAIRE

    Mancini, Clara; Pietsch, Christian; Scott, Donia; Busemann, Stephan

    2007-01-01

    In this paper we introduce a method for generating interactive documents which exploits the visual features of hypertext to represent discourse structure. We explore the consistent and principled use of graphics and animation to support navigation and comprehension of non-linear text, where textual discourse markers do not always work effectively.

  16. Text processing for technical reports (direct computer-assisted origination, editing, and output of text)

    Energy Technology Data Exchange (ETDEWEB)

    De Volpi, A.; Fenrick, M. R.; Stanford, G. S.; Fink, C. L.; Rhodes, E. A.

    1980-10-01

    Documentation often is a primary residual of research and development. Because of this important role and because of the large amount of time consumed in generating technical reports, particularly those containing formulas and graphics, an existing data-processing computer system has been adapted so as to provide text-processing of technical documents. Emphasis has been on accuracy, turnaround time, and time savings for staff and secretaries, for the types of reports normally produced in the reactor development program. The computer-assisted text-processing system, called TXT, has been implemented to benefit primarily the originator of technical reports. The system is of particular value to professional staff, such as scientists and engineers, who have responsibility for generating much correspondence or lengthy, complex reports or manuscripts - especially if prompt turnaround and high accuracy are required. It can produce text that contains special Greek or mathematical symbols. Written in FORTRAN and MACRO, the program TXT operates on a PDP-11 minicomputer under the RSX-11M multitask multiuser monitor. Peripheral hardware includes videoterminals, electrostatic printers, and magnetic disks. Either data- or word-processing tasks may be performed at the terminals. The repertoire of operations has been restricted so as to minimize user training and memory burden. Spectarial staff may be readily trained to make corrections from annotated copy. Some examples of camera-ready copy are provided.

  17. REVISION AND REWRITING IN OFFICIAL DOCUMENTS: CONCEPTS AND METHODOLOGICAL ORIENTATIONS

    Directory of Open Access Journals (Sweden)

    Renilson José MENEGASSI

    2014-12-01

    Full Text Available The text discuss how the concepts and the methodological orientations about text revision and rewriting processes, in teaching context, are conceived, presented and guide the Portuguese Language teacher’s work. To this end, the concepts of revision and rewriting are characterized in four Brazilian official documents, two from national scope and two from Paraná state. The information was organized from what the documents show about the teacher and student attitude face to the investigated concepts, which determine the methodological orientations to the text production work. The results show irregularities in these processes handling, highlighting one of the official documents, from national scope, as the one that presents more suitable methodological and conceptual orientations. It shows that the documents which guide the mother language teaching in the country are still not appropriately discussing the writing text production process, specifically the revision and rewriting, even in more recent documents.

  18. Areva reference document 2007

    International Nuclear Information System (INIS)

    This reference document contains information on the AREVA group's objectives, prospects and development strategies, particularly in Chapters 4 and 7. It contains also information on the markets, market shares and competitive position of the AREVA group. Content: 1 - Person responsible for the reference document and persons responsible for auditing the financial statements; 2 - Information pertaining to the transaction (not applicable); 3 - General information on the company and its share capital: Information on Areva, Information on share capital and voting rights, Investment certificate trading, Dividends, Organization chart of AREVA group companies, Equity interests, Shareholders' agreements; 4 - Information on company operations, new developments and future prospects: Overview and strategy of the AREVA group, The Nuclear Power and Transmission and Distribution markets, The energy businesses of the AREVA group, Front End division, Reactors and Services division, Back End division, Transmission and Distribution division, Major contracts 140 Principal sites of the AREVA group, AREVA's customers and suppliers, Sustainable Development and Continuous Improvement, Capital spending programs, Research and Development programs, Intellectual Property and Trademarks, Risk and insurance; 5 - Assets financial position financial performance: Analysis of and comments on the group's financial position and performance, Human Resources report, Environmental report, Consolidated financial statements 2007, Notes to the consolidated financial statements, Annual financial statements 2007, Notes to the corporate financial statements; 6 - Corporate governance: Composition and functioning of corporate bodies, Executive compensation, Profit-sharing plans, AREVA Values Charter, Annual Ordinary General Meeting of Shareholders of April 17, 2008; 7 - Recent developments and future prospects: Events subsequent to year-end closing for 2007, Outlook; Glossary; table of concordance

  19. Areva, reference document 2006

    International Nuclear Information System (INIS)

    This reference document contains information on the AREVA group's objectives, prospects and development strategies, particularly in Chapters 4 and 7. It contains information on the markets, market shares and competitive position of the AREVA group. Content: - 1 Person responsible for the reference document and persons responsible for auditing the financial statements; - 2 Information pertaining to the transaction (Not applicable); - 3 General information on the company and its share capital: Information on AREVA, on share capital and voting rights, Investment certificate trading, Dividends, Organization chart of AREVA group companies, Equity interests, Shareholders' agreements; - 4 Information on company operations, new developments and future prospects: Overview and strategy of the AREVA group, The Nuclear Power and Transmission and Distribution markets, The energy businesses of the AREVA group, Front End division, Reactors and Services division, Back End division, Transmission and Distribution division, Major contracts, The principal sites of the AREVA group, AREVA's customers and suppliers, Sustainable Development and Continuous Improvement, Capital spending programs, Research and development programs, intellectual property and trademarks, Risk and insurance; - 5 Assets - Financial position - Financial performance: Analysis of and comments on the group's financial position and performance, 2006 Human Resources Report, Environmental Report, Consolidated financial statements, Notes to the consolidated financial statements, AREVA SA financial statements, Notes to the corporate financial statements; 6 - Corporate Governance: Composition and functioning of corporate bodies, Executive compensation, Profit-sharing plans, AREVA Values Charter, Annual Combined General Meeting of Shareholders of May 3, 2007; 7 - Recent developments and future prospects: Events subsequent to year-end closing for 2006, Outlook; 8 - Glossary; 9 - Table of concordance

  20. A Survey On Various Approaches Of Text Extraction In Images

    Directory of Open Access Journals (Sweden)

    C.P. Sumathi

    2012-09-01

    Full Text Available Text Extraction plays a major role in finding vital and valuable information. Text extraction involvesdetection, localization, tracking, binarization, extraction, enhancement and recognition of the text from the given image. These text characters are difficult to be detected and recognized due to their deviation of size, font, style, orientation, alignment, contrast, complex colored, textured background. Due to rapid growth of available multimedia documents and growing requirement for information, identification, indexing and retrieval, many researches have been done on text extraction in images.Several techniqueshave been developed for extracting the text from an image. The proposed methods were based on morphological operators, wavelet transform, artificial neural network,skeletonization operation,edge detection algorithm, histogram technique etc. All these techniques have their benefits and restrictions. This article discusses various schemes proposed earlier for extracting the text from an image. This paper also provides the performance comparison of several existing methods proposed by researchers in extracting the text from an image.

  1. Automated Postediting of Documents

    CERN Document Server

    Knight, K; Knight, Kevin; Chander, Ishwar

    1994-01-01

    Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technical report or thesis written by a non-native speaker of English knows the potential of an automated postediting system. For the case of MT-generated text, we argue for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system. As an example, we have built a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. This is a notoriously difficult problem for Japanese-English MT. Our system contains over 200,000 rules derived automatically from online text resources. We report on l...

  2. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  3. Text writing in the air

    OpenAIRE

    Beg, Saira; Khan, M. Fahad; Baig, Faisal

    2016-01-01

    This paper presents a real time video based pointing method which allows sketching and writing of English text over air in front of mobile camera. Proposed method have two main tasks: first it track the colored finger tip in the video frames and then apply English OCR over plotted images in order to recognize the written characters. Moreover, proposed method provides a natural human-system interaction in such way that it do not require keypad, stylus, pen or glove etc for character input. For...

  4. New Historicism: Text and Context

    Directory of Open Access Journals (Sweden)

    Violeta M. Vesić

    2016-02-01

    Full Text Available During most of the twentieth century history was seen as a phenomenon outside of literature that guaranteed the veracity of literary interpretation. History was unique and it functioned as a basis for reading literary works. During the seventies of the twentieth century there occurred a change of attitude towards history in American literary theory, and there appeared a new theoretical approach which soon became known as New Historicism. Since its inception, New Historicism has been identified with the study of Renaissance and Romanticism, but nowadays it has been increasingly involved in other literary trends. Although there are great differences in the arguments and practices at various representatives of this school, New Historicism has clearly recognizable features and many new historicists will agree with the statement of Walter Cohen that New Historicism, when it appeared in the eighties, represented something quite new in reference to the studies of theory, criticism and history (Cohen 1987, 33. Theoretical connection with Bakhtin, Foucault and Marx is clear, as well as a kind of uneasy tie with deconstruction and the work of Paul de Man. At the center of this approach is a renewed interest in the study of literary works in the light of historical and political circumstances in which they were created. Foucault encouraged readers to begin to move literary texts and to link them with discourses and representations that are not literary, as well as to examine the sociological aspects of the texts in order to take part in the social struggles of today. The study of literary works using New Historicism is the study of politics, history, culture and circumstances in which these works were created. With regard to one of the main fact which is located in the center of the criticism, that history cannot be viewed objectively and that reality can only be understood through a cultural context that reveals the work, re-reading and interpretation of

  5. Attitudes and emotions through written text: the case of textual deformation in internet chat rooms.

    Directory of Open Access Journals (Sweden)

    Francisco Yus Ramos

    2010-11-01

    Full Text Available Normal 0 21 false false false ES X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabla normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} Los chats españoles de Internet son visitados por muchos jóvenes que usan el lenguaje de una forma muy creativa (ej. repetición de letras y signos de puntuación. En este artículo se evalúan varias hipótesis sobre el uso de la deformación textual respecto a su eficacia comunicativa. Se trata de comprobar si estas deformaciones favorecen una identificación y evaluación más adecuada de las actitudes (proposicionales o afectivas y emociones de sus autores. Las respuestas a un cuestionario revelan que a pesar de la información adicional que la deformación textual aporta, los lectores no suelen coincidir en la cualidad exacta de estas actitudes y emociones, ni establecen grados de intensidad relacionados con la cantidad de texto tecleada. Sin embargo, y a pesar de estos resultados, la deformación textual parece jugar un papel en la interpretación que finalmente se elige de estos mensajes enviados a los chats.

  6. Rank Based Clustering For Document Retrieval From Biomedical Databases

    Directory of Open Access Journals (Sweden)

    Jayanthi Manicassamy

    2009-09-01

    Full Text Available Now a day's, search engines are been most widely used for extracting information's from various resources throughout the world. Where, majority of searches lies in the field of biomedical for retrieving related documents from various biomedical databases. Currently search engines lacks in document clustering and representing relativeness level of documents extracted from the databases. In order to overcome these pitfalls a text based search engine have been developed for retrieving documents from Medline and PubMed biomedical databases. The search engine has incorporated page ranking bases clustering concept which automatically represents relativeness on clustering bases. Apart from this graph tree construction is made for representing the level of relatedness of the documents that are networked together. This advance functionality incorporation for biomedical document based search engine found to provide better results in reviewing related documents based on relativeness.

  7. Succincter Text Indexing with Wildcards

    CERN Document Server

    Thachuk, Chris

    2011-01-01

    We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)---positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giving a succinct index requiring $(2 + o(1))n \\log \\sigma + O(n) + O(d \\log n) + O(k \\log k)$ bits for a text of length $n$ over an alphabet of size $\\sigma$ containing $d$ groups of $k$ wildcards. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying $O(n) + O(d \\log \\frac{n}{d})$ bits to also support efficient dictionary matching queries. The query algorithm for our wildcard index is faster than previous approaches using reasonable working space. More importantly our new algorithm greatly reduces the query working space to ...

  8. Documenting Penicillin Allergy: The Impact of Inconsistency.

    Directory of Open Access Journals (Sweden)

    Nirav S Shah

    Full Text Available Allergy documentation is frequently inconsistent and incomplete. The impact of this variability on subsequent treatment is not well described.To determine how allergy documentation affects subsequent antibiotic choice.Retrospective, cohort study.232,616 adult patients seen by 199 primary care providers (PCPs between January 1, 2009 and January 1, 2014 at an academic medical system.Inter-physician variation in beta-lactam allergy documentation; antibiotic treatment following beta-lactam allergy documentation.15.6% of patients had a reported beta-lactam allergy. Of those patients, 39.8% had a specific allergen identified and 22.7% had allergic reaction characteristics documented. Variation between PCPs was greater than would be expected by chance (all p<0.001 in the percentage of their patients with a documented beta-lactam allergy (7.9% to 24.8%, identification of a specific allergen (e.g. amoxicillin as opposed to "penicillins" (24.0% to 58.2% and documentation of the reaction characteristics (5.4% to 51.9%. After beta-lactam allergy documentation, patients were less likely to receive penicillins (Relative Risk [RR] 0.16 [95% Confidence Interval: 0.15-0.17] and cephalosporins (RR 0.28 [95% CI 0.27-0.30] and more likely to receive fluoroquinolones (RR 1.5 [95% CI 1.5-1.6], clindamycin (RR 3.8 [95% CI 3.6-4.0] and vancomycin (RR 5.0 [95% CI 4.3-5.8]. Among patients with beta-lactam allergy, rechallenge was more likely when a specific allergen was identified (RR 1.6 [95% CI 1.5-1.8] and when reaction characteristics were documented (RR 2.0 [95% CI 1.8-2.2].Provider documentation of beta-lactam allergy is highly variable, and details of the allergy are infrequently documented. Classification of a patient as beta-lactam allergic and incomplete documentation regarding the details of the allergy lead to beta-lactam avoidance and use of other antimicrobial agents, behaviors that may adversely impact care quality and cost.

  9. Transcript mapping for handwritten English documents

    Science.gov (United States)

    Jose, Damien; Bharadwaj, Anurag; Govindaraju, Venu

    2008-01-01

    Transcript mapping or text alignment with handwritten documents is the automatic alignment of words in a text file with word images in a handwritten document. Such a mapping has several applications in fields ranging from machine learning where large quantities of truth data are required for evaluating handwriting recognition algorithms, to data mining where word image indexes are used in ranked retrieval of scanned documents in a digital library. The alignment also aids "writer identity" verification algorithms. Interfaces which display scanned handwritten documents may use this alignment to highlight manuscript tokens when a person examines the corresponding transcript word. We propose an adaptation of the True DTW dynamic programming algorithm for English handwritten documents. The integration of the dissimilarity scores from a word-model word recognizer and Levenshtein distance between the recognized word and lexicon word, as a cost metric in the DTW algorithm leading to a fast and accurate alignment, is our primary contribution. Results provided, confirm the effectiveness of our approach.

  10. OVERLAPPING VIRTUAL CADASTRAL DOCUMENTATION

    Directory of Open Access Journals (Sweden)

    Madalina - Cristina Marian

    2013-12-01

    Full Text Available Two cadastrale plans of buildings, can overlap virtual. Overlap is highlighted when digital reception. According to Law no. 7/1996 as amended and supplemented, to solve these problems is by updating the database graphs, the repositioning. This paper addresses the issue of overlapping virtual cadastre in the history of the period 1999-2012.

  11. Toward Documentation of Program Evolution

    DEFF Research Database (Denmark)

    Vestdam, Thomas; Nørmark, Kurt

    2005-01-01

    The documentation of a program often falls behind the evolution of the program source files. When this happens it may be attractive to shift the documentation mode from updating the documentation to documenting the evolution of the program. This paper describes tools that support the documentation...... of program evolution. The tools are refinements of the Elucidative Programming tools, which in turn are inspired from Literate Programming tools. The version-aware Elucidative Programming tools are able to process a set of program source files in different versions together with unversioned documentation...... files. The paper introduces a set of fine grained program evolution steps, which are supported directly by the documentation tools. The automatic discovery of the fine grained program evolution steps makes up a platform for documenting coarse grained and more high-level program evolution steps...

  12. Relevant documents initiating the EDA

    International Nuclear Information System (INIS)

    In December 1990, the four ITER Parties successfully concluded the Conceptual Design Activities for ITER. In January, 1991, each of the Parties had decided to enter negotiations on co-operation in the ITER EDA, which are to be conducted under the auspices of the IAEA; and each Party was prepared to receive a letter of invitation from the Director General of the IAEA to participate in those negotiations. Four negotiating meetings were held in 1991, the first being in Vienna, the second in Tokyo, the third in Reston near Washington, and the fourth in Moscow. After completion of the negotiations, each of the Parties proceeded domestically to reach its decision to sign the ITER EDA Agreement and its Protocol 1. All formalities were concluded during the first half of 1992, and the EDA documents were signed in Washington on July 21, 1992. Following the signing, each of the Parties provided the Director General with the names of its two ITER Council members. With the formation of the Council, the EDA had begun. This volume contains the papers developed before the start of the EDA. It begins with the Director General's invitation to participate in the negotiations and ends with the Parties' designations of the ITER Council members. While the evolving text of the Agreement and its Protocol 1 is referred to in some of these papers as an attachment, it is only the final, signed text that is reproduced in this volume

  13. Document Clustering Based on Semi-Supervised Term Clustering

    Directory of Open Access Journals (Sweden)

    Hamid Mahmoodi

    2012-05-01

    Full Text Available The study is conducted to propose a multi-step feature (term selection process and in semi-supervised fashion, provide initial centers for term clusters. Then utilize the fuzzy c-means (FCM clustering algorithm for clustering terms. Finally assign each of documents to closest associated term clusters. While most text clustering algorithms directly use documents for clustering, we propose to first group the terms using FCM algorithm and then cluster documents based on terms clusters. We evaluate effectiveness of our technique on several standard text collections and compare our results with the some classical text clustering algorithms.

  14. A Noisy-Channel Model for Document Compression

    CERN Document Server

    Daumé, Hal

    2009-01-01

    We present a document compression system that uses a hierarchical noisy-channel model of text production. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse constituents so as to generate coherent, grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text. Our results support the claim that discourse knowledge plays an important role in document summarization.

  15. Querying XML Documents in Logic Programming

    CERN Document Server

    Almendros-Jiménez, J M; Enciso-Baños, F J

    2007-01-01

    Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML. Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere. XPath language is the result of an effort to provide address parts of an XML document. In support of this primary purpose, it becomes in a query language against an XML document. In this paper we present a proposal for the implementation of the XPath language in logic programming. With this aim we will describe the representation of XML documents by means of a logic program. Rules and facts can be used for representing the document schema and the XML document itself. In particular, we will present how to index XML documents in logic programs: rules are supposed to be stored in main memory, however facts are stored in secondary memory by using two kind of indexes: one for each XML tag, and other for each group of terminal it...

  16. Everyday Life as a Text

    Directory of Open Access Journals (Sweden)

    Michael Lahey

    2016-02-01

    Full Text Available This article explores how audience data are utilized in the tentative partnerships created between television and social media companies. Specially, it looks at the mutually beneficial relationship formed between the social media platform Twitter and television. It calls attention to how audience data are utilized as a way for the television industry to map itself onto the everyday lives of digital media audiences. I argue that the data-intensive monitoring of everyday life offers some measure of soft control over audiences in a digital media landscape. To do this, I explore “Social TV”—the relationships created between social media technologies and television—before explaining how Twitter leverages user data into partnerships with various television companies. Finally, the article explains what is fruitful about understanding the Twitter–television relationship as a form of soft control.

  17. PHYSICAL THERAPY DOCUMENTATION: FROM EXAMINATION TO OUTCOME

    Directory of Open Access Journals (Sweden)

    Mia Erickson

    2008-12-01

    Full Text Available The book covers fundamentals for documentation within contemporary physical therapy practice. It uses practice exercises and case studies for developing documentation skills needed for current quality standards of patient care. PURPOSE The book aims to provide a comprehensive reference tool for physical therapy students and practitioners to use when documenting in current practice setting. FEATURES The text begins with a chapter on comparative description of different disablement models including 1 The International Classification on Functioning, Disability, and Health (ICF developed and recently revised by WHO, 2 the Nagi framework and 3 the National Center for Medical Rehabilitation Research (NCMRR disability classification scheme. Following four chapters sequentially provide rationale, basic rules and guidelines for documenting physical therapy evaluation, treatment planning and functional outcome in medical records.Different documentation formats are presented using multiple examples of clinical cases. Chapters 6 through 9 provide practice for writing different aspects of notes including subjective and objective patient information, assessment, treatment plan, interim and discharge notes. The last three chapters focus on outcome measurement, discussion of regulatory and reimbursement issues which are of utmost importance in terms of documentation. Objectives are provided at the beginning and application exercises are provided at the end of each chapter to facilitate reader's full understanding of practical and theoretical information. AUDIENCE This book can be considered as an excellent source for physical therapy students, educators and practitioners. ASSESSMENT This is a valuable reference tool written by subject specialists in relation to a specific aspect of current physical therapy practice. It fully covers essential information and offers plenty of clinical examples assisting the development and improvement of documentation skills for

  18. Recommended HSE-7 documents hierarchy

    International Nuclear Information System (INIS)

    This report recommends a hierarchy of waste management documents at Los Alamos National Laboratory (LANL or ''Laboratory''). The hierarchy addresses documents that are required to plan, implement, and document waste management programs at Los Alamos. These documents will enable the waste management group and the six sections contained within that group to satisfy requirements that are imposed upon them by the US Department of Energy (DOE), DOE Albuquerque Operations, US Environmental Protection Agency, various State of New Mexico agencies, and Laboratory management

  19. On classifying digital accounting documents

    OpenAIRE

    Chih-Fong, Tsai

    2007-01-01

    Advances in computing and multimedia technologies allow many accounting documents to be digitized within little cost for effective storage and access. Moreover, the amount of accounting documents is increasing rapidly, this leads to the need of developing some mechanisms to effectively manage those (semi-structured) digital accounting documents for future accounting information systems (AIS). In general, accounting documents contains such as invoices, purchase orders, checks, photographs, cha...

  20. Language Documentation in the Americas

    Science.gov (United States)

    Franchetto, Bruna; Rice, Keren

    2014-01-01

    In the last decades, the documentation of endangered languages has advanced greatly in the Americas. In this paper we survey the role that international funding programs have played in advancing documentation in this part of the world, with a particular focus on the growth of documentation in Brazil, and we examine some of the major opportunities…

  1. The Practicalities of Document Conversion.

    Science.gov (United States)

    Galbraith, Ian

    1993-01-01

    Describes steps involved in the conversion of source documents to scanned digital image format. Topics addressed include document preparation, including photographs and oversized material; indexing procedures, including automatic indexing possibilities; scanning documents, including resolution and throughput; quality control; backfile conversion;…

  2. Using color management in color document processing

    Science.gov (United States)

    Nehab, Smadar

    1995-04-01

    Color Management Systems have been used for several years in Desktop Publishing (DTP) environments. While this development hasn't matured yet, we are already experiencing the next generation of the color imaging revolution-Device Independent Color for the small office/home office (SOHO) environment. Though there are still open technical issues with device independent color matching, they are not the focal point of this paper. This paper discusses two new and crucial aspects in using color management in color document processing: the management of color objects and their associated color rendering methods; a proposal for a precedence order and handshaking protocol among the various software components involved in color document processing. As color peripherals become affordable to the SOHO market, color management also becomes a prerequisite for common document authoring applications such as word processors. The first color management solutions were oriented towards DTP environments whose requirements were largely different. For example, DTP documents are image-centric, as opposed to SOHO documents that are text and charts centric. To achieve optimal reproduction on low-cost SOHO peripherals, it is critical that different color rendering methods are used for the different document object types. The first challenge in using color management of color document processing is the association of rendering methods with object types. As a result of an evolutionary process, color matching solutions are now available as application software, as driver embedded software and as operating system extensions. Consequently, document processing faces a new challenge, the correct selection of the color matching solution while avoiding duplicate color corrections.

  3. Introduction to Text Mining with R for Information Professionals

    Directory of Open Access Journals (Sweden)

    Monica Maceli

    2016-07-01

    Full Text Available The 'tm: Text Mining Package' in the open source statistical software R has made text analysis techniques easily accessible to both novice and expert practitioners, providing useful ways of analyzing and understanding large, unstructured datasets. Such an approach can yield many benefits to information professionals, particularly those involved in text-heavy research projects. This article will discuss the functionality and possibilities of text mining, as well as the basic setup necessary for novice R users to employ the RStudio integrated development environment (IDE. Common use cases, such as analyzing a corpus of text documents or spreadsheet text data, will be covered, as well as the text mining tools for calculating term frequency, term correlations, clustering, creating wordclouds, and plotting.

  4. DIVERSE DEPICTION OF PARTICLE SWARM OPTIMIZATION FOR DOCUMENT CLUSTERING

    Directory of Open Access Journals (Sweden)

    K. Premalatha

    2011-01-01

    Full Text Available Document clustering algorithms play an important task towards the goal of organizing huge amounts of documents into a small number of significant clusters. Traditional clustering algorithms will search only a small sub-set of possible clustering and as a result, there is no guarantee that the solution found will be optimal. This paper presents different representation of particle in Particle Swarm Optimization (PSO for document clustering. Experiments results are examined with document corpus. It demonstrates that the Discrete PSO algorithm statistically outperforms the Binary PSO and Simple PSO for document Clustering.

  5. Dynamic documents with R and knitr

    CERN Document Server

    Xie, Yihui

    2013-01-01

    IntroductionReproducible ResearchLiteratureGood and Bad PracticesBarriersA First LookSetupMinimal ExamplesQuick Reporting Extracting R Code Editors RStudio LYX Emacs/ESS Other Editors Document FormatsInput Syntax Document Formats Output Renderers R Scripts Text OutputInline Output Chunk Output Tables Themes GraphicsGraphical Devices Plot Recording Plot Rearrangement Plot Size in Output Extra Output Options The tikz Device Figure Environment Figure Path CacheImplementation Write Cache When to Update Cache Side Effects Chunk Dependencies Cross Reference 79Chunk Reference Code Externalization Chi

  6. A programmed text in statistics

    CERN Document Server

    Hine, J

    1975-01-01

    Exercises for Section 2 42 Physical sciences and engineering 42 43 Biological sciences 45 Social sciences Solutions to Exercises, Section 1 47 Physical sciences and engineering 47 49 Biological sciences 49 Social sciences Solutions to Exercises, Section 2 51 51 PhYSical sciences and engineering 55 Biological sciences 58 Social sciences 62 Tables 2 62 x - tests involving variances 2 63,64 x - one tailed tests 2 65 x - two tailed tests F-distribution 66-69 Preface This project started some years ago when the Nuffield Foundation kindly gave a grant for writing a pro­ grammed text to use with service courses in statistics. The work carried out by Mrs. Joan Hine and Professor G. B. Wetherill at Bath University, together with some other help from time to time by colleagues at Bath University and elsewhere. Testing was done at various colleges and universities, and some helpful comments were received, but we particularly mention King Edwards School, Bath, who provided some sixth formers as 'guinea pigs' for the fir...

  7. Improving collaborative documentation in CMS

    CERN Document Server

    Lassila-Perini, Kati

    2009-01-01

    Complete and up-to-date documentation is essential for efficient data analysis in a large and complex collaboration like CMS. Good documentation reduces the time spent in problem solving for users and software developers. The scientists in our research environment do not necessarily have the interests or skills of professional technical writers. This results in inconsistencies in the documentation. To improve the quality, we have started a multidisciplinary project involving CMS user support and expertise in technical communication from the University of Turku, Finland. In this paper, we present possible approaches to study the usability of the documentation, for instance, usability tests conducted recently for the CMS software and computing user documentation

  8. Simple-Random-Sampling-Based Multiclass Text Classification Algorithm

    Directory of Open Access Journals (Sweden)

    Wuying Liu

    2014-01-01

    Full Text Available Multiclass text classification (MTC is a challenging issue and the corresponding MTC algorithms can be used in many applications. The space-time overhead of the algorithms must be concerned about the era of big data. Through the investigation of the token frequency distribution in a Chinese web document collection, this paper reexamines the power law and proposes a simple-random-sampling-based MTC (SRSMTC algorithm. Supported by a token level memory to store labeled documents, the SRSMTC algorithm uses a text retrieval approach to solve text classification problems. The experimental results on the TanCorp data set show that SRSMTC algorithm can achieve the state-of-the-art performance at greatly reduced space-time requirements.

  9. An Evident Theoretic Feature Selection Approach for Text Categorization

    Directory of Open Access Journals (Sweden)

    UMARSATHIC ALI

    2012-06-01

    Full Text Available With the exponential growth of textual documents available in unstructured form on the Internet, feature selection approaches are increasingly significant for the preprocessing of textual documents for automatic text categorization. Feature selection, which focuses on identifying relevant and informative features, can help reduce the computational cost of processing voluminous amounts of data as well asincrease the effectiveness for the subsequent text categorization tasks. In this paper, we propose a new evident theoretic feature selection approach for text categorization based on transferable belief model (TBM. An evaluation on the performance of the proposed evident theoretic feature selection approach on benchmark dataset is also presented. We empirically show the effectiveness of our approach in outperforming the traditional feature selection methods using two standard benchmark datasets.

  10. TEXTS SENTIMENT-ANALYSIS APPLICATION FOR PUBLIC OPINION ASSESSMENT

    Directory of Open Access Journals (Sweden)

    I. A. Bessmertny

    2015-01-01

    Full Text Available The paper describes an approach to the emotional tonality assessment of natural language texts based on special dictionaries. A method for an automatic assessment of public opinion by means of sentiment-analysis of reviews and discussions followed by published Web-documents is proposed. The method is based on statistics of words in the documents. A pilot model of the software system implementing the sentiment-analysis of natural language text in Russian based on a linear assessment scale is developed. A syntactic analysis and words lemmatization are used to identify terms more correctly. Tonality dictionaries are presented in editable format and are open for enhancing. The program system implementing a sentiment-analysis of the Russian texts based on open dictionaries of tonality is presented for the first time.

  11. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  12. Enhancing Biomedical Text Summarization Using Semantic Relation Extraction

    OpenAIRE

    Yue Shang; Yanpeng Li; Hongfei Lin; Zhihao Yang

    2011-01-01

    Automatic text summarization for a biomedical concept can help researchers to get the key points of a certain topic from large amount of biomedical literature efficiently. In this paper, we present a method for generating text summary for a given biomedical concept, e.g., H1N1 disease, from multiple documents based on semantic relation extraction. Our approach includes three stages: 1) We extract semantic relations in each sentence using the semantic knowledge representation tool SemRep. 2) W...

  13. The Challenge of Violence. [Student Text and] Teacher's Guide.

    Science.gov (United States)

    Croddy, Marshall; Degelman, Charles; Hayes, Bill

    This document addresses violence as one of the key challenges facing the democratic and pluralistic republic under the framework of the Constitution and its Bill of Rights. Primary focus is on criminal violence and the factors and behaviors that contribute to violent crime. The text is organized into three chapters: (1) "The Problem of Violence";…

  14. Texts of the Agency's Agreements with the Republic of Austria

    International Nuclear Information System (INIS)

    The document reproduces the text of the Exchange of Letters, dated 8 January 1999 and 27 January 1999 respectively, between the Ministry of Foreign Affairs of Austria and the IAEA, constituting a supplementary agreement t o the Agreement between the Republic of Austria and the IAEA regarding the Headquarters of the IAEA. The aforementioned Agreement entered into force on 8 February 1999

  15. Subject Retrieval from Full-Text Databases in the Humanities

    Science.gov (United States)

    East, John W.

    2007-01-01

    This paper examines the problems involved in subject retrieval from full-text databases of secondary materials in the humanities. Ten such databases were studied and their search functionality evaluated, focusing on factors such as Boolean operators, document surrogates, limiting by subject area, proximity operators, phrase searching, wildcards,…

  16. The Texts of the Agency's Relationship Agreements with Specialized Agencies

    International Nuclear Information System (INIS)

    The texts of the relationship agreements which the Agency has concluded with the specialized agencies listed below, together with the respective protocols authenticating them, are reproduced in this document in the order which the agreements entered into force, for the information of all Members of the Agency

  17. Signal Detection Framework Using Semantic Text Mining Techniques

    Science.gov (United States)

    Sudarsan, Sithu D.

    2009-01-01

    Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…

  18. Orientalist discourse in media texts

    Directory of Open Access Journals (Sweden)

    Necla Mora

    2009-10-01

    Full Text Available By placing itself at the center of the world with a Eurocentric point of view, the West exploits other countries and communities through inflicting cultural change and transformation on them either from within via colonialist movements or from outside via “Orientalist” discourses in line with its imperialist objectives.The West has fictionalized the “image of the Orient” in terms of science by making use of social sciences like anthropology, history and philology and launched an intensive propaganda which covers literature, painting, cinema and other fields of art in order to actualize this fiction. Accordingly, the image of the Orient – which has been built firstly in terms of science then socially – has been engraved into the collective memory of both the Westerner and the Easterner.The internalized “Orientalist” point of view and discourse cause the Westerner to see and perceive the Easterner with the image formed in his/her memory while looking at them. The Easterner represents and expresses himself/herself from the eyes of the Westerner and with the image which the Westerner fictionalized for him/her. Hence, in order to gain acceptance from the West, the East tries to shape itself into the “Orientalist” mold which the Westerner fictionalized for it.Artists, intellectuals, writers and media professionals, who embrace and internalize the stereotypical hegemonic-driven “Orientalist” discourse of the Westerner and who rank among the elite group, reflect their internalized “Orientalist” discourse on their own actions. This condition causes the “Orientalist” clichés to be engraved in the memory of the society; causes the society to view itself with an “Orientalist” point of view and perceive itself with the clichés of the Westerner. Consequently, the second ring of the hegemony is reproduced by the symbolic elites who represent the power/authority within the country.The “Orientalist” discourse, which is

  19. t-Plausibility: Generalizing Words to Desensitize Text

    Directory of Open Access Journals (Sweden)

    Balamurugan Anandan

    2012-12-01

    Full Text Available De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates, they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

  20. Probabilistic Aspects in Spoken Document Retrieval

    Directory of Open Access Journals (Sweden)

    Macherey Wolfgang

    2003-01-01

    Full Text Available Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR plays an important role. In SDR, a set of automatically transcribed speech documents constitutes the files for retrieval, to which a user may address a request in natural language. This paper deals with two probabilistic aspects in SDR. The first part investigates the effect of recognition errors on retrieval performance and inquires the question of why recognition errors have only a little effect on the retrieval performance. In the second part, we present a new probabilistic approach to SDR that is based on interpolations between document representations. Experiments performed on the TREC-7 and TREC-8 SDR task show comparable or even better results for the new proposed method than other advanced heuristic and probabilistic retrieval metrics.

  1. Stamp Detection in Color Document Images

    DEFF Research Database (Denmark)

    Micenkova, Barbora; van Beusekom, Joost

    2011-01-01

    An automatic system for stamp segmentation and further verification is needed especially for environments like insurance companies where a huge volume of documents is processed daily. However, detection of a general stamp is not a trivial task as it can have different shapes and colors and......, moreover, it can be imprinted with a variable quality and rotation. Previous methods were restricted to detection of stamps of particular shapes or colors. The method presented in the paper includes segmentation of the image by color clustering and subsequent classification of candidate solutions by...... geometrical and color-related features. The approach allows for differentiation of stamps from other color objects in the document such as logos or texts. For the purpose of evaluation, a data set of 400 document images has been collected, annotated and made public. With the proposed method, recall of 83% and...

  2. Processing the Text of the Holy Quran: a Text Mining Study

    Directory of Open Access Journals (Sweden)

    Mohammad Alhawarat

    2015-02-01

    Full Text Available The Holy Quran is the reference book for more than 1.6 billion of Muslims all around the world Extracting information and knowledge from the Holy Quran is of high benefit for both specialized people in Islamic studies as well as non-specialized people. This paper initiates a series of research studies that aim to serve the Holy Quran and provide helpful and accurate information and knowledge to the all human beings. Also, the planned research studies aim to lay out a framework that will be used by researchers in the field of Arabic natural language processing by providing a ”Golden Dataset” along with useful techniques and information that will advance this field further. The aim of this paper is to find an approach for analyzing Arabic text and then providing statistical information which might be helpful for the people in this research area. In this paper the holly Quran text is preprocessed and then different text mining operations are applied to it to reveal simple facts about the terms of the holy Quran. The results show a variety of characteristics of the Holy Quran such as its most important words, its wordcloud and chapters with high term frequencies. All these results are based on term frequencies that are calculated using both Term Frequency (TF and Term Frequency-Inverse Document Frequency (TF-IDF methods.

  3. INTEGRATION OF COMPUTER TECHNOLOGIES SMK: AUTOMATION OF THE PRODUCTION CERTIFICA-TION PROCEDURE AND FORMING OF SHIPPING DOCUMENTS

    Directory of Open Access Journals (Sweden)

    S. A. Pavlenko

    2009-01-01

    Full Text Available Integration of informational computer technologies allowed to reorganize and optimize some processes due to decrease of circulation of documents, unification of documentation forms and others.

  4. Identifying Sentiment in Web Multi-topic Documents

    Directory of Open Access Journals (Sweden)

    Na Fan

    2012-02-01

    Full Text Available Most of web documents coverage multiple topic. Identifying sentiment of multi-topic documents is a challenge task. In this paper, we proposed a new method to solve this problem. The method firstly reveals the latent topical facets in documents by Parametric Mixture Model. By focusing on modeling the generation process of a document with multiple topics, we can extract specific properties of documents with multiple topics. PMM models documents with multiple topics by mixing model parameters of each single topic. In order to analyze sentiment of each topic, conditional random fields techniques is used to identify sentiment. Empirical experiments on test datasets show that this approach is effective for extracting subtopics and revealing sentiments of each topic. Moreover, this method is quite general and can be applied to any kinds of text collections.

  5. HL7 Clinical Document Architecture, Release 2.

    Science.gov (United States)

    Dolin, Robert H; Alschuler, Liora; Boyer, Sandy; Beebe, Calvin; Behlen, Fred M; Biron, Paul V; Shabo Shvo, Amnon

    2006-01-01

    Clinical Document Architecture, Release One (CDA R1), became an American National Standards Institute (ANSI)-approved HL7 Standard in November 2000, representing the first specification derived from the Health Level 7 (HL7) Reference Information Model (RIM). CDA, Release Two (CDA R2), became an ANSI-approved HL7 Standard in May 2005 and is the subject of this article, where the focus is primarily on how the standard has evolved since CDA R1, particularly in the area of semantic representation of clinical events. CDA is a document markup standard that specifies the structure and semantics of a clinical document (such as a discharge summary or progress note) for the purpose of exchange. A CDA document is a defined and complete information object that can include text, images, sounds, and other multimedia content. It can be transferred within a message and can exist independently, outside the transferring message. CDA documents are encoded in Extensible Markup Language (XML), and they derive their machine processable meaning from the RIM, coupled with terminology. The CDA R2 model is richly expressive, enabling the formal representation of clinical statements (such as observations, medication administrations, and adverse events) such that they can be interpreted and acted upon by a computer. On the other hand, CDA R2 offers a low bar for adoption, providing a mechanism for simply wrapping a non-XML document with the CDA header or for creating a document with a structured header and sections containing only narrative content. The intent is to facilitate widespread adoption, while providing a mechanism for incremental semantic interoperability. PMID:16221939

  6. Supporting the education evidence portal via text mining.

    Science.gov (United States)

    Ananiadou, Sophia; Thompson, Paul; Thomas, James; Mu, Tingting; Oliver, Sandy; Rickinson, Mark; Sasaki, Yutaka; Weissenbacher, Davy; McNaught, John

    2010-08-28

    The UK Education Evidence Portal (eep) provides a single, searchable, point of access to the contents of the websites of 33 organizations relating to education, with the aim of revolutionizing work practices for the education community. Use of the portal alleviates the need to spend time searching multiple resources to find relevant information. However, the combined content of the websites of interest is still very large (over 500,000 documents and growing). This means that searches using the portal can produce very large numbers of hits. As users often have limited time, they would benefit from enhanced methods of performing searches and viewing results, allowing them to drill down to information of interest more efficiently, without having to sift through potentially long lists of irrelevant documents. The Joint Information Systems Committee (JISC)-funded ASSIST project has produced a prototype web interface to demonstrate the applicability of integrating a number of text-mining tools and methods into the eep, to facilitate an enhanced searching, browsing and document-viewing experience. New features include automatic classification of documents according to a taxonomy, automatic clustering of search results according to similar document content, and automatic identification and highlighting of key terms within documents. PMID:20643679

  7. Conservation Documentation and the Implications of Digitisation

    Directory of Open Access Journals (Sweden)

    Michelle Moore

    2001-11-01

    Full Text Available Conservation documentation can be defined as the textual and visual records collected during the care and treatment of an object. It can include records of the object's condition, any treatment done to the object, any observations or conclusions made by the conservator as well as details on the object's past and present environment. The form of documentation is not universally agreed upon nor has it always been considered an important aspect of the conservation profession. Good documentation tells the complete story of an object thus far and should provide as much information as possible for the future researcher, curator, or conservator. The conservation profession will benefit from digitising its documentation using software such as databases and hardware like digital cameras and scanners. Digital technology will make conservation documentation more easily accessible, cost/time efficient, and will increase consistency and accuracy of the recorded data, and reduce physical storage space requirements. The major drawback to digitising conservation records is maintaining access to the information for the future; the notorious pace of technological change has serious implications for retrieving data from any machine- readable medium.

  8. Title Based Duplicate Detection of Web Documents

    Directory of Open Access Journals (Sweden)

    Mrs. M. Kiruthika

    2012-09-01

    Full Text Available In recent times, the concept of web crawling has received remarkable significance owing to extreme development of the World Wide Web. Very large amounts of web documents are swarming the web making the search engines less appropriate to the users. Among the vast number of web documents are many duplicates and near duplicates i.e. variants derived from the same original web document due to which additional overheads are created for search engines by which their performance and quality is significantly affected. Web crawling research community has extensively recognized the need for detection of duplicate and near duplicate web pages. Providing the users with relevant results for their queries in the first page without duplicates and redundant results is a vital requisite. Also, this problem of duplication should be avoided to save storage as well as to improve search quality. The near duplicate web pages are detected followed by the storage of crawled web pages in to repositories. The detection of near duplicates conserves network bandwidth, brings down storage cost and enhances the quality of search engines. In this paper, we have discussed a feasible method for detection of near-duplicate web documents based on the title of the documents which will help to reduce the overhead of search engines and improve their performance.

  9. Development of digital library system on regulatory documents for nuclear power plants

    Energy Technology Data Exchange (ETDEWEB)

    Lee, K. H.; Kim, K. J.; Yoon, Y. H.; Kim, M. W.; Lee, J. I. [KINS, Taejon (Korea, Republic of)

    2001-10-01

    The main objective of this study is to establish nuclear regulatory document retrieval system based on internet. With the advancement of internet and information processing technology, information management patterns are going through a new paradigm. Getting along the current of the time, it is general tendency to transfer paper-type documents into electronic-type documents through document scanning and indexing. This system consists of nuclear regulatory documents, nuclear safety documents, digital library, and information system with index and full text.

  10. 77 FR 60475 - Draft of SWGDOC Standard Classification of Typewritten Text

    Science.gov (United States)

    2012-10-03

    ... From the Federal Register Online via the Government Publishing Office DEPARTMENT OF JUSTICE Office of Justice Programs Draft of SWGDOC Standard Classification of Typewritten Text AGENCY: National... general public a draft document entitled, ``SWGDOC Standard Classification of Typewritten Text''....

  11. Document image analysis: A primer

    Indian Academy of Sciences (India)

    Rangachar Kasturi; Lawrence O’Gorman; Venu Govindaraju

    2002-02-01

    Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In this paper we briefly describe various components of a document analysis system. Many of these basic building blocks are found in most document analysis systems, irrespective of the particular domain or language to which they are applied. We hope that this paper will help the reader by providing the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.

  12. Mining Causality for Explanation Knowledge from Text

    Institute of Scientific and Technical Information of China (English)

    Chaveevan Pechsiri; Asanee Kawtrakul

    2007-01-01

    Mining causality is essential to provide a diagnosis. This research aims at extracting the causality existing within multiple sentences or EDUs (Elementary Discourse Unit). The research emphasizes the use of causality verbs because they make explicit in a certain way the consequent events of a cause, e.g., "Aphids suck the sap from rice leaves. Then leaves will shrink. Later, they will become yellow and dry.". A verb can also be the causal-verb link between cause and effect within EDU(s), e.g., "Aphids suck the sap from rice leaves causing leaves to be shrunk" ("causing" is equivalent to a causal-verb link in Thai). The research confronts two main problems: identifying the interesting causality events from documents and identifying their boundaries. Then, we propose mining on verbs by using two different machine learning techniques, Naive Bayes classifier and Support Vector Machine. The resulted mining rules will be used for the identification and the causality extraction of the multiple EDUs from text. Our multiple EDUs extraction shows 0.88 precision with 0.75 recall from Na'ive Bayes classifier and 0.89 precision with 0.76 recall from Support Vector Machine.

  13. SDDL- SOFTWARE DESIGN AND DOCUMENTATION LANGUAGE

    Science.gov (United States)

    Kleine, H.

    1994-01-01

    and a collection of directives which control processor actions. The designer has complete control over the choice of keywords, commanding the capabilities of the processor in a way which is best suited to communicating the intent of the design. The SDDL processor translates the designer's creative thinking into an effective document for communication. The processor performs as many automatic functions as possible, thereby freeing the designer's energy for the creative effort. Document formatting includes graphical highlighting of structure logic, accentuation of structure escapes and module invocations, logic error detection, and special handling of title pages and text segments. The SDDL generated document contains software design summary information including module invocation hierarchy, module cross reference, and cross reference tables of user selected words or phrases appearing in the document. The basic forms of the methodology are module and block structures and the module invocation statement. A design is stated in terms of modules that represent problem abstractions which are complete and independent enough to be treated as separate problem entities. Blocks are lower-level structures used to build the modules. Both kinds of structures may have an initiator part, a terminator part, an escape segment, or a substructure. The SDDL processor is written in PASCAL for batch execution on a DEC VAX series computer under VMS. SDDL was developed in 1981 and last updated in 1984.

  14. A Topic Modeling Based Solution for Confirming Software Documentation Quality

    Directory of Open Access Journals (Sweden)

    Nouh Alhindawi

    2016-02-01

    Full Text Available this paper presents an approach for evaluating and confirming the quality of the external software documentation using topic modeling. Typically, the quality of the external documentation has to mirror precisely the organization of the source code. Therefore, the elements of such documentation should be strongly written, associated, and presented. In this paper, we use Latent Dirichlet Allocation (LDA and HELLINGER DISTANCE to compute the similarities between the fragments of source code and the external documentation topics. These similarities are used in this paper to improve and advance the existing external documentation. Furthermore, these similarities can also be used for evaluating the new documenting process during the evolution phase of the software. The results show that the new approach yields state-of-the-art performance in evaluating and confirming the existing external documentations quality and superiority.

  15. Semi-structured document image matching and recognition

    OpenAIRE

    Augereau, Olivier; Journet, Nicholas; Domenger, Jean-Philippe

    2013-01-01

    International audience This article presents a method to recognize and to localize semi-structured documents such as ID cards, tickets, invoices, etc. Object recognition methods based on interest points work well on natural images but fail on document images because of repetitive patterns like text. In this article, we propose an adaptation of object recognition for image documents. The advantages of our method is that it does not use character recognition or segmentation and it is robust ...

  16. Automatic digital document processing and management problems, algorithms and techniques

    CERN Document Server

    Ferilli, Stefano

    2011-01-01

    This text reviews the issues involved in handling and processing digital documents. Examining the full range of a document's lifetime, this book covers acquisition, representation, security, pre-processing, layout analysis, understanding, analysis of single components, information extraction, filing, indexing and retrieval. This title: provides a list of acronyms and a glossary of technical terms; contains appendices covering key concepts in machine learning, and providing a case study on building an intelligent system for digital document and library management; discusses issues of security,

  17. Documentation and the users of digital resources in the humanities.

    OpenAIRE

    Warwick, C.; Galina, I.; Rimmer, J; M. Terras; Blandford, A.; Gow, J; Buchanan, G

    2009-01-01

    Purpose – The purpose of this paper is to discuss the importance of documentation for digital humanities resources. This includes technical documentation of textual markup or database construction, and procedural documentation about resource construction. Design/methodology/approach – A case study is presented of an attempt to reuse electronic text to create a digital library for humanities users, as part of the UCIS project. The results of qualitative research by the LAIRAH study on prov...

  18. Pedagogical documentation: Preschool teachers’ perspective

    OpenAIRE

    Pavlović-Breneselović Dragana; Krnjaja Živka; Matović Nataša

    2012-01-01

    Educational policy shapes the positions of all stakeholders and their mutual relations in the system of preschool education through its attitude towards documentation. The attitude towards the function of pedagogical documentation in preschool education programmes reflects certain views on children, learning and nature of the programmes. Although contemporary approaches to preschool education emphasise the issue of documentation, this problem is dealt with partially and technically in o...

  19. EDF Group - 2010 Reference Document

    International Nuclear Information System (INIS)

    Beside the accounts of EDF for 2008 and 2009, this voluminous document presents persons in charge, legal account auditors, and how risks are managed within the company. It gives an overview of EDF activities, of its organization, of its assets. It presents and discusses its financial situation and results, indicates the main contracts, and proposes other documents concerning the company. Many documents and reports are provided in appendix

  20. Document understanding for a broad class of documents

    NARCIS (Netherlands)

    Aiello, Marco; Monz, Christof; Todoran, Leon; Worring, Marcel

    2002-01-01

    We present a document analysis system able to assign logical labels and extract the reading order in a broad set of documents. All information sources, from geometric features and spatial relations to the textual features and content are employed in the analysis. To deal effectively with these infor

  1. Extended Approach to Water Flow Algorithm for Text Line Segmentation

    Institute of Scientific and Technical Information of China (English)

    Darko Brodi(c)

    2012-01-01

    This paper proposes a new approach to the water flow algorithm for text line segmentation.In the basic method the hypothetical water flows under few specified angles which have been defined by water flow angle as parameter.It is applied to the document image frame from left to right and vice versa.As a result,the unwetted and wetted areas are established.Thesc areas separate text from non-text elements in each text line,respectively.Hence,they represent the control areas that are of major importance for text line segmentation.Primarily,an extended approach means extraction of the connected-components by bounding boxes ovcr text.By this way,each connected component is mutually separated.Hence,the water flow angle,which defines the unwetted areas,is determined adaptively.By choosing appropriate water flow angle,the unwetted areas are lengthening which leads to the better text line segmentation.Results of this approach are encouraging due to the text line segmentation improvement which is the most challenging step in document image processing.

  2. Fuzzy Logic Based Method for Improving Text Summarization

    CERN Document Server

    Suanmali, Ladda; Binwahlan, Mohammed Salem

    2009-01-01

    Text summarization can be classified into two approaches: extraction and abstraction. This paper focuses on extraction approach. The goal of text summarization based on extraction approach is sentence selection. One of the methods to obtain the suitable sentences is to assign some numerical measure of a sentence for the summary called sentence weighting and then select the best ones. The first step in summarization by extraction is the identification of important features. In our experiment, we used 125 test documents in DUC2002 data set. Each document is prepared by preprocessing process: sentence segmentation, tokenization, removing stop word, and word stemming. Then, we use 8 important features and calculate their score for each sentence. We propose text summarization based on fuzzy logic to improve the quality of the summary created by the general statistic method. We compare our results with the baseline summarizer and Microsoft Word 2007 summarizers. The results show that the best average precision, rec...

  3. Unsupervised mining of frequent tags for clinical eligibility text indexing.

    Science.gov (United States)

    Miotto, Riccardo; Weng, Chunhua

    2013-12-01

    Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.

  4. The Role of Text Mining in Export Control

    Energy Technology Data Exchange (ETDEWEB)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon [Korea Institute of Nuclear Nonproliferation and Control, Daejeon (Korea, Republic of)

    2015-10-15

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control.

  5. An Effective Concept Extraction Method for Improving Text Classification Performance

    Institute of Scientific and Technical Information of China (English)

    ZHANG Yuntao; GONG Ling; WANG Yongcheng; YIN Zhonghang

    2003-01-01

    This paper presents anew way to extract concept that can beused to improve text classification per-formance (precision and recall). Thecomputational measure will be dividedinto two layers. The bottom layercalled document layer is concernedwith extracting the concepts of parti-cular document and the upper layercalled category layer is with findingthe description and subject concepts ofparticular category. The relevant im-plementation algorithm that dramatic-ally decreases the search space is dis-cussed in detail. The experiment basedon real-world data collected from Info-Bank shows that the approach is supe-rior to the traditional ones.

  6. EDCMS: A Content Management System for Engineering Documents

    Institute of Scientific and Technical Information of China (English)

    Shaofeng Liu; Chris McMahon; Mansur Darlington; Steve Culley; Peter Wild

    2007-01-01

    Engineers often need to look for the right pieces of information by sifting through long engineering documents. It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting their attention to new ways to help information users, including engineers, to access and retrieve document content. The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure), document mark-up (with Extensible Markup Language (XML), HyperText Mark-up Language (HTML), and Scalable Vector Graphics (SVG)), and a facetted classification mechanism. Document content extraction is implemented via computer programming (with Java). An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers.The main features of the EDCMS system are:1) EDCMS is a system that enables users, especially engineers, to access and retrieve information at content rather than document level. In other words, it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information.2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content.3) Users can use the EDCMS to access and retrieve content objects, I.e. Text, images and graphics (including engineering drawings)via multiple views and at different granularities based on decomposition schemes.Experiments with the EDCMS have been conducted on semi-structured documents, a textbook of CADCAM, and a set of project posters in the Engineering Design domain. Experimental results show that the system provides information users with a powerful solution to access document content.

  7. Intelligent bar chart plagiarism detection in documents.

    Science.gov (United States)

    Al-Dabbagh, Mohammed Mumtaz; Salim, Naomie; Rehman, Amjad; Alkawaz, Mohammed Hazim; Saba, Tanzila; Al-Rodhaan, Mznah; Al-Dhelaan, Abdullah

    2014-01-01

    This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts.

  8. Script Identification In Trilingual Indian Documents

    Directory of Open Access Journals (Sweden)

    R. R. Aparna

    2014-07-01

    Full Text Available This paper presents a research work in identification of script from trilingual Indian documents. This paper proposes a classification algorithm based on structural and contour features. The proposed system identifies the script of languages like English, Tamil and Hindi. 300 word images of the above mentioned three scripts were tested and 98.6% accuracy was obtained. Performance comparison with various existing methods is discussed.

  9. Healing texts and healing techniques in indigenous Balinese medicine.

    Science.gov (United States)

    McCauley, A P

    1988-01-01

    Case histories of three prominent Balinese healers illustrate various ways that indigenous medical texts are used in healing. Most healers employ mantras, spells and inscriptions from the texts because they believe them to have innate power which can heal. A smaller group of healers are literate in the archaic language used in the palm-leaf medical manuscripts. However, their use of these manuscripts often differs from the literal and unambiguous way that Westerners read medical documents. An examination of Balinese medical manuscripts, in the context of the conventions of Balinese literature, demonstrates the use of these texts to align the body with the macrocosm and to reaffirm the beliefs of the ancestors.

  10. FLCW: Frequent Itemset Based Text Clustering with Window Constraint

    Institute of Scientific and Technical Information of China (English)

    ZHOU Chong; LU Yansheng; ZOU Lei; HU Rong

    2006-01-01

    Most of the existing text clustering algorithms overlook the fact that one document is a word sequence with semantic information.There is some important semantic information existed in the positions of words in the sequence.In this paper, a novel method named Frequent Itemset-based Clustering with Window (FICW) was proposed, which makes use of the semantic information for text clustering with a window constraint.The experimental results obtained from tests on three (hypertext) text sets show that FICW outperforms the method compared in both clustering accuracy and efficiency.

  11. An Optimization Model and DPSO-EDA for Document Summarization

    Directory of Open Access Journals (Sweden)

    Rasim M. Alguliev

    2011-11-01

    Full Text Available We model document summarization as a nonlinear 0-1 programming problem where an objective function is defined as Heronian mean of the objective functions enforcing the coverage and diversity. The proposed model implemented on a multi-document summarization task. Experiments on DUC2001 and DUC2002 datasets showed that the proposed model outperforms the other summarization methods.

  12. Magnetic fusion program summary document

    International Nuclear Information System (INIS)

    This document outlines the current and planned research, development, and commercialization (RD and C) activities of the Offic of Fusion Energy under the Assistant Secretary for Energy Technology, US Department of Energy (DOE). The purpose of this document is to explain the Office of Fusion Energy's activities to Congress and its committees and to interested members of the public

  13. SRS ecology: Environmental information document

    International Nuclear Information System (INIS)

    The purpose of this Document is to provide a source of ecological information based on the exiting knowledge gained from research conducted at the Savannah River Site. This document provides a summary and synthesis of ecological research in the three main ecosystem types found at SRS and information on the threatened and endangered species residing there

  14. Document Organization Using Kohonen's Algorithm.

    Science.gov (United States)

    Guerrero Bote, Vicente P.; Moya Anegon, Felix de; Herrero Solana, Victor

    2002-01-01

    Discussion of the classification of documents from bibliographic databases focuses on a method of vectorizing reference documents from LISA (Library and Information Science Abstracts) which permits their topological organization using Kohonen's algorithm. Analyzes possibilities of this type of neural network with respect to the development of…

  15. SRS ecology: Environmental information document

    Energy Technology Data Exchange (ETDEWEB)

    Wike, L.D.; Shipley, R.W.; Bowers, J.A. [and others

    1993-09-01

    The purpose of this Document is to provide a source of ecological information based on the exiting knowledge gained from research conducted at the Savannah River Site. This document provides a summary and synthesis of ecological research in the three main ecosystem types found at SRS and information on the threatened and endangered species residing there.

  16. Documentation: The Reggio Emilia Approach.

    Science.gov (United States)

    Katz, Lilian G.; Chard, Sylvia C.

    1997-01-01

    Notes the municipal preprimary schools of Reggio Emilia, Italy, are attracting worldwide attention for extensively documenting children's experience, memories, thoughts, and ideas. Suggest this type of documentation contributes to early-childhood-program quality by enhancing learning, taking children's ideas and work seriously, providing…

  17. Bulkloading and Maintaining XML Documents

    NARCIS (Netherlands)

    Schmidt, A.R.; Kersten, M.L.

    2002-01-01

    The popularity of XML as a exchange and storage format brings about massive amounts of documents to be stored, maintained and analyzed -- a challenge that traditionally has been tackled with Database Management Systems (DBMS). To open up the content of XML documents to analysis with declarative quer

  18. Vector space model for document representation in information retrieval

    Directory of Open Access Journals (Sweden)

    Dan MUNTEANU

    2007-12-01

    Full Text Available This paper presents the basics of information retrieval: the vector space model for document representation with Boolean and term weighted models, ranking methods based on the cosine factor and evaluation measures: recall, precision and combined measure.

  19. Photographic Documentation in Plastic Surgeon’s Practice

    Directory of Open Access Journals (Sweden)

    Kasielska-Trojan Anna

    2016-05-01

    Full Text Available The aim of the study was to analyze practices of clinical photographic documentation management among plastic surgeons in Poland as well as to gain their opinion about the characteristics of “ideal” software for images archiving.

  20. Invisible in Thailand: documenting the need for protection

    Directory of Open Access Journals (Sweden)

    Margaret Green

    2008-04-01

    Full Text Available The International Rescue Committee (IRC has conducted asurvey to document the experiences of Burmese people livingin border areas of Thailand and assess the degree to whichthey merit international protection as refugees.