WorldWideScience

Sample records for ascii text documents

  1. State Of The Art In Digital Steganography Focusing ASCII Text Documents

    CERN Document Server

    Rafat, Khan Farhan

    2010-01-01

    Digitization of analogue signals has opened up new avenues for information hiding and the recent advancements in the telecommunication field has taken up this desire even further. From copper wire to fiber optics, technology has evolved and so are ways of covert channel communication. By "Covert" we mean "anything not meant for the purpose for which it is being used". Investigation and detection of existence of such cover channel communication has always remained a serious concern of information security professionals which has now been evolved into a motivating source of an adversary to communicate secretly in "open" without being allegedly caught or noticed. This paper presents a survey report on steganographic techniques which have been evolved over the years to hide the existence of secret information inside some cover (Text) object. The introduction of the subject is followed by the discussion which is narrowed down to the area where digital ASCII Text documents are being used as cover. Finally, the conc...

  2. Communication in Veil: Enhanced Paradigm for ASCII Text Files

    Directory of Open Access Journals (Sweden)

    Muhammad Sher

    2013-08-01

    Full Text Available Digitization has a persuasive impact on information and communication technology (ICT field which can be realized from the fact that today one seldom think to stand in long awaiting queue just to deposit utility bills, buy movie ticket, or dispatch private letters via post office etc. as these and other such similar activities are now preferably being done electronically over internet which has shattered the geographical boundaries and has tied the people across the world into a single logical unit called global village. The efficacy and precision with which electronic transactions are made is commendable and is one of the reasons why more and more people are switching over to e-commerce for their official and personal usage. Via social networking sites one can interact with family and friends at any time of his/her choice. The darker side of this comforting aspect, however, is that the contents sent on/off-line may be monitored for active or passive intervention by the antagonistic forces for their illicit motives ranging from but not only limited to password, ID and social security number theft to impersonation, compromising personal information, blackmailing etc. This necessitated the need to hide data or information of some significance in an oblivious manner in order to detract the enemy as regards its detection. This paper aims at evolving an avant-garde information hiding scheme for ASCII text files - a research area regarded as the most difficult in contrast to audio, video or image file formats for the said purpose.

  3. Documents and legal texts

    International Nuclear Information System (INIS)

    This section reprints a selection of recently published legislative texts and documents: - Russian Federation: Federal Law No.170 of 21 November 1995 on the use of atomic energy, Adopted by the State Duma on 20 October 1995; - Uruguay: Law No.19.056 On the Radiological Protection and Safety of Persons, Property and the Environment (4 January 2013); - Japan: Third Supplement to Interim Guidelines on Determination of the Scope of Nuclear Damage resulting from the Accident at the Tokyo Electric Power Company Fukushima Daiichi and Daini Nuclear Power Plants (concerning Damages related to Rumour-Related Damage in the Agriculture, Forestry, Fishery and Food Industries), 30 January 2013; - France and the United States: Joint Statement on Liability for Nuclear Damage (Aug 2013); - Franco-Russian Nuclear Power Declaration (1 November 2013)

  4. Documents and legal texts

    International Nuclear Information System (INIS)

    This section treats of the following Documents and legal texts: 1 - Canada: Nuclear Liability and Compensation Act (An Act respecting civil liability and compensation for damage in case of a nuclear incident, repealing the Nuclear Liability Act and making consequential amendments to other acts); 2 - Japan: Act on Compensation for Nuclear Damage (The purpose of this act is to protect persons suffering from nuclear damage and to contribute to the sound development of the nuclear industry by establishing a basic system regarding compensation in case of nuclear damage caused by reactor operation etc.); Act on Indemnity Agreements for Compensation of Nuclear Damage; 3 - Slovak Republic: Act on Civil Liability for Nuclear Damage and on its Financial Coverage and on Changes and Amendments to Certain Laws (This Act regulates: a) The civil liability for nuclear damage incurred in the causation of a nuclear incident, b) The scope of powers of the Nuclear Regulatory Authority (hereinafter only as the 'Authority') in relation to the application of this Act, c) The competence of the National Bank of Slovakia in relation to the supervised financial market entities in the financial coverage of liability for nuclear damage; and d) The penalties for violation of this Act)

  5. Text document classification

    Czech Academy of Sciences Publication Activity Database

    Novovičová, Jana

    č. 62 (2005), s. 53-54. ISSN 0926-4981 R&D Projects: GA AV ČR IAA2075302; GA AV ČR KSK1019101; GA MŠk 1M0572 Institutional research plan: CEZ:AV0Z10750506 Keywords : document representation * categorization * classification Subject RIV: BD - Theory of Information

  6. Documents and legal texts

    International Nuclear Information System (INIS)

    This section reprints the text of the two following laws: 1 - United Arab Emirates: Federal Law by Decree No. 4 of 2012 concerning civil liability for nuclear damage; India: The Civil Liability for Nuclear Damage Act, 2010, No. 38 of 2010, 21 September 2010 (An Act to provide for civil liability for Nuclear Damage and prompt compensation to the victims of a Nuclear accident through a No Fault Liability Regime channeling liability to the operator, appointment of Claims Commissioner, establishment of Nuclear Damage Claims commission and for matters connected therewith or incidental thereto); 2 - Republic of Moldova - Parliament: Law No. 132 of 08.06.2012 on the safe conduct of nuclear and radiological activities (Published: 02.11.2012 in the Official Gazette No. 229-233 art. no: 739). the purpose of this law is to regulate nuclear and radiological activities in accordance with the international requirements in this field arising out of several treaties, conventions and directives

  7. An Advanced Text Encryption & Compression System Based on ASCII Values & Arithmetic Encoding to Improve Data Security

    OpenAIRE

    Amandeep Singh Sidhu; Er. Meenakshi Garg

    2014-01-01

    Compression algorithms reduce the redundancy in data representation thus increasing effective data density. Data compression is a very useful technique that helps in reducing the size of text data and storing the same amount of data in relatively fewer bits resulting in reducing the data storage space, resource usage or transmission capacity. There are a number of techniques that have been used for text data compression which can be categorized as Lossy and Lossless data compre...

  8. Survey on Text Document Clustering

    OpenAIRE

    M.Thangamani; Dr.P.Thangaraj

    2010-01-01

    Document clustering is also referred as text clustering, and its concept is merely equal to data clustering. It is hardly difficult to find the selective information from an ‘N’number of series information, so that document clustering came into picture. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. Clustering can be of mathematical, statistical or numerical domain. Clustering is a fundamental data analysi...

  9. Text documents as social networks

    Science.gov (United States)

    Balinsky, Helen; Balinsky, Alexander; Simske, Steven J.

    2012-03-01

    The extraction of keywords and features is a fundamental problem in text data mining. Document processing applications directly depend on the quality and speed of the identification of salient terms and phrases. Applications as disparate as automatic document classification, information visualization, filtering and security policy enforcement all rely on the quality of automatically extracted keywords. Recently, a novel approach to rapid change detection in data streams and documents has been developed. It is based on ideas from image processing and in particular on the Helmholtz Principle from the Gestalt Theory of human perception. By modeling a document as a one-parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helmholtz's principle, we demonstrated that for some range of the parameters, the resulting graph becomes a small-world network. In this article we investigate the natural orientation of edges in such small world networks. For two connected sentences, we can say which one is the first and which one is the second, according to their position in a document. This will make such a graph look like a small WWW-type network and PageRank type algorithms will produce interesting ranking of nodes in such a document.

  10. Emotion Detection From Text Documents

    Directory of Open Access Journals (Sweden)

    Shiv Naresh Shivhare

    2014-11-01

    Full Text Available Emotion Detection is one of the most emerging issues in human computer interaction. A sufficient amount of work has been done by researchers to detect emotions from facial and audio information whereas recognizing emotions from textual data is still a fresh and hot research area. This paper presented a knowledge based survey on emotion detection based on textual data and the methods used for this purpose. At the next step paper also proposed a new architecture for recognizing emotions from text document.Proposed architecture is composed of two main parts, emotion ontology and emotion detector algorithm.Proposed emotion detector system takes a text document and the emotion ontology as inputs and produces one of the six emotion classes (i.e. love, joy, anger, sadness, fear and surprise as the output.

  11. Text line Segmentation of Curved Document Images

    Directory of Open Access Journals (Sweden)

    Anusree.M

    2014-05-01

    Full Text Available Document image analysis has been widely used in historical and heritage studies, education and digital library. Document image analytical techniques are mainly used for improving the human readability and the OCR quality of the document. During the digitization, camera captured images contain warped document due perspective and geometric distortions. The main difficulty is text line detection in the document. Many algorithms had been proposed to address the problem of printed document text line detection, but they failed to extract text lines in curved document. This paper describes a segmentation technique that detects the curled text line in camera captured document images.

  12. Arabic multi-document text summarisation

    OpenAIRE

    El-Haj, Mahmoud

    2012-01-01

    Multi-document summarisation is the process of producing a single summary of a collection of related documents. Much of the current work on multi-document text summarisation is concerned with the English language; relevant resources are numerous and readily available. These resources include human generated (gold-standard) and automatic summaries. Arabic multi-document summarisation is still in its infancy. One of the obstacles to progress is the limited availability of Arabic resources to su...

  13. Plagiarism in text documents: Methods of Plagiarism

    OpenAIRE

    Opička, Jan

    2009-01-01

    This thesis is devoted to detection of plagiarism among documents in large document databases. The problem of detection of plagiarism is more appealing today than ever. Easy accessibility of documents in digital form contributes to this problem. To enforce author rights and wipe out plagiarism it is necessary to project such system that will be able to distinguish plagiarism among documents with certainty. Such system is valuable help in academic field, where it can be used for controlling of...

  14. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-12-01

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. However, these metaphors (e.g., word clouds, tag clouds, etc.) often lack interactivity to explore the information and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. Further, transitioning between levels of detail (i.e., from terms to full documents) can be challanging. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multipel levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geography metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  15. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-10-06

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. These visual metaphors (e.g., word clouds, tag clouds, etc.) traditionally show a series of terms organized by space-filling algorithms. However, often lacking in these views is the ability to interactively explore the information to gain more detail, and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multiple levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geographic metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  16. Text document classification based on mixture models

    Czech Academy of Sciences Publication Activity Database

    Novovičová, Jana; Malík, Antonín

    2004-01-01

    Roč. 40, č. 3 (2004), s. 293-304. ISSN 0023-5954 R&D Projects: GA AV ČR IAA2075302; GA ČR GA102/03/0049; GA AV ČR KSK1019101 Institutional research plan: CEZ:AV0Z1075907 Keywords : text classification * text categorization * multinomial mixture model Subject RIV: BB - Applied Statistics, Operational Research Impact factor: 0.224, year: 2004

  17. GENERATION OF A SET OF KEY TERMS CHARACTERISING TEXT DOCUMENTS

    Directory of Open Access Journals (Sweden)

    Kristina Machova

    2007-06-01

    Full Text Available The presented paper describes statistical methods (information gain, mutual X^2 statistics, and TF-IDF method for key words generation from a text document collection. These key words should characterise the content of text documents and can be used to retrieve relevant documents from a document collection. Term relations were detected on the base of conditional probability of term occurrences. The focus is on the detection of those words, which occur together very often. Thus, key words, which consist from two terms were generated additionally. Several tests were carried out using the 20 News Groups collection of text documents.

  18. Quality Control in Software Documentation Based on Measurement of Text Comprehension and Text Comprehensibility.

    Science.gov (United States)

    Lehner, Franz

    1993-01-01

    Discusses methods of textual documentation that can be used for software documentation. Highlights include measurement of text comprehensibility; methods for the measurement of documentation quality, including readability and the Cloze Procedure; tools for the measurement of text readability; and the development of the Reading Measurability…

  19. A New Fragile Watermarking Scheme for Text Documents Authentication

    Institute of Scientific and Technical Information of China (English)

    XIANG Huazheng; SUN Xingming; TANG Chengliang

    2006-01-01

    Because there are different modification types of deleting characters and inserting characters in text documents, the algorithms for image authentication can not be used for text documents authentication directly. A text watermarking scheme for text document authentication is proposed in this paper. By extracting the features of character cascade together with the user secret key, the scheme combines the features of the text with the user information as a watermark which is embedded into the transformed text itself. The receivers can verify the integrity and the authentication of the text through the blind detection technique. A further research demonstrates that it can also localize the tamper, classify the type of modification, and recover part of modified text documents. The aforementioned conclusion has been proved by both our experiment results and analysis.

  20. CERCLIS (Superfund) ASCII Text Format - CPAD Database

    Data.gov (United States)

    U.S. Environmental Protection Agency — The Comprehensive Environmental Response, Compensation and Liability Information System (CERCLIS) (Superfund) Public Access Database (CPAD) contains a selected set...

  1. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  2. Classification process in a text document recommender system

    Directory of Open Access Journals (Sweden)

    Dan MUNTEANU

    2005-12-01

    Full Text Available This paper presents the classification process in a recommender system used for textual documents taken especially from web. The system uses in the classification process a combination of content filters, event filters and collaborative filters and it uses implicit and explicit feedback for evaluating documents.

  3. Literature Review of Automatic Multiple Documents Text Summarization

    Directory of Open Access Journals (Sweden)

    Md. Majharul Haque

    2013-05-01

    Full Text Available For the blessing of World Wide Web, the corpus of online information is gigantic in its volume. Search engines have been developed such as Google, AltaVista, Yahoo, etc., to retrieve specific information from this huge amount of data. But the outcome of search engine is unable to provide expected result as the quantity of information is increasing enormously day by day and the findings are abundant. So, the automatic text summarization is demanded for salient information retrieval. Automatic text summarization is a system of summarizing text by computer where a text is given to the computer as input and the output is a shorter and less redundant form of the original text. An informative précis is very much helpful in our daily life to save valuable time. Research was first started naively on single document abridgement but recently information is found from various sources about a single topic in different website, journal, newspaper, text book, etc., for which multi-document summarization is required. In this paper, automatic multiple documents text summarization task is addressed and different procedure of various researchers are discussed. Various techniques are compared here that have done for multi-document summarization. Some promising approaches are indicated here and particular concentration is dedicated to describe different methods from raw level to similar like human experts, so that in future one can get significant instruction for further analysis.

  4. Keyword Extraction Based Summarization of Categorized Kannada Text Documents

    Directory of Open Access Journals (Sweden)

    Jayashree.R

    2011-12-01

    Full Text Available The internet has caused a humongous growth in the number of documents available online. Summaries ofdocuments can help find the right information and are particularly effective when the document base isvery large. Keywords are closely associated to a document as they reflect the document's content and actas indices for a given document. In this work, we present a method to produce extractive summaries ofdocuments in the Kannada language, given number of sentences as limitation. The algorithm extracts keywords from pre-categorized Kannada documents collected from online resources. We use two featureselection techniques for obtaining features from documents, then we combine scores obtained by GSS(Galavotti, Sebastiani, Simi coefficients and IDF (Inverse Document Frequency methods along with TF(Term Frequency for extracting key words and later use these for summarization based on rank of thesentence. In the current implementation, a document from a given category is selected from our databaseand depending on the number of sentences given by the user, a summary is generated.

  5. EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION

    Directory of Open Access Journals (Sweden)

    N. Adilah Hanin Zahri

    2015-03-01

    Full Text Available Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents

  6. Integrated Clustering and Feature Selection Scheme for Text Documents.

    Directory of Open Access Journals (Sweden)

    M. Thangamani

    2010-01-01

    Full Text Available Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The cube size is very high and accuracy is low in the term based text clustering and feature selection method. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. The proposed system was designed to identify the semantic relations using the ontology. The ontology was used to represent the term and concept relationship. Results: The synonym, meronym and hypernym relationships were represented in the ontology. The concept weights were estimated with reference to the ontology. The concept weight was used for the clustering process. The system was implemented in two methods. They were term clustering with feature selection and semantic clustering with feature selection. Conclusion: The performance analysis was carried out with the term clustering and semantic clustering methods. The accuracy and efficiency factors were analyzed in the performance analysis.

  7. Approaches to Ontology Based Algorithms for Clustering Text Documents

    OpenAIRE

    V.Sureka; S.C. Punitha

    2012-01-01

    The advancement in digital technology and WorldWide Web has increased the usage of digitaldocuments being used for various purposes like epublishing,digital library. Increase in number oftext documents requires efficient techniques thatcan help during searching and retrieval. Documentclustering is one such technique whichautomatically organizes text documents intomeaningful groups. This paper compares theperformance of enhanced ontological algorithmsbased on K-Means and DBScan clustering.Onto...

  8. Text recognition in both ancient and cartographic documents

    OpenAIRE

    Zaghden, Nizar; Khelifi, Badreddine; Alimi, Adel M.; Mullot, Remy

    2013-01-01

    This paper deals with the recognition and matching of text in both cartographic maps and ancient documents. The purpose of this work is to find similar text regions based on statistical and global features. A phase of normalization is done first, in object to well categorize the same quantity of information. A phase of wordspotting is done next by combining local and global features. We make different experiments by combining the different techniques of extracting features in order to obtain ...

  9. Literature Review of Automatic Single Document Text Summarization Using NLP

    Directory of Open Access Journals (Sweden)

    Md. Majharul Haque

    2013-07-01

    Full Text Available In the time of overloaded online information, automatic text summarization is especially demanded for salient information retrieval from huge amount electronic text. For the blessing of World Wide Web, the mass of data is now enormous in its volume. Researchers realized this fact from various aspects and tried to generate an automatic abstract of the gigantic body of data from the commencement of the last half century. Numerous ways are there for characterizing different approaches to passage recapitulation: extractive and abstractive from single or compound document, objective of content abridgement, characteristic of text summarization, level of processing from superficial to profound and sort of article's content. A significant précis is very much helpful in our day to day life which can save valuable time. The investigation was at first commenced naively on single document abstraction. In this paper, automatic single document text summarization task is addressed and different methodologies of various researchers are discussed from the very beginning of this research to this modern age. This literature review intends to observe the trends of abstraction procedure using natural language processing. Also some promising approaches are indicated and particular concentration is dedicated for the categorization of diversified methods from raw level to similar like human professionals, so that in future one can get precious direction for further analysis.

  10. A Fuzzy Approach to Classification of Text Documents

    Institute of Scientific and Technical Information of China (English)

    LIU WeiYi(刘惟一); SONG Ning(宋宁)

    2003-01-01

    This paper discusses the classification problems of text documents. Based onthe concept of the proximity degree, the set of words is partitioned into some equivalence classes.Particularly, the concepts of the semantic field and association degree are given in this paper.Based on the above concepts, this paper presents a fuzzy classification approach for documentcategorization. Furthermore, applying the concept of the entropy of information, the approachesto select key words from the set of words covering the classification of documents and to constructthe hierarchical structure of key words are obtained.

  11. Term Weighting Schemes for Slovak Text Document Clustering

    Directory of Open Access Journals (Sweden)

    ZLACKÝ Daniel

    2013-05-01

    Full Text Available Text representation is the task of transforming the textual data into a multidimensional space with corresponding weights for every word. Wehave tested several widely used term weighting methods on manually created database from Slovak Wikipedia articles. The created vector space models were used as an input in unsupervised clustering algorithms, which cluster text documents based on these created models. We have tested nine different weighting schemes withK-mean clustering algorithm. The best results were obtained by TF-RIDF weighting scheme. However, the next experiments with different clustering techniques have not confirmed previous results.

  12. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  13. Approaches to Ontology Based Algorithms for Clustering Text Documents

    Directory of Open Access Journals (Sweden)

    V.Sureka

    2012-09-01

    Full Text Available The advancement in digital technology and WorldWide Web has increased the usage of digitaldocuments being used for various purposes like epublishing,digital library. Increase in number oftext documents requires efficient techniques thatcan help during searching and retrieval. Documentclustering is one such technique whichautomatically organizes text documents intomeaningful groups. This paper compares theperformance of enhanced ontological algorithmsbased on K-Means and DBScan clustering.Ontology is introduced by using a concept weightwhich is calculated by considering the correlationcoefficient of the word and probability of concept.Various experiments were conducted duringperformance evaluation and the results showedthat the inclusion of ontology increased theefficiency of clustering and the performance ofontology-based DBScan algorithm is better thanthe ontology-based K-Means algorithm

  14. Using linguistic information to classify Portuguese text documents

    OpenAIRE

    Teresa, Gonçalves; Paulo, Quaresma

    2008-01-01

    This paper examines the role of various linguistic structures on text classification applying the study to the Portuguese language. Besides using a bag-of-words representation where we evaluate different measures and use linguistic knowledge for term selection, we do several experiments using syntactic information representing documents as strings of words and strings of syntactic parse trees. To build the classifier we use the Support Vector Machine (SVM) algorithm which is known to prod...

  15. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  16. Finding Text Information in the Ocean of Electronic Documents

    Energy Technology Data Exchange (ETDEWEB)

    Medvick, Patricia A.; Calapristi, Augustin J.

    2003-02-05

    Information management in natural resources has become an overwhelming task. A massive amount of electronic documents and data is now available for creating informed decisions. The problem is finding the relevant information to support the decision-making process. Determining gaps in knowledge in order to propose new studies or to determine which proposals to fund for maximum potential is a time-consuming and difficult task. Additionally, available data stores are increasing in complexity; they now may include not only text and numerical data, but also images, sounds, and video recordings. Information visualization specialists at Pacific Northwest National Laboratory (PNNL) have software tools for exploring electronic data stores and for discovering and exploiting relationships within data sets. These provide capabilities for unstructured text explorations, the use of data signatures (a compact format for the essence of a set of scientific data) for visualization (Wong et al 2000), visualizations for multiple query results (Havre et al. 2001), and others (http://www.pnl.gov/infoviz ). We will focus on IN-SPIRE, a MS Windows vision of PNNL’s SPIRE (Spatial Paradigm for Information Retrieval and Exploration). IN-SPIRE was developed to assist information analysts find and discover information in huge masses of text documents.

  17. Leveraging Text Content for Management of Construction Project Documents

    Science.gov (United States)

    Alqady, Mohammed

    2012-01-01

    The construction industry is a knowledge intensive industry. Thousands of documents are generated by construction projects. Documents, as information carriers, must be managed effectively to ensure successful project management. The fact that a single project can produce thousands of documents and that a lot of the documents are generated in a…

  18. Transliterating non-ASCII characters with Python

    Directory of Open Access Journals (Sweden)

    Seth Bernstein

    2013-10-01

    Full Text Available This lesson shows how to use Python to transliterate automatically a list of words from a language with a non-Latin alphabet to a standardized format using the American Standard Code for Information Interchange (ASCII characters. It builds on readers’ understanding of Python from the lessons “Viewing HTML Files,” “Working with Web Pages,” “From HTML to List of Words (part 1” and “Intro to Beautiful Soup.” At the end of the lesson, we will use the transliteration dictionary to convert the names from a database of the Russian organization Memorial from Cyrillic into Latin characters. Although the example uses Cyrillic characters, the technique can be reproduced with other alphabets using Unicode.

  19. Literature Review of Automatic Single Document Text Summarization Using NLP

    OpenAIRE

    Md. Majharul Haque; Suraiya Pervin; Zerina Begum

    2013-01-01

    In the time of overloaded online information, automatic text summarization is especially demanded for salient information retrieval from huge amount electronic text. For the blessing of World Wide Web, the mass of data is now enormous in its volume. Researchers realized this fact from various aspects and tried to generate an automatic abstract of the gigantic body of data from the commencement of the last half century. Numerous ways are there for characterizing different approaches to passage...

  20. Information Gain Based Dimensionality Selection for Classifying Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Dumidu Wijayasekara; Milos Manic; Miles McQueen

    2013-06-01

    Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexity is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.

  1. Algebraic specification of documents

    OpenAIRE

    Ramalho, José Carlos; Almeida, J. J.; Henriques, Pedro Rangel

    1995-01-01

    According to recent research, nearly 95 percent of a corporate information is stored in documents. Further studies indicate that companies spent between 6 and 10 percent of their gross revenues printing and distributing documents in several ways: web and cdrom publishing, database storage and retrieval and printing. In this context documents exist in some different formats, from pure ascii files to internal database or text processor formats. It is clear that document reu...

  2. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    OpenAIRE

    Vishwadeepak Singh Baghela; S. P. Tripathi

    2012-01-01

    A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mi...

  3. Cluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization

    Directory of Open Access Journals (Sweden)

    A. K. Santra

    2011-09-01

    Full Text Available An efficient cluster based hybrid niche mimetic and genetic algorithm for text document categorization to improve the retrieval rate of relevant document fetching is addressed. The proposal minimizes the processing of structuring the document with better feature selection using hybrid algorithm. In addition restructuring of feature words to associated documents gets reduced, in turn increases document clustering rate. The performance of the proposed work is measured in terms of cluster objects accuracy, term weight, term frequency and inverse document frequency. Experimental results demonstrate that it achieves very good performance on both feature selection and text document categorization, compared to other classifier methods.

  4. Gabor Filter Based Block Energy Analysis for Text Extraction from Digital Document Images

    OpenAIRE

    Raju, Sabari S; Pati, Peeta Basa; Ramakrishnan, AG

    2004-01-01

    Extraction of text areas is a necessary first step for taking a complex document image for character recognition task. In digital libraries, such OCR'ed text facilitates access to the image of document page through keyword search. Gabor filters, known to be simulating certain characteristics of the Human Visual System (HVS), have been employed for this task by a large number of scientists, in scanned document images.Adapting such a scheme for camera based document images is a relatively new ...

  5. MULTI-DOCUMENT TEXT SUMMARIZATION USING CLUSTERING TECHNIQUES AND LEXICAL CHAINING

    Directory of Open Access Journals (Sweden)

    S. Saraswathi

    2010-07-01

    Full Text Available This paper investigates the use of clustering and lexical chains to produce coherent summaries of multiple documents in text format to generate an indicative, less redundant summary. The summary is designed as per user’s requirement of conciseness i.e., the documents are summarized according to the percentage input by the user. For achieving the above, various clustering techniques are used. Clustering is done at two levels, one at single document level and then at multi-document level. The clustered sentences are scored based on five different methods and lexically linked to produce the final summary in a text document.

  6. PLACE OF INDUSTRY-SPECIFIC DOCUMENTS IN THE TRANSLATION-ORIENTED TEXT CLASSIFICATION (BY AN EXAMPLE OF RAILWAY DOCUMENTATION

    Directory of Open Access Journals (Sweden)

    VOLEGZHANINA I.S.

    2015-01-01

    Full Text Available Documentation is an important tool to provide effective process of business communication within the enterprises of different industrial sectors. The authors made an attempt to understand the place of industry-specific documents in the translation-oriented text classification suggested by Irina S. Alekseeva. The British railway standards were analyzed for this purpose: their linguo-communicative specificity was described; some possible translation problems were identified.

  7. Electronic Documentation Support Tools and Text Duplication in the Electronic Medical Record

    Science.gov (United States)

    Wrenn, Jesse

    2010-01-01

    In order to ease the burden of electronic note entry on physicians, electronic documentation support tools have been developed to assist in note authoring. There is little evidence of the effects of these tools on attributes of clinical documentation, including document quality. Furthermore, the resultant abundance of duplicated text and…

  8. THE SEGMENTATION OF A TEXT LINE FOR A HANDWRITTEN UNCONSTRAINED DOCUMENT USING THINING ALGORITHM

    NARCIS (Netherlands)

    Tsuruoka, S.; Adachi, Y.; Yoshikawa, T.

    2004-01-01

    For printed documents, the projection analysis of black pixels is widely used for the segmentation of a text line. However, for handwritten documents, we think that the projection analysis is not appropriate, as the separating border line of a text line is not a straight line on a paper with no rule

  9. Document expansion for text-based image retrieval at WikipediaMM 2010

    OpenAIRE

    Min, Jinming; LEVELING, JOHANNES; Jones, Gareth J.F.

    2010-01-01

    We describe and analyze our participation in the Wikipedi- aMM task at ImageCLEF 2010. Our approach is based on text-based image retrieval using information retrieval techniques on the metadata documents of the images. We submitted two English monolingual runs and one multilingual run. The monolingual runs used the query to retrieve the metadata document with the query and document in the same language; the multilingual run used queries in one language to search the metadata provided in...

  10. UNESDOC textes intégraux des documents de l'UNESCO

    CERN Document Server

    UNESCO. Paris

    UNESDOC contient les textes des principaux documents des organes directeurs (Conférence générale et Conseil exécutif), des documents sectoriels (séries principale et de travail, rapports et documents des réunions/conférences organisées par l'UNESCO), des discours du Directeur général, le "Courrier de l'UNESCO" et le bulletin « Sources UNESCO ».

  11. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    Science.gov (United States)

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  12. A Consistent Web Documents Based Text Clustering Using Concept Based Mining Model

    OpenAIRE

    V.M.Navaneethakumar; C Chandrasekar

    2012-01-01

    Text mining is a growing innovative field that endeavors to collect significant information from natural language processing term. It might be insecurely distinguished as the course of examining texts to extract information that is practical for particular purposes. In this case, the mining model can detain provisions that identify the concepts of the sentence or document, which tends to detect the subject of the document. In an existing work, the concept-based mining model is used only for n...

  13. ARABIC TEXT SUMMARIZATION BASED ON LATENT SEMANTIC ANALYSIS TO ENHANCE ARABIC DOCUMENTS CLUSTERING

    Directory of Open Access Journals (Sweden)

    Hanane Froud

    2013-01-01

    Full Text Available Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.

  14. Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools

    Directory of Open Access Journals (Sweden)

    J. I. Toledo-Alvarado

    2012-06-01

    Full Text Available In this paper we show a procedure to build automatically an ontology from a corpus of text documents without externalhelp such as dictionaries or thesauri. The method proposed finds relevant concepts in the form of multi-words in thecorpus and non-hierarchical relations between them in an unsupervised manner.

  15. Evaluation of a Language Identification System for Mono- and Multi-lingual Text Documents

    OpenAIRE

    Artemenko, Olga; Mandl, Thomas; Shramko, Margaryta; Womser-Hacker, Christa

    2006-01-01

    Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multilingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including U...

  16. LOG2MARKUP: State module to transform a Stata text log into a markup document

    DEFF Research Database (Denmark)

    2016-01-01

    log2markup extract parts of the text version from the Stata log command and transform the logfile into a markup based document with the same name, but with extension markup (or otherwise specified in option extension) instead of log. The author usually uses markdown for writing documents. However...... other users may decide on all sorts of markup languages, eg HTML or LaTex. The key is that markup of Stata code and Stata output can be set by the options....

  17. A Novel Model for Timed Event Extraction and Temporal Reasoning In Legal Text Documents

    Directory of Open Access Journals (Sweden)

    Kolikipogu Ramakrishna

    2011-02-01

    Full Text Available Information Retrieval is in a nascent stage to provide any type of information queried by naïve user.Question Answering System is one such successful area of Information retrieval. Legal Documents (caselaw, statute or transactional document are increasing day by day with the new applications (Mobiletransactions, Medical Diagnosis reports, law cases etc. in the world. Documentation of various Businessand Human Resource (HR applications involve Legal documents. Analysis and temporal reasoning ofsuch documents is a demanding area of research. In this paper we build a novel model for timed eventextraction and temporal reasoning in legal text documents. This paper mainly works on “how one can dofurther reasoning with the extracted temporal information”. Exploring temporal information in legal textdocuments is an important task to support legal practitioner lawyer, in order to determine temporalbased context decisions. Legal documents are available in different natural languages; hence it uses NLPSystem for pre-processing steps, Temporal constraint structure for temporal expressions, associatedtagger, Post-Processor with a knowledge-based sub system helps in discovering implicit information. Theresultant information resolves temporal expressions and deals with issues such as granularity, vagueness,and a reasoning mechanism which models the temporal constraint satisfaction network.

  18. THE COMPOSITIONAL AND SPEECH ORGANIZATION OF REGULATION TEXT AS A REGULATORY DOCUMENT

    Directory of Open Access Journals (Sweden)

    Sharipova Roza Rifatovna

    2014-06-01

    Full Text Available The relevance of the study covered by this article is determined by the extension of the business communication scope, as well as the nessecity to upgrade the administrative activity of organizations which largely depends on the documentation quality. The documents are used in various communicative situations and reflect intercultural business relations, that is why the problem of studying the nature and functions of documents is urgent. Business communication involves interaction in different areas of activity, and a document is one of the main tools of regulating this process. The author studies a regulation, the document which ensures the systematization and adjustment of management process, reflects certain production processes and the order of their execution. Taking into account the complex of criteria (functioning level of document, specificity of business communication subjects, diversity of regulated processes, compositional and content, and speech organization of text, the author suggests to distinguish three types of regulations. The regulations of first type systemize the business activity at government level or corresponding administration. The regulations of second type are used to regulate external relations – with counter-agents, partners – during undetermined (long-term or determined (having starting and ending date validity period. The regulations of third type serve to regulate domestic relations within an organization and are mostly intended for staff. From the composition viewpoint, the regulations of all types represent the text consisting of several paginated sections; at this, the level of regulation functioning, the specificity of business communication subjects define the character of information – degree of its generality/detalization. The speech organization of studied documents is similar as it is characterized by use of lexis with process semantics and official clichés. The regulations differ in terminology

  19. A COMPARATIVE STUDY TO FIND A SUITABLE METHOD FOR TEXT DOCUMENT CLUSTERING

    Directory of Open Access Journals (Sweden)

    Dr.M.Punithavalli

    2012-01-01

    Full Text Available Text mining is used in various text related tasks such as information extraction, concept/entity extraction,document summarization, entity relation modeling (i.e., learning relations between named entities,categorization/classification and clustering. This paper focuses on document clustering, a field of textmining, which groups a set of documents into a list of meaningful categories. The main focus of thispaper is to present a performance analysis of various techniques available for document clustering. Theresults of this comparative study can be used to improve existing text data mining frameworks andimprove the way of knowledge discovery. This paper considers six clustering techniques for documentclustering. The techniques are grouped into three groups namely Group 1 - K-means and its variants(traditional K-means and K* Means algorithms, Group 2 - Expectation Maximization and its variants(traditional EM, Spherical Gaussian EM algorithm and Linear Partitioning and Reallocation clustering(LPR using EM algorithms, Group 3 - Semantic-based techniques (Hybrid method and Feature-basedalgorithms. A total of seven algorithms are considered and were selected based on their popularity inthe text mining field. Several experiments were conducted to analyze the performance of the algorithmand to select the winner in terms of cluster purity, clustering accuracy and speed of clustering.

  20. Automatic Extraction of Spatio-Temporal Information from Arabic Text Documents

    Directory of Open Access Journals (Sweden)

    Abdelkoui Feriel

    2015-10-01

    Full Text Available Unstructured Arabic text documents are an important source of geographical and temporal information. The possibility of automatically tracking spatio-temporal information, capturing changes relating to events from text documents, is a new challenge in the fields of geographic information retrieval (GIR, temporal information retrieval (TIR and natural language processing (NLP. There was a lot of work on the extraction of information in other languages that use Latin alphabet, such as English,, French, or Spanish, by against the Arabic language is still not well supported in GIR and TIR and it needs to conduct more researches. In this paper, we present an approach that support automated exploration and extraction of spatio-temporal information from Arabic text documents in order to capture and model such information before it can be utilized in search and exploration tasks. The system has been successfully tested on 50 documents that include a mixture of types of Spatial/temporal information. The result achieved 91.01% of recall and of 80% precision. This illustrates that our approach is effective and its performance is satisfactory.

  1. Semi-supervised learning for detecting text-lines in noisy document images

    Science.gov (United States)

    Liu, Zongyi; Zhou, Hanning

    2010-01-01

    Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.

  2. Issues and approaches for electronic document approval and transmittal using digital signatures and text authentication: Prototype documentation

    Science.gov (United States)

    Boling, M. E.

    1989-09-01

    Prototypes were assembled pursuant to recommendations made in report K/DSRD-96, Issues and Approaches for Electronic Document Approval and Transmittal Using Digital Signatures and Text Authentication, and to examine and discover the possibilities for integrating available hardware and software to provide cost effective systems for digital signatures and text authentication. These prototypes show that on a LAN, a multitasking, windowed, mouse/keyboard menu-driven interface can be assembled to provide easy and quick access to bit-mapped images of documents, electronic forms and electronic mail messages with a means to sign, encrypt, deliver, receive or retrieve and authenticate text and signatures. In addition they show that some of this same software may be used in a classified environment using host to terminal transactions to accomplish these same operations. Finally, a prototype was developed demonstrating that binary files may be signed electronically and sent by point to point communication and over ARPANET to remote locations where the authenticity of the code and signature may be verified. Related studies on the subject of electronic signatures and text authentication using public key encryption were done within the Department of Energy. These studies include timing studies of public key encryption software and hardware and testing of experimental user-generated host resident software for public key encryption. This software used commercially available command-line source code. These studies are responsive to an initiative within the Office of the Secretary of Defense (OSD) for the protection of unclassified but sensitive data. It is notable that these related studies are all built around the same commercially available public key encryption products from the private sector and that the software selection was made independently by each study group.

  3. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques

    International Nuclear Information System (INIS)

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  4. Domain Based Ontology and Automated Text Categorization Based on Improved Term Frequency – Inverse Document Frequency

    Directory of Open Access Journals (Sweden)

    Sukanya Ray

    2012-05-01

    Full Text Available In recent years there has been a massive growth in textual information in textual information especially in the internet. People now tend to read more e-books than hard copies of the books. While searching for some topic especially some new topic in the internet it will be easier if someone knows the pre-requisites and post- requisites of that topic. It will be easier for someone searching a new topic. Often the topics are found without any proper title and it becomes difficult later on to find which document was for which topic. A text categorization method can provide solution to this problem. In this paper domain based ontology is created so that users can relate to different topics of a domain and an automated text categorization technique is proposed that will categorize the uncategorized documents. The proposed idea is based on Term Frequency – Inverse Document Frequency (tf -idf method and a dependency graph is also provided in the domain based ontology so that the users can visualize the relations among the terms.

  5. A TEI P5 Document Grammar for the IDS Text Model

    OpenAIRE

    Lüngen, Harald; Sperberg-McQueen, C. M.

    2012-01-01

    This paper describes work in progress on I5, a TEI-based document grammar for the corpus holdings of the Institut für Deutsche Sprache (IDS) in Mannheim and the text model used by IDS in its work. The paper begins with background information on the nature and purposes of the corpora collected at IDS and the motivation for the I5 project (section 1). It continues with a description of the origin and history of the IDS text model (section 2), and a description (section 3) of the techniques us...

  6. ParaText : scalable solutions for processing and searching very large document collections : final LDRD report.

    Energy Technology Data Exchange (ETDEWEB)

    Crossno, Patricia Joyce; Dunlavy, Daniel M.; Stanton, Eric T.; Shead, Timothy M.

    2010-09-01

    This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

  7. Hierarchical Concept Indexing of Full-Text Documents in the Unified Medical Language System Information Sources Map.

    Science.gov (United States)

    Wright, Lawrence W.; Nardini, Holly K. Grossetta; Aronson, Alan R.; Rindflesch, Thomas C.

    1999-01-01

    Describes methods for applying natural-language processing for automatic concept-based indexing of full text and methods for exploiting the structure and hierarchy of full-text documents to a large collection of full-text documents drawn from the Health Services/Technology Assessment Text database at the National Library of Medicine. Examines how…

  8. Text Feature Weighting For Summarization Of Document Bahasa Indonesia Using Genetic Algorithm

    Directory of Open Access Journals (Sweden)

    Aristoteles.

    2012-05-01

    Full Text Available This paper aims to perform the text feature weighting for summarization of document bahasa Indonesia using genetic algorithm. There are eleven text features, i.e, sentence position (f1, positive keywords in sentence (f2, negative keywords in sentence (f3, sentence centrality (f4, sentence resemblance to the title (f5, sentence inclusion of name entity (f6, sentence inclusion of numerical data (f7, sentence relative length (f8, bushy path of the node (f9, summation of similarities for each node (f10, and latent semantic feature (f11. We investigate the effect of the first ten sentence features on the summarization task. Then, we use latent semantic feature to increase the accuracy. All feature score functions are used to train a genetic algorithm model to obtain a suitable combination of feature weights. Evaluation of text summarization uses F-measure. The F-measure directly related to the compression rate. The results showed that adding f11 increases the F-measure by 3.26% and 1.55% for compression ratio of 10% and 30%, respectively. On the other hand, it decreases the F-measure by 0.58% for compression ratio of 20%. Analysis of text feature weight showed that only using f2, f4, f5, and f11 can deliver a similar performance using all eleven features.

  9. Document analysis at DFKI. - Part 1: Image analysis and text recognition

    OpenAIRE

    Ali, Majdi Ben Hadj; Fein, Frank; Hönes, Frank; Jäger, Thorsten; Weigel, Achim

    1995-01-01

    Document analysis is responsible for an essential progress in office automation. This paper is part of an overview about the combined research efforts in document analysis at the DFKI. Common to all document analysis projects is the global goal of providing a high level electronic representation of documents in terms of iconic, structural, textual, and semantic information. These symbolic document descriptions enable an "intelligent'; access to a document database. Currently there are three o...

  10. A New Property Coding in Text Steganography of Microsoft Word Documents

    OpenAIRE

    Stojanov, Ivan; Mileva, Aleksandra; STOJANOVIC Igor

    2014-01-01

    Electronic documents, similarly as printed documents, need to be secured by adding some specific features that allow efficient copyright protection, authentication, document tracking or investigation of counterfeiting and forgeries. Microsoft Word is one of the most popular word processors, and several methods exist for embedding data specially in documents produced by it. We present a new type of methods for hiding data in Microsoft Word documents, named as Property coding, which deploys pro...

  11. An Efficient Technique to Implement Similarity Measures in Text Document Clustering using Artificial Neural Networks Algorithm

    Directory of Open Access Journals (Sweden)

    K. Selvi

    2014-12-01

    Full Text Available Pattern recognition, envisaging supervised and unsupervised method, optimization, associative memory and control process are some of the diversified troubles that can be resolved by artificial neural networks. Problem identified: Of late, discovering the required information in massive quantity of data is the challenging tasks. The model of similarity evaluation is the central element in accomplishing a perceptive of variables and perception that encourage behavior and mediate concern. This study proposes Artificial Neural Networks algorithms to resolve similarity measures. In order to apply singular value decomposition the frequency of word pair is established in the given document. (1 Tokenization: The splitting up of a stream of text into words, phrases, signs, or other significant parts is called tokenization. (2 Stop words: Preceding or succeeding to processing natural language data, the words that are segregated is called stop words. (3 Porter stemming: The main utilization of this algorithm is as part of a phrase normalization development that is characteristically completed while setting up in rank recovery technique. (4 WordNet: The compilation of lexical data base for the English language is called as WordNet Based on Artificial Neural Networks, the core part of this study work extends n-gram proposed algorithm. All the phonemes, syllables, letters, words or base pair corresponds in accordance to the application. Future work extends the application of this same similarity measures in various other neural network algorithms to accomplish improved results.

  12. THE COMPOSITIONAL AND SPEECH ORGANIZATION OF REGULATION TEXT AS A REGULATORY DOCUMENT

    OpenAIRE

    Sharipova Roza Rifatovna

    2014-01-01

    The relevance of the study covered by this article is determined by the extension of the business communication scope, as well as the nessecity to upgrade the administrative activity of organizations which largely depends on the documentation quality. The documents are used in various communicative situations and reflect intercultural business relations, that is why the problem of studying the nature and functions of documents is urgent. Business communication involves interaction in differen...

  13. A novel technique for estimation of skew in binary text document images based on linear regression analysis

    Indian Academy of Sciences (India)

    P Shivakumara; G Hemantha Kumar; D S Guru; P Nagabhushan

    2005-02-01

    When a document is scanned either mechanically or manually for digitization, it often suffers from some degree of skew or tilt. Skew-angle detection plays an important role in the field of document analysis systems and OCR in achieving the expected accuracy. In this paper, we consider skew estimation of Roman script. The method uses the boundary growing approach to extract the lowermost and uppermost coordinates of pixels of characters of text lines present in the document, which can be subjected to linear regression analysis (LRA) to determine the skew angle of a skewed document. Further, the proposed technique works fine for scaled text binary documents also. The technique works based on the assumption that the space between the text lines is greater than the space between the words and characters. Finally, in order to evaluate the performance of the proposed methodology we compare the experimental results with those of well-known existing methods.

  14. Ultrasound-guided nerve blocks - is documentation and education feasible using only text and pictures?

    DEFF Research Database (Denmark)

    Worm, Bjarne Skjødt; Krag, Mette; Jensen, Kenneth

    2014-01-01

    With the advancement of ultrasound-guidance for peripheral nerve blocks, still pictures from representative ultrasonograms are increasingly used for clinical procedure documentation of the procedure and for educational purposes in textbook materials. However, little is actually known about the...

  15. Interuniversity Style Guide for Writing Institutional Texts in English: Model Documents

    OpenAIRE

    Xarxa Vives d'Universitats. Grup de Treball de Qualitat Lingüística

    2015-01-01

    Manual d'estil interuniversitari per a la redacció de textos institucionals en anglès: models de documents. Pautes i models per a la redacció en anglès de sol·licituds, resolucions, notificacions, certificats, diligències, cartes, correus electrònics i convenis.

  16. The Notion of Text and the Notion of Document - what Difference does it make?

    OpenAIRE

    Roswitha Skare

    2009-01-01

    The notion of text has a long tradition inside the human science. A broad definition of this concept considers all man-made products as systems of signs and thereby as texts; but often not “as the physical manifestation as such, but as the abstract representation of a work” (Gunder: 2001, 86). Considering that everything – including sculpture, music, photography and film – can become a text, either a written text – like literature in a traditional sense – or a verbal text, one can at least wo...

  17. Content analysis to detect high stress in oral interviews and text documents

    Science.gov (United States)

    Thirumalainambi, Rajkumar (Inventor); Jorgensen, Charles C. (Inventor)

    2012-01-01

    A system of interrogation to estimate whether a subject of interrogation is likely experiencing high stress, emotional volatility and/or internal conflict in the subject's responses to an interviewer's questions. The system applies one or more of four procedures, a first statistical analysis, a second statistical analysis, a third analysis and a heat map analysis, to identify one or more documents containing the subject's responses for which further examination is recommended. Words in the documents are characterized in terms of dimensions representing different classes of emotions and states of mind, in which the subject's responses that manifest high stress, emotional volatility and/or internal conflict are identified. A heat map visually displays the dimensions manifested by the subject's responses in different colors, textures, geometric shapes or other visually distinguishable indicia.

  18. Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Watson, Andrew; Quigley, Aaron; Klein, Ewan; Coates, Colin M.

    2015-01-01

    Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians, computational linguists, and visualization specialists. It combines text mining and information vi...

  19. Trading Consequences: A Case Study of Combining Text Mining & Visualisation to Facilitate Document Exploration

    OpenAIRE

    Hinrichs, Uta; Alex, Beatrice; Clifford, Jim; Quigley, Aaron

    2014-01-01

    Trading Consequences is an interdisciplinary research project between historians, computational linguists and visualization specialists. We use text mining and visualisations to explore the growth of the global commodity trade in the nineteenth century. Feedback from a group of environmental historians during a workshop provided essential information to adapt advanced text mining and visualisation techniques to historical research. Expert feedback is an essential tool for effective interdisci...

  20. Using ImageMagick to Automatically Increase Legibility of Scanned Text Documents

    Directory of Open Access Journals (Sweden)

    Doreva Belfiore

    2011-07-01

    Full Text Available The Law Library Digitization Project of the Rutgers University School of Law in Camden, New Jersey, developed a Perl script to use the open-source module PerlMagick to automatically adjust the brightness levels of digitized images from scanned microfiche. This script can be adapted by novice Perl programmers to manipulate large numbers of text and image files using commands available in PerlMagick and ImageMagick.

  1. Using ImageMagick to Automatically Increase Legibility of Scanned Text Documents

    OpenAIRE

    Doreva Belfiore

    2011-01-01

    The Law Library Digitization Project of the Rutgers University School of Law in Camden, New Jersey, developed a Perl script to use the open-source module PerlMagick to automatically adjust the brightness levels of digitized images from scanned microfiche. This script can be adapted by novice Perl programmers to manipulate large numbers of text and image files using commands available in PerlMagick and ImageMagick.

  2. Raw Data (ASCII format) - PLACE | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available , and the accession numbers in sequence databases are provided. Data...g study / Chemical modifications / Gel retardation assay / homology / other) RD External Database (MEDLINE n...cription Download License Update History of This Database Site Policy | Contact Us Raw Data (ASCII format) - PLACE | LSDB Archive ... ... or G W: A or T Y: C or T (Mar. 31, 1997) XX Data item delimiter Joomla SEF URLs by Artio About This Database Database Des... RA Authors RT Title RL Literature data RC Criteria (DNase I footprintin

  3. Louhi 2010: Special issue on Text and Data Mining of Health Documents

    Directory of Open Access Journals (Sweden)

    Dalianis Hercules

    2011-07-01

    Full Text Available Abstract The papers presented in this supplement focus and reflect on computer use in every-day clinical work in hospitals and clinics such as electronic health record systems, pre-processing for computer aided summaries, clinical coding, computer decision systems, as well as related ethical concerns and security. Much of this work concerns itself by necessity with incorporation and development of language processing tools and methods, and as such this supplement aims at providing an arena for reporting on development in a diversity of languages. In the supplement we can read about some of the challenges identified above.

  4. Farsi/Arabic Document Image Retrieval through Sub -Letter Shape Coding for mixed Farsi/Arabic and English text

    Directory of Open Access Journals (Sweden)

    Zahra Bahmani

    2011-09-01

    Full Text Available A retrieval method for explicit recognition free Farsi/Arabic document is proposed in this paper. The system can be used in mixed Farsi/Arabic and English text. The method consists of Preprocessing, word and sub_word extraction, detection and cancelation of sub_letter connectors, annotation sub_letters by shape coding, classifier of sub_letters by use of decision tree and using of RBF neural network for sub_letter recognition. The Proposed system retrieves document images by a new sub_letter shape coding scheme in Farsi/Arabic documents. In this method document content captures through sub_letter coding of words. The decision tree-based classifier partitions the sub_letters space into a number of sub regions by splitting the sub_letter space, using one topological shape features at a time. Topological shape Features include height, width, holes, openings, valleys, jags, sub_letter ascenders/descanters. Experimental results show advantages of this method in Farsi/Arabic Document Image Retrieval.

  5. Progress Report on the ASCII for Science Data, Airborne and Geospatial Working Groups of the 2014 ESDSWG for MEaSUREs

    Science.gov (United States)

    Evans, K. D.; Krotkov, N. A.; Mattmann, C. A.; Boustani, M.; Law, E.; Conover, H.; Chen, G.; Olding, S. W.; Walter, J.

    2014-12-01

    The Earth Science Data Systems Working Groups (ESDSWG) were setup by NASA HQ 10 years ago. The role of the ESDSWG is to make recommendations relevant to NASA's Earth science data systems from users experiences. Each group works independently focussing on a unique topic. Participation in ESDSWG groups comes from a variety of NASA-funded science and technology projects, NASA information technology experts, affiliated contractor staff and other interested community members from academia and industry. Recommendations from the ESDSWG groups will enhance NASA's efforts to develop long term data products. The ASCII for Science Data Working Group (WG) will define a minimum set of information that should be included in ASCII file headers so that the users will be able to access the data using only the header information. After reviewing various use cases, such as field data and ASCII data exported from software tools, and reviewing ASCII data guidelines documentation, this WG will deliver guidelines for creating ASCII files that contain enough header information to allow the user to access the science data. The Airborne WG's goal is to improve airborne data access and use for NASA science. The first step is to evaluate the state of airborne data and make recommendations focusing on data delivery to the DAACs (data centers). The long term goal is to improve airborne data use for Earth Science research. Many data aircraft observations are reported in ASCII format. The ASCII and Airborne WGs seem like the same group, but the Airborne WG is concerned with maintaining and using airborne for science research, not just the data format. The Geospatial WG focus is on the interoperability issues of Geospatial Information System (GIS) and remotely sensed data, in particular, focusing on DAAC(s) data from NASA's Earth Science Enterprise. This WG will provide a set of tools (GIS libraries) to use with training and/or cookbooks through the use of Open Source technologies. A progress

  6. Segmentation of Handwritten Text Document Written in Devanagri Script for Simple character, skewed character and broken character

    Directory of Open Access Journals (Sweden)

    Vneeta Rani

    2013-06-01

    Full Text Available OCR (optical character recognition is a technology that is commonly used for recognizing patterns artificial intelligence & computer machine. With the help of OCR we can convert scanned document into editable documents which can be further used in various research areas. In this paper, we are presenting a character segmentation technique that can segment simple characters, skewed characters as well as broken characters. Character segmentation is very important phase in any OCR process because output of this phase will be served as input to various other phase like character recognition phase etc. If there is some problem in character segmentation phase then recognition of the corresponding character is very difficult or nearly impossible.

  7. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents

    OpenAIRE

    Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

    2016-01-01

    The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pa...

  8. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents.

    Science.gov (United States)

    Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka

    2016-01-01

    The contiguous sequences of the terms (N-grams) in the documents are symmetrically distributed among different classes. The symmetrical distribution of the N-Grams raises uncertainty in the belongings of the N-Grams towards the class. In this paper, we focused on the selection of most discriminating N-Grams by reducing the effects of symmetrical distribution. In this context, a new text feature selection method named as the symmetrical strength of the N-Grams (SSNG) is proposed using a two pass filtering based feature selection (TPF) approach. Initially, in the first pass of the TPF, the SSNG method chooses various informative N-Grams from the entire extracted N-Grams of the corpus. Subsequently, in the second pass the well-known Chi Square (χ(2)) method is being used to select few most informative N-Grams. Further, to classify the documents the two standard classifiers Multinomial Naive Bayes and Linear Support Vector Machine have been applied on the ten standard text data sets. In most of the datasets, the experimental results state the performance and success rate of SSNG method using TPF approach is superior to the state-of-the-art methods viz. Mutual Information, Information Gain, Odds Ratio, Discriminating Feature Selection and χ(2). PMID:27386386

  9. Lidar Bathymetry Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline and is the only large cape south of Cape Fear, North Carolina....

  10. Mining Clinicians' Electronic Documentation to Identify Heart Failure Patients with Ineffective Self-Management: A Pilot Text-Mining Study.

    Science.gov (United States)

    Topaz, Maxim; Radhakrishnan, Kavita; Lei, Victor; Zhou, Li

    2016-01-01

    Effective self-management can decrease up to 50% of heart failure hospitalizations. Unfortunately, self-management by patients with heart failure remains poor. This pilot study aimed to explore the use of text-mining to identify heart failure patients with ineffective self-management. We first built a comprehensive self-management vocabulary based on the literature and clinical notes review. We then randomly selected 545 heart failure patients treated within Partners Healthcare hospitals (Boston, MA, USA) and conducted a regular expression search with the compiled vocabulary within 43,107 interdisciplinary clinical notes of these patients. We found that 38.2% (n = 208) patients had documentation of ineffective heart failure self-management in the domains of poor diet adherence (28.4%), missed medical encounters (26.4%) poor medication adherence (20.2%) and non-specified self-management issues (e.g., "compliance issues", 34.6%). We showed the feasibility of using text-mining to identify patients with ineffective self-management. More natural language processing algorithms are needed to help busy clinicians identify these patients. PMID:27332377

  11. PHYSICAL MODELLING OF TERRAIN DIRECTLY FROM SURFER GRID AND ARC/INFO ASCII DATA FORMATS#

    Directory of Open Access Journals (Sweden)

    Y.K. Modi

    2012-01-01

    Full Text Available

    ENGLISH ABSTRACT: Additive manufacturing technology is used to make physical models of terrain using GIS surface data. Attempts have been made to understand several other GIS file formats, such as the Surfer grid and the ARC/INFO ASCII grid. The surface of the terrain in these file formats has been converted into an STL file format that is suitable for additive manufacturing. The STL surface is converted into a 3D model by making the walls and the base. In this paper, the terrain modelling work has been extended to several other widely-used GIS file formats. Terrain models can be created in less time and at less cost, and intricate geometries of terrain can be created with ease and great accuracy.

    AFRIKAANSE OPSOMMING: Laagvervaardigingstegnologie word gebruik om fisiese modelle van terreine vanaf GIS oppervlakdata te maak. Daar is gepoog om verskeie ander GIS lêerformate, soos die Surfer rooster en die ARC/INFO ASCII rooster, te verstaan. Die oppervlak van die terrein in hierdie lêerformate is omgeskakel in 'n STL lêerformaat wat geskik is vir laagvervaardiging. Verder is die STL oppervlak omgeskakel in 'n 3D model deur die kante en die basis te modelleer. In hierdie artikel is die terreinmodelleringswerk uitgebrei na verskeie ander algemeen gebruikte GIS lêerformate. Terreinmodelle kan so geskep word in korter tyd en teen laer koste, terwyl komplekse geometrieë van terreine met gemak en groot akkuraatheid geskep kan word.

  12. Gridded bathymetry of French Frigate Shoals, Hawaii, USA - Arc ASCII format

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Gridded bathymetry (5m) of the shelf environment of French Frigate Shoals, Hawaii, USA. The ASCII includes multibeam bathymetry from the Simrad EM3002d, and Reson...

  13. Formation of skill of interpretation of legal documents and potential of graphic means registrations of the text

    OpenAIRE

    Kosareva T. B.

    2010-01-01

    The article deals with teaching translation of legal documents, ways of effective learning legal vocabulary and testing for educational purposes. Testing is seen as a kind of training in achieving automatic skills of interpretation in which teaching materials are designed with the help of graphic highlighting.

  14. A Research paper: An ASCII value based data encryption algorithm and its comparison with other symmetric data encryption algorithms

    Directory of Open Access Journals (Sweden)

    Akanksha Mathur

    2012-09-01

    Full Text Available Encryption is the process of transforming plaintext into the ciphertext where plaintext is the input to the encryption process and ciphertext is the output of the encryption process. Decryption isthe process of transforming ciphertext into the plaintext where ciphertext is the input to the decryption process and plaintext is the output of the decryption process. There are various encryption algorithms exist classified as symmetric and asymmetric encryption algorithms. Here, I present an algorithm for data encryption and decryption which is based on ASCII values of characters in the plaintext. This algorithm is used to encrypt data by using ASCII values of the data to be encrypted. The secret used will be modifying o another string and that string is used as a key to encrypt or decrypt the data. So, it can be said that it is a kind of symmetric encryption algorithm because it uses same key for encryption anddecryption but by slightly modifying it. This algorithm operates when the length of input and the length of key are same.

  15. The Hong Kong Chinese University Document Retrieval Database——The Hong Kong Newspaper Full-text Database Projeet

    Institute of Scientific and Technical Information of China (English)

    MichaelM.Lee

    1994-01-01

    This project is to collect, organize, index and store full-text and graphics of selected Chinese and English newspapers currently published in Hang Kong. The end product will be an electronic database available to researchers through local area network, Internet and dial-up users. New items of the day before and up to six months will be available for online searching, via key word or subject, Earlier cumulated nateriats alone with the same indexing and searchmg software will be archived to optical media (CD ROM disks). As Itong Kong experiences rapid social, financial, conmtercial, political, educational and cultural changes, our state-of-the-art comprehensive coverage of local and regional newspapers will be a landmark contribution to information industries and researchers internationally. As the coverage of the database will be comprehensive and centralized, retrieval of news items of major Hang Kong newspapers will be fast and immtediate. Users do no need to look through daily or bi-monthly indexes in order to go to the newspapers or cuttings to obtain the hard copy, and then bring to the photocopier machine to copy,At this stage, we are hiring librarians, information specialists and support staff to work on this project. We also met and work with newspaper indexing and retrieval system developers in Beijing and Hang Kong to study cooperative systems to speed up the process. So far, we have received funding support from the Chinese University and the Hong Kong Government for two years. It is our plan to have a presentable sample database done by mid 1995, and have several newspapers indexed and stored in the structure arid for mat easy formigration to the eventual database system by the end of 1996.

  16. Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

    KAUST Repository

    Bin Raies, Arwa

    2013-10-16

    Background:In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.Methodology:We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.Conclusion:The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download. © 2013 Bin Raies et al.

  17. Scholars in the Humanities Are Reluctant to Cite E-Texts as Primary Materials. A Review of: Sukovic, S. (2009. References to e-texts in academic publications. Journal of Documentation, 65(6, 997-1015.

    Directory of Open Access Journals (Sweden)

    Deena Yanofsky

    2011-03-01

    collections as well as ‘electronically born’ documents, works of art and popular culture artifacts. Of the 22 works resulting from the research projects examined during the study period, half did not cite e-texts as primary materials. The 11 works that made at least one reference to an e-text included 4 works in which the only reference was to e-texts created by the actual author. In total, only 7 works referred to e-texts created by outside authors. These 7 final works were written by 5 participants, representing 31 percent of the total number of study participants.Analysis of the participants’ citation practices revealed that decisions to cite an electronic source or omit it from publication were based on two important factors: (1 the perceived trustworthiness of an e-text and (2 a sense of what was acceptable practice.Participants established trustworthiness through a process of verification. To confirm the authenticity and reliability of an e-text, most participants compared electronic documents against a print version to verify provenance, context, and details. Even when digitized materials were established as trustworthy sources, however, hard copies were often cited because they were considered more authoritative or accurate.Traditions of a particular discipline also had a strong influence on a participant’s willingness to cite e-texts. Participants working on traditional historical topics were more reluctant to cite electronic resources, while researchers who worked on topics that explored relatively new fields were more willing to acknowledge the use of e-texts in published works. Traditional practices also influenced participants’ decisions about how to cite materials. Some participants always cited original works in hard copy, regardless of electronic access because it was accepted scholarly practice.Conclusions – The results of this study suggest that the small number of citations to electronic sources in publications in the humanities is directly

  18. PDF文档HTML化中文本重排问题研究%A Study of Text Rearrang in Conversion of PDF Documents into HTML

    Institute of Scientific and Technical Information of China (English)

    林青; 李健

    2014-01-01

    目前各种PDF转化工具中,将PDF元素抽取后还原顺序的方法是根据每个文字元素的坐标---由左到右,由上到下的顺序重排元素。这种重排方式无法正确还原多栏或者多区域的PDF文档。文章提出了一种页面分块算法。所提算法将页面划分为不同的区域,在分区基础上重排,有效的提高了多栏或者多区域的PDF文档文本顺序还原的正确性。%Most of the existing PDF converters fulfill text detection by locating the coordinate of each text element. Specifically, text detection is realized by rearranging these elements from left to right as well as from top to bottom in the order. Unfortunately, such methods fail to work in complex multiple-column PDF documents. To settle this problem, this work proposed a novel page segmentation algorithm. The proposed algorithm first divides a page into several blocks, and then reorders these blocks. With the proposed algorithm, the correctness of returning to original complex multiple-column text increases effectively.

  19. Single-Beam Bathymetry Sounding Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline, and is the only large cape south of Cape Fear, North Carolina....

  20. Text files of the navigation logged by the U.S. Geological Survey offshore of Fire Island, NY in 2011 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The U.S. Geological Survey (USGS) mapped approximately 336 square kilometers of the lower shoreface and inner-continental shelf offshore of Fire Island, New York in...

  1. Research on Document Relevancy Based on Full-Text Retrieval System%一种基于全文检索系统的文档关联研究与实现

    Institute of Scientific and Technical Information of China (English)

    饶祎; 郭辉; 蔡庆生

    2003-01-01

    As a important application of the Full-Text retrieval system, document relevancy has powerful function. In this paper, a document relevancy method based on the Full-Text retrieval system is presented, which is deeply discussed from two aspects, content relevancy and properties relevancy. This system is proved to have good response time and precision by tests. It has great prospects in application area.

  2. Keyless Entry: Building a Text Database Using OCR Technology.

    Science.gov (United States)

    Grotophorst, Clyde W.

    1989-01-01

    Discusses the use of optical character recognition (OCR) technology to produce an ASCII text database. A tutorial on digital scanning and OCR is provided, and a systems integration project which used the Calera CDP-3000XF scanner and text retrieval software to construct a database of dissertations at George Mason University is described. (four…

  3. Native Language Processing using Exegy Text Miner

    Energy Technology Data Exchange (ETDEWEB)

    Compton, J

    2007-10-18

    Lawrence Livermore National Laboratory's New Architectures Testbed recently evaluated Exegy's Text Miner appliance to assess its applicability to high-performance, automated native language analysis. The evaluation was performed with support from the Computing Applications and Research Department in close collaboration with Global Security programs, and institutional activities in native language analysis. The Exegy Text Miner is a special-purpose device for detecting and flagging user-supplied patterns of characters, whether in streaming text or in collections of documents at very high rates. Patterns may consist of simple lists of words or complex expressions with sub-patterns linked by logical operators. These searches are accomplished through a combination of specialized hardware (i.e., one or more field-programmable gates arrays in addition to general-purpose processors) and proprietary software that exploits these individual components in an optimal manner (through parallelism and pipelining). For this application the Text Miner has performed accurately and reproducibly at high speeds approaching those documented by Exegy in its technical specifications. The Exegy Text Miner is primarily intended for the single-byte ASCII characters used in English, but at a technical level its capabilities are language-neutral and can be applied to multi-byte character sets such as those found in Arabic and Chinese. The system is used for searching databases or tracking streaming text with respect to one or more lexicons. In a real operational environment it is likely that data would need to be processed separately for each lexicon or search technique. However, the searches would be so fast that multiple passes should not be considered as a limitation a priori. Indeed, it is conceivable that large databases could be searched as often as necessary if new queries were deemed worthwhile. This project is concerned with evaluating the Exegy Text Miner installed in the

  4. Text Steganographic Approaches: A Comparison

    Directory of Open Access Journals (Sweden)

    Monika Agarwal

    2013-02-01

    Full Text Available This paper presents three novel approaches of text steganography. The first approach uses the theme ofmissing letter puzzle where each character of message is hidden by missing one or more letters in a wordof cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover andstego file. The second approach hides a message in a wordlist where ASCII value of embedded characterdetermines length and starting letter of a word. The third approach conceals a message, withoutdegrading cover, by using start and end letter of words of the cover. For enhancing the security of secretmessage, the message is scrambled using one-time pad scheme before being concealed and cipher text isthen concealed in cover. We also present an empirical comparison of the proposed approaches with someof the popular text steganographic approaches and show that our approaches outperform the existingapproaches.

  5. Multilingual information identification and extraction from imaged documents using optical correlator

    Science.gov (United States)

    Stalcup, Bruce W.; Brower, James; Vaughn, Lou; Vertuno, Mike

    2002-11-01

    Most organizations usually have large archives of paper documents that they maintain. These archives typically contain valuable information and data, which are imaged to provide electronic access. However, once a document is either printed or imaged, these organizations had no efficient method of retrieving information from these documents. The only methods available to retrieve information from them were to either manually read them or to convert them to ASCII text using optical character recognition (OCR). For most of the archives with large numbers of documents, these methods are problematic. Manual searches are not feasible. OCR, on the other hand, can be CPU intensive and prone to error. In addition, for many foreign languages, OCR engines do not exist. By contrast, our system provides an innovative approach to the problem of retrieving information from imaged document archives utilizing a client/server architecture. Since its beginning in 1999, we have made significant advances in the development of a system that employs optical correlation (OC) technology (either software or hardware) to access directly the textual and graphic information contained in imaged paper documents therefore eliminating the OCR process. It provides a fast, accurate means of accessing this information directly from multilingual documents. In addition, our system can also rapidly and accurately detect the presence of duplicate documents within an archive using optical correlation techniques. In this paper, we describe the present system and selected examples of its capabilities. We also present some performance results (accuracy, speed, etc.) against test document sets.

  6. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  7. Automatic Arabic Text Classification

    OpenAIRE

    Al-harbi, S; Almuhareb, A.; Al-Thubaity , A; Khorsheed, M. S.; Al-Rajeh, A.

    2008-01-01

    Automated document classification is an important text mining task especially with the rapid growth of the number of online documents present in Arabic language. Text classification aims to automatically assign the text to a predefined category based on linguistic features. Such a process has different useful applications including, but not restricted to, e-mail spam detection, web page content filtering, and automatic message routing. This paper presents the results of experiments on documen...

  8. Locations and analysis of sediment samples collected offshore of Massachusetts within Northern Cape Cod Bay(CCB_SedSamples Esri Shapefile, and ASCII text format, WGS84)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — These data were collected under a cooperative agreement with the Massachusetts Office of Coastal Zone Management (CZM) and the U.S. Geological Survey (USGS),...

  9. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  10. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  11. De que modo os textos oficiais prescrevem o trabalho do professor? Análise comparativa de documentos brasileiros e genebrinos How do official texts prescribe the teacher's work? A comparative analysis of the brazilian and genebrian Documents

    Directory of Open Access Journals (Sweden)

    Anna Rachel Machado

    2005-12-01

    Full Text Available Neste artigo, são apresentados os resultados de análises de dois documentos produzidos por instâncias oficiais para orientar o trabalho dos professores no Brasil e na Suíça. De um lado, buscamos detectar as características da textualização da prescrição do trabalho do professor. Os resultados mostram que, além das propriedades comuns aos textos prescritivos (apagamento do enunciador, contrato de felicidade etc., esses documentos se caracterizam por apresentar uma estrutura temática mais complexa, articulando um agir prescritivo, um agir-fonte e um agir - prescrito. Além disso, buscamos identificar as formas de construção do objeto da prescrição, o que permitiu verificar que, nos dois contextos, esse objeto se configura como uma proposta pedagógica global e não como trabalho concreto dos professores, não estando eles representados, nesses textos, como atores que têm uma real responsabilidade no desenvolvimento das propostas e, paralelamente, apresentando-se os alunos como alvos inertes. Esse trabalho também nos permitiu levantar algumas diferenças das formas de textualização das prescrições examinadas, diferenças essas que relacionamos ao contexto político-econômico dos dois países. Ao final, chegamos a questionamentos referentes às razões da não-consideração do trabalho efetivo dos professores nesse tipo de documento.In this article we present the results of two documents produced by official agencies that aim at guiding the teachers' work in Brazil and in Switzerland. On the one hand, we focused on detecting the textualization features used to prescribe the teacher's work. Results show that besides the common prescriptive features of the two texts (enunciator's erasure, felicity contract etc, these documents carry a more complex thematic structure, articulating a prescriptive doing, a source-doing and a prescribed-doing. We have also tried to identify the forms of building the object of prescription, which

  12. Text Laws

    Czech Academy of Sciences Publication Activity Database

    Hřebíček, Luděk

    Vol. 26. Ein internationales Handbuch/An International Handbook. Berlin-New York : Walter de Gruyter, 2005 - (Köhler, R.; Altmann, G.; Piotrowski, R.), s. 348-361 ISBN 978-3-11-015578-5 Institutional research plan: CEZ:AV0Z90210515 Keywords : Text structure * Quantitative linguistics Subject RIV: AI - Linguistics

  13. EMOTION DETECTION FROM TEXT

    Directory of Open Access Journals (Sweden)

    Shiv Naresh Shivhare

    2012-05-01

    Full Text Available Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content – based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  14. Documenting the Earliest Chinese Journals

    Directory of Open Access Journals (Sweden)

    Jian-zhong (Joe Zhou

    2001-10-01

    Full Text Available

    頁次:19-24

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">According to various authoritative sources, the English word "journal" was first used in the 16lh century, but the existence of the journal in its original meaning as a daily record can be traced back to Acta Diuma (Daily Events in ancient Roman cities as early as 59 B.C. This article documents the first appearance of Chinese daily records that were much early than 59 B.C.

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">The evidence of the earlier Chinese daily records came from some important archaeological discoveries in the 1970's, but they were also documented by Sima Qian (145 B.C. - 85 B.C., the grand historian of the Han Dynasty imperial court. Sima's lifetime contribution was the publication of Shi Ji (ascii-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-font-kerning: 0pt; mso-hansi-font-family: 'Times New Roman';">史記 (The Grand Scribe's Records, the Records hereafter. The Records is a book of history of a grand scope. It encompasses all Chinese history from 30lh century B.C. through the end of the second century B.C. in 130 chapters and over 525,000 Chinese

  15. Text Classification using Artificial Intelligence

    CERN Document Server

    Kamruzzaman, S M

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of na\\"ive Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A syste...

  16. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  17. Centroid Based Text Clustering

    Directory of Open Access Journals (Sweden)

    Priti Maheshwari

    2010-09-01

    Full Text Available Web mining is a burgeoning new field that attempts to glean meaningful information from natural language text. Web mining refers generally to the process of extracting interesting information and knowledge from unstructured text. Text clustering is one of the important Web mining functionalities. Text clustering is the task in which texts are classified into groups of similar objects based on their contents. Current research in the area of Web mining is tacklesproblems of text data representation, classification, clustering, information extraction or the search for and modeling of hidden patterns. In this paper we propose for mining large document collections it is necessary to pre-process the web documents and store the information in a data structure, which is more appropriate for further processing than a plain web file. In this paper we developed a php-mySql based utility to convert unstructured web documents into structured tabular representation by preprocessing, indexing .We apply centroid based web clustering method on preprocessed data. We apply three methods for clustering. Finally we proposed a method that can increase accuracy based on clustering ofdocuments.

  18. Classification of Arabic Documents

    OpenAIRE

    Elbery, Ahmed

    2012-01-01

    Arabic language is a very rich language with complex morphology, so it has a very different and difficult structure than other languages. So it is important to build an Arabic Text Classifier (ATC) to deal with this complex language. The importance of text or document classification comes from its wide variety of application domains such as text indexing, document sorting, text filtering, and Web page categorization. Due to the immense amount of Arabic documents as well as the number of inter...

  19. EMOTION DETECTION FROM TEXT

    OpenAIRE

    Shiv Naresh Shivhare; Saritha Khethawat

    2012-01-01

    Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content – based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  20. Emotion Detection from Text

    CERN Document Server

    Shivhare, Shiv Naresh

    2012-01-01

    Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content - based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  1. Quality text editing

    Directory of Open Access Journals (Sweden)

    Gyöngyi Bujdosó

    2009-10-01

    Full Text Available Text editing is more than the knowledge of word processing techniques. Originally typographers, printers, text editors were the ones qualified to edit texts, which were well structured, legible, easily understandable, clear, and were able to emphasize the coreof the text. Time has changed, and nowadays everyone has access to computers as well as to text editing software and most users believe that having these tools is enough to edit texts. However, text editing requires more skills. Texts appearing either in printed or inelectronic form reveal that most of the users do not realize that they are not qualified to edit and publish their works. Analyzing the ‘text-products’ of the last decade a tendency can clearly be drawn. More and more documents appear, which instead of emphasizingthe subject matter, are lost in the maze of unstructured text slices. Without further thoughts different font types, colors, sizes, strange arrangements of objects, etc. are applied. We present examples with the most common typographic and text editing errors. Our aim is to call the attention to these mistakes and persuadeusers to spend time to educate themselves in text editing. They have to realize that a well-structured text is able to strengthen the effect on the reader, thus the original message will reach the target group.

  2. Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use

    Science.gov (United States)

    White, Sheida

    2012-01-01

    This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…

  3. Exploiting Document Level Semantics in Document Clustering

    Directory of Open Access Journals (Sweden)

    Muhammad Rafi

    2016-06-01

    Full Text Available Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus into smaller, more manageable, subject homogeneous collections (clusters. Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1 to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2 to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii we propose a similarity function using these three, and (iii we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained

  4. Automated document analysis system

    Science.gov (United States)

    Black, Jeffrey D.; Dietzel, Robert; Hartnett, David

    2002-08-01

    A software application has been developed to aid law enforcement and government intelligence gathering organizations in the translation and analysis of foreign language documents with potential intelligence content. The Automated Document Analysis System (ADAS) provides the capability to search (data or text mine) documents in English and the most commonly encountered foreign languages, including Arabic. Hardcopy documents are scanned by a high-speed scanner and are optical character recognized (OCR). Documents obtained in an electronic format bypass the OCR and are copied directly to a working directory. For translation and analysis, the script and the language of the documents are first determined. If the document is not in English, the document is machine translated to English. The documents are searched for keywords and key features in either the native language or translated English. The user can quickly review the document to determine if it has any intelligence content and whether detailed, verbatim human translation is required. The documents and document content are cataloged for potential future analysis. The system allows non-linguists to evaluate foreign language documents and allows for the quick analysis of a large quantity of documents. All document processing can be performed manually or automatically on a single document or a batch of documents.

  5. 2005-004-FA_HYPACK: Text files of the Wide Area Augmentation System (WAAS) navigation collected by the U.S. Geological Survey in Moultonborough Bay, Lake Winnipesaukee, New Hampshire in 2005 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — In freshwater bodies of New Hampshire, the most problematic aquatic invasive plant species is Myriophyllum heterophyllum or variable leaf water-milfoil. Once...

  6. Integrated Documents

    OpenAIRE

    Sawitzki, Günther

    2000-01-01

    An introduction to integrated documents in statistics. Integrated documents allow a seamless integration of interactive statistics and data analysis components in 'life' documents while keeping the full computational power needed for simulation or resampling.

  7. Discover Effective Pattern for Text Mining

    OpenAIRE

    Khade, A. D.; A. B. Karche

    2014-01-01

    Many data mining techniques have been discovered for finding useful patterns in documents like text document. However, how to use effective and bring to up to date discovered patterns is still an open research task, especially in the domain of text mining. Text mining is the finding of very interesting knowledge (or features) in the text documents. It is a challenging task to find appropriate knowledge (or features) in text documents to help users to find what they exactly want...

  8. Arabic Short Text Compression

    Directory of Open Access Journals (Sweden)

    Eman Omer

    2010-01-01

    Full Text Available Problem statement: Text compression permits representing a document by using less space. This is useful not only to save disk space, but more importantly, to save disk transfer and network transmission time. With the continues increase in the number of Arabic short text messages sent by mobile phones, the use of a suitable compression scheme would allow users to use more characters than the default value specified by the provider. The development of an efficient compression scheme to compress short Arabic texts is not a straight forward task. Approach: This study combined the benefits of pre-processing, entropy reduction through splitting files and hybrid dynamic coding: A new technique proposed in this study that uses the fact that Arabic texts have single case letters. Experimental tests had been performed on short Arabic texts and a comparison with the well known plain Huffman compression was made to measure the performance of the proposed schema for Arabic short text. Results: The proposed schema can achieve a compression ratio around 4.6 bits byte-1 for very short Arabic text sequences of 15 bytes and around 4 bits byte-1 for 50 bytes text sequences, using only 8 Kbytes overhead of memory. Conclusion: Furthermore, a reasonable compression ratio can be achieved using less than 0.4 KB of memory overhead. We recommended the use of proposed schema to compress small Arabic text with recourses limited.

  9. Text Mining for Neuroscience

    Science.gov (United States)

    Tirupattur, Naveen; Lapish, Christopher C.; Mukhopadhyay, Snehasis

    2011-06-01

    Text mining, sometimes alternately referred to as text analytics, refers to the process of extracting high-quality knowledge from the analysis of textual data. Text mining has wide variety of applications in areas such as biomedical science, news analysis, and homeland security. In this paper, we describe an approach and some relatively small-scale experiments which apply text mining to neuroscience research literature to find novel associations among a diverse set of entities. Neuroscience is a discipline which encompasses an exceptionally wide range of experimental approaches and rapidly growing interest. This combination results in an overwhelmingly large and often diffuse literature which makes a comprehensive synthesis difficult. Understanding the relations or associations among the entities appearing in the literature not only improves the researchers current understanding of recent advances in their field, but also provides an important computational tool to formulate novel hypotheses and thereby assist in scientific discoveries. We describe a methodology to automatically mine the literature and form novel associations through direct analysis of published texts. The method first retrieves a set of documents from databases such as PubMed using a set of relevant domain terms. In the current study these terms yielded a set of documents ranging from 160,909 to 367,214 documents. Each document is then represented in a numerical vector form from which an Association Graph is computed which represents relationships between all pairs of domain terms, based on co-occurrence. Association graphs can then be subjected to various graph theoretic algorithms such as transitive closure and cycle (circuit) detection to derive additional information, and can also be visually presented to a human researcher for understanding. In this paper, we present three relatively small-scale problem-specific case studies to demonstrate that such an approach is very successful in

  10. Clustering Text Data Streams

    Institute of Scientific and Technical Information of China (English)

    Yu-Bao Liu; Jia-Rong Cai; Jian Yin; Ada Wai-Chee Fu

    2008-01-01

    Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF.IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

  11. Combined image and text relational database features and implications

    International Nuclear Information System (INIS)

    A relational database has been developed that allows comparison of patient information, imaging signs, and diagnosis with surgical pathology and the actual images. This allows comparison of the diagnostic accuracy of CT, MR imaging, US, and digital subtraction angiography with pathologic, cytologic, and surgical findings. This is a significant research and quality-assurance tool. This system is unique because of its ability to have a totally relational database linking binary image files with ASCII text. Previously this has not been possible at 8-bit pixel depth because of the lack of interface modules

  12. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  13. Documenting localities

    CERN Document Server

    Cox, Richard J

    1996-01-01

    Now in paperback! Documenting Localities is the first effort to summarize the past decade of renewed discussion about archival appraisal theory and methodology and to provide a practical guide for the documentation of localities.This book discusses the continuing importance of the locality in American historical research and archival practice, traditional methods archivists have used to document localities, and case studies in documenting localities. These chapters draw on a wide range of writings from archivists, historians, material culture specialists, historic preservationists

  14. IMPROVED TEXT CLUSTERING WITH NEIGHBORS

    Directory of Open Access Journals (Sweden)

    Sri Lalitha Y

    2015-03-01

    Full Text Available With ever increasing number of documents on web and other repositories, the task of organizing and categorizing these documents to the diverse need of the user by manual means is a complicated job, hence a machine learning technique named clustering is very useful. Text documents are clustered by pair wise similarity of documents with similarity measures like Cosine, Jaccard or Pearson. Best clustering results are seen when overlapping of terms in documents is less, that is, when clusters are distinguishable. Hence for this problem, to find document similarity we apply link and neighbor introduced in ROCK. Link specifies number of shared neighbors of a pair of documents. Significantly similar documents are called as neighbors. This work applies links and neighbors to Bisecting K-means clustering in identifying seed documents in the dataset, as a heuristic measure in choosing a cluster to be partitioned and as a means to find the number of partitions possible in the dataset. Our experiments on real-time datasets showed a significant improvement in terms of accuracy with minimum time.

  15. A Survey on Text Mining in Clustering

    Directory of Open Access Journals (Sweden)

    S.Logeswari

    2011-02-01

    Full Text Available Text mining has important applications in the area of data mining and information retrieval. One of the important tasks in text mining is document clustering. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. It is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. The synonymy between related documents is ignored. It reduces the effectiveness of applications using a standard full-text document representation. This paper emphasis on the various techniques that are used to cluster the text documents based on keywords, phrases and concepts. It also includes the different performance measures that are used to evaluate the quality of clusters.

  16. Arabic Text Mining Using Rule Based Classification

    OpenAIRE

    Fadi Thabtah; Omar Gharaibeh; Rashid Al-Zubaidy

    2012-01-01

    A well-known classification problem in the domain of text mining is text classification, which concerns about mapping textual documents into one or more predefined category based on its content. Text classification arena recently attracted many researchers because of the massive amounts of online documents and text archives which hold essential information for a decision-making process. In this field, most of such researches focus on classifying English documents while there are limited studi...

  17. Termination Documentation

    Science.gov (United States)

    Duncan, Mike; Hill, Jillian

    2014-01-01

    In this study, we examined 11 workplaces to determine how they handle termination documentation, an empirically unexplored area in technical communication and rhetoric. We found that the use of termination documentation is context dependent while following a basic pattern of infraction, investigation, intervention, and termination. Furthermore,…

  18. Learning Context for Text Categorization

    OpenAIRE

    Haribhakta, Y. V.; Parag Kulkarni

    2011-01-01

    This paper describes our work which is based on discovering context for text document categorization. The document categorization approach is derived from a combination of a learning paradigm known as relation extraction and an technique known as context discovery. We demonstrate the effectiveness of our categorization approach using reuters 21578 dataset and synthetic real world data from sports domain. Our experimental results indicate that the learned context greatly improves t...

  19. Learning Context for Text Categorization

    CERN Document Server

    Haribhakta, Y V

    2011-01-01

    This paper describes our work which is based on discovering context for text document categorization. The document categorization approach is derived from a combination of a learning paradigm known as relation extraction and an technique known as context discovery. We demonstrate the effectiveness of our categorization approach using reuters 21578 dataset and synthetic real world data from sports domain. Our experimental results indicate that the learned context greatly improves the categorization performance as compared to traditional categorization approaches.

  20. TEXT CATEGORIZATION USING QLEARNING ALOGRITHM

    OpenAIRE

    Dr.S.R.Suresh; T.Karthikeyan,; D.B.Shanmugam,; J.Dhilipan

    2011-01-01

    This paper aims at creation of an efficient document classification process using reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision-making. Onestrength of reinforcement learning is that it provides formalism for measuring the utility of actions that gives benefit only in the future. An effective and flexible classifier learning algorithm is provided, which classifies a set of text documents into a more specific domain like Cricket, Tenn...

  1. Typesafe Modeling in Text Mining

    OpenAIRE

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  2. Approaches to Automatic Text Structuring

    OpenAIRE

    Erbs, Nicolai

    2015-01-01

    Structured text helps readers to better understand the content of documents. In classic newspaper texts or books, some structure already exists. In the Web 2.0, the amount of textual data, especially user-generated data, has increased dramatically. As a result, there exists a large amount of textual data which lacks structure, thus making it more difficult to understand. In this thesis, we will explore techniques for automatic text structuring to help readers to fulfill their information need...

  3. Task specific image text recognition

    OpenAIRE

    Ben-Haim, Nadav

    2008-01-01

    This thesis addresses the problem of reading image text, which we define here as a digital image of machine printed text. Images of license plates, signs, and scanned documents fall into this category, whereas images of handwriting do not. Automatically reading image text is a very well researched problem, which falls into the broader category of Optical Character Recognition (OCR). Virtually all work in this domain begins by segmenting characters from the image and proceeds with a classifica...

  4. DM Documentation

    OpenAIRE

    Sick, Jonathan

    2016-01-01

    An overview of resources for the science community to learn about, and interact with, LSST Data Management. This talk highlights the LSST Community Forum, https://community.lsst.org, as well as Data Management Technical Notes and software documentation projects.

  5. Maury Documentation

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Supporting documentation for the Maury Collection of marine observations. Includes explanations from Maury himself, as well as guides and descriptions by the U.S....

  6. Short Text Classification: A Survey

    Directory of Open Access Journals (Sweden)

    Ge Song

    2014-05-01

    Full Text Available With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification

  7. Al-Hadith Text Classifier

    OpenAIRE

    Mohammed Naji Al-Kabi; Ghassan Kanaan; Riyad Al-Shalabi; Saja I. Al- Sinjilawi; Ronza S. Al- Mustafa

    2005-01-01

    This study explore the implementation of a text classification method to classify the prophet Mohammed (PBUH) hadiths (sayings) using Sahih Al-Bukhari classification. The sayings explain the Holy Qur`an, which considered by Muslims to be the direct word of Allah. Present method adopts TF/IDF (Term Frequency-Inverse Document Frequency) which is used usually for text search. TF/IDF was used for term weighting, in which document weights for the selected terms are computed, to classify non-vocali...

  8. A Survey of Unstructured Text Summarization Techniques

    Directory of Open Access Journals (Sweden)

    Sherif Elfayoumy

    2014-05-01

    Full Text Available Due to the explosive amounts of text data being created and organizations increased desire to leverage their data corpora, especially with the availability of Big Data platforms, there is not usually enough time to read and understand each document and make decisions based on document contents. Hence, there is a great demand for summarizing text documents to provide a representative substitute for the original documents. By improving summarizing techniques, precision of document retrieval through search queries against summarized documents is expected to improve in comparison to querying against the full spectrum of original documents. Several generic text summarization algorithms have been developed, each with its own advantages and disadvantages. For example, some algorithms are particularly good for summarizing short documents but not for long ones. Others perform well in identifying and summarizing single-topic documents but their precision degrades sharply with multi-topic documents. In this article we present a survey of the literature in text summarization. We also surveyed some of the most common evaluation methods for the quality of automated text summarization techniques. Last, we identified some of the challenging problems that are still open, in particular the need for a universal approach that yields good results for mixed types of documents.

  9. Identifying Patients with Depression Using Free-text Clinical Documents.

    Science.gov (United States)

    Zhou, Li; Baughman, Amy W; Lei, Victor J; Lai, Kenneth H; Navathe, Amol S; Chang, Frank; Sordo, Margarita; Topaz, Maxim; Zhong, Feiran; Murrali, Madhavan; Navathe, Shamkant; Rocha, Roberto A

    2015-01-01

    About 1 in 10 adults are reported to exhibit clinical depression and the associated personal, societal, and economic costs are significant. In this study, we applied the MTERMS NLP system and machine learning classification algorithms to identify patients with depression using discharge summaries. Domain experts reviewed both the training and test cases, and classified these cases as depression with a high, intermediate, and low confidence. For depression cases with high confidence, all of the algorithms we tested performed similarly, with MTERMS' knowledge-based decision tree slightly better than the machine learning classifiers, achieving an F-measure of 89.6%. MTERMS also achieved the highest F-measure (70.6%) on intermediate confidence cases. The RIPPER rule learner was the best performing machine learning method, with an F-measure of 70.0%, and a higher precision but lower recall than MTERMS. The proposed NLP-based approach was able to identify a significant portion of the depression cases (about 20%) that were not on the coded diagnosis list. PMID:26262127

  10. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  11. Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

    CERN Document Server

    Kaser, Owen

    2007-01-01

    Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

  12. Interconnectedness und digitale Texte

    Directory of Open Access Journals (Sweden)

    Detlev Doherr

    2013-04-01

    Full Text Available Zusammenfassung Die multimedialen Informationsdienste im Internet werden immer umfangreicher und umfassender, wobei auch die nur in gedruckter Form vorliegenden Dokumente von den Bibliotheken digitalisiert und ins Netz gestellt werden. Über Online-Dokumentenverwaltungen oder Suchmaschinen können diese Dokumente gefunden und dann in gängigen Formaten wie z.B. PDF bereitgestellt werden. Dieser Artikel beleuchtet die Funktionsweise der Humboldt Digital Library, die seit mehr als zehn Jahren Dokumente von Alexander von Humboldt in englischer Übersetzung im Web als HDL (Humboldt Digital Library kostenfrei zur Verfügung stellt. Anders als eine digitale Bibliothek werden dabei allerdings nicht nur digitalisierte Dokumente als Scan oder PDF bereitgestellt, sondern der Text als solcher und in vernetzter Form verfügbar gemacht. Das System gleicht damit eher einem Informationssystem als einer digitalen Bibliothek, was sich auch in den verfügbaren Funktionen zur Auffindung von Texten in unterschiedlichen Versionen und Übersetzungen, Vergleichen von Absätzen verschiedener Dokumente oder der Darstellung von Bilden in ihrem Kontext widerspiegelt. Die Entwicklung von dynamischen Hyperlinks auf der Basis der einzelnen Textabsätze der Humboldt‘schen Werke in Form von Media Assets ermöglicht eine Nutzung der Programmierschnittstelle von Google Maps zur geographischen wie auch textinhaltlichen Navigation. Über den Service einer digitalen Bibliothek hinausgehend, bietet die HDL den Prototypen eines mehrdimensionalen Informationssystems, das mit dynamischen Strukturen arbeitet und umfangreiche thematische Auswertungen und Vergleiche ermöglicht. Summary The multimedia information services on Internet are becoming more and more comprehensive, even the printed documents are digitized and republished as digital Web documents by the libraries. Those digital files can be found by search engines or management tools and provided as files in usual formats as

  13. System for Distributed Text Mining

    OpenAIRE

    Torgersen, Martin Nordseth

    2011-01-01

    Text mining presents us with new possibilities for the use of collections of documents.There exists a large amount of hidden implicit information inside these collection, which text mining techniques may help us to uncover. Unfortunately, these techniques generally requires large amounts of computational power. This is addressed by the introduction of distributed systems and methods for distributed processing, such as Hadoop and MapReduce.This thesis aims to describe, design, implement and ev...

  14. Segmentation of complex document

    Directory of Open Access Journals (Sweden)

    Souad Oudjemia

    2014-06-01

    Full Text Available In this paper we present a method for segmentation of documents image with complex structure. This technique based on GLCM (Grey Level Co-occurrence Matrix used to segment this type of document in three regions namely, 'graphics', 'background' and 'text'. Very briefly, this method is to divide the document image, in block size chosen after a series of tests and then applying the co-occurrence matrix to each block in order to extract five textural parameters which are energy, entropy, the sum entropy, difference entropy and standard deviation. These parameters are then used to classify the image into three regions using the k-means algorithm; the last step of segmentation is obtained by grouping connected pixels. Two performance measurements are performed for both graphics and text zones; we have obtained a classification rate of 98.3% and a Misclassification rate of 1.79%.

  15. Effective Term Based Text Clustering Algorithms

    OpenAIRE

    P. Ponmuthuramalingam,; T. Devi

    2010-01-01

    Text clustering methods can be used to group large sets of text documents. Most of the text clustering methods do not address the problems of text clustering such as very high dimensionality of the data and understandability of the clustering descriptions. In this paper, a frequent term based approach of clustering has been introduced; it provides a natural way of reducing a large dimensionality of the document vector space. This approach is based on clustering the low dimensionality frequent...

  16. Al-Hadith Text Classifier

    Directory of Open Access Journals (Sweden)

    Mohammed Naji Al-Kabi

    2005-01-01

    Full Text Available This study explore the implementation of a text classification method to classify the prophet Mohammed (PBUH hadiths (sayings using Sahih Al-Bukhari classification. The sayings explain the Holy Qur`an, which considered by Muslims to be the direct word of Allah. Present method adopts TF/IDF (Term Frequency-Inverse Document Frequency which is used usually for text search. TF/IDF was used for term weighting, in which document weights for the selected terms are computed, to classify non-vocalized sayings, after their terms (keywords have been transformed to the corresponding canonical form (i.e., roots, to one of eight Books (classes, according to Al-Bukhari classification. A term would have a higher weight if it were a good descriptor for a particular book, i.e., it appears frequently in the book but is infrequent in the entire corpus.

  17. Text Recognition from an Image

    Directory of Open Access Journals (Sweden)

    Shrinath Janvalkar

    2014-04-01

    Full Text Available To achieve high speed in data processing it is necessary to convert the analog data into digital data. Storage of hard copy of any document occupies large space and retrieving of information from that document is time consuming. Optical character recognition system is an effective way in recognition of printed character. It provides an easy way to recognize and convert the printed text on image into the editable text. It also increases the speed of data retrieval from the image. The image which contains characters can be scanned through scanner and then recognition engine of the OCR system interpret the images and convert images of printed characters into machine-readable characters [8].It improving the interface between man and machine in many applications

  18. Documenting Spreadsheets

    CERN Document Server

    Payette, Raymond

    2008-01-01

    This paper discusses spreadsheets documentation and new means to achieve this end by using Excel's built-in "Comment" function. By structuring comments, they can be used as an essential tool to fully explain spreadsheet. This will greatly facilitate spreadsheet change control, risk management and auditing. It will fill a crucial gap in corporate governance by adding essential information that can be managed in order to satisfy internal controls and accountability standards.

  19. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the na...

  20. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and ...

  1. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natu...

  2. CMS DOCUMENTATION

    CERN Document Server

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natur...

  3. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted.   CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat a...

  4. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of em¬pl...

  5. CNEA's quality system documentation

    International Nuclear Information System (INIS)

    Full text: To obtain an effective and coherent documentation system suitable for CNEA's Quality Management Program, we decided to organize the CNEA's quality documentation with : a- Level 1. Quality manual. b- Level 2. Procedures. c-Level 3. Qualities plans. d- Level 4: Instructions. e- Level 5. Records and other documents. The objective of this work is to present a standardization of the documentation of the CNEA's quality system of facilities, laboratories, services, and R and D activities. Considering the diversity of criteria and formats for elaboration the documentation by different departments, and since ultimately each of them generally includes the same quality management policy, we proposed the elaboration of a system in order to improve the documentation, avoiding unnecessary time wasting and costs. This will aloud each sector to focus on their specific documentation. The quality manuals of the atomic centers fulfill the rule 3.6.1 of the Nuclear Regulatory Authority, and the Safety Series 50-C/SG-Q of the International Atomic Energy Agency. They are designed by groups of competent and highly trained people of different departments. The normative procedures are elaborated with the same methodology as the quality manuals. The quality plans which describe the organizational structure of working group and the appropriate documentation, will asses the quality manuals of facilities, laboratories, services, and research and development activities of atomic centers. The responsibilities for approval of the normative documentation are assigned to the management in charge of the administration of economic and human resources in order to fulfill the institutional objectives. Another improvement aimed to eliminate unnecessary invaluable processes is the inclusion of all quality system's normative documentation in the CNEA intranet. (author)

  6. Text Analytics to Data Warehousing

    Directory of Open Access Journals (Sweden)

    Kalli Srinivasa Nageswara Prasad

    2010-09-01

    Full Text Available Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language to extract and analyze data, textanalytics has evolved to handle unstructured data to helps unlock and predict business results via Business Intelligence and Data Warehousing. Text mining involves dealing with texts in documents and discovering hidden patterns, but Text Analytics enhances InformationRetrieval in form of search and enabling clustering of results and more over Text Analytics is text mining and visualization. In this paper we would discuss on handling unstructured data that are in documents so that they fit into business applications like Data Warehouses for further analysis and it helps in the framework we have used for the solution.

  7. Secure Copier Which Allows Reuse Copied Documents with Sorting Capability in Accordance with Document Types

    Directory of Open Access Journals (Sweden)

    Kohei Arai

    2013-09-01

    Full Text Available Secure copy machine which allows reuse copied documents with sorting capability in accordance with the document types. Through experiments with a variety of document types, it is found that copied documents can be shared and stored in database in accordance with automatically classified document types securely. The copied documents are protected by data hiding based on wavelet Multi Resolution Analysis: MRA.

  8. Open architecture for multilingual parallel texts

    CERN Document Server

    Benitez, M T Carrasco

    2008-01-01

    Multilingual parallel texts (abbreviated to parallel texts) are linguistic versions of the same content ("translations"); e.g., the Maastricht Treaty in English and Spanish are parallel texts. This document is about creating an open architecture for the whole Authoring, Translation and Publishing Chain (ATP-chain) for the processing of parallel texts.

  9. ASCII Text File of the Original 1-m Bathymetry from National Oceanic and Atmospheric Administration (NOAA) Survey H11321 in Central Rhode Island Sound (H11321_1M_UTM19NAD83.TXT)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The United States Geological Survey (USGS) is working cooperatively with the National Oceanic and Atmospheric Administration (NOAA) to interpret the surficial...

  10. MMI Diversity Based Text Summarization

    Directory of Open Access Journals (Sweden)

    Ladda Suanmali

    2009-03-01

    Full Text Available The searching for interesting information in a huge data collection is a tough job frustrating the seekers for that information. The automatic text summarization has come to facilitate such searching process. The selection of distinct ideas “diversity” from the original document can produce an appropriate summary. Incorporating of multiple means can help to find the diversity in the text. In this paper, we propose approach for text summarization, in which three evidences are employed (clustering, binary tree and diversity based method to help in finding the document distinct ideas. The emphasis of our approach is on controlling the redundancy in the summarized text. The role of clustering is very important, where some clustering algorithms perform better than others. Therefore we conducted an experiment for comparing two clustering algorithms (K-means and complete linkage clustering algorithms based on the performance of our method, the results shown that k-means performs better than complete linkage. In general, the experimental results shown that our method performs well for text summarization comparing with the benchmark methods used in this study.

  11. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  12. Bengali text summarization by sentence extraction

    CERN Document Server

    Sarkar, Kamal

    2012-01-01

    Text summarization is a process to produce an abstract or a summary by selecting significant portion of the information from one or more texts. In an automatic text summarization process, a text is given to the computer and the computer returns a shorter less redundant extract or abstract of the original text(s). Many techniques have been developed for summarizing English text(s). But, a very few attempts have been made for Bengali text summarization. This paper presents a method for Bengali text summarization which extracts important sentences from a Bengali document to produce a summary.

  13. INFORMATION RETRIEVAL FOR SHORT DOCUMENTS

    Institute of Scientific and Technical Information of China (English)

    Qi Haoliang; Li Mu; Gao Jianfeng; Li Sheng

    2006-01-01

    The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be any linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.

  14. Summit documents; Documents du sommet

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2003-07-01

    This document gathers three declarations about the non-proliferation of massive destruction weapons, made by the G8 organization participants during their last summit held in Evian (France): declaration about the enforcement and respect of the non-proliferation measures implemented by the IAEA and by the conventions for chemical and biological weapons; declaration about the protection of radioactive sources against diversion (regulatory control, inventory, control of sources export etc..); warranty about the security of radioactive sources (G8 approach, sustain of the IAEA action, sustain to the most vulnerable states, control mechanisms, political commitment of states, implementation of the recommendations of the international conference about the security and safety of radiation sources, held in Vienna (Austria) on March 2003. (J.S.)

  15. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the ICMS Web site. The following items can be found on: http://cms.cern.ch/iCMS Management – CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management – CB – MB – FB Agendas and minutes are accessible to CMS members through Indico. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2008 Annual Reviews are posted in Indico. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral student upon completion of their theses.  Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and name of their first employer. The Notes, Conference Reports and Theses published si...

  16. “Dreamers Often Lie”: On “Compromise”, the subversive documentation of an Israeli- Palestinian political adaptation of Shakespeare’s Romeo and Juliet

    Directory of Open Access Journals (Sweden)

    Yael Munk

    2012-07-01

    Full Text Available Normal 0 14 false false false IT X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabella normale"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} Normal 0 14 false false false IT X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabella normale"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin;} Is Romeo and Juliet relevant to a description of the Middle-East conflict? This is the question raised in Compromise, an Israeli documentary that follows the Jerusalem Khan Theater's production of the play in the mid-1990's. This paper describes how the cinematic documentation of a theatrical Shakespeare production can undermine the original intentions of its creators. This staging of the play was carefully planned in order to demonstrate to the country and the

  17. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  18. Effective Classification of Text

    OpenAIRE

    A Saritha; N NaveenKumar

    2014-01-01

    Text mining is the process of obtaining useful and interesting information from text. Huge amount of text data is available in the form of various formats. Most of it is unstructured.Text mining usually involves the process of structuring the input text which involves parsing it, structuring it by inserting results into a database, deriving patterns from the structured data, and finally evaluation and interpretation of the output. There are several data mining techniques proposed for mi...

  19. Text Coherence in Translation

    OpenAIRE

    Yanping Zheng

    2009-01-01

    In the thesis a coherent text is defined as a continuity of senses of the outcome of combining concepts and relations into a network composed of knowledge space centered around main topics. And the author maintains that in order to obtain the coherence of a target language text from a source text during the process of translation, a translator can utilize the following approaches: retention of the continuity of senses of a text; reconstruction of the target text for the purpose of continuity;...

  20. Omega documentation

    Energy Technology Data Exchange (ETDEWEB)

    Howerton, R.J.; Dye, R.E.; Giles, P.C.; Kimlinger, J.R.; Perkins, S.T.; Plechaty, E.F.

    1983-08-01

    OMEGA is a CRAY I computer program that controls nine codes used by LLNL Physical Data Group for: 1) updating the libraries of evaluated data maintained by the group (UPDATE); 2) calculating average values of energy deposited in secondary particles and residual nuclei (ENDEP); 3) checking the libraries for internal consistency, especially for energy conservation (GAMCHK); 4) producing listings, indexes and plots of the library data (UTILITY); 5) producing calculational constants such as group averaged cross sections and transfer matrices for diffusion and Sn transport codes (CLYDE); 6) producing and updating standard files of the calculational constants used by LLNL Sn and diffusion transport codes (NDFL); 7) producing calculational constants for Monte Carlo transport codes that use group-averaged cross sections and continuous energy for particles (CTART); 8) producing and updating standard files used by the LLNL Monte Carlo transport codes (TRTL); and 9) producing standard files used by the LANL pointwise Monte Carlo transport code MCNP (MCPOINT). The first four of these functions and codes deal with the libraries of evaluated data and the last five with various aspects of producing calculational constants for use by transport codes. In 1970 a series, called PD memos, of internal and informal memoranda was begun. These were intended to be circulated among the group for comment and then to provide documentation for later reference whenever questions arose about the subject matter of the memos. They have served this purpose and now will be drawn upon as source material for this more comprehensive report that deals with most of the matters covered in those memos.

  1. Omega documentation

    International Nuclear Information System (INIS)

    OMEGA is a CRAY I computer program that controls nine codes used by LLNL Physical Data Group for: 1) updating the libraries of evaluated data maintained by the group (UPDATE); 2) calculating average values of energy deposited in secondary particles and residual nuclei (ENDEP); 3) checking the libraries for internal consistency, especially for energy conservation (GAMCHK); 4) producing listings, indexes and plots of the library data (UTILITY); 5) producing calculational constants such as group averaged cross sections and transfer matrices for diffusion and Sn transport codes (CLYDE); 6) producing and updating standard files of the calculational constants used by LLNL Sn and diffusion transport codes (NDFL); 7) producing calculational constants for Monte Carlo transport codes that use group-averaged cross sections and continuous energy for particles (CTART); 8) producing and updating standard files used by the LLNL Monte Carlo transport codes (TRTL); and 9) producing standard files used by the LANL pointwise Monte Carlo transport code MCNP (MCPOINT). The first four of these functions and codes deal with the libraries of evaluated data and the last five with various aspects of producing calculational constants for use by transport codes. In 1970 a series, called PD memos, of internal and informal memoranda was begun. These were intended to be circulated among the group for comment and then to provide documentation for later reference whenever questions arose about the subject matter of the memos. They have served this purpose and now will be drawn upon as source material for this more comprehensive report that deals with most of the matters covered in those memos

  2. Automatic text summarization

    CERN Document Server

    Torres Moreno, Juan Manuel

    2014-01-01

    This new textbook examines the motivations and the different algorithms for automatic document summarization (ADS). We performed a recent state of the art. The book shows the main problems of ADS, difficulties and the solutions provided by the community. It presents recent advances in ADS, as well as current applications and trends. The approaches are statistical, linguistic and symbolic. Several exemples are included in order to clarify the theoretical concepts.  The books currently available in the area of Automatic Document Summarization are not recent. Powerful algorithms have been develop

  3. Scalable Text Mining with Sparse Generative Models

    OpenAIRE

    Puurula, Antti

    2016-01-01

    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: gener...

  4. English Metafunction Analysis in Chemistry Text: Characterization of Scientific Text

    Directory of Open Access Journals (Sweden)

    Ahmad Amin Dalimunte, M.Hum

    2013-09-01

    Full Text Available The objectives of this research are to identify what Metafunctions are applied in chemistry text and how they characterize a scientific text. It was conducted by applying content analysis. The data for this research was a twelve-paragraph chemistry text. The data were collected by applying a documentary technique. The document was read and analyzed to find out the Metafunction. The data were analyzed by some procedures: identifying the types of process, counting up the number of the processes, categorizing and counting up the cohesion devices, classifying the types of modulation and determining modality value, finally counting up the number of sentences and clauses, then scoring the grammatical intricacy index. The findings of the research show that Material process (71of 100 is mostly used, circumstance of spatial location (26 of 56 is more dominant than the others. Modality (5 is less used in order to avoid from subjectivity. Impersonality is implied through less use of reference either pronouns (7 or demonstrative (7, conjunctions (60 are applied to develop ideas, and the total number of the clauses are found much more dominant (109 than the total number of the sentences (40 which results high grammatical intricacy index. The Metafunction found indicate that the chemistry text has fulfilled the characteristics of scientific or academic text which truly reflects it as a natural science.

  5. Securing XML Documents

    Directory of Open Access Journals (Sweden)

    Charles Shoniregun

    2004-11-01

    Full Text Available XML (extensible markup language is becoming the current standard for establishing interoperability on the Web. XML data are self-descriptive and syntax-extensible; this makes it very suitable for representation and exchange of semi-structured data, and allows users to define new elements for their specific applications. As a result, the number of documents incorporating this standard is continuously increasing over the Web. The processing of XML documents may require a traversal of all document structure and therefore, the cost could be very high. A strong demand for a means of efficient and effective XML processing has posed a new challenge for the database world. This paper discusses a fast and efficient indexing technique for XML documents, and introduces the XML graph numbering scheme. It can be used for indexing and securing graph structure of XML documents. This technique provides an efficient method to speed up XML data processing. Furthermore, the paper explores the classification of existing methods impact of query processing, and indexing.

  6. CLUSTERING-BASED ANALYSIS OF TEXT SIMILARITY

    OpenAIRE

    Bovcon , Borja

    2013-01-01

    The focus of this thesis is comparison of analysis of text-document similarity using clustering algorithms. We begin by defining main problem and then, we proceed to describe the two most used text-document representation techniques, where we present words filtering methods and their importance, Porter's algorithm and tf-idf term weighting algorithm. We then proceed to apply all previously described algorithms on selected data-sets, which vary in size and compactness. Fallowing this, we ...

  7. Questioning the Text.

    Science.gov (United States)

    Harvey, Stephanie

    2001-01-01

    One way teachers can improve students' reading comprehension is to teach them to think while reading, questioning the text and carrying on an inner conversation. This involves: choosing the text for questioning; introducing the strategy to the class; modeling thinking aloud and marking the text with stick-on notes; and allowing time for guided…

  8. Text Coherence in Translation

    Science.gov (United States)

    Zheng, Yanping

    2009-01-01

    In the thesis a coherent text is defined as a continuity of senses of the outcome of combining concepts and relations into a network composed of knowledge space centered around main topics. And the author maintains that in order to obtain the coherence of a target language text from a source text during the process of translation, a translator can…

  9. Multilingual Topic Models for Unaligned Text

    CERN Document Server

    Boyd-Graber, Jordan

    2012-01-01

    We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

  10. Vocabulary Constraint on Texts

    Directory of Open Access Journals (Sweden)

    C. Sutarsyah

    2008-01-01

    Full Text Available This case study was carried out in the English Education Department of State University of Malang. The aim of the study was to identify and describe the vocabulary in the reading text and to seek if the text is useful for reading skill development. A descriptive qualitative design was applied to obtain the data. For this purpose, some available computer programs were used to find the description of vocabulary in the texts. It was found that the 20 texts containing 7,945 words are dominated by low frequency words which account for 16.97% of the words in the texts. The high frequency words occurring in the texts were dominated by function words. In the case of word levels, it was found that the texts have very limited number of words from GSL (General Service List of English Words (West, 1953. The proportion of the first 1,000 words of GSL only accounts for 44.6%. The data also show that the texts contain too large proportion of words which are not in the three levels (the first 2,000 and UWL. These words account for 26.44% of the running words in the texts.  It is believed that the constraints are due to the selection of the texts which are made of a series of short-unrelated texts. This kind of text is subject to the accumulation of low frequency words especially those of content words and limited of words from GSL. It could also defeat the development of students' reading skills and vocabulary enrichment.

  11. From Text to Knowledge

    OpenAIRE

    Bundschus, Markus

    2010-01-01

    The global information space provided by the World Wide Web has changed dramatically the way knowledge is shared all over the world. To make this unbelievable huge information space accessible, search engines index the uploaded contents and provide efficient algorithmic machinery for ranking the importance of documents with respect to an input query. All major search engines such as Google, Yahoo or Bing are keyword-based, which is indisputable a very powerful tool for accessin...

  12. Planning Argumentative Texts

    CERN Document Server

    Huang, X

    1994-01-01

    This paper presents \\proverb\\, a text planner for argumentative texts. \\proverb\\'s main feature is that it combines global hierarchical planning and unplanned organization of text with respect to local derivation relations in a complementary way. The former splits the task of presenting a particular proof into subtasks of presenting subproofs. The latter simulates how the next intermediate conclusion to be presented is chosen under the guidance of the local focus.

  13. Text Summarizing In Polish

    Directory of Open Access Journals (Sweden)

    Emilia Branny

    2005-01-01

    Full Text Available The aim of this article is to describe an existing implementation of a text summarizer forPolish, to analyze the results and propose the possibilities of further development. Theproblem of text summarizing has been already addressed by science but until now there hasbeen no implementation designed for Polish. The implemented algorithm is based on existingdevelopments in the field but it also includes some improvements. It has been optimized fornewspaper texts ranging from approx. 10 to 50 sentences. Evaluation has shown that it worksbetter than known generic summarization tools when applied to Polish.

  14. Instant Sublime Text starter

    CERN Document Server

    Haughee, Eric

    2013-01-01

    A starter which teaches the basic tasks to be performed with Sublime Text with the necessary practical examples and screenshots. This book requires only basic knowledge of the Internet and basic familiarity with any one of the three major operating systems, Windows, Linux, or Mac OS X. However, as Sublime Text 2 is primarily a text editor for writing software, many of the topics discussed will be specifically relevant to software development. That being said, the Sublime Text 2 Starter is also suitable for someone without a programming background who may be looking to learn one of the tools of

  15. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  16. TEXT CLASSIFICATION TOWARD A SCIENTIFIC FORUM

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Text mining, also known as discovering knowledge from the text, which has emerged as a possible solution for the current information explosion, refers to the process of extracting non-trivial and useful patterns from unstructured text. Among the general tasks of text mining such as text clustering,summarization, etc, text classification is a subtask of intelligent information processing, which employs unsupervised learning to construct a classifier from training text by which to predict the class of unlabeled text. Because of its simplicity and objectivity in performance evaluation, text classification was usually used as a standard tool to determine the advantage or weakness of a text processing method, such as text representation, text feature selection, etc. In this paper, text classification is carried out to classify the Web documents collected from XSSC Website (http://www. xssc.ac.cn). The performance of support vector machine (SVM) and back propagation neural network (BPNN) is compared on this task. Specifically, binary text classification and multi-class text classification were conducted on the XSSC documents. Moreover, the classification results of both methods are combined to improve the accuracy of classification. An experiment is conducted to show that BPNN can compete with SVM in binary text classification; but for multi-class text classification, SVM performs much better. Furthermore, the classification is improved in both binary and multi-class with the combined method.

  17. Multilingual access to full text databases

    International Nuclear Information System (INIS)

    Many full text databases are available in only one language, or more, they may contain documents in different languages. Even if the user is able to understand the language of the documents in the database, it could be easier for him to express his need in his own language. For the case of databases containing documents in different languages, it is more simple to formulate the query in one language only and to retrieve documents in different languages. This paper present the developments and the first experiments of multilingual search, applied to french-english pair, for text data in nuclear field, based on the system SPIRIT. After reminding the general problems of full text databases search by queries formulated in natural language, we present the methods used to reformulate the queries and show how they can be expanded for multilingual search. The first results on data in nuclear field are presented (AFCEN norms and INIS abstracts). 4 refs

  18. PUNJABI TEXT CLUSTERING BY SENTENCE STRUCTURE ANALYSIS

    Directory of Open Access Journals (Sweden)

    Saurabh Sharma

    2012-10-01

    Full Text Available Punjabi Text Document Clustering is done by analyzing the sentence structure of similar documents sharing same topics and grouping them into clusters. The prevalent algorithms in this field utilize the vector space model which treats the documents as a bag of words. The meaning in natural language inherently depends on the word sequences which are overlooked and ignored while clustering. The current paper deals with a new Punjabi text clustering algorithm named Clustering by Sentence Structure Analysis(CSSA which has been carried out on 221 Punjabi news articles available on news sites. The phrases are extracted for processing by a meticulous analysis of the structure of a sentence by applying the basic grammatical rules of Karaka. Sequences formed from phrases, are used to find the topic and for finding similarities among all documents which results in the formation of meaningful clusters.

  19. Context Based Word Sense Extraction in Text

    Directory of Open Access Journals (Sweden)

    Ranjeetsingh S.Suryawanshi

    2011-11-01

    Full Text Available In the era of modern e-document technology, everyone using computerized document for their purpose. Due to huge amount of text document available in the form of pdf, doc, txt, html, and xml user may confuse about reading sense of these entire documents, if same word interpret different sense. Word sense has always been an important problem in information retrieval and extraction, as well as, text mining, because machines don’t have that much intelligence as compared to human to sense word in particular context. User want to determine which sense of a word is used in a given context. Word is usage-based, and part of it can be created automatically from an electronic dictionary. This paper describes word sense as expressed by its WordNet synsets, arranged according to their relevance and their context are expressed by means of word association

  20. Automatic text categorisation of racist webpages

    OpenAIRE

    Greevy, Edel

    2004-01-01

    Automatic Text Categorisation (TC) involves the assignment of one or more predefined categories to text documents in order that they can be effectively managed. In this thesis we examine the possibility of applying automatic text categorisation to the problem of categorising texts (web pages) based on whether or not they are racist. TC has proven successful for topic-based problems such as news story categorisation. However, the problem of detecting racism is dissimilar to topic-based pro...

  1. A Survey on Preprocessing in Text Mining

    OpenAIRE

    Dr. Anadakumar. K; Ms. Padmavathy. V

    2013-01-01

    Now-a-days information’s are stored electronically in databases. Extracting reliable, unknown and useful information from the abundant source is an eminent task. Data mining and Text mining are the process for extracting unknown and useful information. Text Mining is the process of extracting interesting and non-trivial patterns or knowledge from text documents. This paper presents the related activities and focuses on preprocessing steps in text mining.

  2. GPU-Accelerated Text Mining

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Mueller, Frank [North Carolina State University; Zhang, Yongpeng [ORNL; Potok, Thomas E [ORNL

    2009-01-01

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices.

  3. GPU-Accelerated Text Mining

    International Nuclear Information System (INIS)

    Accelerating hardware devices represent a novel promise for improving the performance for many problem domains but it is not clear for which domains what accelerators are suitable. While there is no room in general-purpose processor design to significantly increase the processor frequency, developers are instead resorting to multi-core chips duplicating conventional computing capabilities on a single die. Yet, accelerators offer more radical designs with a much higher level of parallelism and novel programming environments. This present work assesses the viability of text mining on CUDA. Text mining is one of the key concepts that has become prominent as an effective means to index the Internet, but its applications range beyond this scope and extend to providing document similarity metrics, the subject of this work. We have developed and optimized text search algorithms for GPUs to exploit their potential for massive data processing. We discuss the algorithmic challenges of parallelization for text search problems on GPUs and demonstrate the potential of these devices in experiments by reporting significant speedups. Our study may be one of the first to assess more complex text search problems for suitability for GPU devices, and it may also be one of the first to exploit and report on atomic instruction usage that have recently become available in NVIDIA devices

  4. Systematic text condensation

    DEFF Research Database (Denmark)

    Malterud, Kirsti

    2012-01-01

    To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies.......To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies....

  5. Linguistics in Text Interpretation

    DEFF Research Database (Denmark)

    Togeby, Ole

    A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'.......A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'....

  6. Making Sense of Texts

    Science.gov (United States)

    Harper, Rebecca G.

    2014-01-01

    This article addresses the triadic nature regarding meaning construction of texts. Grounded in Rosenblatt's (1995; 1998; 2004) Transactional Theory, research conducted in an undergraduate Language Arts curriculum course revealed that when presented with unfamiliar texts, students used prior experiences, social interactions, and literary…

  7. Text Categorization with Latent Dirichlet Allocation

    Directory of Open Access Journals (Sweden)

    ZLACKÝ Daniel

    2014-05-01

    Full Text Available This paper focuses on the text categorization of Slovak text corpora using latent Dirichlet allocation. Our goal is to build text subcorpora that contain similar text documents. We want to use these better organized text subcorpora to build more robust language models that can be used in the area of speech recognition systems. Our previous research in the area of text categorization showed that we can achieve better results with categorized text corpora. In this paper we used latent Dirichlet allocation for text categorization. We divided initial text corpus into 2, 5, 10, 20 or 100 subcorpora with various iterations and save steps. Language models were built on these subcorpora and adapted with linear interpolation to judicial domain. The experiment results showed that text categorization using latent Dirichlet allocation can improve the system for automatic speech recognition by creating the language models from organized text corpora.

  8. Unstructured Documents Categorization: A Study

    Directory of Open Access Journals (Sweden)

    Debnath Bhattacharyya

    2008-12-01

    Full Text Available The main purpose of communication is to transfer information from onecorner to another of the world. The information is basically stored in forms of documents or files created on the basis of requirements. So, the randomness of creation and storage makes them unstructured in nature. As a consequence, data retrieval and modification become hard nut to crack. The data, that is required frequently, should maintain certain pattern. Otherwise, problems like retrievingerroneous data or anomalies in modification or time consumption in retrieving process may hike. As every problem has its own solution, these unstructured documents have also given the solution named unstructured document categorization. That means, the collected unstructured documents will be categorized based on some given constraints. This paper is a review which deals with different techniques like text and data mining, genetic algorithm, lexicalchaining, binarization method to reach the fulfillment of desired unstructured document categorization appeared in the literature.

  9. Arabic Text Classification Using Support Vector Machines

    NARCIS (Netherlands)

    Gharib, Tarek Fouad; Habib, Mena Badieh; Fayed, Zaki Taha; Zhu, Qiang

    2009-01-01

    Text classification (TC) is the process of classifying documents into a predefined set of categories based on their content. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In this paper we applied the Support Vector Machines (SVM) model in cl

  10. Texts of Television Advertisements

    OpenAIRE

    Michalewski, Kazimierz

    1995-01-01

    Short advertisement films occupy a large part (especially around the peak viewing hours) of everyday programmes of the Polish stale television. Even though it is possible to imagine an advertisement film employing only extralinguistic means of communication, the advertisements in generał, have so far been using written and spoken texts. The basic function of such a text and of the whole film is to encourage the viewers to buy the advertised product. However, independently of th...

  11. Text simplification for children

    OpenAIRE

    De Belder, Jan; Moens, Marie-Francine

    2010-01-01

    The goal in this paper is to automatically transform text into a simpler text, so that it is easier to understand by children. We perform syntactic simplification, i.e. the splitting of sentences, and lexical simplification, i.e. replacing difficult words with easier synonyms. We test the performance of this approach for each component separately on a per sentence basis, and globally with the automatic construction of simplified news articles and encyclopedia articles. By including informatio...

  12. Polyglotte Texte : Einleitung

    OpenAIRE

    Zemanek, Evi; Willms, Weertje

    2014-01-01

    Ist von Polyglossie oder Multilingualität die Rede, so kann damit Verschiedenes gemeint sein: Erstens die literarische Mehrsprachigkeit einzelner Autoren oder Kulturgemeinschaften, die in verschiedenen Sprachen kommunizieren und Texte verfassen – ohne dass ein und derselbe "Text" notwendig mehrsprachig sein muss. Dabei handelt es sich um ein traditionsreiches Phänomen: Man denke nur an das jahrhundertelange Nebeneinander von Volkssprache und Latein in mehreren europäischen Kulturen zwischen S...

  13. Modified Approach to Transform Arc From Text to Linear Form Text : A Preprocessing Stage for OCR

    Directory of Open Access Journals (Sweden)

    Vijayashree C S

    2014-08-01

    Full Text Available Arc-form-text is an artistic-text which is quite common in several documents such as certificates, advertisements and history documents. OCRs fail to read such arc-form-text and it is necessary to transform the same to linear-form-text at preprocessing stage. In this paper, we present a modification to an existing transformation model for better readability by OCRs. The method takes the segmented arcform-text as input. Initially two concentric ellipses are approximated to enclose the arc-form-text and later the modified transformation model transforms the text in arc-form to linear-form. The proposed method is implemented on several upper semi-circular arc-form-text inputs and the readability of the transformed text is analyzed with an OCR

  14. A new graph based text segmentation using Wikipedia for automatic text summarization

    Directory of Open Access Journals (Sweden)

    Mohsen Pourvali

    2012-01-01

    Full Text Available The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. According to the input text, in this paper we use the knowledge base of Wikipedia and the words of the main text to create independent graphs. We will then determine the important of graphs. Then we are specified importance of graph and sentences that have topics with high importance. Finally, we extract sentences with high importance. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to state-of-the-art summarization approaches

  15. Automatic Induction of Rule Based Text Categorization

    OpenAIRE

    D.Maghesh Kumar

    2010-01-01

    The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuingneed to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describ...

  16. Machine Learning in Automated Text Categorization

    OpenAIRE

    Sebastiani, Fabrizio

    2001-01-01

    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categori...

  17. Reading Authentic Texts

    DEFF Research Database (Denmark)

    Balling, Laura Winther

    2013-01-01

    Most research on cognates has focused on words presented in isolation that are easily defined as cognate between L1 and L2. In contrast, this study investigates what counts as cognate in authentic texts and how such cognates are read. Participants with L1 Danish read news articles in their highly...

  18. Reading Authorship into Texts.

    Science.gov (United States)

    Werner, Walter

    2000-01-01

    Provides eight concepts, with illustrative questions for interpreting the authorship of texts, that are borrowed from cultural studies literature: (1) representation; (2) the gaze; (3) voice; (4) intertextuality; (5) absence; (6) authority; (7) mediation; and (8) reflexivity. States that examples were taken from British Columbia's (Canada) social…

  19. 26. Text laws

    Czech Academy of Sciences Publication Activity Database

    Hřebíček, Luděk

    Vol. 26. Ein internationales Handbuch/An International Handbook. Berlin-New York : Walter de Gruyter, 2005 - (Köhler, R.; Altmann, G.; Piotrowski, R.), s. 348-361 ISBN 978-3-11-015578-5 Institutional research plan: CEZ:AV0Z9021901 Keywords : Text structure * Quantitative linguistics Subject RIV: AI - Linguistics

  20. EAL studying texts

    CERN Document Server

    Napthin, Melanie

    2013-01-01

    EAL: Studying texts has been developed out of Insight's best-selling ESL English for Year 12, which has helped thousands of ESL/ EAL students to achieve top marks. Offering comprehensive coverage of Area of Study 1: Reading and responding in VCE English, the book takes a highly practical approach that builds students' skills progressively.

  1. Text Induced Spelling Correction

    NARCIS (Netherlands)

    Reynaert, M.W.C.

    2004-01-01

    We present TISC, a language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from a very large corpus of raw text, without supervision, and contains word unigram

  2. Texts in the landscape

    Directory of Open Access Journals (Sweden)

    James Graham-Campbell

    1998-11-01

    Full Text Available The Institute's members of UCL's "Celtic Inscribed Stones" project describe, in collaboration with Wendy Davies, Mark Handley and Paul Kershaw (Department of History, a major interdisciplinary study of inscriptions of the early middle ages from the Celtic areas of northwest Europe.

  3. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  4. How Much Handwritten Text Is Needed for Text-Independent Writer Verification and Identification

    NARCIS (Netherlands)

    Brink, Axel; Bulacu, Marius; Schomaker, Lambert

    2008-01-01

    The performance of off-line text-independent writer verification and identification increases when the documents contain more text. This relation was examined by repeatedly conducting writer verification and identification performance tests while gradually increasing the amount of text on the pages.

  5. [On two antique medical texts].

    Science.gov (United States)

    Rosa, Maria Carlota

    2005-01-01

    The two texts presented here--Regimento proueytoso contra ha pestenença [literally, "useful regime against pestilence"] and Modus curandi cum balsamo ["curing method using balm"]--represent the extent of Portugal's known medical library until circa 1530, produced in gothic letters by foreign printers: Germany's Valentim Fernandes, perhaps the era's most important printer, who worked in Lisbon between 1495 and 1518, and Germdo Galharde, a Frenchman who practiced his trade in Lisbon and Coimbra between 1519 and 1560. Modus curandi, which came to light in 1974 thanks to bibliophile José de Pina Martins, is anonymous. Johannes Jacobi is believed to be the author of Regimento proueytoso, which was translated into Latin (Regimen contra pestilentiam), French, and English. Both texts are presented here in facsimile and in modern Portuguese, while the first has also been reproduced in archaic Portuguese using modern typographical characters. This philological venture into sixteenth-century medicine is supplemented by a scholarly glossary which serves as a valuable tool in interpreting not only Regimento proueytoso but also other texts from the era. Two articles place these documents in historical perspective. PMID:17500134

  6. Gabor filters for Document analysis in Indian Bilingual Documents

    OpenAIRE

    Pati, Peeta Basa; Raju, Sabari S; Pati, Nishikanta; Ramakrishnan, AG

    2004-01-01

    Reasonable success has been achieved at developing monolingual OCR systems in Indian scripts. Scientists, optimistically, have started t o look beyond. Development of bilingual OCR systems and OCR systems with capability t o identify the text areas are some of the pointers to future activities in Indian scenario. The separation of text and non-text regions before considering the document image for OCR is an important task. In this paper, we present a biologically inspired, multi-channel...

  7. Documentation of Cultural Heritage Objects

    Directory of Open Access Journals (Sweden)

    Jon Grobovšek

    2013-09-01

    Full Text Available EXTENDED ABSTRACT:The first and important phase of documentation of cultural heritage objects is to understand which objects need to be documented. The entire documentation process is determined by the characteristics and scope of the cultural heritage object. The next question to be considered is the expected outcome of the documentation process and the purpose for which it will be used. These two essential guidelines determine each stage of the documentation workflow: the choice of the most appropriate data capturing technology and data processing method, how detailed should the documentation be, what problems may occur, what the expected outcome is, what it will be used for, and the plan for storing data and results. Cultural heritage objects require diverse data capturing and data processing methods. It is important that even the first stages of raw data capturing are oriented towards the applicability of results. The selection of the appropriate working method can facilitate the data processing and the preparation of final documentation. Documentation of paintings requires different data capturing method than documentation of buildings or building areas. The purpose of documentation can also be the preservation of the contemporary cultural heritage to posterity or the basis for future projects and activities on threatened objects. Documentation procedures should be adapted to our needs and capabilities. Captured and unprocessed data are lost unless accompanied by additional analyses and interpretations. Information on tools, procedures and outcomes must be included into documentation. A thorough analysis of unprocessed but accessible documentation, if adequately stored and accompanied by additional information, enables us to gather useful data. In this way it is possible to upgrade the existing documentation and to avoid data duplication or unintentional misleading of users. The documentation should be archived safely and in a way to meet

  8. Data Security by Preprocessing the Text with Secret Hiding

    Directory of Open Access Journals (Sweden)

    Ajit Singh

    2012-06-01

    Full Text Available With the advent of the Internet, an open forum, the massive increase in the data travel across networkmake an issue for secure transmission. Cryptography is the term that involves many encryption method to make data secure. But the transmission of the secure data is an intricate task. Steganography here comes with effect of transmission without revealing the secure data. The research paper provide the mechanism which enhance the security of data by using a crypto+stegano combination to increase the security level without knowing the fact that some secret data is sharing across networks. In the firstphase data is encrypted by manipulating the text using the ASCII codes and some random generated strings for the codes by taking some parameters. Steganography related to cryptography forms the basisfor many data hiding techniques. The data is encrypted using a proposed approach and then hide the message in random N images with the help of perfect hashing scheme which increase the security of the message before sending across the medium. Thus the sending and receiving of message will be safe and secure with an increased confidentiality.

  9. Reading Text While Driving

    OpenAIRE

    Liang, Yulan; Horrey, William J.; Hoffman, Joshua D.

    2015-01-01

    Objective In this study, we investigated how drivers adapt secondary-task initiation and time-sharing behavior when faced with fluctuating driving demands. Background Reading text while driving is particularly detrimental; however, in real-world driving, drivers actively decide when to perform the task. Method In a test track experiment, participants were free to decide when to read messages while driving along a straight road consisting of an area with increased driving demands (demand zone)...

  10. Toponym Resolution in Text

    OpenAIRE

    Leidner, Jochen Lothar

    2007-01-01

    Background. In the area of Geographic Information Systems (GIS), a shared discipline between informatics and geography, the term geo-parsing is used to describe the process of identifying names in text, which in computational linguistics is known as named entity recognition and classification (NERC). The term geo-coding is used for the task of mapping from implicitly geo-referenced datasets (such as structured address records) to explicitly geo-referenced representations (e.g.,...

  11. Text classification method review

    OpenAIRE

    Mahinovs, Aigars; Tiwari, Ashutosh; Roy, Rajkumar; Baxter, David

    2007-01-01

    With the explosion of information fuelled by the growth of the World Wide Web it is no longer feasible for a human observer to understand all the data coming in or even classify it into categories. With this growth of information and simultaneous growth of available computing power automatic classification of data, particularly textual data, gains increasingly high importance. This paper provides a review of generic text classification process, phases of that process and met...

  12. Text a jeho ilustrace

    OpenAIRE

    SNÁŠELOVÁ, Karolína

    2015-01-01

    The thesis deals with question of linking visual representation of a literary work of art. It focuses primarily on the genre of book illustration and the question of the relationship between verbal and visual components literary work, resp. possibilities and limits of the language transformation and visual representation. Theoretical explanation accompanies the analysis of several illustrations of specific works of Czech literature in their relation to visual text.

  13. TCD: A Text-Based UML Class Diagram Notation and Its Model Converters

    Science.gov (United States)

    Washizaki, Hironori; Akimoto, Masayoshi; Hasebe, Atsushi; Kubo, Atsuto; Fukazawa, Yoshiaki

    Among several diagrams defined in UML, the class diagram is particularly useful through entire software development process, from early domain analysis stages to later maintenance stages. However conventional UML environments are often inappropriate for collaborative modeling in physically remote locations, such as exchanging models on a public mailing list via email. To overcome this issue, we propose a new diagram notation, called "TCD" (Text-based uml Class Diagram), for describing UML class diagrams using ASCII text. Since text files can be easily created, modified and exchanged in anywhere by any computing platforms, TCD facilitates the collaborative modeling with a number of unspecified people. Moreover, we implemented model converters for converting in both directions between UML class diagrams described in the XMI form and those in the TCD form. By using the converters, the reusability of models can be significantly improved because many of UML modeling tools support the XMI for importing and exporting modeling data.

  14. Modeling statistical properties of written text.

    Directory of Open Access Journals (Sweden)

    M Angeles Serrano

    Full Text Available Written text is one of the fundamental manifestations of human language, and the study of its universal regularities can give clues about how our brains process information and how we, as a society, organize and share it. Among these regularities, only Zipf's law has been explored in depth. Other basic properties, such as the existence of bursts of rare words in specific documents, have only been studied independently of each other and mainly by descriptive models. As a consequence, there is a lack of understanding of linguistic processes as complex emergent phenomena. Beyond Zipf's law for word frequencies, here we focus on burstiness, Heaps' law describing the sublinear growth of vocabulary size with the length of a document, and the topicality of document collections, which encode correlations within and across documents absent in random null models. We introduce and validate a generative model that explains the simultaneous emergence of all these patterns from simple rules. As a result, we find a connection between the bursty nature of rare words and the topical organization of texts and identify dynamic word ranking and memory across documents as key mechanisms explaining the non trivial organization of written text. Our research can have broad implications and practical applications in computer science, cognitive science and linguistics.

  15. Document Management on Display.

    Science.gov (United States)

    Grimshaw, Anne

    1998-01-01

    Describes some of the products displayed at the United Kingdom's largest document management, imaging and workflow exhibition (Document 97, Birmingham, England, October 7-9, 1997). Includes recognition technologies; document delivery; scanning; document warehousing; document management and retrieval software; workflow systems; Internet software;…

  16. Working with Documents in Databases

    Directory of Open Access Journals (Sweden)

    Marian DARDALA

    2008-01-01

    Full Text Available Using on a larger and larger scale the electronic documents within organizations and public institutions requires their storage and unitary exploitation by the means of databases. The purpose of this article is to present the way of loading, exploitation and visualization of documents in a database, taking as example the SGBD MSSQL Server. On the other hand, the modules for loading the documents in the database and for their visualization will be presented through code sequences written in C#. The interoperability between averages will be carried out by the means of ADO.NET technology of database access.

  17. Events and Trends in Text Streams

    Energy Technology Data Exchange (ETDEWEB)

    Engel, David W.; Whitney, Paul D.; Cramer, Nicholas O.

    2010-03-04

    "Text streams--collections of documents or messages that are generated and observed over time--are ubiquitous. Our research and development are targeted at developing algorithms to find and characterize changes in topic within text streams. To date, this research has demonstrated the ability to detect and describe 1) short duration, atypical events and 2) the emergence of longer-term shifts in topical content. This technology has been applied to predefined temporally ordered document collections but is also suitable for application to near-real-time textual data streams."

  18. Text Classification: A Sequential Reading Approach

    CERN Document Server

    Dulac-Arnold, Gabriel; Gallinari, Patrick

    2011-01-01

    We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided.

  19. Text and Music Revisited

    OpenAIRE

    Fornäs, Johan

    1997-01-01

    Are words and music two separate symbolic modes, or rather variants of the same human symbolic practice? Are they parallel, opposing or over­lap­ping? What do they have in common and how does each of them exceed the other? Is music perhaps incomparably dif­fer­ent from words, or even their anti-verbal Other? Distinctions between text (in the verbal sense of units of words rather than in the wide sense of symbolic webs in general) and music are regularly made – but also prob­lem­atized – withi...

  20. Weaving with text

    DEFF Research Database (Denmark)

    Hagedorn-Rasmussen, Peter

    This paper explores how a school principal by means of practical authorship creates reservoirs of language that provide a possible context for collective sensemaking. The paper draws upon a field study in which a school principal, and his managerial team, was shadowed in a period of intensive cha...... changes. The paper explores how the manager weaves with text, extracted from stakeholders, administration, politicians, employees, public discourse etc., as a means of creating a new fabric, a texture, of diverse perspectives that aims for collective sensemaking....

  1. Document Clustering based on Topic Maps

    CERN Document Server

    Rafi, Muhammad; Farooq, Amir; 10.5120/1640-2204

    2011-01-01

    Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algori...

  2. Text Clustering Using a Suffix Tree Similarity Measure

    OpenAIRE

    Huang, Chenghui; Yin, Jian; Fang HOU

    2011-01-01

    In text mining area, popular methods use the bag-of-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between the text documents of corpus by applying a suffix tree similarity that combines with TF-IDF weighti...

  3. Weitere Texte physiognomischen Inhalts

    Directory of Open Access Journals (Sweden)

    Böck, Barbara

    2004-12-01

    Full Text Available The present article offers the edition of three cuneiform texts belonging to the Akkadian handbook of omens drawn from the physical appearance as well as the morals and behaviour of man. The book comprising up to 27 chapters with more than 100 omens each was entitled in antiquity Alamdimmû. The edition of the three cuneiform tablets completes, thus, the author's monographic study on the ancient Mesopotamian divinatory discipline of physiognomy (Die babylonisch-assyrische Morphoskopie (Wien 2000 [=AfO Beih. 27].

    En este artículo se presenta la editio princeps de tres textos cuneiformes conservados en el British Museum (Londres y el Vorderasiatisches Museum (Berlín, que pertenecen al libro asirio-babilonio de presagios fisiognómicos. Este libro, titulado originalmente Alamdimmû ('forma, figura', consta de 27 capítulos, cada uno con más de cien presagios escritos en lengua acadia. Los tres textos completan así el estudio monográfico de la autora sobre la disciplina adivinatoria de la fisiognomía en el antiguo Oriente (Die babylonisch-assyrische Morphoskopie (Wien 2000 [=AfO Beih. 27].

  4. Texts of presentation

    Energy Technology Data Exchange (ETDEWEB)

    Magnin, G.; Vidolov, K.; Dufour-Fallot, B.; Dewarrat, Th.; Rose, T.; Favatier, A.; Gazeley, D.; Pujol, T.; Worner, D.; Van de Wel, E.; Revaz, J.M.; Clerfayt, G.; Creedy, A.; Moisan, F.; Geissler, M.; Isbell, P.; Macaluso, M.; Litzka, V.; Gillis, W.; Jarvis, I.; Gorg, M.; Bebie, B.

    2004-07-01

    Implementing a sustainable local energy policy involves a long term reflection on the general interest, energy efficiency, distributed generation and environmental protection. Providing services on a market involves looking for activities that are profitable, if possible in the 'short-term'. The aim of this conference is to analyse the possibility of reconciling these apparently contradictory requirements and how this can be achieved. This conference brings together the best specialists from European municipalities as well as important partners for local authorities (energy agencies, service companies, institutions, etc.) in order to discuss the public-private partnerships concerning the various functions that municipalities may perform in the energy field as consumers and customers, planners and organizers of urban space and rousers as regards inhabitants and economic players of their areas. This document contains the summaries of the following presentations: 1 - Performance contracting: Bulgarian municipalities use private capital for energy efficiency improvement (K. VIDOLOV, Varna (BG)), Contracting experiences in Swiss municipalities: consistent energy policy thanks to the Energy-city label (B. DUFOUR-FALLOT and T. DEWARRAT (CH)), Experience of contracting in the domestic sector (T. ROSE (GB)); 2 - Public procurement: Multicolor electricity (A. FAVATIER (CH)), Tendering for new green electricity capacity (D. GAZELEY (GB)), The Barcelona solar thermal ordinance (T. PUJOL (ES)); 3 - Urban planning and schemes: Influencing energy issues through urban planning (D. WOERNER (DE)), Tendering for the supply of energy infrastructure (E. VAN DE WEL (NL)), Concessions and public utility warranty (J.M. REVAZ (CH)); 4 - Certificate schemes: the market of green certificates in Wallonia region in a liberalized power market (G. CLERFAYT (BE)), The Carbon Neutral{sup R} project: a voluntary certification scheme with opportunity for implementation in other European

  5. Generic safety documentation model

    Energy Technology Data Exchange (ETDEWEB)

    Mahn, J.A.

    1994-04-01

    This document is intended to be a resource for preparers of safety documentation for Sandia National Laboratories, New Mexico facilities. It provides standardized discussions of some topics that are generic to most, if not all, Sandia/NM facilities safety documents. The material provides a ``core`` upon which to develop facility-specific safety documentation. The use of the information in this document will reduce the cost of safety document preparation and improve consistency of information.

  6. Generic safety documentation model

    International Nuclear Information System (INIS)

    This document is intended to be a resource for preparers of safety documentation for Sandia National Laboratories, New Mexico facilities. It provides standardized discussions of some topics that are generic to most, if not all, Sandia/NM facilities safety documents. The material provides a ''core'' upon which to develop facility-specific safety documentation. The use of the information in this document will reduce the cost of safety document preparation and improve consistency of information

  7. HANDWRITTEN TEXT IMAGE AUTHENTICATION USING BACK PROPAGATION

    Directory of Open Access Journals (Sweden)

    A S N Chakravarthy

    2011-10-01

    Full Text Available Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involveconfirming the identity of a person, tracing the origins of an artefact, ensuring that a product is whatit’s packaging and labelling claims to be, or assuring that a computer program is a trusted one. Theauthentication of information can pose special problems (especially man-in-the-middle attacks, and isoften wrapped up with authenticating identity. Literary can involve imitating the style of a famous author.If an original manuscript, typewritten text, or recording is available, then the medium itself (or itspackaging - anything from a box to e-mail headers can help prove or disprove the authenticity of thedocument. The use of digital images of handwritten historical documents has become more popular inrecent years. Volunteers around the world now read thousands of these images as part of theirindexing process. Handwritten text images of old documents are sometimes difficult to read or noisy dueto the preservation of the document and quality of the image [1]. Handwritten text offers challenges thatare rarely encountered in machine-printed text. In addition, most problems faced in reading machineprintedtext (e.g., character recognition, word segmentation, letter segmentation, etc. are more severe, inhandwritten text. In this paper we Here in this paper we proposed a method for authenticating handwritten text images using back propagation algorithm..

  8. Multinomial Inverse Regression for Text Analysis

    OpenAIRE

    Taddy, Matt

    2010-01-01

    Text data, including speeches, stories, and other document forms, are often connected to sentiment variables that are of interest for research in marketing, economics, and elsewhere. It is also very high dimensional and difficult to incorporate into statistical analyses. This article introduces a straightforward framework of sentiment-preserving dimension reduction for text data. Multinomial inverse regression is introduced as a general tool for simplifying predictor sets that can be represen...

  9. Text mining for the biocuration workflow

    OpenAIRE

    Hirschman, L.; Burns, G. A. P. C.; Krallinger, M.; Arighi, C.; Cohen, K. B.; Valencia, A.; Wu, C H; Chatr-aryamontri, A; Dowell, K. G.; Huala, E; Lourenco, A.; Nash, R; Veuthey, A.-L.; Wiegers, T.; Winter, A. G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations too...

  10. Text mining for the biocuration workflow.

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  11. Writing Treatment for Aphasia: A Texting Approach

    Science.gov (United States)

    Beeson, Pelagie M.; Higginson, Kristina; Rising, Kindle

    2013-01-01

    Purpose: Treatment studies have documented the therapeutic and functional value of lexical writing treatment for individuals with severe aphasia. The purpose of this study was to determine whether such retraining could be accomplished using the typing feature of a cellular telephone, with the ultimate goal of using text messaging for…

  12. Automatic Syntactic Analysis of Free Text.

    Science.gov (United States)

    Schwarz, Christoph

    1990-01-01

    Discusses problems encountered with the syntactic analysis of free text documents in indexing. Postcoordination and precoordination of terms is discussed, an automatic indexing system call COPSY (context operator syntax) that uses natural language processing techniques is described, and future developments are explained. (60 references) (LRW)

  13. Automatic Induction of Rule Based Text Categorization

    Directory of Open Access Journals (Sweden)

    D.Maghesh Kumar

    2010-12-01

    Full Text Available The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuingneed to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describes, a novel method for the automatic induction of rule-based text classifiers. This method supports a hypothesis language of the form "if T1, … or Tn occurs in document d, and none of T1+n,... Tn+m occurs in d, then classify d under category c," where each Ti is a conjunction of terms. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. Issues pertaining tothree different problems, namely, document representation, classifier construction, and classifier evaluation were discussed in detail.

  14. Subsegmental language detection in Celtic language text

    OpenAIRE

    Tyers, Francis Morton; Minocha, Akshay

    2014-01-01

    This paper describes an experiment to perform language identification on a sub-sentence basis. The typical case of language identification is to detect the language of documents or sentences. However, it may be the case that a single sentence or segment contains more than one language. This is especially the case in texts where code switching occurs.

  15. DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION

    Directory of Open Access Journals (Sweden)

    Jayashree.R

    2011-09-01

    Full Text Available The internet has caused a humongous growth in the amount of data available to the common man. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indexes for the given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We combine GSS (Galavotti, Sebastiani, Simi coefficients and IDF (Inverse Document Frequency methods along with TF (Term Frequency for extracting key words and later use these for summarization. In the current implementation a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.

  16. TEXT MINING AND CLASSIFICATION OF PRODUCT REVIEWS USING STRUCTURED SUPPORT VECTOR MACHINE

    OpenAIRE

    Jincy B. Chrystal; Stephy Joseph

    2015-01-01

    Text mining and Text classification are the two prominent and challenging tasks in the field of Machine learning. Text mining refers to the process of deriving high quality and relevant information from text, while Text classification deals with the categorization of text documents into different classes. The real challenge in these areas is to address the problems like handling large text corpora, similarity of words in text documents, and association of text documents with a ...

  17. Audit of Orthopaedic Surgical Documentation

    Directory of Open Access Journals (Sweden)

    Fionn Coughlan

    2015-01-01

    Full Text Available Introduction. The Royal College of Surgeons in England published guidelines in 2008 outlining the information that should be documented at each surgery. St. James’s Hospital uses a standard operation sheet for all surgical procedures and these were examined to assess documentation standards. Objectives. To retrospectively audit the hand written orthopaedic operative notes according to established guidelines. Methods. A total of 63 operation notes over seven months were audited in terms of date and time of surgery, surgeon, procedure, elective or emergency indication, operative diagnosis, incision details, signature, closure details, tourniquet time, postop instructions, complications, prosthesis, and serial numbers. Results. A consultant performed 71.4% of procedures; however, 85.7% of the operative notes were written by the registrar. The date and time of surgery, name of surgeon, procedure name, and signature were documented in all cases. The operative diagnosis and postoperative instructions were frequently not documented in the designated location. Incision details were included in 81.7% and prosthesis details in only 30% while the tourniquet time was not documented in any. Conclusion. Completion and documentation of operative procedures were excellent in some areas; improvement is needed in documenting tourniquet time, prosthesis and incision details, and the location of operative diagnosis and postoperative instructions.

  18. Vietnamese Document Representation and Classification

    Science.gov (United States)

    Nguyen, Giang-Son; Gao, Xiaoying; Andreae, Peter

    Vietnamese is very different from English and little research has been done on Vietnamese document classification, or indeed, on any kind of Vietnamese language processing, and only a few small corpora are available for research. We created a large Vietnamese text corpus with about 18000 documents, and manually classified them based on different criteria such as topics and styles, giving several classification tasks of different difficulty levels. This paper introduces a new syllable-based document representation at the morphological level of the language for efficient classification. We tested the representation on our corpus with different classification tasks using six classification algorithms and two feature selection techniques. Our experiments show that the new representation is effective for Vietnamese categorization, and suggest that best performance can be achieved using syllable-pair document representation, an SVM with a polynomial kernel as the learning algorithm, and using Information gain and an external dictionary for feature selection.

  19. Towards document engineering

    OpenAIRE

    Quint, Vincent; Nanard, M.; André, Jacques

    1990-01-01

    This article compares methods and techniques used in software engineering with the ones used for handling electronic documents. It shows the common features in both domains, but also the differences and it proposes an approach which extends the field of document manipulation to document engineering. It shows also in what respect document engineering is different from software engineering. Therefore specific techniques must be developped for building integrated environments for document engine...

  20. Ants for Document Clustering

    Directory of Open Access Journals (Sweden)

    Priya Vaijayanthi

    2012-03-01

    Full Text Available The usage of computers for mass storage has become mandatory nowadays due to World Wide Web (WWW. This has placed many challenges to the Information Retrieval (IR system. Clustering of documents available improves the efficiency of IR system. The problem of clustering has become a combinatorial optimization problem in IR system due to the exponential growth in information over WWW. In this paper, a hybrid algorithm that combines the basic Ant Colony Optimization with Tabu search has been proposed. The feasibility of the proposed algorithm is tested over a few standard benchmark datasets. The experimental results reveal that the proposed algorithm yields promising quality clusters compared to other ones produced by K-means algorithm.

  1. Contextualizing Data Warehouses with Documents

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Berlanga, Rafael; Aramburu, Maria Jose;

    2008-01-01

    Current data warehouse and OLAP technologies are applied to analyze the structured data that companies store in databases. The context that helps to understand data over time is usually described separately in text-rich documents. This paper proposes to integrate the traditional corporate data...

  2. Text Classification Using Sentential Frequent Itemsets

    Institute of Scientific and Technical Information of China (English)

    Shi-Zhu Liu; He-Ping Hu

    2007-01-01

    Text classification techniques mostly rely on single term analysis of the document data set, while more concepts,especially the specific ones, are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset, a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification. Experiments over the Reuters and newsgroup corpus are carried out, which validate the practicability of the proposed system.

  3. Segmentation of Handwritten Text in Gurmukhi Script

    Directory of Open Access Journals (Sweden)

    Rajiv K. Sharma

    2008-09-01

    Full Text Available Character segmentation is an important preprocessing step for text recognition.The size and shape of characters generally play an important role in the processof segmentation. But for any optical character recognition (OCR system, thepresence of touching characters in textual as well handwritten documents furtherdecreases correct segmentation as well as recognition rate drastically. Becauseone can not control the size and shape of characters in handwritten documentsso the segmentation process for the handwritten document is too difficult. Wetried to segment handwritten text by proposing some algorithms, which wereimplemented and have shown encouraging results. Algorithms have beenproposed to segment the touching characters. These algorithms have shown areasonable improvement in segmenting the touching handwritten characters inGurmukhi script.

  4. A Novel Approach For Syntactic Similarity Between Two Short Text

    OpenAIRE

    Anterpreet Kaur

    2015-01-01

    ABSTRACT Syntactic similarity is an important activity in the area of high field of text documents data mining natural language processing information retrieval. Natural language processing NLP is the intelligent machine where its ability is to translate the text into natural language such as English and other computer language such as c. Web mining used for task such as document clustering community mining etc to performed on web. However to find the similarity between the two documents ...

  5. The Chinese Text Categorization System with Category Priorities

    OpenAIRE

    Huan-Chao Keh; Ding-An Chiang; Chih-Cheng Hsu; Hui-Hua Huang

    2010-01-01

    The process of text categorization involves some understanding of the content of the documents and/or some previous knowledge of the categories. For the content of the documents, we use a filtering measure for feature selection in our Chinese text categorization system. We modify the formula of Term Frequency-Inverse Document Frequency (TF-IDF) to strengthen important keywords’ weights and weaken unimportant keywords’ weights. For the knowledge of the categories, we use category priority to r...

  6. NEW TECHNIQUES USED IN AUTOMATED TEXT ANALYSIS

    Directory of Open Access Journals (Sweden)

    M. I strate

    2010-12-01

    Full Text Available Automated analysis of natural language texts is one of the most important knowledge discovery tasks for any organization. According to Gartner Group, almost 90% of knowledge available at an organization today is dispersed throughout piles of documents buried within unstructured text. Analyzing huge volumes of textual information is often involved in making informed and correct business decisions. Traditional analysis methods based on statistics fail to help processing unstructured texts and the society is in search of new technologies for text analysis. There exist a variety of approaches to the analysis of natural language texts, but most of them do not provide results that could be successfully applied in practice. This article concentrates on recent ideas and practical implementations in this area.

  7. Text Clustering with String Kernels in R

    OpenAIRE

    Karatzoglou, Alexandros; Feinerer , Ingo

    2006-01-01

    We present a package which provides a general framework, including tools and algorithms, for text mining in R using the S4 class system. Using this package and the kernlab R package we explore the use of kernel methods for clustering (e.g., kernel k-means and spectral clustering) on a set of text documents, using string kernels. We compare these methods to a more traditional clustering technique like k-means on a bag of word representation of the text and evaluate the viability of kernel-base...

  8. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  9. Registration document 2005; Document de reference 2005

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2005-07-01

    This reference document of Gaz de France provides information and data on the Group activities in 2005: financial informations, business, activities, equipments factories and real estate, trade, capital, organization charts, employment, contracts and research programs. (A.L.B.)

  10. 2002 reference document; Document de reference 2002

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2002-07-01

    This 2002 reference document of the group Areva, provides information on the society. Organized in seven chapters, it presents the persons responsible for the reference document and for auditing the financial statements, information pertaining to the transaction, general information on the company and share capital, information on company operation, changes and future prospects, assets, financial position, financial performance, information on company management and executive board and supervisory board, recent developments and future prospects. (A.L.B.)

  11. Preventing Document Leakage through Active Document

    OpenAIRE

    Aaber, Zeyad; Crowder, Richard; Fadhel, Nawfal; Wills, Gary B.

    2014-01-01

    Electronic documents inside any enterprise environment are assets that add to the enterprise’s capital in intellectual property such as design patents or customer information, securing, these assets is a priority requirement in any security system design. The security of these documents suffers when they have migrated outside the organization security system, as there is not always a way to extend the enterprise security policy to limit/prevent access to those assets. This paper present...

  12. An Efficient Pattern Discovery Over Long Text Patterns

    Directory of Open Access Journals (Sweden)

    T.Ravi Kiran

    2014-04-01

    Full Text Available There are several techniques are implemented for mining documents. In this text mining, still so many problems getting exact patterns in text mining. In this some of the techniques are adapted in text mining. In proposed system the temporal text mining approach is introduced. The system terms of its ability is evaluated to predict forthcoming events in the document. In this we present optimal decomposition of the time period associated with the given document set is discovered where each subinterval consists of consecutive time points having identical information content. Extraction of sequences of events from new and other documents based on the publication times of these documents has been shown to be extremely effective in tracking past events.

  13. ARABIC TEXT CATEGORIZATION ALGORITHM USING VECTOR EVALUATION METHOD

    Directory of Open Access Journals (Sweden)

    Ashraf Odeh

    2014-12-01

    Full Text Available Text categorization is the process of grouping documents into categories based on their contents. This process is important to make information retrieval easier, and it became more important due to the huge textual information available online. The main problem in text categorization is how to improve the classification accuracy. Although Arabic text categorization is a new promising field, there are a few researches in this field. This paper proposes a new method for Arabic text categorization using vector evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the tested document's words are calculated to determine the document keywords which will be compared with the keywords of the corpus categorizes to determine the tested document's best category.

  14. Electronic Document Management Using Inverted Files System

    Directory of Open Access Journals (Sweden)

    Suhartono Derwin

    2014-03-01

    Full Text Available The amount of documents increases so fast. Those documents exist not only in a paper based but also in an electronic based. It can be seen from the data sample taken by the SpringerLink publisher in 2010, which showed an increase in the number of digital document collections from 2003 to mid of 2010. Then, how to manage them well becomes an important need. This paper describes a new method in managing documents called as inverted files system. Related with the electronic based document, the inverted files system will closely used in term of its usage to document so that it can be searched over the Internet using the Search Engine. It can improve document search mechanism and document save mechanism.

  15. ODQ: A Fluid Office Document Query Language

    Directory of Open Access Journals (Sweden)

    Xuhong Liu

    2015-06-01

    Full Text Available Fluid office documents, as semi-structured data often represented by Extensible Markup Language (XML are important parts of Big Data. These office documents have different formats, and their matching Application Programming Interfaces (APIs depend on developing platform and versions, which causes difficulty in custom development and information retrieval from them. To solve this problem, we have been developing an office document query (ODQ language which provides a uniform method to retrieve content from documents with different formats and versions. ODQ builds common document model ontology to conceal the format details of documents and provides a uniform operation interface to handle office documents with different formats. The results show that ODQ has advantages in format independence, and can facilitate users in developing documents processing systems with good interoperability.

  16. Text

    International Nuclear Information System (INIS)

    The purpose of this act is to safeguard against the dangers and harmful effects of radioactive waste and to contribute to public safety and environmental protection by laying down requirements for the safe and efficient management of radioactive waste. We will find definitions, interrelation with other legislation, responsibilities of the state and local governments, responsibilities of radioactive waste management companies and generators, formulation of the basic plan for the control of radioactive waste, radioactive waste management ( with public information, financing and part of spent fuel management), Korea radioactive waste management corporation ( business activities, budget), establishment of a radioactive waste fund in order to secure the financial resources required for radioactive waste management, and penalties in case of improper operation of radioactive waste management. (N.C.)

  17. Document Analysis by Crosscount Approach

    Institute of Scientific and Technical Information of China (English)

    王海琴; 戴汝为

    1998-01-01

    In this paper a new feature called crosscount for document analysis is introduced.The reature crosscount is a function of white line segment with its start on the edge of document images.It reflects not only the contour of image,but also the periodicity of white lines(background)and text lines in the document images.In complex printed-page layouts,there are different blocks such as textual,graphical,tabular,and so on.Of these blocks,textual ones have the most obvious periodicity with their homogeneous white lines arranged regularly.The important property of textual blocks can be extracted by crosscount functions.here the document layouts are classified into three classes on the basis of their physical structures.Then the definition and properties of the crosscount function are described.According to the classification of document layouts,the application of this new feature to different types of document images' analysis and understanding is discussed.

  18. Enterprise Document Management

    Data.gov (United States)

    US Agency for International Development — The function of the operation is to provide e-Signature and document management support for Acquisition and Assisitance (A&A) documents including vouchers in...

  19. Web document engineering

    International Nuclear Information System (INIS)

    This tutorial provides an overview of several document engineering techniques which are applicable to the authoring of World Wide Web documents. It illustrates how pre-WWW hypertext research is applicable to the development of WWW information resources

  20. Hypermedia and Free Text Retrieval.

    Science.gov (United States)

    Dunlop, Mark D.; van Rijsbergen, C. J.

    1993-01-01

    Discusses access to nontextual documents in large multimedia document bases. A hybrid information retrieval model, using queries in a hypertext environment for location of browsing areas, is presented; and two experiments using cluster-based descriptions of content are reported. (23 references) (EA)

  1. Traceability Method for Software Engineering Documentation

    Directory of Open Access Journals (Sweden)

    Nur Adila Azram

    2012-03-01

    Full Text Available Traceability has been widely discussed in research area. It has been one of interest topic to be research in software engineering. Traceability in software documentation is one of the interesting topics to be research further. It is important in software documentation to trace out the flow or process in all the documents whether they depends with one another or not. In this paper, we present a traceability method for software engineering documentation. The objective of this research is to facilitate in tracing of the software documentation.

  2. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  3. A Survey on Web Text Information Retrieval in Text Mining

    OpenAIRE

    Tapaswini Nayak; Srinivash Prasad; Manas Ranjan Senapat

    2015-01-01

    In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concep...

  4. Document image cleanup and binarization

    Science.gov (United States)

    Wu, Victor; Manmatha, Raghaven

    1998-04-01

    Image binarization is a difficult task for documents with text over textured or shaded backgrounds, poor contrast, and/or considerable noise. Current optical character recognition (OCR) and document analysis technology do not handle such documents well. We have developed a simple yet effective algorithm for document image clean-up and binarization. The algorithm consists of two basic steps. In the first step, the input image is smoothed using a low-pass filter. The smoothing operation enhances the text relative to any background texture. This is because background texture normally has higher frequency than text does. The smoothing operation also removes speckle noise. In the second step, the intensity histogram of the smoothed image is computed and a threshold automatically selected as follows. For black text, the first peak of the histogram corresponds to text. Thresholding the image at the value of the valley between the first and second peaks of the histogram binarizes the image well. In order to reliably identify the valley, the histogram is smoothed by a low-pass filter before the threshold is computed. The algorithm has been applied to some 50 images from a wide variety of source: digitized video frames, photos, newspapers, advertisements in magazines or sales flyers, personal checks, etc. There are 21820 characters and 4406 words in these images. 91 percent of the characters and 86 percent of the words are successfully cleaned up and binarized. A commercial OCR was applied to the binarized text when it consisted of fonts which were OCR recognizable. The recognition rate was 84 percent for the characters and 77 percent for the words.

  5. Clinical document architecture.

    Science.gov (United States)

    Heitmann, Kai

    2003-01-01

    The Clinical Document Architecture (CDA), a standard developed by the Health Level Seven organisation (HL7), is an ANSI approved document architecture for exchange of clinical information using XML. A CDA document is comprised of a header with associated vocabularies and a body containing the structural clinical information. PMID:15061557

  6. Scheme Program Documentation Tools

    DEFF Research Database (Denmark)

    Nørmark, Kurt

    2004-01-01

    This paper describes and discusses two different Scheme documentation tools. The first is SchemeDoc, which is intended for documentation of the interfaces of Scheme libraries (APIs). The second is the Scheme Elucidator, which is for internal documentation of Scheme programs. Although the tools ar...

  7. Informative document waste plastics

    NARCIS (Netherlands)

    Nagelhout D; Sein AA; Duvoort GL

    1989-01-01

    This "Informative document waste plastics" forms part of a series of "informative documents waste materials". These documents are conducted by RIVM on the indstruction of the Directorate General for the Environment, Waste Materials Directorate, in behalf of the program of acti

  8. Documents preparation and review

    International Nuclear Information System (INIS)

    Ignalina Safety Analysis Group takes active role in assisting regulatory body VATESI to prepare various regulatory documents and reviewing safety reports and other documentation presented by Ignalina NPP in the process of licensing of unit 1. The list of main documents prepared and reviewed is presented

  9. TEXT MINING – PREREQUISITE FOR KNOWLEDGE MANAGEMENT SYSTEMS

    OpenAIRE

    Dragoº Marcel VESPAN

    2009-01-01

    Text mining is an interdisciplinary field with the main purpose of retrieving new knowledge from large collections of text documents. This paper presents the main techniques used for knowledge extraction through text mining and their main areas of applicability and emphasizes the importance of text mining in knowledge management systems.

  10. Introducing Text Analytics as a Graduate Business School Course

    Science.gov (United States)

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  11. Script Recovery from Scanned Document Image

    Directory of Open Access Journals (Sweden)

    Dr. Srinivasan K.S

    2012-10-01

    Full Text Available Document digitization with scanner in text document images which have distortions that deteriorate the quality of the document. We propose a goal-oriented rectification methodology to recover the document from distorted document image. Our approach relies upon a coarse-to-fine strategy. First, a coarse rectification is accomplished with the projection of the curved surface on the plane which is guided by the textual content’s appearance in the document image while incorporating a transformation which does not depend on specific model primitives or scanner setup parameters. Secondly, normalization is applied on the word level aiming to restore all the local distortions of the document image. Experimental results on various document images with a variety of distortions demonstrate the robustness and effectiveness of the proposed rectification methodology that improves OCR accuracy. It finds its application widely in de-warping of document images, images captured from sculptures, from cursive handwritten text, text from palm leaves and so on...

  12. Language Independent Document Retrieval Using Unicode Standard

    Directory of Open Access Journals (Sweden)

    Vidhya M

    2014-08-01

    Full Text Available In this paper, we presented a method to retrieve documents with unstruc tured text data written in different languages. Apart from the ordinary document retrieval systems, the proposed system can also process queries with terms in more than one language. Unicode, the universally accepted encoding standard is used to prese n t th e data in a common platform while converting the text data into Vector Space Model. We got notable F measure value s in the experiments irrespective of languages used in document s and quer ies

  13. Towards Multi Label Text Classification through Label Propagation

    Directory of Open Access Journals (Sweden)

    Shweta C. Dharmadhikari

    2012-06-01

    Full Text Available Classifying text data has been an active area of research for a long time. Text document is multifaceted object and often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text Classification. Our proposed approach models the relationship between class labels and also effectively represents input text documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for classification. Our proposed approach promises better classification accuracy and handling of complexity and elaborated on the basis of standard datasets such as Enron, Slashdot and Bibtex.

  14. Extracting Conceptual Feature Structures from Text

    DEFF Research Database (Denmark)

    Andreasen, Troels; Bulskov, Henrik; Jensen, Per Anker;

    2011-01-01

    This paper describes an approach to indexing texts by their conceptual content using ontologies along with lexico-syntactic information and semantic role assignment provided by lexical resources. The conceptual content of meaningful chunks of text is transformed into conceptual feature structures...... and mapped into concepts in a generative ontology. Synonymous but linguistically quite distinct expressions are mapped to the same concept in the ontology. This allows us to perform a content-based search which will retrieve relevant documents independently of the linguistic form of the query as well...

  15. Choices of texts for literary education

    DEFF Research Database (Denmark)

    Skyggebjerg, Anna Karlskov

    literature studies at universities, where criteria concerning language and form are often more valued than criteria concerning character and content. This tendency to celebrate the formal aspects and the literariness of literature is recognized in governmental documents, teaching materials, and in the...... the possibility for positioning pupils/young adults ? What does the choice of texts mean for pupils’/young adults’ possibilities as readers and individual interpreters? How are the pupils’ potentials for envisioning and engaging in literature with certain choices of texts?...

  16. Modeling Documents with Event Model

    Directory of Open Access Journals (Sweden)

    Longhui Wang

    2015-08-01

    Full Text Available Currently deep learning has made great breakthroughs in visual and speech processing, mainly because it draws lessons from the hierarchical mode that brain deals with images and speech. In the field of NLP, a topic model is one of the important ways for modeling documents. Topic models are built on a generative model that clearly does not match the way humans write. In this paper, we propose Event Model, which is unsupervised and based on the language processing mechanism of neurolinguistics, to model documents. In Event Model, documents are descriptions of concrete or abstract events seen, heard, or sensed by people and words are objects in the events. Event Model has two stages: word learning and dimensionality reduction. Word learning is to learn semantics of words based on deep learning. Dimensionality reduction is the process that representing a document as a low dimensional vector by a linear mode that is completely different from topic models. Event Model achieves state-of-the-art results on document retrieval tasks.

  17. Handwriting segmentation of unconstrained Oriya text

    Indian Academy of Sciences (India)

    N Tripathy; U Pal

    2006-12-01

    Segmentation of handwritten text into lines, words and characters is one of the important steps in the handwritten text recognition process. In this paper we propose a water reservoir concept-based scheme for segmentation of unconstrained Oriya handwritten text into individual characters. Here, at first, the text image is segmented into lines, and the lines are then segmented into individual words. For line segmentation, the document is divided into vertical stripes. Analysing the heights of the water reservoirs obtained from different components of the document, the width of a stripe is calculated. Stripe-wise horizontal histograms are then computed and the relationship of the peak–valley points of the histograms is used for line segmentation. Based on vertical projection profiles and structural features of Oriya characters, text lines are segmented into words. For character segmentation, at first, the isolated and connected (touching) characters in a word are detected. Using structural, topological and water reservoir concept-based features, characters of the word that touch are then segmented. From experiments we have observed that the proposed “touching character” segmentation module has 96·7% accuracy for two-character touching strings.

  18. Document Classification Using Support Vector Machine

    OpenAIRE

    Shweta Mayor; Bhasker Pant

    2012-01-01

    Information like NEWS FEEDS is generally stored in the form of documents and files created on the basis of daily occurrence in the world. Classifying an unstructured text in these large document corpora has become cumbersome. Efficiently and effectively retrieving and categorizing these document is a hard task to perform. This research paper discuss in detail the implementation of Support Vector Machine (SVM) for calculating term frequency of the features used as Sports, Business and Entertai...

  19. Classroom Texting in College Students

    Science.gov (United States)

    Pettijohn, Terry F.; Frazier, Erik; Rieser, Elizabeth; Vaughn, Nicholas; Hupp-Wilds, Bobbi

    2015-01-01

    A 21-item survey on texting in the classroom was given to 235 college students. Overall, 99.6% of students owned a cellphone and 98% texted daily. Of the 138 students who texted in the classroom, most texted friends or significant others, and indicate the reason for classroom texting is boredom or work. Students who texted sent a mean of 12.21…

  20. Text Clustering Using a Suffix Tree Similarity Measure

    Directory of Open Access Journals (Sweden)

    Chenghui HUANG

    2011-10-01

    Full Text Available In text mining area, popular methods use the bag-of-words models, which represent a document as a vector. These methods ignored the word sequence information, and the good clustering result limited to some special domains. This paper proposes a new similarity measure based on suffix tree model of text documents. It analyzes the word sequence information, and then computes the similarity between the text documents of corpus by applying a suffix tree similarity that combines with TF-IDF weighting method. Experimental results on standard document benchmark corpus RUTERS and BBC indicate that the new text similarity measure is effective. Comparing with the results of the other two frequent word sequence based methods, our proposed method achieves an improvement of about 15% on the average of F-Measure score.

  1. A Novel Approach For Syntactic Similarity Between Two Short Text

    Directory of Open Access Journals (Sweden)

    Anterpreet Kaur

    2015-06-01

    Full Text Available ABSTRACT Syntactic similarity is an important activity in the area of high field of text documents data mining natural language processing information retrieval. Natural language processing NLP is the intelligent machine where its ability is to translate the text into natural language such as English and other computer language such as c. Web mining used for task such as document clustering community mining etc to performed on web. However to find the similarity between the two documents is the difficult task. So with increasing scope in NLP require technique for dealing with many aspects of language in particular syntax semantics and paradigms.

  2. Health physics documentation

    International Nuclear Information System (INIS)

    When dealing with radioactive material the health physicist receives innumerable papers and documents within the fields of researching, prosecuting, organizing and justifying radiation protection. Some of these papers are requested by the health physicist and some are required by law. The scope, quantity and deposit periods of the health physics documentation at the Karlsruhe Nuclear Research Center are presented and rationalizing methods discussed. The aim of this documentation should be the application of physics to accident prevention, i.e. documentation should protect those concerned and not the health physicist. (H.K.)

  3. Document control program (DCP)

    Energy Technology Data Exchange (ETDEWEB)

    Burger, M.J.

    1978-01-01

    The management and control of classified and unclassified documents is tedious, time consuming, and error prone. DCP is a simple, inexpensive, but effective program for the Livermore Time Sharing System and is written in TRIX and TRIX AC. It is used to computerize the classified document control task with a completely self-contained program requiring essentially no modifications or programer support to implement or maintain. DCP provides a complete dialect to prepare interactively the input data, update the document master file, and interrogate and retrieve any information desired from the document file. 2 figures. (RWR)

  4. Text Steganography with Multi level Shielding

    Directory of Open Access Journals (Sweden)

    Sharon Rose Govada

    2012-07-01

    Full Text Available Steganography it is a form of security through obscurity. It is the art and science of writing hidden messages in such a way that no one, except sender and intended recipient can understand the hidden message,. The purpose of steganography is covert communication-to hide the existence of a message from a third party. Compared with study on text-steganography, research on text-steganalysis is in its infancy. In this paper, we present a method that is capable of performing text Steganography that is more reliable and secure when compared to the existing algorithms. Our method is a combination of Word shifting, Text Steganography and Synonym Text Steganography. So we called this as “Three Phase Shielding Text Steganography” This method overcomes various limitations faced by the existing Steganographic algorithms. The experimental results are very encouraging when compared to the already existing algorithms. Our method also helps in finding out the embedding rate of a secret message in a text document.

  5. Text-Attentional Convolutional Neural Network for Scene Text Detection.

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-06-01

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this paper, we present a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/non-text information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates the main task of text/non-text classification. In addition, a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed, which extends the widely used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 data set, with an F-measure of 0.82, substantially improving the state-of-the-art results. PMID:27093723

  6. Automatic handwriting identification on medieval documents

    NARCIS (Netherlands)

    Bulacu, M.L.; Schomaker, L.R.B.

    2007-01-01

    In this paper, we evaluate the performance of text-independent writer identification methods on a handwriting dataset containing medieval English documents. Applicable identification rates are achieved by combining textural features (joint directional probability distributions) with allographic feat

  7. Inclusivity, Gestalt Principles, and Plain Language in Document Design

    Directory of Open Access Journals (Sweden)

    Jennifer Turner

    2016-06-01

    Full Text Available In Brief: Good design makes documents easier to use, helps documents stand out from other pieces of information, and lends credibility to document creators. Librarians across library types and departments provide instruction and training materials to co-workers and library users. For these materials to be readable and accessible, they must follow guidelines for usable document design. […

  8. Text Mining Infrastructure in R

    OpenAIRE

    Kurt Hornik; Ingo Feinerer; David Meyer

    2008-01-01

    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classiffication and string kernels. (authors' abstract)

  9. FENDL/A-MCNP and FENDL/A-175G. The processed neutron activation cross-section data files of the FENDL project. Summary documentation

    International Nuclear Information System (INIS)

    This document summarises a neutron activation cross-section database in processed in two formats as generated by F.M. Mann within the project of the Fusion Evaluated Nuclear Data Library (FENDL): FENDL/PA in continuous energy format as used by the Monte Carlo neutron/photon transport code MCNP; and FENDL/PA-175G, in ASCII 175 group multigroup format as used by the transmutation code REAC*2/3. The data are available from the IAEA Nuclear Data Section online via INTERNET by FTP command. (author)

  10. Text analysis devices, articles of manufacture, and text analysis methods

    Science.gov (United States)

    Turner, Alan E; Hetzler, Elizabeth G; Nakamura, Grant C

    2013-05-28

    Text analysis devices, articles of manufacture, and text analysis methods are described according to some aspects. In one aspect, a text analysis device includes processing circuitry configured to analyze initial text to generate a measurement basis usable in analysis of subsequent text, wherein the measurement basis comprises a plurality of measurement features from the initial text, a plurality of dimension anchors from the initial text and a plurality of associations of the measurement features with the dimension anchors, and wherein the processing circuitry is configured to access a viewpoint indicative of a perspective of interest of a user with respect to the analysis of the subsequent text, and wherein the processing circuitry is configured to use the viewpoint to generate the measurement basis.

  11. Handwritten Text Image Authentication using Back Propagation

    CERN Document Server

    Chakravarthy, A S N; Avadhani, P S

    2011-01-01

    Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involve confirming the identity of a person, tracing the origins of an artefact, ensuring that a product is what it's packaging and labelling claims to be, or assuring that a computer program is a trusted one. The authentication of information can pose special problems (especially man-in-the-middle attacks), and is often wrapped up with authenticating identity. Literary can involve imitating the style of a famous author. If an original manuscript, typewritten text, or recording is available, then the medium itself (or its packaging - anything from a box to e-mail headers) can help prove or disprove the authenticity of the document. The use of digital images of handwritten historical documents has become more popular in recent years. Volunteers around the world now read thousands of these images as part of their indexing process. Handwritten text images of old documents are sometimes difficult to read or noisy du...

  12. Contrastive Study of Coherence in Chinese Text and English Text

    Institute of Scientific and Technical Information of China (English)

    王婷

    2013-01-01

    The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse, coherence and cohesive. In addition we try to discover different manifestations of text between ET and CT, including different coherent structures.

  13. Metamorphoses d'un texte (Metamorphoses of a Text).

    Science.gov (United States)

    Meitinger, Guy Roger

    1993-01-01

    A variety of exercises based on manipulation of a single text are described. The activities involve replacing words or phrases in the text with synonyms or opposites, transposing gender, changing tenses, filling in blanks, and answering multiple-choice questions about linguistic forms. Three brief sample texts are offered. (MSE)

  14. Automatic document classification of biological literature

    OpenAIRE

    Sternberg Paul W; Müller Hans-Michael; Chen David

    2006-01-01

    Abstract Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis eleg...

  15. Illumination Compensation Algorithm for Unevenly Lighted Document Segmentation

    Directory of Open Access Journals (Sweden)

    Ju Zhiyong

    2013-07-01

    Full Text Available For the problem of segmenting the unevenly lighted document image, this paper proposes an illumination compensation segmentation algorithm which can effectively segment the unevenly lighted document. The illumination compensation method is proposed to equivalently convert unevenly lighted document image to evenly lighted document image, then segment the evenly lighted document directly. Experimental results show that the proposed method can get the accurate evenly lighted document images so that we can segment the document accurately and it is more efficient to process unevenly lighted document  images than traditional binarization methods. The algorithm effectively overcomes the difficulty in handling uneven lighting and enhances segmentation quality considerably.

  16. Significant Attributes of Documents.

    Science.gov (United States)

    Armstrong, Frances T.

    The purpose of this paper is to describe a method of finding the significant attributes of documents established during the course of research on the automatic classification of documents. The problem was first approached by examining the way in which an existing hierarchical classification system classifies things. The study of biological…

  17. INFCE plenary conference documents

    International Nuclear Information System (INIS)

    This document consists of the reports to the First INFCE Plenary Conference (November 1978) by the Working Groups a Plenary Conference of its actions and decisions, the Communique of the Final INFCE Plenary Conference (February 1980), and a list of all documents in the IAEA depository for INFCE

  18. IDC System Specification Document.

    Energy Technology Data Exchange (ETDEWEB)

    Clifford, David J.

    2014-12-01

    This document contains the system specifications derived to satisfy the system requirements found in the IDC System Requirements Document for the IDC Reengineering Phase 2 project. Revisions Version Date Author/Team Revision Description Authorized by V1.0 12/2014 IDC Reengineering Project Team Initial delivery M. Harris

  19. Document image analysis

    CERN Document Server

    Bunke, H; Baird, H

    1994-01-01

    Interest in the automatic processing and analysis of document images has been rapidly increasing during the past few years. This book addresses the different subfields of document image analysis, including preprocessing and segmentation, form processing, handwriting recognition, line drawing and map processing, and contextual processing.

  20. Research on Text Mining Based on Domain Ontology

    OpenAIRE

    Li-hua, Jiang; Neng-fu, Xie; Hong-bin, Zhang

    2013-01-01

    This paper improves the traditional text mining technology which cannot understand the text semantics. The author discusses the text mining methods based on ontology and puts forward text mining model based on domain ontology. Ontology structure is built firstly and the “concept-concept” similarity matrix is introduced, then a conception vector space model based on domain ontology is used to take the place of traditional vector space model to represent the documents in order to realize text m...

  1. Investigating text message classification using case-based reasoning

    OpenAIRE

    Healy, Matt, (Thesis)

    2007-01-01

    Text classification is the categorization of text into a predefined set of categories. Text classification is becoming increasingly important given the large volume of text stored electronically e.g. email, digital libraries and the World Wide Web (WWW). These documents represent a massive amount of information that can be accessed easily. To gain benefit from using this information requires organisation. One way of organising it automatically is to use text classification. A number of well k...

  2. The Use of Bigrams To Enhance Text Categorization.

    Science.gov (United States)

    Tan, Chade-Meng; Wang, Yuan-Fang; Lee, Chan-Do

    2002-01-01

    Presents an efficient text categorization (or text classification) algorithm for document retrieval of natural language texts that generates bigrams (two-word phrases) and uses the information gain metric, combined with various frequency thresholds. Experimental results suggest that the bigrams can substantially raise the quality of feature sets.…

  3. IR and OLAP in XML document warehouses

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Pedersen, Torben Bach; Berlanga, Rafael;

    2005-01-01

    In this paper we propose to combine IR and OLAP (On-Line Analytical Processing) technologies to exploit a warehouse of text-rich XML documents. In the system we plan to develop, a multidimensional implementation of a relevance modeling document model will be used for interactively querying the...... warehouse by allowing navigation in the structure of documents and in a concept hierarchy of query terms. The facts described in the relevant documents will be ranked and analyzed in a novel OLAP cube model able to represent and manage facts with relevance indexes....

  4. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  5. Author Gender Identification from Text

    OpenAIRE

    Rezaei, Atoosa Mohammad

    2014-01-01

    ABSTRACT: The identification of an author's gender from a text has become a popular research area within the scope of text categorization. The number of users of social network applications based on text, such as Twitter, Facebook and text messaging services, has grown rapidly over the past few decades. As a result, text has become one of the most important and prevalent media types on the Internet. This thesis aims to determine the gender of an author from an arbitrary piece of text such as,...

  6. Using LSA and text segmentation to improve automatic Chinese dialogue text summarization

    Institute of Scientific and Technical Information of China (English)

    LIU Chuan-han; WANG Yong-cheng; ZHENG Fei; LIU De-rong

    2007-01-01

    Automatic Chinese text summarization for dialogue style is a relatively new research area. In this paper, Latent Semantic Analysis (LSA) is first used to extract semantic knowledge from a given document, all question paragraphs are identified,an automatic text segmentation approach analogous to TextTiling is exploited to improve the precision of correlating question paragraphs and answer paragraphs, and finally some "important" sentences are extracted from the generic content and the question-answer pairs to generate a complete summary. Experimental results showed that our approach is highly efficient and improves significantly the coherence of the summary while not compromising informativeness.

  7. Extracting information from free-text mammography reports

    OpenAIRE

    Esuli, Andrea; Marcheggiani, Diego; Sebastiani, Fabrizio

    2010-01-01

    Researchers from ISTI-CNR, Pisa, aim at effectively and efficiently extracting information from free-text mammography reports, as a step towards the automatic transformation of unstructured medical documentation into structured data.

  8. The Texts of the Agency's Relationship Agreements with Specialized Agencies

    International Nuclear Information System (INIS)

    The text of the relationship agreement which the Agency has concluded with the Inter-Governmental Maritime Consultative Organization, together with the protocol authenticating it, is reproduced in this document for the information of all Members of the Agency

  9. Document reconstruction by layout analysis of snippets

    Science.gov (United States)

    Kleber, Florian; Diem, Markus; Sablatnig, Robert

    2010-02-01

    Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. Also skew detection of scanned documents is performed to support OCR algorithms that are sensitive to skew. In this paper document analysis is applied to snippets of torn documents to calculate features for the reconstruction. Documents can either be destroyed by the intention to make the printed content unavailable (e.g. tax fraud investigation, business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, inpainting and texture synthesis techniques. In this paper the possibility of document analysis techniques of snippets to support the matching algorithm by considering additional features are shown. This implies a rotational analysis, a color analysis and a line detection. As a future work it is planned to extend the feature set with the paper type (blank, checked, lined), the type of the writing (handwritten vs. machine printed) and the text layout of a snippet (text size, line spacing). Preliminary results show that these pre-processing steps can be performed reliably on a real dataset consisting of 690 snippets.

  10. Methodological Aspects of Architectural Documentation

    Directory of Open Access Journals (Sweden)

    Arivaldo Amorim

    2011-12-01

    Full Text Available This paper discusses the methodological approach that is being developed in the state of Bahia in Brazil since 2003, in architectural and urban sites documentation, using extensive digital technologies. Bahia has a vast territory with important architectural ensembles ranging from the sixteenth century to present day. As part of this heritage is constructed of raw earth and wood, it is very sensitive to various deleterious agents. It is therefore critical document this collection that is under threats. To conduct those activities diverse digital technologies that could be used in documentation process are being experimented. The task is being developed as an academic research, with few financial resources, by scholarship students and some volunteers. Several technologies are tested ranging from the simplest to the more sophisticated ones, used in the main stages of the documentation project, as follows: work overall planning, data acquisition, processing and management and ultimately, to control and evaluate the work. The activities that motivated this paper are being conducted in the cities of Rio de Contas and Lençóis in the Chapada Diamantina, located at 420 km and 750 km from Salvador respectively, in Cachoeira city at Recôncavo Baiano area, 120 km from Salvador, the capital of Bahia state, and at Pelourinho neighbourhood, located in the historic capital. Part of the material produced can be consulted in the website: < www.lcad.ufba.br>.

  11. Text mining: A Brief survey

    OpenAIRE

    Falguni N. Patel , Neha R. Soni

    2012-01-01

    The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have...

  12. Clustering and Classification in Text Collections Using Graph Modularity

    OpenAIRE

    Pivovarov, Grigory; Trunov, Sergei

    2011-01-01

    A new fast algorithm for clustering and classification of large collections of text documents is introduced. The new algorithm employs the bipartite graph that realizes the word-document matrix of the collection. Namely, the modularity of the bipartite graph is used as the optimization functional. Experiments performed with the new algorithm on a number of text collections had shown a competitive quality of the clustering (classification), and a record-breaking speed.

  13. An Evident Theoretic Feature Selection Approach for Text Categorization

    OpenAIRE

    UMARSATHIC ALI; JOTHI VENKATESWARAN

    2012-01-01

    With the exponential growth of textual documents available in unstructured form on the Internet, feature selection approaches are increasingly significant for the preprocessing of textual documents for automatic text categorization. Feature selection, which focuses on identifying relevant and informative features, can help reduce the computational cost of processing voluminous amounts of data as well asincrease the effectiveness for the subsequent text categorization tasks. In this paper, we ...

  14. Multioriented and Curved Text Lines Extraction from Documents

    OpenAIRE

    Vaibhav Gavali; B. R. Bombade

    2013-01-01

    There is need of the robust algorithm to extract text lines from script independent documents,color independent, font and size independent segmentation algorithms. This paper presents simple method toextract curved and multioriented text lines from the documents. The input is may be colored or grayscaleimage. Discrete wavelet transform is applied on input image to get four sub-bands. Thresholding is appliedon the three sub-bands (horizontal, vertical, diagonal). Edge detection is applied on t...

  15. New Challenges of the Documentation in Media

    Directory of Open Access Journals (Sweden)

    Antonio García Jiménez

    2015-07-01

    Full Text Available This special issue, presented by index.comunicación, is focused on media related information & documentation. This field undergoes constant and profound changes, especially visible in documentation processes. A situation characterized by the existence of tablets, smartphones, applications, and by the almost achieved digitization of traditional documents, in addition to the crisis of the press business model, that involves mutations in the journalists’ tasks and in the relationship between them and Documentation. Papers included in this special issue focus on some of the concerns in this domain: the progressive autonomy of the journalist in access to information sources, the role of press offices as documentation sources, the search of information on the web, the situation of media blogs, the viability of elements of information architecture in smart TV and the development of social TV and its connection to Documentation.

  16. Informational Text and the CCSS

    Science.gov (United States)

    Aspen Institute, 2012

    2012-01-01

    What constitutes an informational text covers a broad swath of different types of texts. Biographies & memoirs, speeches, opinion pieces & argumentative essays, and historical, scientific or technical accounts of a non-narrative nature are all included in what the Common Core State Standards (CCSS) envisions as informational text. Also included…

  17. Too Dumb for Complex Texts?

    Science.gov (United States)

    Bauerlein, Mark

    2011-01-01

    High school students' lack of experience and practice with reading complex texts is a primary cause of their difficulties with college-level reading. Filling the syllabus with digital texts does little to address this deficiency. Complex texts demand three dispositions from readers: a willingness to probe works characterized by dense meanings, the…

  18. Slippery Texts and Evolving Literacies

    Science.gov (United States)

    Mackey, Margaret

    2007-01-01

    The idea of "slippery texts" provides a useful descriptor for materials that mutate and evolve across different media. Eight adult gamers, encountering the slippery text "American McGee's Alice," demonstrate a variety of ways in which players attempt to manage their attention as they encounter a new text with many resonances. The range of their…

  19. Multilingual Text Analysis for Text-to-Speech Synthesis

    CERN Document Server

    Sproat, R

    1996-01-01

    We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite-state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of lexicons, morphological rules, numeral-expansion rules, and phonological rules, inter alia. To date, the model has been applied to eight languages: Spanish, Italian, Romanian, French, German, Russian, Mandarin and Japanese.

  20. Collaborative Document Management Systems

    OpenAIRE

    Viisainen, Harri

    2013-01-01

    The volume of the electronic documents that the company nowadays has to manage is high. Therefore the systematic document management has a significant role in company’s working process. The Internet has enabled that the software can be acquired as a service that operates in a cloud environment. In that case the company has the use of the software and pays only for the use. This study gathered the requirements for the cloud based documentation management system. The purpose of the study wa...

  1. Monitoring interaction and collective text production through text mining

    Directory of Open Access Journals (Sweden)

    Macedo, Alexandra Lorandi

    2014-04-01

    Full Text Available This article presents the Concepts Network tool, developed using text mining technology. The main objective of this tool is to extract and relate terms of greatest incidence from a text and exhibit the results in the form of a graph. The Network was implemented in the Collective Text Editor (CTE which is an online tool that allows the production of texts in synchronized or non-synchronized forms. This article describes the application of the Network both in texts produced collectively and texts produced in a forum. The purpose of the tool is to offer support to the teacher in managing the high volume of data generated in the process of interaction amongst students and in the construction of the text. Specifically, the aim is to facilitate the teacher’s job by allowing him/her to process data in a shorter time than is currently demanded. The results suggest that the Concepts Network can aid the teacher, as it provides indicators of the quality of the text produced. Moreover, messages posted in forums can be analyzed without their content necessarily having to be pre-read.

  2. Predicting Prosody from Text for Text-to-Speech Synthesis

    CERN Document Server

    Rao, K Sreenivasa

    2012-01-01

    Predicting Prosody from Text for Text-to-Speech Synthesis covers the specific aspects of prosody, mainly focusing on how to predict the prosodic information from linguistic text, and then how to exploit the predicted prosodic knowledge for various speech applications. Author K. Sreenivasa Rao discusses proposed methods along with state-of-the-art techniques for the acquisition and incorporation of prosodic knowledge for developing speech systems. Positional, contextual and phonological features are proposed for representing the linguistic and production constraints of the sound units present in the text. This book is intended for graduate students and researchers working in the area of speech processing.

  3. Survey on Feature Selection in Document Clustering

    Directory of Open Access Journals (Sweden)

    MS. K.Mugunthadevi,

    2011-03-01

    Full Text Available Text mining is to research technologies to discover useful knowledge from enormous collections of documents, and to develop a system to provide knowledge and to support in decision making. Basically cluster means a group of similar data, document clustering means segregating the data into different groups of similar data. lustering is a fundamental data analysis technique used for variousapplications such as biology, psychology, control and signal processing, information theory and mining technologies. Text mining is not a stand-alone task that human analysts typically engage in. The goal is to transform text composed of everyday language into a structured, database format. In this way, heterogeneous documents are summarized and presented in a uniform manner. Among others, the challenging problems of text clustering are big volume, high dimensionality and complex semantics.

  4. Text mining: A Brief survey

    Directory of Open Access Journals (Sweden)

    Falguni N. Patel , Neha R. Soni

    2012-12-01

    Full Text Available The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have also discussed different technologies that teach computers with natural language so that they may analyze, understand, and even generate text. In addition, we briefly discuss a number of successful applications of text mining which are used currently and in future.

  5. TEXT DEIXIS IN NARRATIVE SEQUENCES

    Directory of Open Access Journals (Sweden)

    Josep Rivera

    2007-06-01

    Full Text Available This study looks at demonstrative descriptions, regarding them as text-deictic procedures which contribute to weave discourse reference. Text deixis is thought of as a metaphorical referential device which maps the ground of utterance onto the text itself. Demonstrative expressions with textual antecedent-triggers, considered as the most important text-deictic units, are identified in a narrative corpus consisting of J. M. Barrie’s Peter Pan and its translation into Catalan. Some linguistic and discourse variables related to DemNPs are analysed to characterise adequately text deixis. It is shown that this referential device is usually combined with abstract nouns, thus categorising and encapsulating (non-nominal complex discourse entities as nouns, while performing a referential cohesive function by means of the text deixis + general noun type of lexical cohesion.

  6. Document segmentation for high-quality printing

    Science.gov (United States)

    Ancin, Hakan

    1997-04-01

    A technique to segment dark texts on light background of mixed mode color documents is presented. This process does not perceptually change graphics and photo regions. Color documents are scanned and printed from various media which usually do not have clean background. This is especially the case for the printouts generated from thin magazine samples, these printouts usually include text and figures form the back of the page, which is called bleeding. Removal of bleeding artifacts improves the perceptual quality of the printed document and reduces the color ink usage. By detecting the light background of the document, these artifacts are removed from background regions. Also detection of dark text regions enables the halftoning algorithms to use true black ink for the black text pixels instead of composite black. The processed document contains sharp black text on white background, resulting improved perceptual quality and better ink utilization. The described method is memory efficient and requires a small number of scan lines of high resolution color documents during processing.

  7. Integrated Criteria document Chlorophenols

    NARCIS (Netherlands)

    Slooff W; Bremmer HJ; Janus JA; Matthijsen AJCM; van Beelen P; van den Berg R; Bloemen HJT; Canton JH; Eerens HC; Hrubec J; Janssens H; Jumelet JC; Knaap AGAC; de Leeuw FAAM; van der Linden AMA; Loch JPG; van Loveren H; Peijnenburg WJGM; Piersma AH; Struijs J; Taalman RDFM; Theelen RMC; van der Velde JMA; Verburgh JJ; Versteegh JFM; van der Woerd KF

    1991-01-01

    Bij dit rapport behoort een bijlage onder hetzelfde nummer getiteld: "Integrated Criteria document Chlorophenols: Effects:" Auteurs : Janus JA
    Taalman RDFM; Theelen RMC en is de engelse editie van 710401003

  8. NCDC Archive Documentation Manuals

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The National Climatic Data Center Tape Deck Documentation library is a collection of over 400 manuals describing NCDC's digital holdings (both historic and...

  9. Transportation System Requirements Document

    Energy Technology Data Exchange (ETDEWEB)

    1993-09-01

    This Transportation System Requirements Document (Trans-SRD) describes the functions to be performed by and the technical requirements for the Transportation System to transport spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from Purchaser and Producer sites to a Civilian Radioactive Waste Management System (CRWMS) site, and between CRWMS sites. The purpose of this document is to define the system-level requirements for Transportation consistent with the CRWMS Requirement Document (CRD). These requirements include design and operations requirements to the extent they impact on the development of the physical segments of Transportation. The document also presents an overall description of Transportation, its functions, its segments, and the requirements allocated to the segments and the system-level interfaces with Transportation. The interface identification and description are published in the CRWMS Interface Specification.

  10. Registration document 2005

    International Nuclear Information System (INIS)

    This reference document of Gaz de France provides information and data on the Group activities in 2005: financial informations, business, activities, equipments factories and real estate, trade, capital, organization charts, employment, contracts and research programs. (A.L.B.)

  11. Transportation System Requirements Document

    International Nuclear Information System (INIS)

    This Transportation System Requirements Document (Trans-SRD) describes the functions to be performed by and the technical requirements for the Transportation System to transport spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from Purchaser and Producer sites to a Civilian Radioactive Waste Management System (CRWMS) site, and between CRWMS sites. The purpose of this document is to define the system-level requirements for Transportation consistent with the CRWMS Requirement Document (CRD). These requirements include design and operations requirements to the extent they impact on the development of the physical segments of Transportation. The document also presents an overall description of Transportation, its functions, its segments, and the requirements allocated to the segments and the system-level interfaces with Transportation. The interface identification and description are published in the CRWMS Interface Specification

  12. Wilmar joint market model, Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Meibom, P.; Larsen, Helge V. [Risoe National Lab. (Denmark); Barth, R.; Brand, H. [IER, Univ. of Stuttgart (Germany); Weber, C.; Voll, O. [Univ. of Duisburg-Essen (Germany)

    2006-01-15

    The Wilmar Planning Tool is developed in the project Wind Power Integration in Liberalised Electricity Markets (WILMAR) supported by EU (Contract No. ENK5-CT-2002-00663). A User Shell implemented in an Excel workbook controls the Wilmar Planning Tool. All data are contained in Access databases that communicate with various sub-models through text files that are exported from or imported to the databases. The Joint Market Model (JMM) constitutes one of these sub-models. This report documents the Joint Market model (JMM). The documentation describes: 1. The file structure of the JMM. 2. The sets, parameters and variables in the JMM. 3. The equations in the JMM. 4. The looping structure in the JMM. (au)

  13. Wilmar joint market model, Documentation

    International Nuclear Information System (INIS)

    The Wilmar Planning Tool is developed in the project Wind Power Integration in Liberalised Electricity Markets (WILMAR) supported by EU (Contract No. ENK5-CT-2002-00663). A User Shell implemented in an Excel workbook controls the Wilmar Planning Tool. All data are contained in Access databases that communicate with various sub-models through text files that are exported from or imported to the databases. The Joint Market Model (JMM) constitutes one of these sub-models. This report documents the Joint Market model (JMM). The documentation describes: 1. The file structure of the JMM. 2. The sets, parameters and variables in the JMM. 3. The equations in the JMM. 4. The looping structure in the JMM. (au)

  14. Endangered Language Documentation and Transmission

    Directory of Open Access Journals (Sweden)

    D. Victoria Rau

    2007-01-01

    Full Text Available This paper describes an on-going project on digital archiving Yami language documentation (http://www.hrelp.org/grants/projects/index.php?projid=60. We present a cross-disciplinary approach, involving computer science and applied linguistics, to document the Yami language and prepare teaching materials. Our discussion begins with an introduction to an integrated framework for archiving, processing and developing learning materials for Yami (Yang and Rau 2005, followed by a historical account of Yami language teaching, from a grammatical syllabus (Dong and Rau 2000b to a communicative syllabus using a multimedia CD as a resource (Rau et al. 2005, to the development of interactive on-line learning based on the digital archiving project. We discuss the methods used and challenges of each stage of preparing Yami teaching materials, and present a proposal for rethinking pedagogical models for e-learning.

  15. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  16. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  17. Stroke Briefing: Technical Documentation

    OpenAIRE

    Institute of Public Health in Ireland

    2012-01-01

    A stroke happens when blood flow to a part of the brain is interrupted by a blocked or burst blood vessel. A lack of blood supply can damage brain cells and affect body functions. IPH has systematically estimated and forecast the prevalence of stroke on the island of Ireland. This document details the methods used to calculate these estimates and forecasts. Technical documentation      

  18. 2002 reference document

    International Nuclear Information System (INIS)

    This 2002 reference document of the group Areva, provides information on the society. Organized in seven chapters, it presents the persons responsible for the reference document and for auditing the financial statements, information pertaining to the transaction, general information on the company and share capital, information on company operation, changes and future prospects, assets, financial position, financial performance, information on company management and executive board and supervisory board, recent developments and future prospects. (A.L.B.)

  19. The Eagle Document

    OpenAIRE

    Oechsler, Monika

    2008-01-01

    The Eagle Document forms the second stage of an ongoing project by artist Monika Oechsler. Oeschler visited Farnham last September with a radical live performance combining modern dance, performance art, experimental music and a falconry display. Stage two of the The Eagle Document is the culmination of filmed performance rehearsals, and bird flights presented on five screens. The installation examines notions of performance and 'live' art using projections onto multiple screens. T...

  20. Evaluation of online documentation.

    OpenAIRE

    Prophet, C. M.; Krall, M. E.; Budreau, G. K.; Gibbs, T. D.; Walker, K. P.; Eyman, J. M.; Hafner, M. J.

    1998-01-01

    The University of Iowa Hospitals and Clinics (UIHC) implemented an online documentation system for patient care orders in 1994-1996. Developed entirely in-house, the INFORMM NIS (Information Network for Online Retrieval & Medical Management Nursing Information System) features order-generated task lists, defaulted charting responses, computer-generated chart forms, and graphical data displays. To measure the impact of automation on user perceptions, and documentation compliance, completeness,...

  1. Problems and Methods of Source Study of Cinema Documents

    Directory of Open Access Journals (Sweden)

    Grigory N. Lanskoy

    2016-03-01

    Full Text Available The article is devoted to basic problems of analysis and interpretation of cinema documents in historical studies, with the possibility of shared approach to the study of cinema and paper documents, the using of art studies principles to the analysis of cinema documents and the efficacy of textual approach to the study of cinema documents among them. The forms of applying different scientific methods to the evaluation of cinema documents as historical sources are also discussed in the article.

  2. Text structures in medical text processing: empirical evidence and a text understanding prototype.

    OpenAIRE

    Hahn, U.; Romacker, M

    1997-01-01

    We consider the role of textual structures in medical texts. In particular, we examine the impact the lacking recognition of text phenomena has on the validity of medical knowledge bases fed by a natural language understanding front-end. First, we review the results from an empirical study on a sample of medical texts considering, in various forms of local coherence phenomena (anaphora and textual ellipses). We then discuss the representation bias emerging in the text knowledge base that is l...

  3. Documentation of spectrom-32

    International Nuclear Information System (INIS)

    SPECTROM-32 is a finite element program for analyzing two-dimensional and axisymmetric inelastic thermomechanical problems related to the geological disposal of nuclear waste. The code is part of the SPECTROM series of special-purpose computer programs that are being developed by RE/SPEC Inc. to address many unique rock mechanics problems encountered in analyzing radioactive wastes stored in geologic formations. This document presents the theoretical basis for the mathematical models, the finite element formulation and solution procedure of the program, a description of the input data for the program, verification problems, and details about program support and continuing documentation. The computer code documentation is intended to satisfy the requirements and guidelines outlined in the document entitled Final Technical Position on Documentation of Computer Codes for High-Level Waste Management. The principle component models used in the program involve thermoelastic, thermoviscoelastic, thermoelastic-plastic, and thermoviscoplastic types of material behavior. Special material considerations provide for the incorporation of limited-tension material behavior and consideration of jointed material behavior. Numerous program options provide the capabilities for various boundary conditions, sliding interfaces, excavation, backfill, arbitrary initial stresses, multiple material domains, load incrementation, plotting database storage and access of results, and other features unique to the geologic disposal of radioactive wastes. Numerous verification problems that exercise many of the program options and illustrate the required data input and printed results are included in the documentation

  4. Documentation of spectrom-32

    International Nuclear Information System (INIS)

    SPECTROM-32 is a finite element program for analyzing two-dimensional and axisymmetric inelastic thermomechanical problems related to the geological disposal of nuclear waste. The code is part of the SPECTROM series of special-purpose computer programs that are being developed by RE/SPEC Inc. to address many unique rock mechanics problems encountered in analyzing radioactive wastes stored in geologic formations. This document presents the theoretical basis for the mathematical models, the finite element formulation and solution procedure of the program, a description of the input data for the program, verification problems, and details about program support and continuing documentation. The computer code documentation is intended to satisfy the requirements and guidelines outlined in the document entitled Final Technical Position on Documentation of Computer Codes for High-Level Waste Management. The principal component models used in the program involve thermoelastic, thermoviscoelastic, thermoelastic-plastic, and thermoviscoplastic types of material behavior. Special material considerations provide for the incorporation of limited-tension material behavior and consideration of jointed material behavior. Numerous program options provide the capabilities for various boundary conditions, sliding interfaces, excavation, backfill, arbitrary initial stresses, multiple material domains, load incrementation, plotting database storage and access of results, and other features unique to the geologic disposal of radioactive wastes. Numerous verification problems that exercise many of the program options and illustrate the required data input and printed results are included in the documentation

  5. LCS Content Document Application

    Science.gov (United States)

    Hochstadt, Jake

    2011-01-01

    My project at KSC during my spring 2011 internship was to develop a Ruby on Rails application to manage Content Documents..A Content Document is a collection of documents and information that describes what software is installed on a Launch Control System Computer. It's important for us to make sure the tools we use everyday are secure, up-to-date, and properly licensed. Previously, keeping track of the information was done by Excel and Word files between different personnel. The goal of the new application is to be able to manage and access the Content Documents through a single database backed web application. Our LCS team will benefit greatly with this app. Admin's will be able to login securely to keep track and update the software installed on each computer in a timely manner. We also included exportability such as attaching additional documents that can be downloaded from the web application. The finished application will ease the process of managing Content Documents while streamlining the procedure. Ruby on Rails is a very powerful programming language and I am grateful to have the opportunity to build this application.

  6. Texting while driving: is speech-based text entry less risky than handheld text entry?

    Science.gov (United States)

    He, J; Chaparro, A; Nguyen, B; Burge, R J; Crandall, J; Chaparro, B; Ni, R; Cao, S

    2014-11-01

    Research indicates that using a cell phone to talk or text while maneuvering a vehicle impairs driving performance. However, few published studies directly compare the distracting effects of texting using a hands-free (i.e., speech-based interface) versus handheld cell phone, which is an important issue for legislation, automotive interface design and driving safety training. This study compared the effect of speech-based versus handheld text entries on simulated driving performance by asking participants to perform a car following task while controlling the duration of a secondary text-entry task. Results showed that both speech-based and handheld text entries impaired driving performance relative to the drive-only condition by causing more variation in speed and lane position. Handheld text entry also increased the brake response time and increased variation in headway distance. Text entry using a speech-based cell phone was less detrimental to driving performance than handheld text entry. Nevertheless, the speech-based text entry task still significantly impaired driving compared to the drive-only condition. These results suggest that speech-based text entry disrupts driving, but reduces the level of performance interference compared to text entry with a handheld device. In addition, the difference in the distraction effect caused by speech-based and handheld text entry is not simply due to the difference in task duration. PMID:25089769

  7. Situational Interest in Literary Text

    Science.gov (United States)

    Schraw

    1997-10-01

    This study examined relationships among text characteristics, situational interest, two measures of text understanding, and personal responses when reading a literary text. A factor analysis of ratings made after reading revealed six interrelated text characteristics. Of these, suspense, coherence and thematic complexity explained 54% of the variance in interest. Additional analyses found that situational interest was unrelated to a multiple choice test of main ideas; but was related to personal responses and holistic interpretations of the text. These results suggest that multiple aspects of literary texts are interesting to readers, and that interest is related to personal engagement variables, even when it is not related to the comprehension of main ideas. Copyright 1997Academic Press PMID:9356182

  8. Outer Texts in Bilingual Dictionaries

    OpenAIRE

    Rufus H Gouws

    2011-01-01

    Abstract: Dictionaries often display a central list bias with little or no attention to the use ofouter texts. This article focuses on dictionaries as text compounds and carriers of different texttypes. Utilising either a partial or a complete frame structure, a variety of outer text types can beused to enhance the data distribution structure of a dictionary and to ensure a better informationretrieval by the intended target user. A distinction is made between primary frame structures...

  9. Active Learning for Text Classification

    OpenAIRE

    Hu, Rong

    2011-01-01

    Text classification approaches are used extensively to solve real-world challenges. The success or failure of text classification systems hangs on the datasets used to train them, without a good dataset it is impossible to build a quality system. This thesis examines the applicability of active learning in text classification for the rapid and economical creation of labelled training data. Four main contributions are made in this thesis. First, we present two novel selection strategies to cho...

  10. Multimodal texts in kindergarten rooms

    OpenAIRE

    Granly, Astrid; Maagerø, Eva

    2012-01-01

    This article provides an overview of the results of our project “The Kindergarten Room: A Multimodal Pedagogical Text”. Our major initiative was to investigate what the multimodal texts in kindergarten represent and the extent to which they reflect and provide attributions to the children’s activities. In addition, we wanted to investigate whether kindergarten walls and floors can be called ‘pedagogical texts’, and the extent to which texts on walls and floors establish a particular text cult...

  11. Text Type and Translation Strategy

    Institute of Scientific and Technical Information of China (English)

    刘福娟

    2015-01-01

    Translation strategy and translation standards are undoubtedly the core problems translators are confronted with in translation. There have arisen many kinds of translation strategies in translation history, among which the text type theory is considered an important breakthrough and a significant complement of traditional translation standards. This essay attempts to demonstrate the value of text typology (informative, expressive, and operative) to translation strategy, emphasizing the importance of text types and their communicative functions.

  12. Strategies for Translating Vocative Texts

    OpenAIRE

    Olga COJOCARU

    2014-01-01

    The paper deals with the linguistic and cultural elements of vocative texts and the techniques used in translating them by giving some examples of texts that are typically vocative (i.e. advertisements and instructions for use). Semantic and communicative strategies are popular in translation studies and each of them has its own advantages and disadvantages in translating vocative texts. The advantage of semantic translation is that it takes more account of the aesthetic value of the SL te...

  13. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  14. TRANSLATION PROBLEMS IN MEDICAL TEXTS

    OpenAIRE

    OĞUZ, Derya

    2014-01-01

    In this study, our aim was to emphasize the fact that the translation of medical texts represents a special area in translation, and that the most important aspect of medical text translation is being aware of the purpose and intent behind these translations. Medical text translation is not an area in which any person engaged in translation can work effectively. The translator first needs to have a considerable scientific background and experience. The translator’s task, in this context, is t...

  15. The Chinese Text Categorization System with Category Priorities

    Directory of Open Access Journals (Sweden)

    Huan-Chao Keh

    2010-10-01

    Full Text Available The process of text categorization involves some understanding of the content of the documents and/or some previous knowledge of the categories. For the content of the documents, we use a filtering measure for feature selection in our Chinese text categorization system. We modify the formula of Term Frequency-Inverse Document Frequency (TF-IDF to strengthen important keywords’ weights and weaken unimportant keywords’ weights. For the knowledge of the categories, we use category priority to represent the relationship between two different categories. Consequently, the experimental results show that our method can effectively not only decrease noise text but also increase the accuracy rate and recall rate of text categorization.

  16. Knowledge Representation in Travelling Texts

    DEFF Research Database (Denmark)

    Mousten, Birthe; Locmele, Gunta

    2014-01-01

    Today, information travels fast. Texts travel, too. In a corporate context, the question is how to manage which knowledge elements should travel to a new language area or market and in which form? The decision to let knowledge elements travel or not travel highly depends on the limitation...... and the purpose of the text in a new context as well as on predefined parameters for text travel. For texts used in marketing and in technology, the question is whether culture-bound knowledge representation should be domesticated or kept as foreign elements, or should be mirrored or moulded—or should not travel...

  17. Relation Based Mining Model for Enhancing Web Document Clustering

    Directory of Open Access Journals (Sweden)

    M.Reka

    2014-05-01

    Full Text Available The design of web Information management system becomes more complex one with more time complexity. Information retrieval is a difficult task due to the huge volume of web documents. The way of clustering makes the retrieval easier and less time consuming. Thisalgorithm introducesa web document clustering approach, which use the semantic relation between documents, which reduces the time complexity. It identifies the relations and concepts in a document and also computes the relation score between documents. This algorithm analyses the key concepts from the web documents by preprocessing, stemming, and stop word removal. Identified concepts are used to compute the document relation score and clusterrelation score. The domain ontology is used to compute the document relation score and cluster relation score. Based on the document relation score and cluster relation score, the web document cluster is identified. This algorithm uses 2,00,000 web documents for evaluation and 60 percentas trainingset and 40 percent as testing set.

  18. AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN

    OpenAIRE

    J.Sreemathy; P. S. Balamurugan

    2012-01-01

    The main objective is to propose a text classification based on the features selection and preprocessing thereby reducing the dimensionality of the Feature vector and increase the classificationaccuracy. Text classification is the process of assigning a document to one or more target categories, based on its contents. In the proposed method, machine learning methods for text classification is used to apply some text preprocessing methods in different dataset, and then to extract feature vecto...

  19. ERRORS AND DIFFICULTIES IN TRANSLATING LEGAL TEXTS

    Directory of Open Access Journals (Sweden)

    Camelia, CHIRILA

    2014-11-01

    Full Text Available Nowadays the accurate translation of legal texts has become highly important as the mistranslation of a passage in a contract, for example, could lead to lawsuits and loss of money. Consequently, the translation of legal texts to other languages faces many difficulties and only professional translators specialised in legal translation should deal with the translation of legal documents and scholarly writings. The purpose of this paper is to analyze translation from three perspectives: translation quality, errors and difficulties encountered in translating legal texts and consequences of such errors in professional translation. First of all, the paper points out the importance of performing a good and correct translation, which is one of the most important elements to be considered when discussing translation. Furthermore, the paper presents an overview of the errors and difficulties in translating texts and of the consequences of errors in professional translation, with applications to the field of law. The paper is also an approach to the differences between languages (English and Romanian that can hinder comprehension for those who have embarked upon the difficult task of translation. The research method that I have used to achieve the objectives of the paper was the content analysis of various Romanian and foreign authors' works.

  20. Extraction of information from unstructured text

    Energy Technology Data Exchange (ETDEWEB)

    Irwin, N.H.; DeLand, S.M.; Crowder, S.V.

    1995-11-01

    Extracting information from unstructured text has become an emphasis in recent years due to the large amount of text now electronically available. This status report describes the findings and work done by the end of the first year of a two-year LDRD. Requirements of the approach included that it model the information in a domain independent way. This means that it would differ from current systems by not relying on previously built domain knowledge and that it would do more than keyword identification. Three areas that are discussed and expected to contribute to a solution include (1) identifying key entities through document level profiling and preprocessing, (2) identifying relationships between entities through sentence level syntax, and (3) combining the first two with semantic knowledge about the terms.

  1. Use of Printed and Online Documents.

    Science.gov (United States)

    Poupa, Christine

    2001-01-01

    Explains how written material started; describes the nature and supply of electronic documents; characterizes student practices in using paper texts and online electronic texts in higher education, including the role of reading; and considers communication and speed and informational inflation and choice. (Author/LRW)

  2. Linguistic Dating of Biblical Texts

    DEFF Research Database (Denmark)

    Ehrensvärd, Martin Gustaf

    For two centuries, scholars have pointed to consistent differences in the Hebrew of certain biblical texts and interpreted these differences as reflecting the date of composition of the texts. Until the 1980s, this was quite uncontroversial as the linguistic findings largely confirmed the...

  3. Strategies for Translating Vocative Texts

    Directory of Open Access Journals (Sweden)

    Olga COJOCARU

    2014-12-01

    Full Text Available The paper deals with the linguistic and cultural elements of vocative texts and the techniques used in translating them by giving some examples of texts that are typically vocative (i.e. advertisements and instructions for use. Semantic and communicative strategies are popular in translation studies and each of them has its own advantages and disadvantages in translating vocative texts. The advantage of semantic translation is that it takes more account of the aesthetic value of the SL text, while communicative translation attempts to render the exact contextual meaning of the original text in such a way that both content and language are readily acceptable and comprehensible to the readership. Focus is laid on the strategies used in translating vocative texts, strategies that highlight and introduce a cultural context to the target audience, in order to achieve their overall purpose, that is to sell or persuade the reader to behave in a certain way. Thus, in order to do that, a number of advertisements from the field of cosmetics industry and electronic gadgets were selected for analysis. The aim is to gather insights into vocative text translation and to create new perspectives on this field of research, now considered a process of innovation and diversion, especially in areas as important as economy and marketing.

  4. Scalable ranked retrieval using document images

    Science.gov (United States)

    Jain, Rajiv; Oard, Douglas W.; Doermann, David

    2013-12-01

    Despite the explosion of text on the Internet, hard copy documents that have been scanned as images still play a significant role for some tasks. The best method to perform ranked retrieval on a large corpus of document images, however, remains an open research question. The most common approach has been to perform text retrieval using terms generated by optical character recognition. This paper, by contrast, examines whether a scalable segmentation-free image retrieval algorithm, which matches sub-images containing text or graphical objects, can provide additional benefit in satisfying a user's information needs on a large, real world dataset. Results on 7 million scanned pages from the CDIP v1.0 test collection show that content based image retrieval finds a substantial number of documents that text retrieval misses, and that when used as a basis for relevance feedback can yield improvements in retrieval effectiveness.

  5. ONTOLOGY BASED DOCUMENT CLUSTERING USING MAPREDUCE

    Directory of Open Access Journals (Sweden)

    Abdelrahman Elsayed

    2015-05-01

    Full Text Available Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only enhances internal evaluation measures of document clustering; and decreases the documents features from thousands to tens features. Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means algorithm.

  6. Chemical-text hybrid search engines.

    Science.gov (United States)

    Zhou, Yingyao; Zhou, Bin; Jiang, Shumei; King, Frederick J

    2010-01-01

    As the amount of chemical literature increases, it is critical that researchers be enabled to accurately locate documents related to a particular aspect of a given compound. Existing solutions, based on text and chemical search engines alone, suffer from the inclusion of "false negative" and "false positive" results, and cannot accommodate diverse repertoire of formats currently available for chemical documents. To address these concerns, we developed an approach called Entity-Canonical Keyword Indexing (ECKI), which converts a chemical entity embedded in a data source into its canonical keyword representation prior to being indexed by text search engines. We implemented ECKI using Microsoft Office SharePoint Server Search, and the resultant hybrid search engine not only supported complex mixed chemical and keyword queries but also was applied to both intranet and Internet environments. We envision that the adoption of ECKI will empower researchers to pose more complex search questions that were not readily attainable previously and to obtain answers at much improved speed and accuracy. PMID:20047295

  7. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  8. GPM Mission Gridded Text Products Providing Surface Precipitation Retrievals

    Science.gov (United States)

    Stocker, Erich Franz; Kelley, Owen; Huffman, George; Kummerow, Christian

    2015-04-01

    constellation satellites. Both of these gridded products are generated for a .25 degree x .25 degree hourly grid, which are packaged into daily ASCII files that can downloaded from the PPS FTP site. To reduce the download size, the files are compressed using the gzip utility. This paper will focus on presenting high-level details about the gridded text product being generated from the instruments on the GPM core satellite. But summary information will also be presented about the partner radiometer gridded product. All retrievals for the partner radiometer are done using the GPROF2014 algorithm using as input the PPS generated inter-calibrated 1C product for the radiometer.

  9. Outer Texts in Bilingual Dictionaries

    Directory of Open Access Journals (Sweden)

    Rufus H. Gouws

    2011-10-01

    Full Text Available

    Abstract: Dictionaries often display a central list bias with little or no attention to the use ofouter texts. This article focuses on dictionaries as text compounds and carriers of different texttypes. Utilising either a partial or a complete frame structure, a variety of outer text types can beused to enhance the data distribution structure of a dictionary and to ensure a better informationretrieval by the intended target user. A distinction is made between primary frame structures andsecondary frame structures and attention is drawn to the use of complex outer texts and the need ofan extended complex outer text with its own table of contents to guide the user to the relevant textsin the complex outer text. It is emphasised that outer texts need to be planned in a meticulous wayand that they should participate in the lexicographic functions of the specific dictionary, bothknowledge-orientated and communication-orientated functions, to ensure a transtextual functionalapproach.

    Keywords: BACK MATTER, CENTRAL LIST, COMMUNICATION-ORIENTATED FUNCTIONS,COMPLEX TEXT, CULTURAL DATA, EXTENDED COMPLEX TEXT, EXTENDED TEXTS,FRONT MATTER, FRAME STRUCTURE, KNOWLEDGE-ORIENTATED FUNCTIONS, LEXICOGRAPHICFUNCTIONS, OUTER TEXTS, PRIMARY FRAME, SECONDARY FRAME

    Opsomming: Buitetekste in tweetalige woordeboeke. Woordeboeke vertoondikwels 'n partydigheid ten gunste van die sentrale lys met min of geen aandag aan die buitetekstenie. Hierdie artikel fokus op woordeboeke as tekssamestellings en draers van verskillende tekssoorte.Met die benutting van óf 'n gedeeltelike óf 'n volledige raamstruktuur kan 'n verskeidenheidbuitetekste aangewend word om die dataverspreidingstruktuur van 'n woordeboek te verbeteren om 'n beter herwinning van inligting deur die teikengebruiker te verseker. 'n Onderskeidword gemaak tussen primêre en sekondêre raamstrukture en die aandag word gevestig op kompleksebuitetekste en die behoefte aan 'n uitgebreide komplekse

  10. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  11. Document Management vs. Knowledge Management

    Directory of Open Access Journals (Sweden)

    Sergiu JECAN

    2008-01-01

    Full Text Available Most large organizations have been investing in various disconnected management technologies during the past few years. Efforts to improve management have been especially noticeable over the last 18-24 months, as organizations try to tame the chaos behind their public internet and internal intranet sites. More recently, regulatory concerns have reawakened interest in records management, archiving and document management. In addition, organizations seeking to increase innovation and overall employee efficiency have initiated projects to improve collaborative capabilities. With business models constantly changing and organizations moving to outsourced solutions, the drive towards improving business processes has never been greater. Organizations expect outsourcing to streamline business processes efficiently and effectively if they are to achieve rapid payback and return on investment (ROI.This is where workflow, document management and knowledge management can support the in-house and outsourced business process improvements that help CEOs gain the business benefits they seek in order to remain competitive. We will show how processes can be improved through workflow, document management and knowledge management.

  12. Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture

    Science.gov (United States)

    Sanfilippo, Antonio; Calapristi, Augustin J.; Crow, Vernon L.; Hetzler, Elizabeth G.; Turner, Alan E.

    2009-12-22

    Document clustering methods, document cluster label disambiguation methods, document clustering apparatuses, and articles of manufacture are described. In one aspect, a document clustering method includes providing a document set comprising a plurality of documents, providing a cluster comprising a subset of the documents of the document set, using a plurality of terms of the documents, providing a cluster label indicative of subject matter content of the documents of the cluster, wherein the cluster label comprises a plurality of word senses, and selecting one of the word senses of the cluster label.

  13. Customer Communication Document

    Science.gov (United States)

    2009-01-01

    This procedure communicates to the Customers of the Automation, Robotics and Simulation Division (AR&SD) Dynamics Systems Test Branch (DSTB) how to obtain services of the Six-Degrees-Of-Freedom Dynamic Test System (SDTS). The scope includes the major communication documents between the SDTS and its Customer. It established the initial communication and contact points as well as provides the initial documentation in electronic media for the customer. Contact the SDTS Manager (SM) for the names of numbers of the current contact points.

  14. An Automated FORTRAN documenter

    Science.gov (United States)

    Erickson, T.

    1982-01-01

    A set of programs designed to help R&D programmers document their FORTRAN programs more effectively were written. The central program reads FORTRAN source code and asks the programmer questions about things it has not heard of before. It inserts the answers to these questions as comments into the FORTRAN code. The comments, as well as extensive cross-reference information, are also written to an unformatted file. Other programs read this file to produce printed information or to act as an interactive document.

  15. Sustainable, Extensible Documentation Generation Using inlinedocs

    Directory of Open Access Journals (Sweden)

    Toby Dylan Hocking

    2013-09-01

    Full Text Available This article presents inlinedocs, an R package for generating documentation from comments. The concept of structured, interwoven code and documentation has existed for many years, but existing systems that implement this for the R programming language do not tightly integrate with R code, leading to several drawbacks. This article attempts to address these issues and presents 2 contributions for documentation generation for the R community. First, we propose a new syntax for inline documentation of R code within comments adjacent to the relevant code, which allows for highly readable and maintainable code and documentation. Second, we propose an extensible system for parsing these comments, which allows the syntax to be easily augmented.

  16. Why is Light Text Harder to Read Than Dark Text?

    Science.gov (United States)

    Scharff, Lauren V.; Ahumada, Albert J.

    2005-01-01

    Scharff and Ahumada (2002, 2003) measured text legibility for light text and dark text. For paragraph readability and letter identification, responses to light text were slower and less accurate for a given contrast. Was this polarity effect (1) an artifact of our apparatus, (2) a physiological difference in the separate pathways for positive and negative contrast or (3) the result of increased experience with dark text on light backgrounds? To rule out the apparatus-artifact hypothesis, all data were collected on one monitor. Its luminance was measured at all levels used, and the spatial effects of the monitor were reduced by pixel doubling and quadrupling (increasing the viewing distance to maintain constant angular size). Luminances of vertical and horizontal square-wave gratings were compared to assess display speed effects. They existed, even for 4-pixel-wide bars. Tests for polarity asymmetries in display speed were negative. Increased experience might develop full letter templates for dark text, while recognition of light letters is based on component features. Earlier, an observer ran all conditions at one polarity and then switched. If dark and light letters were intermixed, the observer might use component features on all trials and do worse on the dark letters, reducing the polarity effect. We varied polarity blocking (completely blocked, alternating smaller blocks, and intermixed blocks). Letter identification responses times showed polarity effects at all contrasts and display resolution levels. Observers were also more accurate with higher contrasts and more pixels per degree. Intermixed blocks increased the polarity effect by reducing performance on the light letters, but only if the randomized block occurred prior to the nonrandomized block. Perhaps observers tried to use poorly developed templates, or they did not work as hard on the more difficult items. The experience hypothesis and the physiological gain hypothesis remain viable explanations.

  17. Stemming Malay Text and Its Application in Automatic Text Categorization

    Science.gov (United States)

    Yasukawa, Michiko; Lim, Hui Tian; Yokoo, Hidetoshi

    In Malay language, there are no conjugations and declensions and affixes have important grammatical functions. In Malay, the same word may function as a noun, an adjective, an adverb, or, a verb, depending on its position in the sentence. Although extensively simple root words are used in informal conversations, it is essential to use the precise words in formal speech or written texts. In Malay, to make sentences clear, derivative words are used. Derivation is achieved mainly by the use of affixes. There are approximately a hundred possible derivative forms of a root word in written language of the educated Malay. Therefore, the composition of Malay words may be complicated. Although there are several types of stemming algorithms available for text processing in English and some other languages, they cannot be used to overcome the difficulties in Malay word stemming. Stemming is the process of reducing various words to their root forms in order to improve the effectiveness of text processing in information systems. It is essential to avoid both over-stemming and under-stemming errors. We have developed a new Malay stemmer (stemming algorithm) for removing inflectional and derivational affixes. Our stemmer uses a set of affix rules and two types of dictionaries: a root-word dictionary and a derivative-word dictionary. The use of set of rules is aimed at reducing the occurrence of under-stemming errors, while that of the dictionaries is believed to reduce the occurrence of over-stemming errors. We performed an experiment to evaluate the application of our stemmer in text mining software. For the experiment, text data used were actual web pages collected from the World Wide Web to demonstrate the effectiveness of our Malay stemming algorithm. The experimental results showed that our stemmer can effectively increase the precision of the extracted Boolean expressions for text categorization.

  18. An Experimental Text-Commentary

    Science.gov (United States)

    O'Brien, Joan

    1976-01-01

    An experimental text-commentary of selected passages from Sophocles'"Antigone" is described. The commentary is intended for students seeking more than a conventional translation who do not know enough Greek to use a standard commentary. (RM)

  19. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  20. Analysing Representations of Otherness Using Different Text-Types.

    Science.gov (United States)

    Murphy-LeJeune, Elizabeth; And Others

    1996-01-01

    Demonstrates how the teacher can use texts to confront learners with cultural representations. Four texts are used to represent a literary extract, a student essay, an advertising document, and a newspaper article. The article illustrates approaches that borrow from stylistics, linguistics, and discourse analysis. (21 references) (Author/CK)

  1. Text Mining in Social Networks

    Science.gov (United States)

    Aggarwal, Charu C.; Wang, Haixun

    Social networks are rich in various kinds of contents such as text and multimedia. The ability to apply text mining algorithms effectively in the context of text data is critical for a wide variety of applications. Social networks require text mining algorithms for a wide variety of applications such as keyword search, classification, and clustering. While search and classification are well known applications for a wide variety of scenarios, social networks have a much richer structure both in terms of text and links. Much of the work in the area uses either purely the text content or purely the linkage structure. However, many recent algorithms use a combination of linkage and content information for mining purposes. In many cases, it turns out that the use of a combination of linkage and content information provides much more effective results than a system which is based purely on either of the two. This paper provides a survey of such algorithms, and the advantages observed by using such algorithms in different scenarios. We also present avenues for future research in this area.

  2. A Method for Text Summarization by Bacterial Foraging Optimization Algorithm

    Directory of Open Access Journals (Sweden)

    Morteza Dastkhosh Nikoo

    2012-07-01

    Full Text Available Due to rapid and increasingly growth of electronic texts and documents, we need some techniques for integration, communication and appropriate utilization of these texts. Summarizing the literature is one of the most fundamental tasks for integrating and taking advantages of these gathered texts. Selecting key words and then integrating them as a summary text, is the most common method in text summarization. In this paper we present a new method of automatic text summarization, with bacterial foraging optimization. The main idea of this method, is weighting words, then valuing the sentences, and finally extracting key sentences from the text, as the summarized text. It should be noted that, here we used the weighting term TF-IDF method, to determine weight for each text. Also, the bacterial foraging optimization used to converge the solutions is obtained from each bacteria, and finally the best candidate summarized text is given.

  3. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  4. A search engine for Arabic documents

    OpenAIRE

    Sari, T.; Kefali, A.

    2008-01-01

    This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. E...

  5. Semi-Automatic Indexing of Multilingual Documents

    OpenAIRE

    Schiel, Ulrich; de Souza, Ianna M. Sodre Ferreira; Ferneda, Edberto

    1999-01-01

    With the growing significance of digital libraries and the Internet, more and more electronic texts become accessible to a wide and geographically disperse public. This requires adequate tools to facilitate indexing, storage, and retrieval of documents written in different languages. We present a method for semi-automatic indexing of electronic documents and construction of a multilingual thesaurus, which can be used for query formulation and information retrieval. We use special dictionaries...

  6. Analysing ESP Texts, but How?

    Directory of Open Access Journals (Sweden)

    Borza Natalia

    2015-03-01

    Full Text Available English as a second language (ESL teachers instructing general English and English for specific purposes (ESP in bilingual secondary schools face various challenges when it comes to choosing the main linguistic foci of language preparatory courses enabling non-native students to study academic subjects in English. ESL teachers intending to analyse English language subject textbooks written for secondary school students with the aim of gaining information about what bilingual secondary school students need to know in terms of language to process academic textbooks cannot avoiding deal with a dilemma. It needs to be decided which way it is most appropriate to analyse the texts in question. Handbooks of English applied linguistics are not immensely helpful with regard to this problem as they tend not to give recommendation as to which major text analytical approaches are advisable to follow in a pre-college setting. The present theoretical research aims to address this lacuna. Respectively, the purpose of this pedagogically motivated theoretical paper is to investigate two major approaches of ESP text analysis, the register and the genre analysis, in order to find the more suitable one for exploring the language use of secondary school subject texts from the point of view of an English as a second language teacher. Comparing and contrasting the merits and limitations of the two contrastive approaches allows for a better understanding of the nature of the two different perspectives of text analysis. The study examines the goals, the scope of analysis, and the achievements of the register perspective and those of the genre approach alike. The paper also investigates and reviews in detail the starkly different methods of ESP text analysis applied by the two perspectives. Discovering text analysis from a theoretical and methodological angle supports a practical aspect of English teaching, namely making an informed choice when setting out to analyse

  7. Text segmentation with character-level text embeddings

    NARCIS (Netherlands)

    Chrupała, Grzegorz

    2013-01-01

    Learning word representations has recently seen much success in computational linguistics. However, assuming sequences of word tokens as input to linguistic analysis is often unjustified. For many languages word segmentation is a non-trivial task and naturally occurring text is sometimes a mixture o

  8. Document clustering using graph based document representation with constraints

    OpenAIRE

    Rafi, Muhammad; Amin, Farnaz; Shaikh, Mohammad Shahid

    2014-01-01

    Document clustering is an unsupervised approach in which a large collection of documents (corpus) is subdivided into smaller, meaningful, identifiable, and verifiable sub-groups (clusters). Meaningful representation of documents and implicitly identifying the patterns, on which this separation is performed, is the challenging part of document clustering. We have proposed a document clustering technique using graph based document representation with constraints. A graph data structure can easi...

  9. Documentation of CORTAX

    OpenAIRE

    Leon Bettendorf; Albert Van der Horst

    2006-01-01

    CORTAX is applied in Bettendorf et al. (2006), a simulation study on the economic and welfare implications of reforms in corporate income taxation. This technical documentation of the model consists of the derivation and listing of the equations of the model and a justification of the calibration.

  10. Course documentation report

    DEFF Research Database (Denmark)

    Buus, Lillian; Bygholm, Ann; Walther, Tina Dyngby Lyng

    A documentation report on the three pedagogical courses developed during the MVU project period. The report describes the three processes taking departure in the structure and material avaiable at the virtual learning environment. Also the report describes the way the two of the courses developed...

  11. ICRS Recommendation Document

    DEFF Research Database (Denmark)

    Roos, Ewa M.; Engelhart, Luella; Ranstam, Jonas;

    2011-01-01

    function evaluated for validity and psychometric properties in patients with articular cartilage lesions. Results: The knee-specific instruments, titled the International Knee Documentation Committee Subjective Knee Form and the Knee injury and Osteoarthritis and Outcome Score, both fulfill the basic...

  12. Documentation of spectrom-41

    International Nuclear Information System (INIS)

    SPECTROM-41 is a finite element heat transfer computer program developed to analyze thermal problems related to nuclear waste disposal. The code is part of the SPECTROM (Special Purpose Engineering Codes for Thermal/ROck Mechanics) series of special purpose finite element programs that are continually being developed by RE/SPEC Inc. (RSI) to address the many unique formations. This document presents the theoretical basis for the mathematical model, the finite element formulation of the program, and a description of the input data for the program, along with details about program support and continuing documentation. The documentation is intended to satisfy the requirements and guidelines outlined in NUREG-0856. The principal component model used in the programs based on Fourier's law of conductance. Numerous program options provide the capability of considering various boundary conditions, material stratification and anisotropy, and time-dependent heat generation that are characteristic of problems involving the disposal of nuclear waste in geologic formation. Numerous verification problems are included in the documentation in addition to highlights of past and ongoing verification and validation efforts. A typical repository problem is solving using SPECTROM-41 to demonstrate the use of the program in addressing problems related to the disposal of nuclear waste

  13. QA programme documentation

    International Nuclear Information System (INIS)

    The present paper deals with the following topics: The need for a documented Q.A. program; Establishing a Q.A. program; Q.A. activities; Fundamental policies; Q.A. policies; Quality objectives Q.A. manual. (orig./RW)

  14. Analysis of Design Documentation

    DEFF Research Database (Denmark)

    Hansen, Claus Thorp

    1998-01-01

    has been established where we seek to identify useful design work patterns by retrospective analyses of documentation created during design projects. This paper describes the analysis method, a tentatively defined metric to evaluate identified work patterns, and presents results from the first...... analysis accomplished....

  15. Biogas document; Dossier Biogaz

    Energy Technology Data Exchange (ETDEWEB)

    Verchin, J.C.; Servais, C. [Club BIOGAZ, 94 - Arcueil (France)

    2002-06-01

    In this document concerning the biogas, the author presents this renewable energy situation in 2001-2002, the concerned actors, the accounting of the industrial methanization installations in France, the three main chains of process for industrial wastes and two examples of methanization implementation in a paper industry and in a dairy. (A.L.B.)

  16. Extremely secure identification documents

    International Nuclear Information System (INIS)

    The technology developed in this project uses biometric information printed on the document and public key cryptography to ensure that an adversary cannot issue identification documents to unauthorized individuals or alter existing documents to allow their use by unauthorized individuals. This process can be used to produce many types of identification documents with much higher security than any currently in use. The system is demonstrated using a security badge as an example. This project focused on the technologies requiring development in order to make the approach viable with existing badge printing and laminating technologies. By far the most difficult was the image processing required to verify that the picture on the badge had not been altered. Another area that required considerable work was the high density printed data storage required to get sufficient data on the badge for verification of the picture. The image processing process was successfully tested, and recommendations are included to refine the badge system to ensure high reliability. A two dimensional data array suitable for printing the required data on the badge was proposed, but testing of the readability of the array had to be abandoned due to reallocation of the budgeted funds by the LDRD office

  17. Extracting laboratory test information from biomedical text

    Directory of Open Access Journals (Sweden)

    Yanna Shen Kang

    2013-01-01

    Full Text Available Background: No previous study reported the efficacy of current natural language processing (NLP methods for extracting laboratory test information from narrative documents. This study investigates the pathology informatics question of how accurately such information can be extracted from text with the current tools and techniques, especially machine learning and symbolic NLP methods. The study data came from a text corpus maintained by the U.S. Food and Drug Administration, containing a rich set of information on laboratory tests and test devices. Methods: The authors developed a symbolic information extraction (SIE system to extract device and test specific information about four types of laboratory test entities: Specimens, analytes, units of measures and detection limits. They compared the performance of SIE and three prominent machine learning based NLP systems, LingPipe, GATE and BANNER, each implementing a distinct supervised machine learning method, hidden Markov models, support vector machines and conditional random fields, respectively. Results: Machine learning systems recognized laboratory test entities with moderately high recall, but low precision rates. Their recall rates were relatively higher when the number of distinct entity values (e.g., the spectrum of specimens was very limited or when lexical morphology of the entity was distinctive (as in units of measures, yet SIE outperformed them with statistically significant margins on extracting specimen, analyte and detection limit information in both precision and F-measure. Its high recall performance was statistically significant on analyte information extraction. Conclusions: Despite its shortcomings against machine learning methods, a well-tailored symbolic system may better discern relevancy among a pile of information of the same type and may outperform a machine learning system by tapping into lexically non-local contextual information such as the document structure.

  18. Practical vision based degraded text recognition system

    Science.gov (United States)

    Mohammad, Khader; Agaian, Sos; Saleh, Hani

    2011-02-01

    Rapid growth and progress in the medical, industrial, security and technology fields means more and more consideration for the use of camera based optical character recognition (OCR) Applying OCR to scanned documents is quite mature, and there are many commercial and research products available on this topic. These products achieve acceptable recognition accuracy and reasonable processing times especially with trained software, and constrained text characteristics. Even though the application space for OCR is huge, it is quite challenging to design a single system that is capable of performing automatic OCR for text embedded in an image irrespective of the application. Challenges for OCR systems include; images are taken under natural real world conditions, Surface curvature, text orientation, font, size, lighting conditions, and noise. These and many other conditions make it extremely difficult to achieve reasonable character recognition. Performance for conventional OCR systems drops dramatically as the degradation level of the text image quality increases. In this paper, a new recognition method is proposed to recognize solid or dotted line degraded characters. The degraded text string is localized and segmented using a new algorithm. The new method was implemented and tested using a development framework system that is capable of performing OCR on camera captured images. The framework allows parameter tuning of the image-processing algorithm based on a training set of camera-captured text images. Novel methods were used for enhancement, text localization and the segmentation algorithm which enables building a custom system that is capable of performing automatic OCR which can be used for different applications. The developed framework system includes: new image enhancement, filtering, and segmentation techniques which enabled higher recognition accuracies, faster processing time, and lower energy consumption, compared with the best state of the art published

  19. Related Documents Search Using User Created Annotations

    Directory of Open Access Journals (Sweden)

    Jakub Sevcech

    2013-01-01

    Full Text Available We often use various services for creating bookmarks,tags, highlights and other types of annotations while surf-ing the Internet or when reading electronic documentsas well. These services allows us to create a number oftypes of annotation that we are commonly creating intoprinted documents. Annotations attached to electronicdocuments however can be used for other purposes suchas navigation support, text summarization etc. We pro-posed a method for searching related documents to cur-rently studied document using annotations created by thedocument reader as indicators of user's interest in par-ticular parts of the document. The method is based onspreading activation in text transformed into graph. Forevaluation we created a service called Annota, which al-lows users to insert various types of annotations into webpages and PDF documents displayed in the web browser.We analyzed properties of various types of annotations in-serted by users of Annota into documents. Based on thesewe evaluated our method by simulation and we comparedit against commonly used TF-IDF based method.

  20. Technical approach document

    Energy Technology Data Exchange (ETDEWEB)

    1989-12-01

    The Uranium Mill Tailings Radiation Control Act (UMTRCA) of 1978, Public Law 95-604 (PL95-604), grants the Secretary of Energy the authority and responsibility to perform such actions as are necessary to minimize radiation health hazards and other environmental hazards caused by inactive uranium mill sites. This Technical Approach Document (TAD) describes the general technical approaches and design criteria adopted by the US Department of Energy (DOE) in order to implement remedial action plans (RAPS) and final designs that comply with EPA standards. It does not address the technical approaches necessary for aquifer restoration at processing sites; a guidance document, currently in preparation, will describe aquifer restoration concerns and technical protocols. This document is a second revision to the original document issued in May 1986; the revision has been made in response to changes to the groundwater standards of 40 CFR 192, Subparts A--C, proposed by EPA as draft standards. New sections were added to define the design approaches and designs necessary to comply with the groundwater standards. These new sections are in addition to changes made throughout the document to reflect current procedures, especially in cover design, water resources protection, and alternate site selection; only minor revisions were made to some of the sections. Sections 3.0 is a new section defining the approach taken in the design of disposal cells; Section 4.0 has been revised to include design of vegetated covers; Section 8.0 discusses design approaches necessary for compliance with the groundwater standards; and Section 9.0 is a new section dealing with nonradiological hazardous constituents. 203 refs., 18 figs., 26 tabs.

  1. Technical approach document

    International Nuclear Information System (INIS)

    The Uranium Mill Tailings Radiation Control Act (UMTRCA) of 1978, Public Law 95-604 (PL95-604), grants the Secretary of Energy the authority and responsibility to perform such actions as are necessary to minimize radiation health hazards and other environmental hazards caused by inactive uranium mill sites. This Technical Approach Document (TAD) describes the general technical approaches and design criteria adopted by the US Department of Energy (DOE) in order to implement remedial action plans (RAPS) and final designs that comply with EPA standards. It does not address the technical approaches necessary for aquifer restoration at processing sites; a guidance document, currently in preparation, will describe aquifer restoration concerns and technical protocols. This document is a second revision to the original document issued in May 1986; the revision has been made in response to changes to the groundwater standards of 40 CFR 192, Subparts A--C, proposed by EPA as draft standards. New sections were added to define the design approaches and designs necessary to comply with the groundwater standards. These new sections are in addition to changes made throughout the document to reflect current procedures, especially in cover design, water resources protection, and alternate site selection; only minor revisions were made to some of the sections. Sections 3.0 is a new section defining the approach taken in the design of disposal cells; Section 4.0 has been revised to include design of vegetated covers; Section 8.0 discusses design approaches necessary for compliance with the groundwater standards; and Section 9.0 is a new section dealing with nonradiological hazardous constituents. 203 refs., 18 figs., 26 tabs

  2. Archimedes: Accelerator Reveals Ancient Text

    International Nuclear Information System (INIS)

    Archimedes (287-212 BC), who is famous for shouting 'Eureka' (I found it) is considered one of the most brilliant thinkers of all times. The 10th-century parchment document known as the 'Archimedes Palimpsest' is the unique source for two of the great Greek's treatises. Some of the writings, hidden under gold forgeries, have recently been revealed at the Stanford Synchrotron Radiation Laboratory at SLAC. An intense x-ray beam produced in a particle accelerator causes the iron in original ink, which has been partly erased and covered, to send out a fluorescence glow. A detector records the signal and a digital image showing the ancient writings is produced. Please join us in this fascinating journey of a 1,000-year-old parchment from its origin in the Mediterranean city of Constantinople to a particle accelerator in Menlo Park.

  3. Princess Brambilla - images/text

    Directory of Open Access Journals (Sweden)

    Maria Aparecida Barbosa

    2016-06-01

    Full Text Available Read the illustrated literary text is simultaneously think pictures and words. This articulation between the written text and pictures adds potential, expands and becomes complex. Coincides with nowadays discussions on Giorgio Agamben's "contemporary" that add to what adheres to respectively time the displacement and the distance needed to understand it, shakes linear notions of historical chronology. Somehow the coincidence is related to the current interest in the concept of "Nachleben" (survival, which assumes the images of the past ransom, postulated by the art historian Aby Warburg in a research on ancient art of motion characteristics in Renaissance pictures Botticelli's. For the translation of the Princesa Brambilla – um capriccio segundo Jakob Callot, de E. T. A. Hoffmann, com 8 gravuras cunhadas a partir de moldes originais de Callot (1820 to Portuguese such discussions were fundamental, as I try to present in this article.

  4. Inferring Group Processes from Computer-Mediated Affective Text Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Schryver, Jack C [ORNL; Begoli, Edmon [ORNL; Jose, Ajith [Missouri University of Science and Technology; Griffin, Christopher [Pennsylvania State University

    2011-02-01

    Political communications in the form of unstructured text convey rich connotative meaning that can reveal underlying group social processes. Previous research has focused on sentiment analysis at the document level, but we extend this analysis to sub-document levels through a detailed analysis of affective relationships between entities extracted from a document. Instead of pure sentiment analysis, which is just positive or negative, we explore nuances of affective meaning in 22 affect categories. Our affect propagation algorithm automatically calculates and displays extracted affective relationships among entities in graphical form in our prototype (TEAMSTER), starting with seed lists of affect terms. Several useful metrics are defined to infer underlying group processes by aggregating affective relationships discovered in a text. Our approach has been validated with annotated documents from the MPQA corpus, achieving a performance gain of 74% over comparable random guessers.

  5. Fuzzy Swarm Based Text Summarization

    Directory of Open Access Journals (Sweden)

    Mohammed S. Binwahlan

    2009-01-01

    Full Text Available Problem statement: The aim of automatic text summarization systems is to select the most relevant information from an abundance of text sources. A daily rapid growth of data on the internet makes the achieve events of such aim a big challenge. Approach: In this study, we incorporated fuzzy logic with swarm intelligence; so that risks, uncertainty, ambiguity and imprecise values of choosing the features weights (scores could be flexibly tolerated. The weights obtained from the swarm experiment were used to adjust the text features scores and then the features scores were used as inputs for the fuzzy inference system to produce the final sentence score. The sentences were ranked in descending order based on their scores and then the top n sentences were selected as final summary. Results: The experiments showed that the incorporation of fuzzy logic with swarm intelligence could play an important role in the selection process of the most important sentences to be included in the final summary. Also the results showed that the proposed method got a good performance outperforming the swarm model and the benchmark methods. Conclusion: Incorporating more than one technique for dealing with the sentence scoring proved to be an effective mechanism. The PSO was employed for producing the text features weights. The purpose of this process was to emphasize on dealing with the text features fairly based on their importance and to differentiate between more and less important features. The fuzzy inference system was employed to determine the final sentence score, on which the decision was made to include the sentence in the summary or not.

  6. Shape codebook based handwritten and machine printed text zone extraction

    Science.gov (United States)

    Kumar, Jayant; Prasad, Rohit; Cao, Huiagu; Abd-Almageed, Wael; Doermann, David; Natarajan, Premkumar

    2011-01-01

    In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level.

  7. AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN

    Directory of Open Access Journals (Sweden)

    J.Sreemathy

    2012-03-01

    Full Text Available The main objective is to propose a text classification based on the features selection and preprocessing thereby reducing the dimensionality of the Feature vector and increase the classificationaccuracy. Text classification is the process of assigning a document to one or more target categories, based on its contents. In the proposed method, machine learning methods for text classification is used to apply some text preprocessing methods in different dataset, and then to extract feature vectors for each new document by using various feature weighting methods for enhancing the text classification accuracy. Further training the classifier by Naive Bayesian (NB and K-nearest neighbor (KNN algorithms, the predication can be made according to the category distribution among this k nearest neighbors.Experimental results show that the methods are favorable in terms of their effectiveness and efficiencywhen compared with other classifier such as SVM.

  8. Multimodal interactive handwritten text transcription

    CERN Document Server

    Romero, Veronica; Vidal, Enrique

    2012-01-01

    This book presents an interactive multimodal approach for efficient transcription of handwritten text images. This approach, rather than full automation, assists the expert in the recognition and transcription process.Until now, handwritten text recognition (HTR) systems are far from being perfect and heavy human intervention is often required to check and correct the results of such systems. The interactive scenario studied in this book combines the efficiency of automatic handwriting recognition systems with the accuracy of the experts, leading to a cost-effective perfect transcription of th

  9. Cluster Based Text Classification Model

    DEFF Research Database (Denmark)

    Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock

    2011-01-01

    We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases the...... classifier is trained on each cluster having reduced dimensionality and less number of examples. The experimental results show that the proposed model outperforms the existing classification models for the task of suspicious email detection and topic categorization on the Reuters-21578 and 20 Newsgroups...... datasets. Our model also outperforms A Decision Cluster Classification (ADCC) and the Decision Cluster Forest Classification (DCFC) models on the Reuters-21578 dataset....

  10. Quality Inspection of Printed Texts

    DEFF Research Database (Denmark)

    Pedersen, Jesper Ballisager; Nasrollahi, Kamal; Moeslund, Thomas B.

    2016-01-01

    -folded: for costumers of the printing and verification system, the overall grade used to verify if the text is of sufficient quality, while for printer's manufacturer, the detailed character/symbols grades and quality measurements are used for the improvement and optimization of the printing task. The......Inspecting the quality of printed texts has its own importance in many industrial applications. To do so, this paper proposes a grading system which evaluates the performance of the printing task using some quality measures for each character and symbols. The purpose of these grading system is two...

  11. Valentine Wilderness proposal supporting documents

    Data.gov (United States)

    US Fish and Wildlife Service, Department of the Interior — This document is a series of documents meant to support the Valentine Wilderness proposal. The documents include a draft bill, draft letter to the President, a...

  12. Ontological representation of texts, and its applicationsin text analysis

    OpenAIRE

    Solheim, Bent André; Vågsnes, Kristian

    2003-01-01

    For the management of a company, the need to know what people think of their products or services is becoming increasingly important in an increasingly competitive market. As the Internet can nearly be described as a digital mirror of events in the ”real“ world, being able to make sense of the semi structured nature of natural language texts published in this ubiquitous medium has received growing interest. The approach proposed in the thesis combines natural language processin...

  13. A Guide Text or Many Texts? "That is the Question”

    Directory of Open Access Journals (Sweden)

    Delgado de Valencia Sonia

    2001-08-01

    Full Text Available The use of supplementary materials in the classroom has always been an essential part of the teaching and learning process. To restrict our teaching to the scope of one single textbook means to stand behind the advances of knowledge, in any area and context. Young learners appreciate any new and varied support that expands their knowledge of the world: diaries, letters, panels, free texts, magazines, short stories, poems or literary excerpts, and articles taken from Internet are materials that will allow learnersto share more and work more collaboratively. In this article we are going to deal with some of these materials, with the criteria to select, adapt, and create them that may be of interest to the learner and that may promote reading and writing processes. Since no text can entirely satisfy the needs of students and teachers, the creativity of both parties will be necessary to improve the quality of teaching through the adequate use and adaptation of supplementary materials.

  14. Full texts in the Czech geographical bibliography database

    OpenAIRE

    Eva Novotná

    2014-01-01

    Open access to the documents is one of the basic requirements of databases users. Czech Geographical Bibliography On-line provides access to 185,000 bibliographical records of Bohemical geographic and cartographic documents and to more than 30,000 full texts and objects. The access is provided through a connection from the permanent storage, the Digital University Repository or a URL address of the bibliographical record. The works in public domain can directly become accessible or it is nece...

  15. Comparison of Text Categorization Algorithms

    Institute of Scientific and Technical Information of China (English)

    SHI Yong-feng; ZHAO Yan-ping

    2004-01-01

    This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages.It provides clues for making use of appropriate automatic classifying algorithms in different fields.Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out.

  16. Multilingual text induced spelling correction

    NARCIS (Netherlands)

    Reynaert, M.W.C.

    2004-01-01

    We present TISC, a multilingual, language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from raw text corpora, without supervision, and contains word unigrams

  17. Values Education: Texts and Supplements.

    Science.gov (United States)

    Curriculum Review, 1979

    1979-01-01

    This column describes and evaluates almost 40 texts, instructional kits, and teacher resources on values, interpersonal relations, self-awareness, self-help skills, juvenile psychology, and youth suicide. Eight effective picture books for the primary grades and seven titles in values fiction for teens are also reviewed. (SJL)

  18. Solar Concepts: A Background Text.

    Science.gov (United States)

    Gorham, Jonathan W.

    This text is designed to provide teachers, students, and the general public with an overview of key solar energy concepts. Various energy terms are defined and explained. Basic thermodynamic laws are discussed. Alternative energy production is described in the context of the present energy situation. Described are the principal contemporary solar…

  19. Reviving "Walden": Mining the Text.

    Science.gov (United States)

    Hewitt Julia

    2000-01-01

    Describes how the author and her high school English students begin their study of Thoreau's "Walden" by mining the text for quotations to inspire their own writing and discussion on the topic, "How does Thoreau speak to you or how could he speak to someone you know?" (SR)

  20. Presentation of the math text

    OpenAIRE

    KREJČOVÁ, Iva

    2009-01-01

    The aim of this bachelor thesis is basic mapping out the mediums for creating mathematical texts and their presentation and the acquisition of basic user skills in the usage of these programs. These funds also compare in terms of availability and ease of use, their opportunities and quality of the output.

  1. A text mining framework in R and its applications

    OpenAIRE

    Feinerer, Ingo

    2008-01-01

    Text mining has become an established discipline both in research as in business intelligence. However, many existing text mining toolkits lack easy extensibility and provide only poor support for interacting with statistical computing environments. Therefore we propose a text mining framework for the statistical computing environment R which provides intelligent methods for corpora handling, meta data management, preprocessing, operations on documents, and data export. We present how well es...

  2. An Efficient Technique to Implement Similarity Measures in Text Document Clustering using Artificial Neural Networks Algorithm

    OpenAIRE

    K. Selvi; R.M. Suresh

    2014-01-01

    Pattern recognition, envisaging supervised and unsupervised method, optimization, associative memory and control process are some of the diversified troubles that can be resolved by artificial neural networks. Problem identified: Of late, discovering the required information in massive quantity of data is the challenging tasks. The model of similarity evaluation is the central element in accomplishing a perceptive of variables and perception that encourage behavior and mediate concern. This s...

  3. On-line access to the full-texts of non periodical documents

    International Nuclear Information System (INIS)

    This article describes several options how electronic books (technical handbooks, scientific books, reference works, etc.) are published and available on-line on the Internet. There is a short description of some of the major services provided by worldwide publishers. As a part of the presentation there will be a live demonstration of selected services and work slices of the most interested systems. (author)

  4. Electronic books. On-line access to the full-texts of non periodical documents

    International Nuclear Information System (INIS)

    This presentation describes several options how electronic books (technical handbooks, scientific books, reference works, etc.) are published and available on-line on the Internet. There is a short description of some of the major services provided by worldwide publishers. As a part of the presentation there will be a live demonstration of selected services and work slices of the most interested systems. (author)

  5. Standardization of engineering documentation

    International Nuclear Information System (INIS)

    Many interrelated activities involving a number of organizational units comprise the process for the design and construction of a nuclear steam supply steam (NSSS). In the application of a standard NSSS design, many activities are duplicated from project to project and form a standard process for the engineering. This standard process in turn lends itself to a system for standardizing the engineering documentation associated with a particular design application. For these varied activities to be carried out successfully, a strong network of communication is required not only within each design organization but also externally among the various participants: the owner, the NSSS supplier, the architect-engineer, the construction agency, equipment suppliers, and others. This paper discusses, from the viewpoint of a NSSS supplier's engineering organization, the role of standard engineering documents in the design process and communication network

  6. Musique et document sonore

    OpenAIRE

    Javault, Patrick

    2013-01-01

    Tirée d'une thèse en musicologie et en esthétique, cette enquête est d'abord remarquable par sa façon d'établir des regroupements et de définir son chemin de réflexion. Pierre-Yves Macé, lui-même musicien, compositeur et praticien des archives phonographiques, trouve des transversales pour éclairer l'emploi du document sonore dans les musiques savantes et expérimentales (en écartant sciemment la pop) - document sonore envisagé comme Autre du musical et qui peut également être dit « le réel de...

  7. Areva - 2011 Reference document

    International Nuclear Information System (INIS)

    After having indicated the person responsible of this document and the legal account auditors, and provided some financial information, this document gives an overview of the different risk factors existing in the company: law risks, industrial and environmental risks, operational risks, risks related to large projects, market and liquidity risks. Then, after having recalled the history and evolution of the company and the evolution of its investments over the last five years, it proposes an overview of Areva's activities on the markets of nuclear energy and renewable energies, of its clients and suppliers, of its strategy, of the activities of its different departments. Other information are provided: company's flow chart, estate properties (plants, equipment), an analysis of its financial situation, its research and development policy, the present context, profit previsions or estimations, management organization and operation

  8. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  9. TEXT SIGNAGE RECOGNITION IN ANDROID MOBILE DEVICES

    Directory of Open Access Journals (Sweden)

    Oi-Mean Foong

    2013-01-01

    Full Text Available This study presents a Text Signage Recognition (TSR model in Android mobile devices for Visually Impaired People (VIP. Independence navigation is always a challenge to VIP for indoor navigation in unfamiliar surroundings. Assistive Technology such as Android smart devices has great potential to assist VIPs in indoor navigation using built-in speech synthesizer. In contrast to previous TSR research which was deployed in standalone personal computer system using Otsu’s algorithm, we have developed an affordable Text Signage Recognition in Android Mobile Devices using Tesseract OCR engine. The proposed TSR model used the input images from the International Conference on Document Analysis and Recognition (ICDAR 2003 dataset for system training and testing. The TSR model was tested by four volunteers who were blind-folded. The system performance of the TSR model was assessed using different metrics (i.e., Precision, Recall, F-Score and Recognition Formulas to determine its accuracy. Experimental results show that the proposed TSR model has achieved recognition rate satisfactorily.

  10. SANSMIC design document.

    Energy Technology Data Exchange (ETDEWEB)

    Weber, Paula D. [Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Rudeen, David Keith [GRAM, Inc., Albuquerque, NM (United States)

    2015-07-01

    The United States Strategic Petroleum Reserve (SPR) maintains an underground storage system consisting of caverns that were leached or solution mined in four salt domes located near the Gulf of Mexico in Texas and Louisiana. The SPR comprises more than 60 active caverns containing approximately 700 million barrels of crude oil. Sandia National Labo- ratories (SNL) is the geotechnical advisor to the SPR. As the most pressing need at the inception of the SPR was to create and fill storage volume with oil, the decision was made to leach the caverns and fill them simultaneously (leach-fill). Therefore, A.J. Russo developed SANSMIC in the early 1980s which allows for a transient oil-brine interface (OBI) making it possible to model leach-fill and withdrawal operations. As the majority of caverns are currently filled to storage capacity, the primary uses of SANSMIC at this time are related to the effects of small and large withdrawals, expansion of existing caverns, and projecting future pillar to diameter ratios. SANSMIC was identified by SNL as a priority candidate for qualification. This report continues the quality assurance (QA) process by documenting the "as built" mathematical and numerical models that comprise this document. The pro- gram flow is outlined and the models are discussed in detail. Code features that were added later or were not documented previously have been expounded. No changes in the code's physics have occurred since the original documentation (Russo, 1981, 1983) although recent experiments may yield improvements to the temperature and plume methods in the future.

  11. Documenting Norwegian Scholarly Publishing

    OpenAIRE

    R.W. Vaagan

    2005-01-01

    From 2005-2006, scholarly publishing, including e-publishing, becomes one of several criteria used by The Ministry of Education and Science in financing research in Norwegian universities and colleges. Based on qualitative methodology and critical case sampling of recent Norwegian policy documents and reports, combined with typical case sampling of articles on e-publishing 2000-2005, especially from D-Lib magazine (Patton, 2002; Hawkins, 2001), the article discusses trends in Norwegian schola...

  12. Chinese multi-document personal name disambiguation

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    This paper presents a new approach to determining whether an interested personal name across documents refers to the same entity. Firstly, three vectors for each text are formed: the personal name Boolean vectors denoting whether a personal name occurs in the text, the biographical word Boolean vector representing title, occupation and so forth, and the feature vector with real values. Then, by combining a heuristic strategy based on Boolean vectors with an agglomerative clustering algorithm based on feature vectors, it seeks to resolve multi-document personal name coreference. Experimental results show that this approach achieves a good performance by testing on "Wang Gang" corpus.

  13. AREVA - 2013 Reference document

    International Nuclear Information System (INIS)

    This Reference Document contains information on the AREVA group's objectives, prospects and development strategies, as well as estimates of the markets, market shares and competitive position of the AREVA group. Content: 1 - Person responsible for the Reference Document; 2 - Statutory auditors; 3 - Selected financial information; 4 - Description of major risks confronting the company; 5 - Information about the issuer; 6 - Business overview; 7 - Organizational structure; 8 - Property, plant and equipment; 9 - Situation and activities of the company and its subsidiaries; 10 - Capital resources; 11 - Research and development programs, patents and licenses; 12 - Trend information; 13 - Profit forecasts or estimates; 14 - Management and supervisory bodies; 15 - Compensation and benefits; 16 - Functioning of the management and supervisory bodies; 17 - Human resources information; 18 - Principal shareholders; 19 - Transactions with related parties; 20 - Financial information concerning assets, financial positions and financial performance; 21 - Additional information; 22 - Major contracts; 23 - Third party information, statements by experts and declarations of interest; 24 - Documents on display; 25 - Information on holdings; Appendix 1: report of the supervisory board chairman on the preparation and organization of the board's activities and internal control procedures; Appendix 2: statutory auditors' reports; Appendix 3: environmental report; Appendix 4: non-financial reporting methodology and independent third-party report on social, environmental and societal data; Appendix 5: ordinary and extraordinary general shareholders' meeting; Appendix 6: values charter; Appendix 7: table of concordance of the management report; glossaries

  14. Content Documents Management

    Science.gov (United States)

    Muniz, R.; Hochstadt, J.; Boelke J.; Dalton, A.

    2011-01-01

    The Content Documents are created and managed under the System Software group with. Launch Control System (LCS) project. The System Software product group is lead by NASA Engineering Control and Data Systems branch (NEC3) at Kennedy Space Center. The team is working on creating Operating System Images (OSI) for different platforms (i.e. AIX, Linux, Solaris and Windows). Before the OSI can be created, the team must create a Content Document which provides the information of a workstation or server, with the list of all the software that is to be installed on it and also the set where the hardware belongs. This can be for example in the LDS, the ADS or the FR-l. The objective of this project is to create a User Interface Web application that can manage the information of the Content Documents, with all the correct validations and filters for administrator purposes. For this project we used one of the most excellent tools in agile development applications called Ruby on Rails. This tool helps pragmatic programmers develop Web applications with Rails framework and Ruby programming language. It is very amazing to see how a student can learn about OOP features with the Ruby language, manage the user interface with HTML and CSS, create associations and queries with gems, manage databases and run a server with MYSQL, run shell commands with command prompt and create Web frameworks with Rails. All of this in a real world project and in just fifteen weeks!

  15. Tank waste remediation system functions and requirements document

    International Nuclear Information System (INIS)

    This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technical Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle

  16. Genetic Programming for Document Segmentation and Region Classification Using Discipulus

    Directory of Open Access Journals (Sweden)

    Priyadharshini N

    2013-02-01

    Full Text Available Document segmentation is a method of rending the document into distinct regions. A document is an assortment of information and a standard mode of conveying information to others. Pursuance of data from documents involves ton of human effort, time intense and might severely prohibit the usage of data systems. So, automatic information pursuance from the document has become a big issue. It is been shown that document segmentation will facilitate to beat such problems. This paper proposes a new approach to segment and classify the document regions as text, image, drawings and table. Document image is divided into blocks using Run length smearing rule and features are extracted from every blocks. Discipulus tool has been used to construct the Genetic programming based classifier model and located 97.5% classification accuracy.

  17. Tank waste remediation system functions and requirements document

    Energy Technology Data Exchange (ETDEWEB)

    Carpenter, K.E

    1996-10-03

    This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technical Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle.

  18. Algorithmic Detection of Computer Generated Text

    CERN Document Server

    Lavoie, Allen

    2010-01-01

    Computer generated academic papers have been used to expose a lack of thorough human review at several computer science conferences. We assess the problem of classifying such documents. After identifying and evaluating several quantifiable features of academic papers, we apply methods from machine learning to build a binary classifier. In tests with two hundred papers, the resulting classifier correctly labeled papers either as human written or as computer generated with no false classifications of computer generated papers as human and a 2% false classification rate for human papers as computer generated. We believe generalizations of these features are applicable to similar classification problems. While most current text-based spam detection techniques focus on the keyword-based classification of email messages, a new generation of unsolicited computer-generated advertisements masquerade as legitimate postings in online groups, message boards and social news sites. Our results show that taking the formatti...

  19. Functional Stylistics and Peripeteic Texts

    DEFF Research Database (Denmark)

    Borchmann, Simon

    2008-01-01

    Using a pragmatically based linguistic description apparatus on literary use of language is not unproblematic. Observations show that literary use of language violates the norms contained by this apparatus. With this paper I suggest how we can deal with this problem by setting up a frame for the ...... use of a functional linguistic description apparatus on literary texts. As an extension of this suggestion I present a model for describing a specific type of literary texts.......Using a pragmatically based linguistic description apparatus on literary use of language is not unproblematic. Observations show that literary use of language violates the norms contained by this apparatus. With this paper I suggest how we can deal with this problem by setting up a frame for the...

  20. TEXT tf coil bonding system

    International Nuclear Information System (INIS)

    An extensive bond test program was conducted prior to manufacturing and bonding the toroidal field (TF) coils for the Texas Experimental Tokamak (TEXT). The bonding materials consisted of fiberglass cloth with pre-impregnated, 'B' staged Hexcel F-159 resin. Approximately 100 double lap bond samples were constructed to test quality, strength, and repeatability of the bonds. The variables investigated included surface machining methods, surface preparations, bond sample size (planform area), bonding pressure, bonding temperature, and the number of laminations bonded simultaneously. Double lap shear tests conducted at room temperature resulted in ultimate shear strengths for all variables in the range of 3000 to 7000 psi with an average value of 5650 psi. Fatigue tests were also conducted to demonstrate bond integrity over the anticipated cycle lifetime of the TEXT machine (10/sup 6/ cycles) under simulated worst case conditions. 2 refs

  1. Challenges in Kurdish Text Processing

    OpenAIRE

    Esmaili, Kyumars Sheykh

    2012-01-01

    Despite having a large number of speakers, the Kurdish language is among the less-resourced languages. In this work we highlight the challenges and problems in providing the required tools and techniques for processing texts written in Kurdish. From a high-level perspective, the main challenges are: the inherent diversity of the language, standardization and segmentation issues, and the lack of language resources.

  2. Psychologische Interpretation. Biographien, Texte, Tests

    OpenAIRE

    Fahrenberg, Jochen

    2002-01-01

    Biographien, Texte und Tests werden psychologisch interpretiert. Psychologische Interpretation wird als Übersetzung einer Aussage mit beziehungsstiftenden Erläuterungen definiert. So werden Zusammenhänge erschlossen und Ergebnisse eingeordnet. Interpretation ist Übersetzung und Verständigung. Sie muss Heuristik und Methodenkritik verbinden. Eingeführt wird in diese methodischen Grundlagen und Regeln psychologischer Interpretationen. Die ersten Kapitel des Buches führen mit einer Interpretatio...

  3. Text Analytics to Data Warehousing

    OpenAIRE

    Kalli Srinivasa Nageswara Prasad; S. Ramakrishna

    2010-01-01

    Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language) to extract and analyze data, textanalytics has evolved to handle un...

  4. A Hough Transform based Technique for Text Segmentation

    CERN Document Server

    Saha, Satadal; Nasipuri, Mita; Basu, Dipak Kr

    2010-01-01

    Text segmentation is an inherent part of an OCR system irrespective of the domain of application of it. The OCR system contains a segmentation module where the text lines, words and ultimately the characters must be segmented properly for its successful recognition. The present work implements a Hough transform based technique for line and word segmentation from digitized images. The proposed technique is applied not only on the document image dataset but also on dataset for business card reader system and license plate recognition system. For standardization of the performance of the system the technique is also applied on public domain dataset published in the website by CMATER, Jadavpur University. The document images consist of multi-script printed and hand written text lines with variety in script and line spacing in single document image. The technique performs quite satisfactorily when applied on mobile camera captured business card images with low resolution. The usefulness of the technique is verifie...

  5. Density Based Script Identification of a Multilingual Document Image

    OpenAIRE

    Rumaan Bashir; S. M. K Quadri

    2015-01-01

    Automatic Pattern Recognition field has witnessed enormous growth in the past few decades. Being an essential element of Pattern Recognition, Document Image Analysis is the procedure of analyzing a document image with the intention of working out the contents so that they can be manipulated as per the requirements at various levels. It involves various procedures like document classification, organizing, conversion, identification and many more. Since a document chiefly contains text, Script ...

  6. AN APPROACH FOR TEXT SUMMARIZATION USING DEEP LEARNING ALGORITHM

    Directory of Open Access Journals (Sweden)

    G. PadmaPriya

    2014-01-01

    Full Text Available Now days many research is going on for text summarization. Because of increasing information in the internet, these kind of research are gaining more and more attention among the researchers. Extractive text summarization generates a brief summary by extracting proper set of sentences from a document or multiple documents by deep learning. The whole concept is to reduce or minimize the important information present in the documents. The procedure is manipulated by Restricted Boltzmann Machine (RBM algorithm for better efficiency by removing redundant sentences. The restricted Boltzmann machine is a graphical model for binary random variables. It consist of three layers input, hidden and output layer. The input data uniformly distributed in the hidden layer for operation. The experimentation is carried out and the summary is generated for three different document set from different knowledge domain. The f-measure value is the identifier to the performance of the proposed text summarization method. The top responses of the three different knowledge domain in accordance with the f-measure are 0.85, 1.42 and 1.97 respectively for the three document set.

  7. A Survey On Various Approaches Of Text Extraction In Images

    Directory of Open Access Journals (Sweden)

    C.P. Sumathi

    2012-09-01

    Full Text Available Text Extraction plays a major role in finding vital and valuable information. Text extraction involvesdetection, localization, tracking, binarization, extraction, enhancement and recognition of the text from the given image. These text characters are difficult to be detected and recognized due to their deviation of size, font, style, orientation, alignment, contrast, complex colored, textured background. Due to rapid growth of available multimedia documents and growing requirement for information, identification, indexing and retrieval, many researches have been done on text extraction in images.Several techniqueshave been developed for extracting the text from an image. The proposed methods were based on morphological operators, wavelet transform, artificial neural network,skeletonization operation,edge detection algorithm, histogram technique etc. All these techniques have their benefits and restrictions. This article discusses various schemes proposed earlier for extracting the text from an image. This paper also provides the performance comparison of several existing methods proposed by researchers in extracting the text from an image.

  8. New method to parse invoice as a type the document

    Directory of Open Access Journals (Sweden)

    Mohammed Moujabbir

    2012-01-01

    Full Text Available In this paper We propose a new method able to detecting and correcting errors relating to the recognition of invoice type documents. We rely on automated document readers that can read and recognize the various relevant information in a scanned document. The process on which this method is based consists of digitizing a large volume of documents, and makes them pass through automatic readers of the documents, then carry out the correction of the various errors. The final goal is to find an electronic document reflecting the various information included in the background document. The main goal is the generation of organized electronic documents, like a data basis or files XML; for a specific use. Our approach is based on the language theory through developing a kind of parser which is applicable to the more general case of documents and can easily detect a specific class of errors and correct them.

  9. Software design and documentation language

    Science.gov (United States)

    Kleine, H.

    1980-01-01

    Language supports design and documentation of complex software. Included are: design and documentation language for expressing design concepts; processor that produces intelligble documentation based on design specifications; and methodology for using language and processor to create well-structured top-down programs and documentation. Processor is written in SIMSCRIPT 11.5 programming language for use on UNIVAC, IBM, and CDC machines.

  10. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  11. Les dislocations: textes et contextes

    OpenAIRE

    Leonarduzzi, Laetitia; Herry, Nadine

    2005-01-01

    In this paper we analyse the contexts in which left and right dislocations appear. Our corpus is based on written as well as oral discourse, with texts ranging from the year 1884 to 2005. After trying to define both types of dislocation, and seeing how far the definitions can be extended, we notice that the dislocated NP is most of the time definite (90% of our examples). This phenomenon may be explained by the notions of anaphora, deixis and thematisation. Dislocations appear both in oral an...

  12. Text writing in the air

    OpenAIRE

    Beg, Saira; Khan, M. Fahad; Baig, Faisal

    2016-01-01

    This paper presents a real time video based pointing method which allows sketching and writing of English text over air in front of mobile camera. Proposed method have two main tasks: first it track the colored finger tip in the video frames and then apply English OCR over plotted images in order to recognize the written characters. Moreover, proposed method provides a natural human-system interaction in such way that it do not require keypad, stylus, pen or glove etc for character input. For...

  13. Thematic networks and text types

    OpenAIRE

    Thomas, Shirley

    2011-01-01

    Dans cet article, la question de l’organisation textuelle est abordée par l’analyse de la progression thématique. Nous nous proposons d’étudier à quel degré les différents types de progression thématique établis dans un texte sont liés à la question de son genre textuel, s’agissant en l’occurrence d’un article de recherche scientifique et d’un article de vulgarisation. Nous considérons également certaines orientations didactiques issues de cette étude.

  14. New Historicism: Text and Context

    Directory of Open Access Journals (Sweden)

    Violeta M. Vesić

    2016-02-01

    Full Text Available During most of the twentieth century history was seen as a phenomenon outside of literature that guaranteed the veracity of literary interpretation. History was unique and it functioned as a basis for reading literary works. During the seventies of the twentieth century there occurred a change of attitude towards history in American literary theory, and there appeared a new theoretical approach which soon became known as New Historicism. Since its inception, New Historicism has been identified with the study of Renaissance and Romanticism, but nowadays it has been increasingly involved in other literary trends. Although there are great differences in the arguments and practices at various representatives of this school, New Historicism has clearly recognizable features and many new historicists will agree with the statement of Walter Cohen that New Historicism, when it appeared in the eighties, represented something quite new in reference to the studies of theory, criticism and history (Cohen 1987, 33. Theoretical connection with Bakhtin, Foucault and Marx is clear, as well as a kind of uneasy tie with deconstruction and the work of Paul de Man. At the center of this approach is a renewed interest in the study of literary works in the light of historical and political circumstances in which they were created. Foucault encouraged readers to begin to move literary texts and to link them with discourses and representations that are not literary, as well as to examine the sociological aspects of the texts in order to take part in the social struggles of today. The study of literary works using New Historicism is the study of politics, history, culture and circumstances in which these works were created. With regard to one of the main fact which is located in the center of the criticism, that history cannot be viewed objectively and that reality can only be understood through a cultural context that reveals the work, re-reading and interpretation of

  15. Does pedagogical documentation support maternal reminiscing conversations?

    Directory of Open Access Journals (Sweden)

    Bethany Fleck

    2015-12-01

    Full Text Available When parents talk with their children about lessons learned in school, they are participating in reminiscing of an unshared event. This study sought to understand if pedagogical documentation, from the Reggio Approach to early childhood education, would support and enhance the conversation. Mother–child dyads reminisced two separate times about preschool lessons, one time with documentation available to them and one time without. Transcripts were coded extracting variables indicative of high and low maternal reminiscing styles. Results indicate that mother and child conversation characteristics were more highly elaborative when documentation was present than when it was not. In addition, children added more information to the conversation supporting the notion that such conversations enhanced memory for lessons. Documentation could be used as a support tool for conversations and children’s memory about lessons learned in school.

  16. Visual Similarity Based Document Layout Analysis

    Institute of Scientific and Technical Information of China (English)

    Di Wen; Xiao-Qing Ding

    2006-01-01

    In this paper, a visual similarity based document layout analysis (DLA) scheme is proposed, which by using clustering strategy can adaptively deal with documents in different languages, with different layout structures and skew angles. Aiming at a robust and adaptive DLA approach, the authors first manage to find a set of representative filters and statistics to characterize typical texture patterns in document images, which is through a visual similarity testing process.Texture features are then extracted from these filters and passed into a dynamic clustering procedure, which is called visual similarity clustering. Finally, text contents are located from the clustered results. Benefit from this scheme, the algorithm demonstrates strong robustness and adaptability in a wide variety of documents, which previous traditional DLA approaches do not possess.

  17. Basic firefly algorithm for document clustering

    Science.gov (United States)

    Mohammed, Athraa Jasim; Yusof, Yuhanis; Husni, Husniza

    2015-12-01

    The Document clustering plays significant role in Information Retrieval (IR) where it organizes documents prior to the retrieval process. To date, various clustering algorithms have been proposed and this includes the K-means and Particle Swarm Optimization. Even though these algorithms have been widely applied in many disciplines due to its simplicity, such an approach tends to be trapped in a local minimum during its search for an optimal solution. To address the shortcoming, this paper proposes a Basic Firefly (Basic FA) algorithm to cluster text documents. The algorithm employs the Average Distance to Document Centroid (ADDC) as the objective function of the search. Experiments utilizing the proposed algorithm were conducted on the 20Newsgroups benchmark dataset. Results demonstrate that the Basic FA generates a more robust and compact clusters than the ones produced by K-means and Particle Swarm Optimization (PSO).

  18. Toward Documentation of Program Evolution

    DEFF Research Database (Denmark)

    Vestdam, Thomas; Nørmark, Kurt

    2005-01-01

    The documentation of a program often falls behind the evolution of the program source files. When this happens it may be attractive to shift the documentation mode from updating the documentation to documenting the evolution of the program. This paper describes tools that support the documentation...... documentation files. The paper introduces a set of fine grained program evolution steps, which are supported directly by the documentation tools. The automatic discovery of the fine grained program evolution steps makes up a platform for documenting coarse grained and more high-level program evolution steps. It...... is concluded that our approach can help revitalize older documentation, and that discovery of the fine grained program evolution steps help the programmer in documenting the evolution of the program....

  19. Extractive Summarisation of Medical Documents

    Directory of Open Access Journals (Sweden)

    Abeed Sarker

    2012-09-01

    Full Text Available Background Evidence Based Medicine (EBM practice requires practitioners to extract evidence from published medical research when answering clinical queries. Due to the time-consuming nature of this practice, there is a strong motivation for systems that can automatically summarise medical documents and help practitioners find relevant information. Aim The aim of this work is to propose an automatic query-focused, extractive summarisation approach that selects informative sentences from medical documents. MethodWe use a corpus that is specifically designed for summarisation in the EBM domain. We use approximately half the corpus for deriving important statistics associated with the best possible extractive summaries. We take into account factors such as sentence position, length, sentence content, and the type of the query posed. Using the statistics from the first set, we evaluate our approach on a separate set. Evaluation of the qualities of the generated summaries is performed automatically using ROUGE, which is a popular tool for evaluating automatic summaries. Results Our summarisation approach outperforms all baselines (best baseline score: 0.1594; our score 0.1653. Further improvements are achieved when query types are taken into account. Conclusion The quality of extractive summarisation in the medical domain can be significantly improved by incorporating domain knowledge and statistics derived from a specialised corpus. Such techniques can therefore be applied for content selection in end-to-end summarisation systems.

  20. A Hybrid Feature Selection Approach for Arabic Documents Classification

    NARCIS (Netherlands)

    Habib, Mena B.; Fayed, Zaki T.; Gharib, Tarek F.; Sarhan, Ahmed A. E.; Salem, Abdel-Badeeh M.

    2006-01-01

    Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. Text categorization algorithms usually represent documents as bags of words and consequently have to deal with huge number of features. Feature selection tries to

  1. Automatic generation of documents

    OpenAIRE

    Rosa Gini; Jacopo Pasquini

    2006-01-01

    This paper describes a natural interaction between Stata and markup languages. Stata’s programming and analysis features, together with the flexibility in output formatting of markup languages, allow generation and/or update of whole documents (reports, presentations on screen or web, etc.). Examples are given for both LaTeX and HTML. Stata’s commands are mainly dedicated to analysis of data on a computer screen and output of analysis stored in a log file available to researchers for later re...

  2. AREVA 2009 reference document

    International Nuclear Information System (INIS)

    This Reference Document contains information on the AREVA group's objectives, prospects and development strategies. It contains information on the markets, market shares and competitive position of the AREVA group. This information provides an adequate picture of the size of these markets and of the AREVA group's competitive position. Content: 1 - Person responsible for the Reference Document and Attestation by the person responsible for the Reference Document; 2 - Statutory and Deputy Auditors; 3 - Selected financial information; 4 - Risks: Risk management and coverage, Legal risk, Industrial and environmental risk, Operating risk, Risk related to major projects, Liquidity and market risk, Other risk; 5 - Information about the issuer: History and development, Investments; 6 - Business overview: Markets for nuclear power and renewable energies, AREVA customers and suppliers, Overview and strategy of the group, Business divisions, Discontinued operations: AREVA Transmission and Distribution; 7 - Organizational structure; 8 - Property, plant and equipment: Principal sites of the AREVA group, Environmental issues that may affect the issuer's; 9 - Analysis of and comments on the group's financial position and performance: Overview, Financial position, Cash flow, Statement of financial position, Events subsequent to year-end closing for 2009; 10 - Capital Resources; 11 - Research and development programs, patents and licenses; 12 -trend information: Current situation, Financial objectives; 13 - Profit forecasts or estimates; 14 - Administrative, management and supervisory bodies and senior management; 15 - Compensation and benefits; 16 - Functioning of corporate bodies; 17 - Employees; 18 - Principal shareholders; 19 - Transactions with related parties: French state, CEA, EDF group; 20 - Financial information concerning assets, financial positions and financial performance; 21 - Additional information: Share capital, Certificate of incorporation and by-laws; 22 - Major

  3. Attitudes and emotions through written text: the case of textual deformation in internet chat rooms.

    Directory of Open Access Journals (Sweden)

    Francisco Yus Ramos

    2010-11-01

    Full Text Available Normal 0 21 false false false ES X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabla normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} Los chats españoles de Internet son visitados por muchos jóvenes que usan el lenguaje de una forma muy creativa (ej. repetición de letras y signos de puntuación. En este artículo se evalúan varias hipótesis sobre el uso de la deformación textual respecto a su eficacia comunicativa. Se trata de comprobar si estas deformaciones favorecen una identificación y evaluación más adecuada de las actitudes (proposicionales o afectivas y emociones de sus autores. Las respuestas a un cuestionario revelan que a pesar de la información adicional que la deformación textual aporta, los lectores no suelen coincidir en la cualidad exacta de estas actitudes y emociones, ni establecen grados de intensidad relacionados con la cantidad de texto tecleada. Sin embargo, y a pesar de estos resultados, la deformación textual parece jugar un papel en la interpretación que finalmente se elige de estos mensajes enviados a los chats.

  4. AREVA - 2012 Reference document

    International Nuclear Information System (INIS)

    After a presentation of the person responsible for this Reference Document, of statutory auditors, and of a summary of financial information, this report address the different risk factors: risk management and coverage, legal risk, industrial and environmental risk, operational risk, risk related to major projects, liquidity and market risk, and other risks (related to political and economic conditions, to Group's structure, and to human resources). The next parts propose information about the issuer, a business overview (markets for nuclear power and renewable energies, customers and suppliers, group's strategy, operations), a brief presentation of the organizational structure, a presentation of properties, plants and equipment (principal sites, environmental issues which may affect these items), analysis and comments on the group's financial position and performance, a presentation of capital resources, a presentation of research and development activities (programs, patents and licenses), a brief description of financial objectives and profit forecasts or estimates, a presentation of administration, management and supervision bodies, a description of the operation of corporate bodies, an overview of personnel, of principal shareholders, and of transactions with related parties, a more detailed presentation of financial information concerning assets, financial positions and financial performance. Addition information regarding share capital is given, as well as an indication of major contracts, third party information, available documents, and information on holdings

  5. Regulatory guidance document

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1994-05-01

    The Office of Civilian Radioactive Waste Management (OCRWM) Program Management System Manual requires preparation of the OCRWM Regulatory Guidance Document (RGD) that addresses licensing, environmental compliance, and safety and health compliance. The document provides: regulatory compliance policy; guidance to OCRWM organizational elements to ensure a consistent approach when complying with regulatory requirements; strategies to achieve policy objectives; organizational responsibilities for regulatory compliance; guidance with regard to Program compliance oversight; and guidance on the contents of a project-level Regulatory Compliance Plan. The scope of the RGD includes site suitability evaluation, licensing, environmental compliance, and safety and health compliance, in accordance with the direction provided by Section 4.6.3 of the PMS Manual. Site suitability evaluation and regulatory compliance during site characterization are significant activities, particularly with regard to the YW MSA. OCRWM`s evaluation of whether the Yucca Mountain site is suitable for repository development must precede its submittal of a license application to the Nuclear Regulatory Commission (NRC). Accordingly, site suitability evaluation is discussed in Chapter 4, and the general statements of policy regarding site suitability evaluation are discussed in Section 2.1. Although much of the data and analyses may initially be similar, the licensing process is discussed separately in Chapter 5. Environmental compliance is discussed in Chapter 6. Safety and Health compliance is discussed in Chapter 7.

  6. Regulatory guidance document

    International Nuclear Information System (INIS)

    The Office of Civilian Radioactive Waste Management (OCRWM) Program Management System Manual requires preparation of the OCRWM Regulatory Guidance Document (RGD) that addresses licensing, environmental compliance, and safety and health compliance. The document provides: regulatory compliance policy; guidance to OCRWM organizational elements to ensure a consistent approach when complying with regulatory requirements; strategies to achieve policy objectives; organizational responsibilities for regulatory compliance; guidance with regard to Program compliance oversight; and guidance on the contents of a project-level Regulatory Compliance Plan. The scope of the RGD includes site suitability evaluation, licensing, environmental compliance, and safety and health compliance, in accordance with the direction provided by Section 4.6.3 of the PMS Manual. Site suitability evaluation and regulatory compliance during site characterization are significant activities, particularly with regard to the YW MSA. OCRWM's evaluation of whether the Yucca Mountain site is suitable for repository development must precede its submittal of a license application to the Nuclear Regulatory Commission (NRC). Accordingly, site suitability evaluation is discussed in Chapter 4, and the general statements of policy regarding site suitability evaluation are discussed in Section 2.1. Although much of the data and analyses may initially be similar, the licensing process is discussed separately in Chapter 5. Environmental compliance is discussed in Chapter 6. Safety and Health compliance is discussed in Chapter 7

  7. ExactPack Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Singleton, Jr., Robert [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Israel, Daniel M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Doebling, Scott William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Woods, Charles Nathan [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Kaul, Ann [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Walter, Jr., John William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Rogers, Michael Lloyd [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

    2016-05-09

    For code verification, one compares the code output against known exact solutions. There are many standard test problems used in this capacity, such as the Noh and Sedov problems. ExactPack is a utility that integrates many of these exact solution codes into a common API (application program interface), and can be used as a stand-alone code or as a python package. ExactPack consists of python driver scripts that access a library of exact solutions written in Fortran or Python. The spatial profiles of the relevant physical quantities, such as the density, fluid velocity, sound speed, or internal energy, are returned at a time specified by the user. The solution profiles can be viewed and examined by a command line interface or a graphical user interface, and a number of analysis tools and unit tests are also provided. We have documented the physics of each problem in the solution library, and provided complete documentation on how to extend the library to include additional exact solutions. ExactPack’s code architecture makes it easy to extend the solution-code library to include additional exact solutions in a robust, reliable, and maintainable manner.

  8. AREVA 2010 Reference document

    International Nuclear Information System (INIS)

    After a presentation of the person responsible for this document, and of statutory auditors, this report proposes some selected financial information. Then, it addresses, presents and comments the different risk factors: risk management and coverage, legal risk, industrial and environmental risk, operational risk, risks related to major projects, liquidity and market risk, and other risk. Then, after a presentation of the issuer, it proposes a business overview (markets for nuclear and renewable energies, AREVA customers and suppliers, strategy, activities), a presentation of the organizational structure, a presentation of AREVA properties, plants and equipment (sites, environmental issues), an analysis and comment of the group's financial position and performance, a presentation of its capital resources, an overview of its research and development activities, programs, patents and licenses. It indicates profit forecast and estimates, presents the administrative, management and supervisory bodies, and compensation and benefits amounts, reports of the functioning of corporate bodies. It describes the human resource company policy, indicates the main shareholders and transactions with related parties. It proposes financial information concerning assets, financial positions and financial performance. This document contains its French and its English versions

  9. REVISION AND REWRITING IN OFFICIAL DOCUMENTS: CONCEPTS AND METHODOLOGICAL ORIENTATIONS

    Directory of Open Access Journals (Sweden)

    Renilson José MENEGASSI

    2014-12-01

    Full Text Available The text discuss how the concepts and the methodological orientations about text revision and rewriting processes, in teaching context, are conceived, presented and guide the Portuguese Language teacher’s work. To this end, the concepts of revision and rewriting are characterized in four Brazilian official documents, two from national scope and two from Paraná state. The information was organized from what the documents show about the teacher and student attitude face to the investigated concepts, which determine the methodological orientations to the text production work. The results show irregularities in these processes handling, highlighting one of the official documents, from national scope, as the one that presents more suitable methodological and conceptual orientations. It shows that the documents which guide the mother language teaching in the country are still not appropriately discussing the writing text production process, specifically the revision and rewriting, even in more recent documents.

  10. Automatic document classification of biological literature

    Directory of Open Access Journals (Sweden)

    Sternberg Paul W

    2006-08-01

    Full Text Available Abstract Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578 compared to previously published results (F-value of 0.55 vs. 0.49, while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

  11. Visualising Discourse Structure in Interactive Documents

    OpenAIRE

    Mancini, Clara; Pietsch, Christian; Scott, Donia; Busemann, Stephan

    2007-01-01

    In this paper we introduce a method for generating interactive documents which exploits the visual features of hypertext to represent discourse structure. We explore the consistent and principled use of graphics and animation to support navigation and comprehension of non-linear text, where textual discourse markers do not always work effectively.

  12. Interactive-predictive detection of handwritten text blocks

    Science.gov (United States)

    Ramos Terrades, O.; Serrano, N.; Gordó, A.; Valveny, E.; Juan, A.

    2010-01-01

    A method for text block detection is introduced for old handwritten documents. The proposed method takes advantage of sequential book structure, taking into account layout information from pages previously transcribed. This glance at the past is used to predict the position of text blocks in the current page with the help of conventional layout analysis methods. The method is integrated into the GIDOC prototype: a first attempt to provide integrated support for interactive-predictive page layout analysis, text line detection and handwritten text transcription. Results are given in a transcription task on a 764-page Spanish manuscript from 1891.

  13. UNL Based Bangla Natural Text Conversion - Predicate Preserving Parser Approach

    CERN Document Server

    Ali, Md Nawab Yousuf; Allayear, Shaikh Muhammad

    2012-01-01

    Universal Networking Language (UNL) is a declarative formal language that is used to represent semantic data extracted from natural language texts. This paper presents a novel approach to converting Bangla natural language text into UNL using a method known as Predicate Preserving Parser (PPP) technique. PPP performs morphological, syntactic and semantic, and lexical analysis of text synchronously. This analysis produces a semantic-net like structure represented using UNL. We demonstrate how Bangla texts are analyzed following the PPP technique to produce UNL documents which can then be translated into any other suitable natural language facilitating the opportunity to develop a universal language translation method via UNL.

  14. Rank Based Clustering For Document Retrieval From Biomedical Databases

    Directory of Open Access Journals (Sweden)

    Jayanthi Manicassamy

    2009-09-01

    Full Text Available Now a day's, search engines are been most widely used for extracting information's from various resources throughout the world. Where, majority of searches lies in the field of biomedical for retrieving related documents from various biomedical databases. Currently search engines lacks in document clustering and representing relativeness level of documents extracted from the databases. In order to overcome these pitfalls a text based search engine have been developed for retrieving documents from Medline and PubMed biomedical databases. The search engine has incorporated page ranking bases clustering concept which automatically represents relativeness on clustering bases. Apart from this graph tree construction is made for representing the level of relatedness of the documents that are networked together. This advance functionality incorporation for biomedical document based search engine found to provide better results in reviewing related documents based on relativeness.

  15. An Evident Theoretic Feature Selection Approach for Text Categorization

    Directory of Open Access Journals (Sweden)

    UMARSATHIC ALI

    2012-06-01

    Full Text Available With the exponential growth of textual documents available in unstructured form on the Internet, feature selection approaches are increasingly significant for the preprocessing of textual documents for automatic text categorization. Feature selection, which focuses on identifying relevant and informative features, can help reduce the computational cost of processing voluminous amounts of data as well asincrease the effectiveness for the subsequent text categorization tasks. In this paper, we propose a new evident theoretic feature selection approach for text categorization based on transferable belief model (TBM. An evaluation on the performance of the proposed evident theoretic feature selection approach on benchmark dataset is also presented. We empirically show the effectiveness of our approach in outperforming the traditional feature selection methods using two standard benchmark datasets.

  16. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  17. A programmed text in statistics

    CERN Document Server

    Hine, J

    1975-01-01

    Exercises for Section 2 42 Physical sciences and engineering 42 43 Biological sciences 45 Social sciences Solutions to Exercises, Section 1 47 Physical sciences and engineering 47 49 Biological sciences 49 Social sciences Solutions to Exercises, Section 2 51 51 PhYSical sciences and engineering 55 Biological sciences 58 Social sciences 62 Tables 2 62 x - tests involving variances 2 63,64 x - one tailed tests 2 65 x - two tailed tests F-distribution 66-69 Preface This project started some years ago when the Nuffield Foundation kindly gave a grant for writing a pro­ grammed text to use with service courses in statistics. The work carried out by Mrs. Joan Hine and Professor G. B. Wetherill at Bath University, together with some other help from time to time by colleagues at Bath University and elsewhere. Testing was done at various colleges and universities, and some helpful comments were received, but we particularly mention King Edwards School, Bath, who provided some sixth formers as 'guinea pigs' for the fir...

  18. Initial bolometric measurements on text

    International Nuclear Information System (INIS)

    A platinum resistance bolometer has been used to measure the total radiated power in TEXT. Preliminary attempts to determine the scaling of the total radiated power with impurity content, toroidal field, electron density, and plasma current have been made. These measurements indicate that the radiated power is strongly dependent on the impurity content, proportional to the plasma current and electron density, and inversely proportional to the toroidal field. The density and toroidal field dependences are apparently connected with changes in impurity confinement with these parameters. Increases in total radiated power during different impurity injections have also been measured. Shot to shot radial scans of the bolometer across the plasma have been made for several plasma conditions. Estimates of the total radiated power have also been made for these conditions. Comparisons with the ohmic heating input power show that the radiated power is a large percentage of the input power, so that the radiated power is a significant term in thermal transport calculations. This report describes the experimental techniques used and preliminary results of the power measurements

  19. Automated Postediting of Documents

    CERN Document Server

    Knight, K; Knight, Kevin; Chander, Ishwar

    1994-01-01

    Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technical report or thesis written by a non-native speaker of English knows the potential of an automated postediting system. For the case of MT-generated text, we argue for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system. As an example, we have built a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. This is a notoriously difficult problem for Japanese-English MT. Our system contains over 200,000 rules derived automatically from online text resources. We report on l...

  20. Integrated criteria document mercury

    International Nuclear Information System (INIS)

    The document contains a systematic review and a critical evaluation of the most relevant data on the priority substance mercury for the purpose of effect-oriented environmental policy. Chapter headings are: properties and existing standards; production, application, sources and emissions (natural sources, industry, energy, households, agriculture, dental use, waste); distribution and transformation (cinnabar; Hg2+, Hg22+, elemental mercury, methylmercury, behavior in soil, water, air, biota); concentrations and fluxes in the environment and exposure levels (sampling and measuring methods, occurrence in soil, water, air etc.); effects (toxicity to humans and aquatic and terrestrial systems); emissions reduction (from industrial sources, energy, waste processing etc.); and evaluation (risks, standards, emission reduction objectives, measuring strategies). 395 refs