WorldWideScience

Sample records for ascii text documents

  1. State Of The Art In Digital Steganography Focusing ASCII Text Documents

    CERN Document Server

    Rafat, Khan Farhan

    2010-01-01

    Digitization of analogue signals has opened up new avenues for information hiding and the recent advancements in the telecommunication field has taken up this desire even further. From copper wire to fiber optics, technology has evolved and so are ways of covert channel communication. By "Covert" we mean "anything not meant for the purpose for which it is being used". Investigation and detection of existence of such cover channel communication has always remained a serious concern of information security professionals which has now been evolved into a motivating source of an adversary to communicate secretly in "open" without being allegedly caught or noticed. This paper presents a survey report on steganographic techniques which have been evolved over the years to hide the existence of secret information inside some cover (Text) object. The introduction of the subject is followed by the discussion which is narrowed down to the area where digital ASCII Text documents are being used as cover. Finally, the conc...

  2. Communication in Veil: Enhanced Paradigm for ASCII Text Files

    Directory of Open Access Journals (Sweden)

    Muhammad Sher

    2013-08-01

    Full Text Available Digitization has a persuasive impact on information and communication technology (ICT field which can be realized from the fact that today one seldom think to stand in long awaiting queue just to deposit utility bills, buy movie ticket, or dispatch private letters via post office etc. as these and other such similar activities are now preferably being done electronically over internet which has shattered the geographical boundaries and has tied the people across the world into a single logical unit called global village. The efficacy and precision with which electronic transactions are made is commendable and is one of the reasons why more and more people are switching over to e-commerce for their official and personal usage. Via social networking sites one can interact with family and friends at any time of his/her choice. The darker side of this comforting aspect, however, is that the contents sent on/off-line may be monitored for active or passive intervention by the antagonistic forces for their illicit motives ranging from but not only limited to password, ID and social security number theft to impersonation, compromising personal information, blackmailing etc. This necessitated the need to hide data or information of some significance in an oblivious manner in order to detract the enemy as regards its detection. This paper aims at evolving an avant-garde information hiding scheme for ASCII text files - a research area regarded as the most difficult in contrast to audio, video or image file formats for the said purpose.

  3. Text line Segmentation of Curved Document Images

    Directory of Open Access Journals (Sweden)

    Anusree.M

    2014-05-01

    Full Text Available Document image analysis has been widely used in historical and heritage studies, education and digital library. Document image analytical techniques are mainly used for improving the human readability and the OCR quality of the document. During the digitization, camera captured images contain warped document due perspective and geometric distortions. The main difficulty is text line detection in the document. Many algorithms had been proposed to address the problem of printed document text line detection, but they failed to extract text lines in curved document. This paper describes a segmentation technique that detects the curled text line in camera captured document images.

  4. Segmentation of Ancient Telugu Text Documents

    Directory of Open Access Journals (Sweden)

    Srinivasa Rao A.V

    2012-07-01

    Full Text Available OCR of ancient document images remains a challenging task till date. Scanning process itself introduces deformation of document images. Cleaning process of these document images will result in information loss. Segmentation contributes an invariance process in OCR. Complex scripts, like derivatives of Brahmi, encounter many problems in the segmentation process. Segmentation of meaningful units, (instead of isolated patterns, revealed interesting trends. A segmentation technique for the ancient Telugu document image into meaningful units is proposed. The topological features of the meaningful units within the script line are adopted as a basis, while segmenting the text line. Horizontal profile pattern is convolved with Gaussian kernel. The statistical properties of meaningful units are explored by extensively analyzing the geometrical patterns of the meaningful unit. The efficiency of the proposed algorithm involving segmentation process is found to be 73.5% for the case of uncleaned document images.

  5. Visual Classifier Training for Text Document Retrieval.

    Science.gov (United States)

    Heimerl, F; Koch, S; Bosch, H; Ertl, T

    2012-12-01

    Performing exhaustive searches over a large number of text documents can be tedious, since it is very hard to formulate search queries or define filter criteria that capture an analyst's information need adequately. Classification through machine learning has the potential to improve search and filter tasks encompassing either complex or very specific information needs, individually. Unfortunately, analysts who are knowledgeable in their field are typically not machine learning specialists. Most classification methods, however, require a certain expertise regarding their parametrization to achieve good results. Supervised machine learning algorithms, in contrast, rely on labeled data, which can be provided by analysts. However, the effort for labeling can be very high, which shifts the problem from composing complex queries or defining accurate filters to another laborious task, in addition to the need for judging the trained classifier's quality. We therefore compare three approaches for interactive classifier training in a user study. All of the approaches are potential candidates for the integration into a larger retrieval system. They incorporate active learning to various degrees in order to reduce the labeling effort as well as to increase effectiveness. Two of them encompass interactive visualization for letting users explore the status of the classifier in context of the labeled documents, as well as for judging the quality of the classifier in iterative feedback loops. We see our work as a step towards introducing user controlled classification methods in addition to text search and filtering for increasing recall in analytics scenarios involving large corpora.

  6. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-12-01

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. However, these metaphors (e.g., word clouds, tag clouds, etc.) often lack interactivity to explore the information and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. Further, transitioning between levels of detail (i.e., from terms to full documents) can be challanging. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multipel levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geography metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  7. Swarm Intelligence in Text Document Clustering

    Energy Technology Data Exchange (ETDEWEB)

    Cui, Xiaohui [ORNL; Potok, Thomas E [ORNL

    2008-01-01

    Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.

  8. Typograph: Multiscale Spatial Exploration of Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Endert, Alexander; Burtner, Edwin R.; Cramer, Nicholas O.; Perko, Ralph J.; Hampton, Shawn D.; Cook, Kristin A.

    2013-10-06

    Visualizing large document collections using a spatial layout of terms can enable quick overviews of information. These visual metaphors (e.g., word clouds, tag clouds, etc.) traditionally show a series of terms organized by space-filling algorithms. However, often lacking in these views is the ability to interactively explore the information to gain more detail, and the location and rendering of the terms are often not based on mathematical models that maintain relative distances from other information based on similarity metrics. In this paper, we present Typograph, a multi-scale spatial exploration visualization for large document collections. Based on the term-based visualization methods, Typograh enables multiple levels of detail (terms, phrases, snippets, and full documents) within the single spatialization. Further, the information is placed based on their relative similarity to other information to create the “near = similar” geographic metaphor. This paper discusses the design principles and functionality of Typograph and presents a use case analyzing Wikipedia to demonstrate usage.

  9. GENERATION OF A SET OF KEY TERMS CHARACTERISING TEXT DOCUMENTS

    Directory of Open Access Journals (Sweden)

    Kristina Machova

    2007-06-01

    Full Text Available The presented paper describes statistical methods (information gain, mutual X^2 statistics, and TF-IDF method for key words generation from a text document collection. These key words should characterise the content of text documents and can be used to retrieve relevant documents from a document collection. Term relations were detected on the base of conditional probability of term occurrences. The focus is on the detection of those words, which occur together very often. Thus, key words, which consist from two terms were generated additionally. Several tests were carried out using the 20 News Groups collection of text documents.

  10. AstroAsciiData: ASCII table Python module

    Science.gov (United States)

    Kümmel, Martin; Haase, Jonas

    2013-11-01

    ASCII tables continue to be one of the most popular and widely used data exchange formats in astronomy. AstroAsciiData, written in Python, imports all reasonably well-formed ASCII tables. It retains formatting of data values, allows column-first access, supports SExtractor style headings, performs column sorting, and exports data to other formats, including FITS, Numpy/Numarray, and LaTeX table format. It also offers interchangeable comment character, column delimiter and null value.

  11. Semantic Document Image Classification Based on Valuable Text Pattern

    Directory of Open Access Journals (Sweden)

    Hossein Pourghassem

    2011-01-01

    Full Text Available Knowledge extraction from detected document image is a complex problem in the field of information technology. This problem becomes more intricate when we know, a negligible percentage of the detected document images are valuable. In this paper, a segmentation-based classification algorithm is used to analysis the document image. In this algorithm, using a two-stage segmentation approach, regions of the image are detected, and then classified to document and non-document (pure region regions in the hierarchical classification. In this paper, a novel valuable definition is proposed to classify document image in to valuable or invaluable categories. The proposed algorithm is evaluated on a database consisting of the document and non-document image that provide from Internet. Experimental results show the efficiency of the proposed algorithm in the semantic document image classification. The proposed algorithm provides accuracy rate of 98.8% for valuable and invaluable document image classification problem.

  12. A New Fragile Watermarking Scheme for Text Documents Authentication

    Institute of Scientific and Technical Information of China (English)

    XIANG Huazheng; SUN Xingming; TANG Chengliang

    2006-01-01

    Because there are different modification types of deleting characters and inserting characters in text documents, the algorithms for image authentication can not be used for text documents authentication directly. A text watermarking scheme for text document authentication is proposed in this paper. By extracting the features of character cascade together with the user secret key, the scheme combines the features of the text with the user information as a watermark which is embedded into the transformed text itself. The receivers can verify the integrity and the authentication of the text through the blind detection technique. A further research demonstrates that it can also localize the tamper, classify the type of modification, and recover part of modified text documents. The aforementioned conclusion has been proved by both our experiment results and analysis.

  13. A Semi-Structured Document Model for Text Mining

    Institute of Scientific and Technical Information of China (English)

    杨建武; 陈晓鸥

    2002-01-01

    A semi-structured document has more structured information compared to anordinary document, and the relation among semi-structured documents can be fully utilized. Inorder to take advantage of the structure and link information in a semi-structured document forbetter mining, a structured link vector model (SLVM) is presented in this paper, where a vectorrepresents a document, and vectors' elements are determined by terms, document structure andneighboring documents. Text mining based on SLVM is described in the procedure of K-meansfor briefness and clarity: calculating document similarity and calculating cluster center. Theclustering based on SLVM performs significantly better than that based on a conventional vectorspace model in the experiments, and its F value increases from 0.65-0.73 to 0.82-0.86.

  14. Text Line Segmentation of Historical Documents: a Survey

    CERN Document Server

    Likforman-Sulem, Laurence; Taconet, Bruno

    2007-01-01

    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedi. to documents of historical interest.

  15. CERCLIS (Superfund) ASCII Text Format - CPAD Database

    Data.gov (United States)

    U.S. Environmental Protection Agency — The Comprehensive Environmental Response, Compensation and Liability Information System (CERCLIS) (Superfund) Public Access Database (CPAD) contains a selected set...

  16. Integrated Clustering and Feature Selection Scheme for Text Documents.

    Directory of Open Access Journals (Sweden)

    M. Thangamani

    2010-01-01

    Full Text Available Problem statement: Text documents are the unstructured databases that contain raw data collection. The clustering techniques are used group up the text documents with reference to its similarity. Approach: The feature selection techniques were used to improve the efficiency and accuracy of clustering process. The feature selection was done by eliminate the redundant and irrelevant items from the text document contents. Statistical methods were used in the text clustering and feature selection algorithm. The cube size is very high and accuracy is low in the term based text clustering and feature selection method. The semantic clustering and feature selection method was proposed to improve the clustering and feature selection mechanism with semantic relations of the text documents. The proposed system was designed to identify the semantic relations using the ontology. The ontology was used to represent the term and concept relationship. Results: The synonym, meronym and hypernym relationships were represented in the ontology. The concept weights were estimated with reference to the ontology. The concept weight was used for the clustering process. The system was implemented in two methods. They were term clustering with feature selection and semantic clustering with feature selection. Conclusion: The performance analysis was carried out with the term clustering and semantic clustering methods. The accuracy and efficiency factors were analyzed in the performance analysis.

  17. EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATION

    Directory of Open Access Journals (Sweden)

    N. Adilah Hanin Zahri

    2015-03-01

    Full Text Available Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents

  18. Document Exploration and Automatic Knowledge Extraction for Unstructured Biomedical Text

    Science.gov (United States)

    Chu, S.; Totaro, G.; Doshi, N.; Thapar, S.; Mattmann, C. A.; Ramirez, P.

    2015-12-01

    We describe our work on building a web-browser based document reader with built-in exploration tool and automatic concept extraction of medical entities for biomedical text. Vast amounts of biomedical information are offered in unstructured text form through scientific publications and R&D reports. Utilizing text mining can help us to mine information and extract relevant knowledge from a plethora of biomedical text. The ability to employ such technologies to aid researchers in coping with information overload is greatly desirable. In recent years, there has been an increased interest in automatic biomedical concept extraction [1, 2] and intelligent PDF reader tools with the ability to search on content and find related articles [3]. Such reader tools are typically desktop applications and are limited to specific platforms. Our goal is to provide researchers with a simple tool to aid them in finding, reading, and exploring documents. Thus, we propose a web-based document explorer, which we called Shangri-Docs, which combines a document reader with automatic concept extraction and highlighting of relevant terms. Shangri-Docsalso provides the ability to evaluate a wide variety of document formats (e.g. PDF, Words, PPT, text, etc.) and to exploit the linked nature of the Web and personal content by performing searches on content from public sites (e.g. Wikipedia, PubMed) and private cataloged databases simultaneously. Shangri-Docsutilizes Apache cTAKES (clinical Text Analysis and Knowledge Extraction System) [4] and Unified Medical Language System (UMLS) to automatically identify and highlight terms and concepts, such as specific symptoms, diseases, drugs, and anatomical sites, mentioned in the text. cTAKES was originally designed specially to extract information from clinical medical records. Our investigation leads us to extend the automatic knowledge extraction process of cTAKES for biomedical research domain by improving the ontology guided information extraction

  19. Unsupervised Text Normalization Approach for Morphological Analysis of Blog Documents

    Science.gov (United States)

    Ikeda, Kazushi; Yanagihara, Tadashi; Matsumoto, Kazunori; Takishima, Yasuhiro

    In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods.

  20. Finding Text Information in the Ocean of Electronic Documents

    Energy Technology Data Exchange (ETDEWEB)

    Medvick, Patricia A.; Calapristi, Augustin J.

    2003-02-05

    Information management in natural resources has become an overwhelming task. A massive amount of electronic documents and data is now available for creating informed decisions. The problem is finding the relevant information to support the decision-making process. Determining gaps in knowledge in order to propose new studies or to determine which proposals to fund for maximum potential is a time-consuming and difficult task. Additionally, available data stores are increasing in complexity; they now may include not only text and numerical data, but also images, sounds, and video recordings. Information visualization specialists at Pacific Northwest National Laboratory (PNNL) have software tools for exploring electronic data stores and for discovering and exploiting relationships within data sets. These provide capabilities for unstructured text explorations, the use of data signatures (a compact format for the essence of a set of scientific data) for visualization (Wong et al 2000), visualizations for multiple query results (Havre et al. 2001), and others (http://www.pnl.gov/infoviz ). We will focus on IN-SPIRE, a MS Windows vision of PNNL’s SPIRE (Spatial Paradigm for Information Retrieval and Exploration). IN-SPIRE was developed to assist information analysts find and discover information in huge masses of text documents.

  1. Leveraging Text Content for Management of Construction Project Documents

    Science.gov (United States)

    Alqady, Mohammed

    2012-01-01

    The construction industry is a knowledge intensive industry. Thousands of documents are generated by construction projects. Documents, as information carriers, must be managed effectively to ensure successful project management. The fact that a single project can produce thousands of documents and that a lot of the documents are generated in a…

  2. Text Mining Approaches To Extract Interesting Association Rules from Text Documents

    Directory of Open Access Journals (Sweden)

    Vishwadeepak Singh Baghela

    2012-05-01

    Full Text Available A handful of text data mining approaches are available to extract many potential information and association from large amount of text data. The term data mining is used for methods that analyze data with the objective of finding rules and patterns describing the characteristic properties of the data. The 'mined information is typically represented as a model of the semantic structure of the dataset, where the model may be used on new data for prediction or classification. In general, data mining deals with structured data (for example relational databases, whereas text presents special characteristics and is unstructured. The unstructured data is totally different from databases, where mining techniques are usually applied and structured data is managed. Text mining can work with unstructured or semi-structured data sets A brief review of some recent researches related to mining associations from text documents is presented in this paper.

  3. SEGMENTATION OF OVERLAPPING TEXT LINES, CHARACTERS IN PRINTED TELUGU TEXT DOCUMENT IMAGES

    Directory of Open Access Journals (Sweden)

    M Swamy Das,

    2010-11-01

    Full Text Available Segmentation is an important task of any OCR system. It separates the image text documents into lines, words and characters. The accuracy of OCR system mainly depends on the segmentation algorithm being used.Segmentation Telugu text is difficult when compared with Latin based languages because of its structural complexity and increased character set. It contains vowels, consonants and compound characters. Some of the characters may overlap together. The profile based methods can only segment non-overlapping lines and characters. This paper addresses the segmentation of overlapped text lines and characters. The proposed algorithm is based on projection profiles, connected components and spatial vertical relationships. It also usesnearest neighborhood method to cluster the connected components. Experimental results it is observed that 100% line segmentation and about 98% character segmentation accuracy can be achieved with overlapping lines and characters.

  4. Classification of protein-protein interaction full-text documents using text and citation network features.

    Science.gov (United States)

    Kolchinsky, Artemy; Abi-Haidar, Alaa; Kaur, Jasleen; Hamed, Ahmed Abdeen; Rocha, Luis M

    2010-01-01

    We participated (as Team 9) in the Article Classification Task of the Biocreative II.5 Challenge: binary classification of full-text documents relevant for protein-protein interaction. We used two distinct classifiers for the online and offline challenges: 1) the lightweight Variable Trigonometric Threshold (VTT) linear classifier we successfully introduced in BioCreative 2 for binary classification of abstracts and 2) a novel Naive Bayes classifier using features from the citation network of the relevant literature. We supplemented the supplied training data with full-text documents from the MIPS database. The lightweight VTT classifier was very competitive in this new full-text scenario: it was a top-performing submission in this task, taking into account the rank product of the Area Under the interpolated precision and recall Curve, Accuracy, Balanced F-Score, and Matthew's Correlation Coefficient performance measures. The novel citation network classifier for the biomedical text mining domain, while not a top performing classifier in the challenge, performed above the central tendency of all submissions, and therefore indicates a promising new avenue to investigate further in bibliome informatics.

  5. Asynchronous ASCII Event Count Status Code

    Science.gov (United States)

    2012-03-01

    IRIG STANDARD 215-12 TELECOMMUNICATIONS AND TIMING GROUP ASYNCHRONOUS ASCII EVENT COUNT STATUS CODES...Inter-range Instrumentation Group ( IRIG ) Standard for American Standard Code for Information Interchange (ASCII)-formatted EC status transfer which can be...circuits and Ethernet networks. Provides systems engineers and equipment vendors with an Inter-range Instrumentation Group ( IRIG ) Standard for American

  6. Transliterating non-ASCII characters with Python

    Directory of Open Access Journals (Sweden)

    Seth Bernstein

    2013-10-01

    Full Text Available This lesson shows how to use Python to transliterate automatically a list of words from a language with a non-Latin alphabet to a standardized format using the American Standard Code for Information Interchange (ASCII characters. It builds on readers’ understanding of Python from the lessons “Viewing HTML Files,” “Working with Web Pages,” “From HTML to List of Words (part 1” and “Intro to Beautiful Soup.” At the end of the lesson, we will use the transliteration dictionary to convert the names from a database of the Russian organization Memorial from Cyrillic into Latin characters. Although the example uses Cyrillic characters, the technique can be reproduced with other alphabets using Unicode.

  7. Information Gain Based Dimensionality Selection for Classifying Text Documents

    Energy Technology Data Exchange (ETDEWEB)

    Dumidu Wijayasekara; Milos Manic; Miles McQueen

    2013-06-01

    Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexity is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.

  8. Cluster Based Hybrid Niche Mimetic and Genetic Algorithm for Text Document Categorization

    Directory of Open Access Journals (Sweden)

    A. K. Santra

    2011-09-01

    Full Text Available An efficient cluster based hybrid niche mimetic and genetic algorithm for text document categorization to improve the retrieval rate of relevant document fetching is addressed. The proposal minimizes the processing of structuring the document with better feature selection using hybrid algorithm. In addition restructuring of feature words to associated documents gets reduced, in turn increases document clustering rate. The performance of the proposed work is measured in terms of cluster objects accuracy, term weight, term frequency and inverse document frequency. Experimental results demonstrate that it achieves very good performance on both feature selection and text document categorization, compared to other classifier methods.

  9. Comparison of Document Index Graph Using TextRank and HITS Weighting Method in Automatic Text Summarization

    Science.gov (United States)

    Hadyan, Fadhlil; Shaufiah; Arif Bijaksana, Moch.

    2017-01-01

    Automatic summarization is a system that can help someone to take the core information of a long text instantly. The system can help by summarizing text automatically. there’s Already many summarization systems that have been developed at this time but there are still many problems in those system. In this final task proposed summarization method using document index graph. This method utilizes the PageRank and HITS formula used to assess the web page, adapted to make an assessment of words in the sentences in a text document. The expected outcome of this final task is a system that can do summarization of a single document, by utilizing document index graph with TextRank and HITS to improve the quality of the summary results automatically.

  10. A New Retrieval Model Based on TextTiling for Document Similarity Search

    Institute of Scientific and Technical Information of China (English)

    Xiao-Jun Wan; Yu-Xin Peng

    2005-01-01

    Document similarity search is to find documents similar to a given query document and return a ranked list of similar documents to users, which is widely used in many text and web systems, such as digital library, search engine,etc. Traditional retrieval models, including the Okapi's BM25 model and the Smart's vector space model with length normalization, could handle this problem to some extent by taking the query document as a long query. In practice,the Cosine measure is considered as the best model for document similarity search because of its good ability to measure similarity between two documents. In this paper, the quantitative performances of the above models are compared using experiments. Because the Cosine measure is not able to reflect the structural similarity between documents, a new retrieval model based on TextTiling is proposed in the paper. The proposed model takes into account the subtopic structures of documents. It first splits the documents into text segments with TextTiling and calculates the similarities for different pairs of text segments in the documents. Lastly the overall similarity between the documents is returned by combining the similarities of different pairs of text segments with optimal matching method. Experiments are performed and results show:1) the popular retrieval models (the Okapi's BM25 model and the Smart's vector space model with length normalization)do not perform well for document similarity search; 2) the proposed model based on TextTiling is effective and outperforms other models, including the Cosine measure; 3) the methods for the three components in the proposed model are validated to be appropriately employed.

  11. THE SEGMENTATION OF A TEXT LINE FOR A HANDWRITTEN UNCONSTRAINED DOCUMENT USING THINING ALGORITHM

    NARCIS (Netherlands)

    Tsuruoka, S.; Adachi, Y.; Yoshikawa, T.

    2004-01-01

    For printed documents, the projection analysis of black pixels is widely used for the segmentation of a text line. However, for handwritten documents, we think that the projection analysis is not appropriate, as the separating border line of a text line is not a straight line on a paper with no rule

  12. Optimizing A syndromic surveillance text classifier for influenza-like illness: Does document source matter?

    Science.gov (United States)

    South, Brett R; South, Brett Ray; Chapman, Wendy W; Chapman, Wendy; Delisle, Sylvain; Shen, Shuying; Kalp, Ericka; Perl, Trish; Samore, Matthew H; Gundlapalli, Adi V

    2008-11-06

    Syndromic surveillance systems that incorporate electronic free-text data have primarily focused on extracting concepts of interest from chief complaint text, emergency department visit notes, and nurse triage notes. Due to availability and access, there has been limited work in the area of surveilling the full text of all electronic note documents compared with more specific document sources. This study provides an evaluation of the performance of a text classifier for detection of influenza-like illness (ILI) by document sources that are commonly used for biosurveillance by comparing them to routine visit notes, and a full electronic note corpus approach. Evaluating the performance of an automated text classifier for syndromic surveillance by source document will inform decisions regarding electronic textual data sources for potential use by automated biosurveillance systems. Even when a full electronic medical record is available, commonly available surveillance source documents provide acceptable statistical performance for automated ILI surveillance.

  13. Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.

    Science.gov (United States)

    Frasconi, Paolo; Soda, Giovanni; Vullo, Alessandro

    Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying…

  14. A Feature Mining Based Approach for the Classification of Text Documents into Disjoint Classes.

    Science.gov (United States)

    Nieto Sanchez, Salvador; Triantaphyllou, Evangelos; Kraft, Donald

    2002-01-01

    Proposes a new approach for classifying text documents into two disjoint classes. Highlights include a brief overview of document clustering; a data mining approach called the One Clause at a Time (OCAT) algorithm which is based on mathematical logic; vector space model (VSM); and comparing the OCAT to the VSM. (Author/LRW)

  15. ARABIC TEXT SUMMARIZATION BASED ON LATENT SEMANTIC ANALYSIS TO ENHANCE ARABIC DOCUMENTS CLUSTERING

    Directory of Open Access Journals (Sweden)

    Hanane Froud

    2013-01-01

    Full Text Available Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.

  16. Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering

    Directory of Open Access Journals (Sweden)

    Hanane Froud

    2013-02-01

    Full Text Available Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, PearsonCorrelation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.

  17. Raw Data (ASCII format) - PLACE | LSDB Archive [Life Science Database Archive metadata

    Lifescience Database Archive (English)

    Full Text Available PLACE Raw Data (ASCII format) Data detail Data name Raw Data (ASCII format) Description of data contents The.... Data analysis method - Number of data entries 469 entries Data item Description ID Identifier AC Accession... number DT Date (operation) author DE Description (regulation, gene, transacting factor, etc.) KW Keywords O... or G W: A or T Y: C or T (Mar. 31, 1997) XX Data item delimiter Joomla SEF URLs by Artio About This Database Database Description

  18. VarifocalReader--In-Depth Visual Analysis of Large Text Documents.

    Science.gov (United States)

    Koch, Steffen; John, Markus; Wörner, Michael; Müller, Andreas; Ertl, Thomas

    2014-12-01

    Interactive visualization provides valuable support for exploring, analyzing, and understanding textual documents. Certain tasks, however, require that insights derived from visual abstractions are verified by a human expert perusing the source text. So far, this problem is typically solved by offering overview-detail techniques, which present different views with different levels of abstractions. This often leads to problems with visual continuity. Focus-context techniques, on the other hand, succeed in accentuating interesting subsections of large text documents but are normally not suited for integrating visual abstractions. With VarifocalReader we present a technique that helps to solve some of these approaches' problems by combining characteristics from both. In particular, our method simplifies working with large and potentially complex text documents by simultaneously offering abstract representations of varying detail, based on the inherent structure of the document, and access to the text itself. In addition, VarifocalReader supports intra-document exploration through advanced navigation concepts and facilitates visual analysis tasks. The approach enables users to apply machine learning techniques and search mechanisms as well as to assess and adapt these techniques. This helps to extract entities, concepts and other artifacts from texts. In combination with the automatic generation of intermediate text levels through topic segmentation for thematic orientation, users can test hypotheses or develop interesting new research questions. To illustrate the advantages of our approach, we provide usage examples from literature studies.

  19. A Novel Model for Timed Event Extraction and Temporal Reasoning In Legal Text Documents

    Directory of Open Access Journals (Sweden)

    Kolikipogu Ramakrishna

    2011-02-01

    Full Text Available Information Retrieval is in a nascent stage to provide any type of information queried by naïve user.Question Answering System is one such successful area of Information retrieval. Legal Documents (caselaw, statute or transactional document are increasing day by day with the new applications (Mobiletransactions, Medical Diagnosis reports, law cases etc. in the world. Documentation of various Businessand Human Resource (HR applications involve Legal documents. Analysis and temporal reasoning ofsuch documents is a demanding area of research. In this paper we build a novel model for timed eventextraction and temporal reasoning in legal text documents. This paper mainly works on “how one can dofurther reasoning with the extracted temporal information”. Exploring temporal information in legal textdocuments is an important task to support legal practitioner lawyer, in order to determine temporalbased context decisions. Legal documents are available in different natural languages; hence it uses NLPSystem for pre-processing steps, Temporal constraint structure for temporal expressions, associatedtagger, Post-Processor with a knowledge-based sub system helps in discovering implicit information. Theresultant information resolves temporal expressions and deals with issues such as granularity, vagueness,and a reasoning mechanism which models the temporal constraint satisfaction network.

  20. LOG2MARKUP: State module to transform a Stata text log into a markup document

    DEFF Research Database (Denmark)

    2016-01-01

    log2markup extract parts of the text version from the Stata log command and transform the logfile into a markup based document with the same name, but with extension markup (or otherwise specified in option extension) instead of log. The author usually uses markdown for writing documents. However...... other users may decide on all sorts of markup languages, eg HTML or LaTex. The key is that markup of Stata code and Stata output can be set by the options....

  1. THE COMPOSITIONAL AND SPEECH ORGANIZATION OF REGULATION TEXT AS A REGULATORY DOCUMENT

    Directory of Open Access Journals (Sweden)

    Sharipova Roza Rifatovna

    2014-06-01

    Full Text Available The relevance of the study covered by this article is determined by the extension of the business communication scope, as well as the nessecity to upgrade the administrative activity of organizations which largely depends on the documentation quality. The documents are used in various communicative situations and reflect intercultural business relations, that is why the problem of studying the nature and functions of documents is urgent. Business communication involves interaction in different areas of activity, and a document is one of the main tools of regulating this process. The author studies a regulation, the document which ensures the systematization and adjustment of management process, reflects certain production processes and the order of their execution. Taking into account the complex of criteria (functioning level of document, specificity of business communication subjects, diversity of regulated processes, compositional and content, and speech organization of text, the author suggests to distinguish three types of regulations. The regulations of first type systemize the business activity at government level or corresponding administration. The regulations of second type are used to regulate external relations – with counter-agents, partners – during undetermined (long-term or determined (having starting and ending date validity period. The regulations of third type serve to regulate domestic relations within an organization and are mostly intended for staff. From the composition viewpoint, the regulations of all types represent the text consisting of several paginated sections; at this, the level of regulation functioning, the specificity of business communication subjects define the character of information – degree of its generality/detalization. The speech organization of studied documents is similar as it is characterized by use of lexis with process semantics and official clichés. The regulations differ in terminology

  2. A COMPARATIVE STUDY TO FIND A SUITABLE METHOD FOR TEXT DOCUMENT CLUSTERING

    Directory of Open Access Journals (Sweden)

    Dr.M.Punithavalli

    2012-01-01

    Full Text Available Text mining is used in various text related tasks such as information extraction, concept/entity extraction,document summarization, entity relation modeling (i.e., learning relations between named entities,categorization/classification and clustering. This paper focuses on document clustering, a field of textmining, which groups a set of documents into a list of meaningful categories. The main focus of thispaper is to present a performance analysis of various techniques available for document clustering. Theresults of this comparative study can be used to improve existing text data mining frameworks andimprove the way of knowledge discovery. This paper considers six clustering techniques for documentclustering. The techniques are grouped into three groups namely Group 1 - K-means and its variants(traditional K-means and K* Means algorithms, Group 2 - Expectation Maximization and its variants(traditional EM, Spherical Gaussian EM algorithm and Linear Partitioning and Reallocation clustering(LPR using EM algorithms, Group 3 - Semantic-based techniques (Hybrid method and Feature-basedalgorithms. A total of seven algorithms are considered and were selected based on their popularity inthe text mining field. Several experiments were conducted to analyze the performance of the algorithmand to select the winner in terms of cluster purity, clustering accuracy and speed of clustering.

  3. Semi-supervised learning for detecting text-lines in noisy document images

    Science.gov (United States)

    Liu, Zongyi; Zhou, Hanning

    2010-01-01

    Document layout analysis is a key step in document image understanding with wide applications in document digitization and reformatting. Identifying correct layout from noisy scanned images is especially challenging. In this paper, we introduce a semi-supervised learning framework to detect text-lines from noisy document images. Our framework consists of three steps. The first step is the initial segmentation that extracts text-lines and images using simple morphological operations. The second step is a grouping-based layout analysis that identifies text-lines, image zones, column separator and vertical border noise. It is able to efficiently remove the vertical border noises from multi-column pages. The third step is an online classifier that is trained with the high confidence line detection results from Step Two, and filters out noise from low confidence lines. The classifier effectively removes speckle noises embedded inside the content zones. We compare the performance of our algorithm to the state-of-the-art work in the field on the UW-III database. We choose the results reported by the Image Understanding Pattern Recognition Research (IUPR) and Scansoft Omnipage SDK 15.5. We evaluate the performances at both the page frame level and the text-line level. The result shows that our system has much lower false-alarm rate, while maintains similar content detection rate. In addition, we also show that our online training model generalizes better than algorithms depending on offline training.

  4. Robust Text Extraction for Automated Processing of Multi-Lingual Personal Identity Documents

    Directory of Open Access Journals (Sweden)

    Pushpa B R

    2016-04-01

    Full Text Available Text extraction is a technique to extract the textual portion from non-textual background like images. It plays an important role in deciphering valuable information from images. Variation in text size, font, orientation, alignment, contrast etc. makes the task of text extraction challenging. Existing text extraction methods focus on certain regions of interest and address characteristics like noise, blur, distortion and variations in fonts makes text extraction difficult. This paper proposes a technique to extract textual characters from scanned personal identity document images. Current procedures keep track of user records manually and thus give way to inefficient practices and need for abundant time and human resources. The proposed methodology digitizes personal identity documents and eliminates the need for a large portion of the manual work involved in existing data entry and verification procedures. The proposed method has been experimented extensively with large datasets of varying sizes and image qualities. The results obtained indicate high accuracy in the extraction of important textual features from the document images.

  5. Finding falls in ambulatory care clinical documents using statistical text mining

    Science.gov (United States)

    McCart, James A; Berndt, Donald J; Jarman, Jay; Finch, Dezon K; Luther, Stephen L

    2013-01-01

    Objective To determine how well statistical text mining (STM) models can identify falls within clinical text associated with an ambulatory encounter. Materials and Methods 2241 patients were selected with a fall-related ICD-9-CM E-code or matched injury diagnosis code while being treated as an outpatient at one of four sites within the Veterans Health Administration. All clinical documents within a 48-h window of the recorded E-code or injury diagnosis code for each patient were obtained (n=26 010; 611 distinct document titles) and annotated for falls. Logistic regression, support vector machine, and cost-sensitive support vector machine (SVM-cost) models were trained on a stratified sample of 70% of documents from one location (dataset Atrain) and then applied to the remaining unseen documents (datasets Atest–D). Results All three STM models obtained area under the receiver operating characteristic curve (AUC) scores above 0.950 on the four test datasets (Atest–D). The SVM-cost model obtained the highest AUC scores, ranging from 0.953 to 0.978. The SVM-cost model also achieved F-measure values ranging from 0.745 to 0.853, sensitivity from 0.890 to 0.931, and specificity from 0.877 to 0.944. Discussion The STM models performed well across a large heterogeneous collection of document titles. In addition, the models also generalized across other sites, including a traditionally bilingual site that had distinctly different grammatical patterns. Conclusions The results of this study suggest STM-based models have the potential to improve surveillance of falls. Furthermore, the encouraging evidence shown here that STM is a robust technique for mining clinical documents bodes well for other surveillance-related topics. PMID:23242765

  6. Human Rights Texts: Converting Human Rights Primary Source Documents into Data.

    Science.gov (United States)

    Fariss, Christopher J; Linder, Fridolin J; Jones, Zachary M; Crabtree, Charles D; Biek, Megan A; Ross, Ana-Sophia M; Kaur, Taranamol; Tsai, Michael

    2015-01-01

    We introduce and make publicly available a large corpus of digitized primary source human rights documents which are published annually by monitoring agencies that include Amnesty International, Human Rights Watch, the Lawyers Committee for Human Rights, and the United States Department of State. In addition to the digitized text, we also make available and describe document-term matrices, which are datasets that systematically organize the word counts from each unique document by each unique term within the corpus of human rights documents. To contextualize the importance of this corpus, we describe the development of coding procedures in the human rights community and several existing categorical indicators that have been created by human coding of the human rights documents contained in the corpus. We then discuss how the new human rights corpus and the existing human rights datasets can be used with a variety of statistical analyses and machine learning algorithms to help scholars understand how human rights practices and reporting have evolved over time. We close with a discussion of our plans for dataset maintenance, updating, and availability.

  7. ParaText : scalable solutions for processing and searching very large document collections : final LDRD report.

    Energy Technology Data Exchange (ETDEWEB)

    Crossno, Patricia Joyce; Dunlavy, Daniel M.; Stanton, Eric T.; Shead, Timothy M.

    2010-09-01

    This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

  8. Hierarchical Concept Indexing of Full-Text Documents in the Unified Medical Language System Information Sources Map.

    Science.gov (United States)

    Wright, Lawrence W.; Nardini, Holly K. Grossetta; Aronson, Alan R.; Rindflesch, Thomas C.

    1999-01-01

    Describes methods for applying natural-language processing for automatic concept-based indexing of full text and methods for exploiting the structure and hierarchy of full-text documents to a large collection of full-text documents drawn from the Health Services/Technology Assessment Text database at the National Library of Medicine. Examines how…

  9. Text Feature Weighting For Summarization Of Document Bahasa Indonesia Using Genetic Algorithm

    Directory of Open Access Journals (Sweden)

    Aristoteles.

    2012-05-01

    Full Text Available This paper aims to perform the text feature weighting for summarization of document bahasa Indonesia using genetic algorithm. There are eleven text features, i.e, sentence position (f1, positive keywords in sentence (f2, negative keywords in sentence (f3, sentence centrality (f4, sentence resemblance to the title (f5, sentence inclusion of name entity (f6, sentence inclusion of numerical data (f7, sentence relative length (f8, bushy path of the node (f9, summation of similarities for each node (f10, and latent semantic feature (f11. We investigate the effect of the first ten sentence features on the summarization task. Then, we use latent semantic feature to increase the accuracy. All feature score functions are used to train a genetic algorithm model to obtain a suitable combination of feature weights. Evaluation of text summarization uses F-measure. The F-measure directly related to the compression rate. The results showed that adding f11 increases the F-measure by 3.26% and 1.55% for compression ratio of 10% and 30%, respectively. On the other hand, it decreases the F-measure by 0.58% for compression ratio of 20%. Analysis of text feature weight showed that only using f2, f4, f5, and f11 can deliver a similar performance using all eleven features.

  10. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Directory of Open Access Journals (Sweden)

    Hamish Cunningham

    Full Text Available This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  11. Getting more out of biomedical documents with GATE's full lifecycle open source text analytics.

    Science.gov (United States)

    Cunningham, Hamish; Tablan, Valentin; Roberts, Angus; Bontcheva, Kalina

    2013-01-01

    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/outcome models in the UK's largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors' own group) who work in text processing for biomedicine and other areas. GATE is available online under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis.

  12. An Efficient Technique to Implement Similarity Measures in Text Document Clustering using Artificial Neural Networks Algorithm

    Directory of Open Access Journals (Sweden)

    K. Selvi

    2014-12-01

    Full Text Available Pattern recognition, envisaging supervised and unsupervised method, optimization, associative memory and control process are some of the diversified troubles that can be resolved by artificial neural networks. Problem identified: Of late, discovering the required information in massive quantity of data is the challenging tasks. The model of similarity evaluation is the central element in accomplishing a perceptive of variables and perception that encourage behavior and mediate concern. This study proposes Artificial Neural Networks algorithms to resolve similarity measures. In order to apply singular value decomposition the frequency of word pair is established in the given document. (1 Tokenization: The splitting up of a stream of text into words, phrases, signs, or other significant parts is called tokenization. (2 Stop words: Preceding or succeeding to processing natural language data, the words that are segregated is called stop words. (3 Porter stemming: The main utilization of this algorithm is as part of a phrase normalization development that is characteristically completed while setting up in rank recovery technique. (4 WordNet: The compilation of lexical data base for the English language is called as WordNet Based on Artificial Neural Networks, the core part of this study work extends n-gram proposed algorithm. All the phonemes, syllables, letters, words or base pair corresponds in accordance to the application. Future work extends the application of this same similarity measures in various other neural network algorithms to accomplish improved results.

  13. Ultrasound-guided nerve blocks--is documentation and education feasible using only text and pictures?

    Directory of Open Access Journals (Sweden)

    Bjarne Skjødt Worm

    Full Text Available PURPOSE: With the advancement of ultrasound-guidance for peripheral nerve blocks, still pictures from representative ultrasonograms are increasingly used for clinical procedure documentation of the procedure and for educational purposes in textbook materials. However, little is actually known about the clinical and educational usefulness of these still pictures, in particular how well nerve structures can be identified compared to real-time ultrasound examination. We aimed to quantify gross visibility or ultrastructure using still picture sonograms compared to real time ultrasound for trainees and experts, for large or small nerves, and discuss the clinical or educational relevance of these findings. MATERIALS AND METHODS: We undertook a clinical study to quantify the maximal gross visibility or ultrastructure of seven peripheral nerves identified by either real time ultrasound (clinical cohort, n = 635 or by still picture ultrasonograms (clinical cohort, n = 112. In addition, we undertook a study on test subjects (n = 4 to quantify interobserver variations and potential bias among expert and trainee observers. RESULTS: When comparing real time ultrasound and interpretation of still picture sonograms, gross identification of large nerves was reduced by 15% and 40% by expert and trainee observers, respectively, while gross identification of small nerves was reduced by 29% and 66%. Identification of within-nerve ultrastructure was even less. For all nerve sizes, trainees were unable to identify any anatomical structure in 24 to 34%, while experts were unable to identify anything in 9 to 10%. CONCLUSION: Exhaustive ultrasonography experience and real time ultrasound measurements seem to be keystones in obtaining optimal nerve identification. In contrast the use of still pictures appears to be insufficient for documentation as well as educational purposes. Alternatives such as video clips or enhanced picture technology are encouraged

  14. A novel technique for estimation of skew in binary text document images based on linear regression analysis

    Indian Academy of Sciences (India)

    P Shivakumara; G Hemantha Kumar; D S Guru; P Nagabhushan

    2005-02-01

    When a document is scanned either mechanically or manually for digitization, it often suffers from some degree of skew or tilt. Skew-angle detection plays an important role in the field of document analysis systems and OCR in achieving the expected accuracy. In this paper, we consider skew estimation of Roman script. The method uses the boundary growing approach to extract the lowermost and uppermost coordinates of pixels of characters of text lines present in the document, which can be subjected to linear regression analysis (LRA) to determine the skew angle of a skewed document. Further, the proposed technique works fine for scaled text binary documents also. The technique works based on the assumption that the space between the text lines is greater than the space between the words and characters. Finally, in order to evaluate the performance of the proposed methodology we compare the experimental results with those of well-known existing methods.

  15. Content analysis to detect high stress in oral interviews and text documents

    Science.gov (United States)

    Thirumalainambi, Rajkumar (Inventor); Jorgensen, Charles C. (Inventor)

    2012-01-01

    A system of interrogation to estimate whether a subject of interrogation is likely experiencing high stress, emotional volatility and/or internal conflict in the subject's responses to an interviewer's questions. The system applies one or more of four procedures, a first statistical analysis, a second statistical analysis, a third analysis and a heat map analysis, to identify one or more documents containing the subject's responses for which further examination is recommended. Words in the documents are characterized in terms of dimensions representing different classes of emotions and states of mind, in which the subject's responses that manifest high stress, emotional volatility and/or internal conflict are identified. A heat map visually displays the dimensions manifested by the subject's responses in different colors, textures, geometric shapes or other visually distinguishable indicia.

  16. A Solution to Reconstruct Cross-Cut Shredded Text Documents Based on Character Recognition and Genetic Algorithm

    Directory of Open Access Journals (Sweden)

    Hedong Xu

    2014-01-01

    Full Text Available The reconstruction of destroyed paper documents is of more interest during the last years. This topic is relevant to the fields of forensics, investigative sciences, and archeology. Previous research and analysis on the reconstruction of cross-cut shredded text document (RCCSTD are mainly based on the likelihood and the traditional heuristic algorithm. In this paper, a feature-matching algorithm based on the character recognition via establishing the database of the letters is presented, reconstructing the shredded document by row clustering, intrarow splicing, and interrow splicing. Row clustering is executed through the clustering algorithm according to the clustering vectors of the fragments. Intrarow splicing regarded as the travelling salesman problem is solved by the improved genetic algorithm. Finally, the document is reconstructed by the interrow splicing according to the line spacing and the proximity of the fragments. Computational experiments suggest that the presented algorithm is of high precision and efficiency, and that the algorithm may be useful for the different size of cross-cut shredded text document.

  17. MeSH Up: Effective MeSH text classification for improved document retrieval

    NARCIS (Netherlands)

    Trieschnigg, D.; Pezik, P.; Lee, V.; Jong, F.de; Kraaij, W.; Rebholz-Schuhmann, D.

    2009-01-01

    Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeS

  18. MeSH Up: effective MeSH text classification for improved document retrieval

    NARCIS (Netherlands)

    Trieschnigg, Dolf; Pezik, Piotr; Lee, Vivian; Jong, de Franciska; Kraaij, Wessel; Rebholz-Schuhmann, Dietrich

    2009-01-01

    Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeS

  19. A Full-Text-Based Search Engine for Finding Highly Matched Documents Across Multiple Categories

    Science.gov (United States)

    Nguyen, Hung D.; Steele, Gynelle C.

    2016-01-01

    This report demonstrates the full-text-based search engine that works on any Web-based mobile application. The engine has the capability to search databases across multiple categories based on a user's queries and identify the most relevant or similar. The search results presented here were found using an Android (Google Co.) mobile device; however, it is also compatible with other mobile phones.

  20. Farsi/Arabic Document Image Retrieval through Sub -Letter Shape Coding for mixed Farsi/Arabic and English text

    Directory of Open Access Journals (Sweden)

    Zahra Bahmani

    2011-09-01

    Full Text Available A retrieval method for explicit recognition free Farsi/Arabic document is proposed in this paper. The system can be used in mixed Farsi/Arabic and English text. The method consists of Preprocessing, word and sub_word extraction, detection and cancelation of sub_letter connectors, annotation sub_letters by shape coding, classifier of sub_letters by use of decision tree and using of RBF neural network for sub_letter recognition. The Proposed system retrieves document images by a new sub_letter shape coding scheme in Farsi/Arabic documents. In this method document content captures through sub_letter coding of words. The decision tree-based classifier partitions the sub_letters space into a number of sub regions by splitting the sub_letter space, using one topological shape features at a time. Topological shape Features include height, width, holes, openings, valleys, jags, sub_letter ascenders/descanters. Experimental results show advantages of this method in Farsi/Arabic Document Image Retrieval.

  1. Memoria documental en textos chilenos del período colonial (siglos XVI y XVII (Documental memory in Chilean texts of the colonial period (sixteenth and seventeenth centuries

    Directory of Open Access Journals (Sweden)

    Manuel Contreras Seitz

    2013-06-01

    Full Text Available En este trabajo se da cuenta de las nociones básicas para la conformación de un corpus documental diacrónico que abarque el período colonial chileno, centrándose con particular énfasis en los siglos XVI y XVII. Se discute, además, los aspectos metodológicos para la edición crítica preliminar de dichos documentos, tanto en lo concerniente a la transcripción paleográfica de los mismos, la adecuación a normas filológicas específicas, así como el aparato crítico que es necesario implementar de acuerdo a los destinatarios, sin dejar de lado la rigurosidad histórica y documental. Especial mención se hará de los requisitos léxico-semánticos para la edición de estos documentos, el problema de las grafías y las abreviaturas, así como de los pasos previos que es necesario implementar para la creación de un programa de reconocimiento óptico de caracteres para textos manuscritos del período. (This article explains the basic notions for the conformation of a diachronic textual corpus that embraces the colonial Chilean period, focusing with particular emphasis on the XVI and XVII centuries. Some methodological aspects for the preliminary critical edition of these documents are also discussed, so much with what is concerned with aspects to the paleographical transcription of the same ones, the adaptation to philological specific norms, as well as the critical apparatus that is necessary to implement according to the addressees, without leaving aside the historical and documental rigor. Special mention will be made to the lexicon-semantic requirements for the edition of these documents, the problem of the graphs and the abbreviations, as well as of the previous steps that are necessary to implement for the creation of an optical character recognition program for handwritten texts of the period.

  2. Progress Report on the ASCII for Science Data, Airborne and Geospatial Working Groups of the 2014 ESDSWG for MEaSUREs

    Science.gov (United States)

    Evans, K. D.; Krotkov, N. A.; Mattmann, C. A.; Boustani, M.; Law, E.; Conover, H.; Chen, G.; Olding, S. W.; Walter, J.

    2014-12-01

    The Earth Science Data Systems Working Groups (ESDSWG) were setup by NASA HQ 10 years ago. The role of the ESDSWG is to make recommendations relevant to NASA's Earth science data systems from users experiences. Each group works independently focussing on a unique topic. Participation in ESDSWG groups comes from a variety of NASA-funded science and technology projects, NASA information technology experts, affiliated contractor staff and other interested community members from academia and industry. Recommendations from the ESDSWG groups will enhance NASA's efforts to develop long term data products. The ASCII for Science Data Working Group (WG) will define a minimum set of information that should be included in ASCII file headers so that the users will be able to access the data using only the header information. After reviewing various use cases, such as field data and ASCII data exported from software tools, and reviewing ASCII data guidelines documentation, this WG will deliver guidelines for creating ASCII files that contain enough header information to allow the user to access the science data. The Airborne WG's goal is to improve airborne data access and use for NASA science. The first step is to evaluate the state of airborne data and make recommendations focusing on data delivery to the DAACs (data centers). The long term goal is to improve airborne data use for Earth Science research. Many data aircraft observations are reported in ASCII format. The ASCII and Airborne WGs seem like the same group, but the Airborne WG is concerned with maintaining and using airborne for science research, not just the data format. The Geospatial WG focus is on the interoperability issues of Geospatial Information System (GIS) and remotely sensed data, in particular, focusing on DAAC(s) data from NASA's Earth Science Enterprise. This WG will provide a set of tools (GIS libraries) to use with training and/or cookbooks through the use of Open Source technologies. A progress

  3. Lidar Bathymetry Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline and is the only large cape south of Cape Fear, North Carolina....

  4. PHYSICAL MODELLING OF TERRAIN DIRECTLY FROM SURFER GRID AND ARC/INFO ASCII DATA FORMATS#

    Directory of Open Access Journals (Sweden)

    Y.K. Modi

    2012-01-01

    Full Text Available

    ENGLISH ABSTRACT: Additive manufacturing technology is used to make physical models of terrain using GIS surface data. Attempts have been made to understand several other GIS file formats, such as the Surfer grid and the ARC/INFO ASCII grid. The surface of the terrain in these file formats has been converted into an STL file format that is suitable for additive manufacturing. The STL surface is converted into a 3D model by making the walls and the base. In this paper, the terrain modelling work has been extended to several other widely-used GIS file formats. Terrain models can be created in less time and at less cost, and intricate geometries of terrain can be created with ease and great accuracy.

    AFRIKAANSE OPSOMMING: Laagvervaardigingstegnologie word gebruik om fisiese modelle van terreine vanaf GIS oppervlakdata te maak. Daar is gepoog om verskeie ander GIS lêerformate, soos die Surfer rooster en die ARC/INFO ASCII rooster, te verstaan. Die oppervlak van die terrein in hierdie lêerformate is omgeskakel in 'n STL lêerformaat wat geskik is vir laagvervaardiging. Verder is die STL oppervlak omgeskakel in 'n 3D model deur die kante en die basis te modelleer. In hierdie artikel is die terreinmodelleringswerk uitgebrei na verskeie ander algemeen gebruikte GIS lêerformate. Terreinmodelle kan so geskep word in korter tyd en teen laer koste, terwyl komplekse geometrieë van terreine met gemak en groot akkuraatheid geskep kan word.

  5. Is there still an unknown Freud? A note on the publications of Freud's texts and on unpublished documents.

    Science.gov (United States)

    Falzeder, Ernst

    2007-01-01

    This article presents an overview of the existing editions of what Freud wrote (works, letters, manuscripts and drafts, diaries and calendar notes, dedications and margin notes in books, case notes, and patient calendars) and what he is recorded as having said (minutes of meetings, interviews, memoirs of and interviews with patients, family members, and followers, and other quotes). There follows a short overview of biographies of Freud and other documentation on his life. It is concluded that a wealth of material is now available to Freud scholars, although more often than not this information is used in a biased and partisan way.

  6. Gridded bathymetry of French Frigate Shoals, Hawaii, USA - Arc ASCII format

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Gridded bathymetry (5m) of the shelf environment of French Frigate Shoals, Hawaii, USA. The ASCII includes multibeam bathymetry from the Simrad EM3002d, and Reson...

  7. CRED 20m Gridded bathymetry of Necker Island, Hawaii, USA (Arc ASCII format)

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Gridded bathymetry of the shelf and slope environments of Necker Island, Northwestern Hawaiian Islands, Hawaii, USA. This ASCII includes multibeam bathymetry from...

  8. CRED 5 m Gridded bathymetry of Brooks Banks, Hawaii, USA (Arc ASCII format)

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Gridded bathymetry (5m) of the shelf and slope environments of Brooks Banks, Hawaii, USA. The ASCII includes multibeam bathymetry from the Simrad EM300, Simrad...

  9. The Hong Kong Chinese University Document Retrieval Database——The Hong Kong Newspaper Full-text Database Projeet

    Institute of Scientific and Technical Information of China (English)

    MichaelM.Lee

    1994-01-01

    This project is to collect, organize, index and store full-text and graphics of selected Chinese and English newspapers currently published in Hang Kong. The end product will be an electronic database available to researchers through local area network, Internet and dial-up users. New items of the day before and up to six months will be available for online searching, via key word or subject, Earlier cumulated nateriats alone with the same indexing and searchmg software will be archived to optical media (CD ROM disks). As Itong Kong experiences rapid social, financial, conmtercial, political, educational and cultural changes, our state-of-the-art comprehensive coverage of local and regional newspapers will be a landmark contribution to information industries and researchers internationally. As the coverage of the database will be comprehensive and centralized, retrieval of news items of major Hang Kong newspapers will be fast and immtediate. Users do no need to look through daily or bi-monthly indexes in order to go to the newspapers or cuttings to obtain the hard copy, and then bring to the photocopier machine to copy,At this stage, we are hiring librarians, information specialists and support staff to work on this project. We also met and work with newspaper indexing and retrieval system developers in Beijing and Hang Kong to study cooperative systems to speed up the process. So far, we have received funding support from the Chinese University and the Hong Kong Government for two years. It is our plan to have a presentable sample database done by mid 1995, and have several newspapers indexed and stored in the structure arid for mat easy formigration to the eventual database system by the end of 1996.

  10. «Leaving aside Dante’s verses»?: A guided tour through the Studies, combining Texts, Biography and Documents

    Directory of Open Access Journals (Sweden)

    Giuliano Milani

    2014-11-01

    Full Text Available The article deals with the role that the documents gathered in the Codice diplomatico dantesco had in Dante’s studies in the last century. The survey is divided into three phases, the first focused on the centenary of 1921 and the figure of Michele Barbi; the second phase, which was dominated by Gianfranco Contini, is located around the centenary of the poet’s death in 1965; the last one is focused on most recent studies. The interest of the scholars for the documents has been discontinuous; multiple factors influenced the turns in the scholars’ attitudes, including the relations of the various fields of research and the often conflictual dialogue between the various generations of specialists. Starting from the new interest in the documentary sources that emerged in the most recent phase, the authors calls for a new approach to the documents exploiting new tools, thus avoiding too much interference between the proper study of the documents and the self-narration Dante often offers in his writings.

  11. Combining position weight matrices and document-term matrix for efficient extraction of associations of methylated genes and diseases from free text.

    Directory of Open Access Journals (Sweden)

    Arwa Bin Raies

    Full Text Available BACKGROUND: In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually. METHODOLOGY: We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text. CONCLUSION: The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download.

  12. Oracle Text全文检索技术在文档资料管理中的应用%Application of Full-Text Search of Oracle Text in Documents Management

    Institute of Scientific and Technical Information of China (English)

    李培军; 毕于慧; 张权; 董玮

    2014-01-01

    本文利用Oracle Text全文检索技术,根据数据库业务逻辑构建了关键词表,通过为关键词表建立索引的方式进行检索,提高了检索效率;以ViusalC++6为开发平台,采用C/S结构技术研发了多类型文档资料管理系统,实现了办公文档资料的高效管理。%Based on the full-text search of Oracle Text, this article first created key words table according to the logical database, the search efficiency was improved used by creating index for the table;and then a documents management system for multi-type files was developed on the platform of Visual C++6 with C/S structure technology to manage official documents efficiently.

  13. Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

    KAUST Repository

    Bin Raies, Arwa

    2013-10-16

    Background:In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually.Methodology:We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text.Conclusion:The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download. © 2013 Bin Raies et al.

  14. Scholars in the Humanities Are Reluctant to Cite E-Texts as Primary Materials. A Review of: Sukovic, S. (2009. References to e-texts in academic publications. Journal of Documentation, 65(6, 997-1015.

    Directory of Open Access Journals (Sweden)

    Deena Yanofsky

    2011-03-01

    collections as well as ‘electronically born’ documents, works of art and popular culture artifacts. Of the 22 works resulting from the research projects examined during the study period, half did not cite e-texts as primary materials. The 11 works that made at least one reference to an e-text included 4 works in which the only reference was to e-texts created by the actual author. In total, only 7 works referred to e-texts created by outside authors. These 7 final works were written by 5 participants, representing 31 percent of the total number of study participants.Analysis of the participants’ citation practices revealed that decisions to cite an electronic source or omit it from publication were based on two important factors: (1 the perceived trustworthiness of an e-text and (2 a sense of what was acceptable practice.Participants established trustworthiness through a process of verification. To confirm the authenticity and reliability of an e-text, most participants compared electronic documents against a print version to verify provenance, context, and details. Even when digitized materials were established as trustworthy sources, however, hard copies were often cited because they were considered more authoritative or accurate.Traditions of a particular discipline also had a strong influence on a participant’s willingness to cite e-texts. Participants working on traditional historical topics were more reluctant to cite electronic resources, while researchers who worked on topics that explored relatively new fields were more willing to acknowledge the use of e-texts in published works. Traditional practices also influenced participants’ decisions about how to cite materials. Some participants always cited original works in hard copy, regardless of electronic access because it was accepted scholarly practice.Conclusions – The results of this study suggest that the small number of citations to electronic sources in publications in the humanities is directly

  15. CRED 20 m Gridded bathymetry of Brooks Banks and St. Rogatien Bank, Hawaii, USA (Arc ASCII format)

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Gridded bathymetry (20m) of the shelf and slope environments of Brooks Banks and St. Rogatien, Hawaii, USA. The ASCII includes multibeam bathymetry from the Simrad...

  16. Single-Beam Bathymetry Sounding Data of Cape Canaveral, Florida, (2014) in XYZ ASCII text file format

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The Cape Canaveral Coastal System (CCCS) is a prominent feature along the Southeast U.S. coastline, and is the only large cape south of Cape Fear, North Carolina....

  17. Text files of the navigation logged by the U.S. Geological Survey offshore of Fire Island, NY in 2011 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The U.S. Geological Survey (USGS) mapped approximately 336 square kilometers of the lower shoreface and inner-continental shelf offshore of Fire Island, New York in...

  18. Position index preserving compression of text data

    OpenAIRE

    Akhtar, Nasim; Rashid, Mamunur; Islam, Shafiqul; Kashem, Mohammod Abul; Kolybanov, Cyrll Y.

    2011-01-01

    Data compression offers an attractive approach to reducing communication cost by using available bandwidth effectively. It also secures data during transmission for its encoded form. In this paper an index based position oriented lossless text compression called PIPC ( Position Index Preserving Compression) is developed. In PIPC the position of the input word is denoted by ASCII code. The basic philosopy of the secure compression is to preprocess the text and transform it into some intermedia...

  19. 一种大容量文本集的智能检索方法%Intelligent information retrieval approach for large-scale collections of full-text document

    Institute of Scientific and Technical Information of China (English)

    金小峰

    2011-01-01

    分析了潜在语义模型,研究了潜在语义空间中文本的表示方法,提出了一种大容量文本集的检索策略.检索过程由粗粒度非相关剔除和相关文本的精确检索两个步骤组成.使用潜在语义空间模型对文本集进行初步的筛选,剔除非相关文本;使用大规模文本检索方法对相关文本在段落一级进行精确检索,其中为了提高检索的执行效率,在检索算法中引入了遗传算法;输出这些候选的段落序号.实验结果证明了这种方法的有效性和高效性.%An information retrieval approach for large-scale collections of full-text document is proposed according to latent model analysis and investigation of latent space-based text representation form. The retrieval process is divided into rough irrelative full-text documents culling procedure,and relative full-text document precise searching prueedure.lrrelative documents are removed by the first procedure. Relative full-text documents are retrieved in passage level by the second one,and in this process, GA algorithm is introduced in order to achieve best performance. Finally, the candidate passage indices are returned.The validity and high efficiency of the proposed method are shown by experimental results.

  20. 可全文检索的校园文档管理系统设计%The Design of Campus Full-text Search and Manage Document System

    Institute of Scientific and Technical Information of China (English)

    韩金松

    2013-01-01

    Generally search engine can only be able to search web contents but can’t search the content of attached documents. This article focuses on document searching method, combining with the actual situation in school, gives the design of campus full-text search and manage document system.%一般的搜索引擎仅仅能够搜索网页内容而无法检索网页内附加的文档内容,本文着重阐述了文档内容检索方法,并结合学校实际情况,完成了校园文档全文检索与管理系统的设计。

  1. A methodology for semiautomatic taxonomy of concepts extraction from nuclear scientific documents using text mining techniques; Metodologia para extracao semiautomatica de uma taxonomia de conceitos a partir da producao cientifica da area nuclear utilizando tecnicas de mineracao de textos

    Energy Technology Data Exchange (ETDEWEB)

    Braga, Fabiane dos Reis

    2013-07-01

    This thesis presents a text mining method for semi-automatic extraction of taxonomy of concepts, from a textual corpus composed of scientific papers related to nuclear area. The text classification is a natural human practice and a crucial task for work with large repositories. The document clustering technique provides a logical and understandable framework that facilitates the organization, browsing and searching. Most clustering algorithms using the bag of words model to represent the content of a document. This model generates a high dimensionality of the data, ignores the fact that different words can have the same meaning and does not consider the relationship between them, assuming that words are independent of each other. The methodology presents a combination of a model for document representation by concepts with a hierarchical document clustering method using frequency of co-occurrence concepts and a technique for clusters labeling more representatives, with the objective of producing a taxonomy of concepts which may reflect a structure of the knowledge domain. It is hoped that this work will contribute to the conceptual mapping of scientific production of nuclear area and thus support the management of research activities in this area. (author)

  2. Native Language Processing using Exegy Text Miner

    Energy Technology Data Exchange (ETDEWEB)

    Compton, J

    2007-10-18

    Lawrence Livermore National Laboratory's New Architectures Testbed recently evaluated Exegy's Text Miner appliance to assess its applicability to high-performance, automated native language analysis. The evaluation was performed with support from the Computing Applications and Research Department in close collaboration with Global Security programs, and institutional activities in native language analysis. The Exegy Text Miner is a special-purpose device for detecting and flagging user-supplied patterns of characters, whether in streaming text or in collections of documents at very high rates. Patterns may consist of simple lists of words or complex expressions with sub-patterns linked by logical operators. These searches are accomplished through a combination of specialized hardware (i.e., one or more field-programmable gates arrays in addition to general-purpose processors) and proprietary software that exploits these individual components in an optimal manner (through parallelism and pipelining). For this application the Text Miner has performed accurately and reproducibly at high speeds approaching those documented by Exegy in its technical specifications. The Exegy Text Miner is primarily intended for the single-byte ASCII characters used in English, but at a technical level its capabilities are language-neutral and can be applied to multi-byte character sets such as those found in Arabic and Chinese. The system is used for searching databases or tracking streaming text with respect to one or more lexicons. In a real operational environment it is likely that data would need to be processed separately for each lexicon or search technique. However, the searches would be so fast that multiple passes should not be considered as a limitation a priori. Indeed, it is conceivable that large databases could be searched as often as necessary if new queries were deemed worthwhile. This project is concerned with evaluating the Exegy Text Miner installed in the

  3. Documenting the Earliest Chinese Journals

    Directory of Open Access Journals (Sweden)

    Jian-zhong (Joe Zhou

    2001-10-01

    Full Text Available

    頁次:19-24

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">According to various authoritative sources, the English word "journal" was first used in the 16lh century, but the existence of the journal in its original meaning as a daily record can be traced back to Acta Diuma (Daily Events in ancient Roman cities as early as 59 B.C. This article documents the first appearance of Chinese daily records that were much early than 59 B.C.

    text-indent: 24pt; mso-layout-grid-align: none; mso-char-indent-count: 2.0;">The evidence of the earlier Chinese daily records came from some important archaeological discoveries in the 1970's, but they were also documented by Sima Qian (145 B.C. - 85 B.C., the grand historian of the Han Dynasty imperial court. Sima's lifetime contribution was the publication of Shi Ji (ascii-font-family: 'Times New Roman'; mso-fareast-theme-font: minor-fareast; mso-font-kerning: 0pt; mso-hansi-font-family: 'Times New Roman';">史記 (The Grand Scribe's Records, the Records hereafter. The Records is a book of history of a grand scope. It encompasses all Chinese history from 30lh century B.C. through the end of the second century B.C. in 130 chapters and over 525,000 Chinese

  4. Using Text Documents from American Memory.

    Science.gov (United States)

    Singleton, Laurel R., Ed.

    2002-01-01

    This publication contains classroom-tested teaching ideas. For grades K-4, "'Blessed Ted-fred': Famous Fathers Write to Their Children" uses American Memory for primary source letters written by Theodore Roosevelt and Alexander Graham Bell to their children. For grades 5-8, "Found Poetry and the American Life Histories…

  5. Text Steganographic Approaches: A Comparison

    Directory of Open Access Journals (Sweden)

    Monika Agarwal

    2013-02-01

    Full Text Available This paper presents three novel approaches of text steganography. The first approach uses the theme ofmissing letter puzzle where each character of message is hidden by missing one or more letters in a wordof cover. The average Jaro score was found to be 0.95 indicating closer similarity between cover andstego file. The second approach hides a message in a wordlist where ASCII value of embedded characterdetermines length and starting letter of a word. The third approach conceals a message, withoutdegrading cover, by using start and end letter of words of the cover. For enhancing the security of secretmessage, the message is scrambled using one-time pad scheme before being concealed and cipher text isthen concealed in cover. We also present an empirical comparison of the proposed approaches with someof the popular text steganographic approaches and show that our approaches outperform the existingapproaches.

  6. Research on Model of Content Contrast of Standard Documents Based on Text Classification%基于文本分类的标准文献内容比对模型研究

    Institute of Scientific and Technical Information of China (English)

    刘嘉谊; 刘高勇

    2015-01-01

    Based on the analysis of standard documents structure and text classification, this paper puts forward the model of content contrast of standard documents to realize the rapid extraction and automatic classification of standard documents, support the easy and quick standard contrast work of related technical personnel and enterprise, and provide methods and strategies for the sustainable development to the standard contrast work.%在分析标准文献结构和文本分类的基础上,提出基于文本分类的标准文献内容比对模型,实现标准文献内容的快速提取和自动分类,支持相关技术人员和企业轻松快捷地实现标准比对工作,为标准文献比对工作的可持续发展提供方法和策略。

  7. Locations and analysis of sediment samples collected offshore of Massachusetts within Northern Cape Cod Bay(CCB_SedSamples Esri Shapefile, and ASCII text format, WGS84)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — These data were collected under a cooperative agreement with the Massachusetts Office of Coastal Zone Management (CZM) and the U.S. Geological Survey (USGS), Coastal...

  8. De que modo os textos oficiais prescrevem o trabalho do professor? Análise comparativa de documentos brasileiros e genebrinos How do official texts prescribe the teacher's work? A comparative analysis of the brazilian and genebrian Documents

    Directory of Open Access Journals (Sweden)

    Anna Rachel Machado

    2005-12-01

    Full Text Available Neste artigo, são apresentados os resultados de análises de dois documentos produzidos por instâncias oficiais para orientar o trabalho dos professores no Brasil e na Suíça. De um lado, buscamos detectar as características da textualização da prescrição do trabalho do professor. Os resultados mostram que, além das propriedades comuns aos textos prescritivos (apagamento do enunciador, contrato de felicidade etc., esses documentos se caracterizam por apresentar uma estrutura temática mais complexa, articulando um agir prescritivo, um agir-fonte e um agir - prescrito. Além disso, buscamos identificar as formas de construção do objeto da prescrição, o que permitiu verificar que, nos dois contextos, esse objeto se configura como uma proposta pedagógica global e não como trabalho concreto dos professores, não estando eles representados, nesses textos, como atores que têm uma real responsabilidade no desenvolvimento das propostas e, paralelamente, apresentando-se os alunos como alvos inertes. Esse trabalho também nos permitiu levantar algumas diferenças das formas de textualização das prescrições examinadas, diferenças essas que relacionamos ao contexto político-econômico dos dois países. Ao final, chegamos a questionamentos referentes às razões da não-consideração do trabalho efetivo dos professores nesse tipo de documento.In this article we present the results of two documents produced by official agencies that aim at guiding the teachers' work in Brazil and in Switzerland. On the one hand, we focused on detecting the textualization features used to prescribe the teacher's work. Results show that besides the common prescriptive features of the two texts (enunciator's erasure, felicity contract etc, these documents carry a more complex thematic structure, articulating a prescriptive doing, a source-doing and a prescribed-doing. We have also tried to identify the forms of building the object of prescription, which

  9. Exploiting Document Level Semantics in Document Clustering

    Directory of Open Access Journals (Sweden)

    Muhammad Rafi

    2016-06-01

    Full Text Available Document clustering is an unsupervised machine learning method that separates a large subject heterogeneous collection (Corpus into smaller, more manageable, subject homogeneous collections (clusters. Traditional method of document clustering works around extracting textual features like: terms, sequences, and phrases from documents. These features are independent of each other and do not cater meaning behind these word in the clustering process. In order to perform semantic viable clustering, we believe that the problem of document clustering has two main components: (1 to represent the document in such a form that it inherently captures semantics of the text. This may also help to reduce dimensionality of the document and (2 to define a similarity measure based on the lexical, syntactic and semantic features such that it assigns higher numerical values to document pairs which have higher syntactic and semantic relationship. In this paper, we propose a representation of document by extracting three different types of features from a given document. These are lexical , syntactic and semantic features. A meta-descriptor for each document is proposed using these three features: first lexical, then syntactic and in the last semantic. A document to document similarity matrix is produced where each entry of this matrix contains a three value vector for each lexical , syntactic and semantic . The main contributions from this research are (i A document level descriptor using three different features for text like: lexical, syntactic and semantics. (ii we propose a similarity function using these three, and (iii we define a new candidate clustering algorithm using three component of similarity measure to guide the clustering process in a direction that produce more semantic rich clusters. We performed an extensive series of experiments on standard text mining data sets with external clustering evaluations like: FMeasure and Purity, and have obtained

  10. Text Mining.

    Science.gov (United States)

    Trybula, Walter J.

    1999-01-01

    Reviews the state of research in text mining, focusing on newer developments. The intent is to describe the disparate investigations currently included under the term text mining and provide a cohesive structure for these efforts. A summary of research identifies key organizations responsible for pushing the development of text mining. A section…

  11. School Survey on Crime and Safety (SSOCS) 2000 Public-Use Data Files, User's Manual, and Detailed Data Documentation. [CD-ROM].

    Science.gov (United States)

    National Center for Education Statistics (ED), Washington, DC.

    This CD-ROM contains the raw, public-use data from the 2000 School Survey on Crime and Safety (SSOCS) along with a User's Manual and Detailed Data Documentation. The data are provided in SAS, SPSS, STATA, and ASCII formats. The User's Manual and the Detailed Data Documentation are provided as .pdf files. (Author)

  12. Text Mining: (Asynchronous Sequences

    Directory of Open Access Journals (Sweden)

    Sheema Khan

    2014-12-01

    Full Text Available In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.

  13. Documenting localities

    CERN Document Server

    Cox, Richard J

    1996-01-01

    Now in paperback! Documenting Localities is the first effort to summarize the past decade of renewed discussion about archival appraisal theory and methodology and to provide a practical guide for the documentation of localities.This book discusses the continuing importance of the locality in American historical research and archival practice, traditional methods archivists have used to document localities, and case studies in documenting localities. These chapters draw on a wide range of writings from archivists, historians, material culture specialists, historic preservationists

  14. Text Classification using Artificial Intelligence

    CERN Document Server

    Kamruzzaman, S M

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms for classifying text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using artificial intelligence technique that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of na\\"ive Bayes classifier is then used on derived features and finally only a single concept of genetic algorithm has been added for final classification. A syste...

  15. Text Classification using Data Mining

    CERN Document Server

    Kamruzzaman, S M; Hasan, Ahmed Ryadh

    2010-01-01

    Text classification is the process of classifying documents into predefined categories based on their content. It is the automated assignment of natural language texts to predefined categories. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Existing supervised learning algorithms to automatically classify text need sufficient documents to learn accurately. This paper presents a new algorithm for text classification using data mining that requires fewer documents for training. Instead of using words, word relation i.e. association rules from these words is used to derive feature set from pre-classified text documents. The concept of Naive Bayes classifier is then used on derived features and finally only a single concept of Genetic Algorithm has been added for final classification. A system based on the...

  16. Termination Documentation

    Science.gov (United States)

    Duncan, Mike; Hill, Jillian

    2014-01-01

    In this study, we examined 11 workplaces to determine how they handle termination documentation, an empirically unexplored area in technical communication and rhetoric. We found that the use of termination documentation is context dependent while following a basic pattern of infraction, investigation, intervention, and termination. Furthermore,…

  17. Emotion Detection from Text

    CERN Document Server

    Shivhare, Shiv Naresh

    2012-01-01

    Emotion can be expressed in many ways that can be seen such as facial expression and gestures, speech and by written text. Emotion Detection in text documents is essentially a content - based classification problem involving concepts from the domains of Natural Language Processing as well as Machine Learning. In this paper emotion recognition based on textual data and the techniques used in emotion detection are discussed.

  18. Maury Documentation

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — Supporting documentation for the Maury Collection of marine observations. Includes explanations from Maury himself, as well as guides and descriptions by the U.S....

  19. Quality text editing

    Directory of Open Access Journals (Sweden)

    Gyöngyi Bujdosó

    2009-10-01

    Full Text Available Text editing is more than the knowledge of word processing techniques. Originally typographers, printers, text editors were the ones qualified to edit texts, which were well structured, legible, easily understandable, clear, and were able to emphasize the coreof the text. Time has changed, and nowadays everyone has access to computers as well as to text editing software and most users believe that having these tools is enough to edit texts. However, text editing requires more skills. Texts appearing either in printed or inelectronic form reveal that most of the users do not realize that they are not qualified to edit and publish their works. Analyzing the ‘text-products’ of the last decade a tendency can clearly be drawn. More and more documents appear, which instead of emphasizingthe subject matter, are lost in the maze of unstructured text slices. Without further thoughts different font types, colors, sizes, strange arrangements of objects, etc. are applied. We present examples with the most common typographic and text editing errors. Our aim is to call the attention to these mistakes and persuadeusers to spend time to educate themselves in text editing. They have to realize that a well-structured text is able to strengthen the effect on the reader, thus the original message will reach the target group.

  20. 2005-004-FA_HYPACK: Text files of the Wide Area Augmentation System (WAAS) navigation collected by the U.S. Geological Survey in Moultonborough Bay, Lake Winnipesaukee, New Hampshire in 2005 (Geographic, WGS 84, HYPACK ASCII Text Files)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — In freshwater bodies of New Hampshire, the most problematic aquatic invasive plant species is Myriophyllum heterophyllum or variable leaf water-milfoil. Once...

  1. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2012-01-01

    <正>Centre for Agriculture and Bioscience International( CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health,our Internet Resources and Abstract Journals. There are currently over 60,000 full text articles available to access. These documents,made possible by agreement with third

  2. Documentation Service; Service de Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Charnay, J.; Chosson, L.; Croize, M.; Ducloux, A.; Flores, S.; Jarroux, D.; Melka, J.; Morgue, D.; Mottin, C. [Inst. de Physique Nucleaire, Lyon-1 Univ., 69 - Villeurbanne (France)

    1998-12-31

    This service assures the treatment and diffusion of the scientific information and the management of the scientific production of the institute as well as the secretariat operation for the groups and services of the institute. The report on documentation-library section mentions: the management of the documentation funds, search in international databases (INIS, Current Contents, Inspects), Pret-Inter service which allows accessing documents through DEMOCRITE network of IN2P3. As realizations also mentioned are: the setup of a video, photo database, the Web home page of the institute`s library, follow-up of digitizing the document funds by integrating the CD-ROMs and diskettes, electronic archiving of the scientific production, etc 1 fig.

  3. Mining the Text: 34 Text Features that Can Ease or Obstruct Text Comprehension and Use

    Science.gov (United States)

    White, Sheida

    2012-01-01

    This article presents 34 characteristics of texts and tasks ("text features") that can make continuous (prose), noncontinuous (document), and quantitative texts easier or more difficult for adolescents and adults to comprehend and use. The text features were identified by examining the assessment tasks and associated texts in the national…

  4. Visualization Guided Document Reading by Citation and Text Summarization%基于文本摘要及引用关系的可视辅助文献阅读

    Institute of Scientific and Technical Information of China (English)

    张加万; 杨思琪; 李泽宇; 杨伟强; 王锦东; 贺瑞芳; 黄茂林

    2016-01-01

    With growing volume of publications in recent years, researchers have to read much more literatures. Therefore, how to read a scientific article in an efficient way becomes an importance issue. When reading an article, it's necessary to read its references in order to get a better understanding. However, how to differentiate between the relevant and non-relevant references, and how to stay in topic in a large document collection are still challenging tasks. This paper presents GUDOR (GUidedDOcument Reader), a visualization guided reader based on citation and summarization. It (1) extracts the important sentences from a scientific article with an objective-based summarization technique, and visualizes the extraction results by a multi-resolution method; (2) identifies the main topics of thereferences with a LDA (Latent Dirichlet Allocation) model; (3) tracks user's reading behavior to keep him or her focusing on the reading objective. In addition, the paper describes the functions and operations of the system in a usage scenario and validates its applicability by a user study.%近年来,科技论文发表数量与日俱增,科研人员需要阅读文献的数量也随之迅速增长.如何快速而有效地阅读一篇科技论文,逐渐成为一个重要的研究课题.另一方面,在阅读科技论文时,理解与其相关的重要参考文献可帮助读者更好地理解文章的内容.然而,如何从众多的参考文献中快速找到最重要、最相关的几篇,如何避免在阅读过程中迷失在文档的多维空间,仍是值得研究的问题.为了解决上述问题,提出了一个基于文本摘要和引用关系的可视辅助文献阅读系统.该系统利用一种基于阅读目的的文本摘要技术提取出论文中重要的句子,并采用多尺度的可视化方式进行展示;使用LDA(latent dirichlet allocation)话题模型抽取参考文献的核心话题;记录用户的阅读行为,用于提示其阅读上下文,以保证用户关

  5. Performance Documentation.

    Science.gov (United States)

    Foster, Paula

    2002-01-01

    Presents an interview with experts on performance documentation. Suggests that educators should strive to represent performance appraisal writing to students in a way that reflects the way it is perceived and evaluated in the workplace. Concludes that educators can enrich their pedagogy with practice by helping students understand the importance…

  6. Documenting Spreadsheets

    CERN Document Server

    Payette, Raymond

    2008-01-01

    This paper discusses spreadsheets documentation and new means to achieve this end by using Excel's built-in "Comment" function. By structuring comments, they can be used as an essential tool to fully explain spreadsheet. This will greatly facilitate spreadsheet change control, risk management and auditing. It will fill a crucial gap in corporate governance by adding essential information that can be managed in order to satisfy internal controls and accountability standards.

  7. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the na...

  8. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of em¬pl...

  9. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ Management- CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management - CB - MB - FB Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2007 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and ...

  10. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natu...

  11. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat about the natur...

  12. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the iCMS Web site. The following items can be found on: http://cms.cern.ch/iCMS/ General - CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. LHC Symposiums Management - CB - MB - FB - FMC Agendas and minutes are accessible to CMS members through their AFS account (ZH). However some linked documents are restricted to the Board Members. FB documents are only accessible to FB members. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2006 Annual reviews are posted.   CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral students upon completion of their theses. Therefore it is requested that Ph.D students inform the CMS Secretariat a...

  13. Text Association Analysis and Ambiguity in Text Mining

    Science.gov (United States)

    Bhonde, S. B.; Paikrao, R. L.; Rahane, K. U.

    2010-11-01

    Text Mining is the process of analyzing a semantically rich document or set of documents to understand the content and meaning of the information they contain. The research in Text Mining will enhance human's ability to process massive quantities of information, and it has high commercial values. Firstly, the paper discusses the introduction of TM its definition and then gives an overview of the process of text mining and the applications. Up to now, not much research in text mining especially in concept/entity extraction has focused on the ambiguity problem. This paper addresses ambiguity issues in natural language texts, and presents a new technique for resolving ambiguity problem in extracting concept/entity from texts. In the end, it shows the importance of TM in knowledge discovery and highlights the up-coming challenges of document mining and the opportunities it offers.

  14. Clustering Text Data Streams

    Institute of Scientific and Technical Information of China (English)

    Yu-Bao Liu; Jia-Rong Cai; Jian Yin; Ada Wai-Chee Fu

    2008-01-01

    Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF.IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

  15. Segmentation of complex document

    Directory of Open Access Journals (Sweden)

    Souad Oudjemia

    2014-06-01

    Full Text Available In this paper we present a method for segmentation of documents image with complex structure. This technique based on GLCM (Grey Level Co-occurrence Matrix used to segment this type of document in three regions namely, 'graphics', 'background' and 'text'. Very briefly, this method is to divide the document image, in block size chosen after a series of tests and then applying the co-occurrence matrix to each block in order to extract five textural parameters which are energy, entropy, the sum entropy, difference entropy and standard deviation. These parameters are then used to classify the image into three regions using the k-means algorithm; the last step of segmentation is obtained by grouping connected pixels. Two performance measurements are performed for both graphics and text zones; we have obtained a classification rate of 98.3% and a Misclassification rate of 1.79%.

  16. Log ASCII Standard (LAS) Files for Geophysical (Gamma Ray) Wireline Well Logs and Their Application to Geologic Cross Section C-C' Through the Central Appalachian Basin

    Science.gov (United States)

    Trippi, Michael H.; Crangle, Robert D.

    2009-01-01

    U.S. Geological Survey (USGS) regional geologic cross section C-C' (Ryder and others, 2008) displays key stratigraphic intervals in the central Appalachian basin. For this cross section, strata were correlated by using descriptions of well cuttings and gamma ray well log traces. This report summarizes the procedures used to convert gamma ray curves on paper well logs to the digital Log ASCII (American Standard Code for Information Interchange) Standard (LAS) format using the third-party software application Neuralog. The procedures could be used with other geophysical wireline logs also. The creation of digital LAS files from paper well logs by using Neuralog is very helpful, especially when dealing with older logs with limited or nonexistent digital data. The LAS files from the gamma ray logs of 11 wells used to construct cross section C-C' are included in this report. They may be downloaded from the index page as a single ZIP file.

  17. Secure Copier Which Allows Reuse Copied Documents with Sorting Capability in Accordance with Document Types

    Directory of Open Access Journals (Sweden)

    Kohei Arai

    2013-09-01

    Full Text Available Secure copy machine which allows reuse copied documents with sorting capability in accordance with the document types. Through experiments with a variety of document types, it is found that copied documents can be shared and stored in database in accordance with automatically classified document types securely. The copied documents are protected by data hiding based on wavelet Multi Resolution Analysis: MRA.

  18. CMS DOCUMENTATION

    CERN Multimedia

    CMS TALKS AT MAJOR MEETINGS The agenda and talks from major CMS meetings can now be electronically accessed from the ICMS Web site. The following items can be found on: http://cms.cern.ch/iCMS Management – CMS Weeks (Collaboration Meetings), CMS Weeks Agendas The talks presented at the Plenary Sessions. Management – CB – MB – FB Agendas and minutes are accessible to CMS members through Indico. LHCC The talks presented at the ‘CMS Meetings with LHCC Referees’ are available on request from the PM or MB Country Representative. Annual Reviews The talks presented at the 2008 Annual Reviews are posted in Indico. CMS DOCUMENTS It is considered useful to establish information on the first employment of CMS doctoral student upon completion of their theses.  Therefore it is requested that Ph.D students inform the CMS Secretariat about the nature of employment and name of their first employer. The Notes, Conference Reports and Theses published si...

  19. INFORMATION RETRIEVAL FOR SHORT DOCUMENTS

    Institute of Scientific and Technical Information of China (English)

    Qi Haoliang; Li Mu; Gao Jianfeng; Li Sheng

    2006-01-01

    The major problem of the most current approaches of information models lies in that individual words provide unreliable evidence about the content of the texts. When the document is short, e.g. only the abstract is available, the word-use variability problem will have substantial impact on the Information Retrieval (IR) performance. To solve the problem, a new technology to short document retrieval named Reference Document Model (RDM) is put forward in this letter. RDM gets the statistical semantic of the query/document by pseudo feedback both for the query and document from reference documents. The contributions of this model are three-fold: (1) Pseudo feedback both for the query and the document; (2) Building the query model and the document model from reference documents; (3) Flexible indexing units, which can be any linguistic elements such as documents, paragraphs, sentences, n-grams, term or character. For short document retrieval, RDM achieves significant improvements over the classical probabilistic models on the task of ad hoc retrieval on Text REtrieval Conference (TREC) test sets. Results also show that the shorter the document, the better the RDM performance.

  20. Omega documentation

    Energy Technology Data Exchange (ETDEWEB)

    Howerton, R.J.; Dye, R.E.; Giles, P.C.; Kimlinger, J.R.; Perkins, S.T.; Plechaty, E.F.

    1983-08-01

    OMEGA is a CRAY I computer program that controls nine codes used by LLNL Physical Data Group for: 1) updating the libraries of evaluated data maintained by the group (UPDATE); 2) calculating average values of energy deposited in secondary particles and residual nuclei (ENDEP); 3) checking the libraries for internal consistency, especially for energy conservation (GAMCHK); 4) producing listings, indexes and plots of the library data (UTILITY); 5) producing calculational constants such as group averaged cross sections and transfer matrices for diffusion and Sn transport codes (CLYDE); 6) producing and updating standard files of the calculational constants used by LLNL Sn and diffusion transport codes (NDFL); 7) producing calculational constants for Monte Carlo transport codes that use group-averaged cross sections and continuous energy for particles (CTART); 8) producing and updating standard files used by the LLNL Monte Carlo transport codes (TRTL); and 9) producing standard files used by the LANL pointwise Monte Carlo transport code MCNP (MCPOINT). The first four of these functions and codes deal with the libraries of evaluated data and the last five with various aspects of producing calculational constants for use by transport codes. In 1970 a series, called PD memos, of internal and informal memoranda was begun. These were intended to be circulated among the group for comment and then to provide documentation for later reference whenever questions arose about the subject matter of the memos. They have served this purpose and now will be drawn upon as source material for this more comprehensive report that deals with most of the matters covered in those memos.

  1. Learning Context for Text Categorization

    CERN Document Server

    Haribhakta, Y V

    2011-01-01

    This paper describes our work which is based on discovering context for text document categorization. The document categorization approach is derived from a combination of a learning paradigm known as relation extraction and an technique known as context discovery. We demonstrate the effectiveness of our categorization approach using reuters 21578 dataset and synthetic real world data from sports domain. Our experimental results indicate that the learned context greatly improves the categorization performance as compared to traditional categorization approaches.

  2. New mathematical cuneiform texts

    CERN Document Server

    Friberg, Jöran

    2016-01-01

    This monograph presents in great detail a large number of both unpublished and previously published Babylonian mathematical texts in the cuneiform script. It is a continuation of the work A Remarkable Collection of Babylonian Mathematical Texts (Springer 2007) written by Jöran Friberg, the leading expert on Babylonian mathematics. Focussing on the big picture, Friberg explores in this book several Late Babylonian arithmetical and metro-mathematical table texts from the sites of Babylon, Uruk and Sippar, collections of mathematical exercises from four Old Babylonian sites, as well as a new text from Early Dynastic/Early Sargonic Umma, which is the oldest known collection of mathematical exercises. A table of reciprocals from the end of the third millennium BC, differing radically from well-documented but younger tables of reciprocals from the Neo-Sumerian and Old-Babylonian periods, as well as a fragment of a Neo-Sumerian clay tablet showing a new type of a labyrinth are also discussed. The material is presen...

  3. Removing Manually-Generated Boilerplate from Electronic Texts: Experiments with Project Gutenberg e-Books

    CERN Document Server

    Kaser, Owen

    2007-01-01

    Collaborative work on unstructured or semi-structured documents, such as in literature corpora or source code, often involves agreed upon templates containing metadata. These templates are not consistent across users and over time. Rule-based parsing of these templates is expensive to maintain and tends to fail as new documents are added. Statistical techniques based on frequent occurrences have the potential to identify automatically a large fraction of the templates, thus reducing the burden on the programmers. We investigate the case of the Project Gutenberg corpus, where most documents are in ASCII format with preambles and epilogues that are often copied and pasted or manually typed. We show that a statistical approach can solve most cases though some documents require knowledge of English. We also survey various technical solutions that make our approach applicable to large data sets.

  4. Perceptions of document relevance

    Directory of Open Access Journals (Sweden)

    Peter eBruza

    2014-07-01

    Full Text Available This article presents a study of how humans perceive the relevance of documents.Humans are adept at making reasonably robust and quick decisions about what information is relevant to them, despite the ever increasing complexity and volume of their surrounding information environment. The literature on document relevance has identified various dimensions of relevance (e.g., topicality, novelty, etc., however little is understood about how these dimensions may interact.We performed a crowdsourced study of how human subjects judge two relevance dimensions in relation to document snippets retrieved from an internet search engine.The order of the judgement was controlled.For those judgements exhibiting an order effect, a q-test was performed to determine whether the order effects can be explained by a quantum decision model based on incompatible decision perspectives.Some evidence of incompatibility was found which suggests incompatible decision perspectives is appropriate for explaining interacting dimensions of relevance.

  5. Typesafe Modeling in Text Mining

    CERN Document Server

    Steeg, Fabian

    2011-01-01

    Based on the concept of annotation-based agents, this report introduces tools and a formal notation for defining and running text mining experiments using a statically typed domain-specific language embedded in Scala. Using machine learning for classification as an example, the framework is used to develop and document text mining experiments, and to show how the concept of generic, typesafe annotation corresponds to a general information model that goes beyond text processing.

  6. Securing XML Documents

    Directory of Open Access Journals (Sweden)

    Charles Shoniregun

    2004-11-01

    Full Text Available XML (extensible markup language is becoming the current standard for establishing interoperability on the Web. XML data are self-descriptive and syntax-extensible; this makes it very suitable for representation and exchange of semi-structured data, and allows users to define new elements for their specific applications. As a result, the number of documents incorporating this standard is continuously increasing over the Web. The processing of XML documents may require a traversal of all document structure and therefore, the cost could be very high. A strong demand for a means of efficient and effective XML processing has posed a new challenge for the database world. This paper discusses a fast and efficient indexing technique for XML documents, and introduces the XML graph numbering scheme. It can be used for indexing and securing graph structure of XML documents. This technique provides an efficient method to speed up XML data processing. Furthermore, the paper explores the classification of existing methods impact of query processing, and indexing.

  7. Spot Elevations, Raw Mass Point and Breakline Digital Elevation Data; 5KDTM_ASCII; This comma-delimited ASCII text file contains the raw elevation data points created by Intermap Technologies in 1997 under contract to the RIDOT and National Grid., Published in 2001, 1:4800 (1in=400ft) scale, State of Rhode Island and Providence Plantations.

    Data.gov (United States)

    NSGIC GIS Inventory (aka Ramona) — This Spot Elevations dataset, published at 1:4800 (1in=400ft) scale, was produced all or in part from Orthoimagery information as of 2001. It is described as 'Raw...

  8. A Survey of Unstructured Text Summarization Techniques

    Directory of Open Access Journals (Sweden)

    Sherif Elfayoumy

    2014-05-01

    Full Text Available Due to the explosive amounts of text data being created and organizations increased desire to leverage their data corpora, especially with the availability of Big Data platforms, there is not usually enough time to read and understand each document and make decisions based on document contents. Hence, there is a great demand for summarizing text documents to provide a representative substitute for the original documents. By improving summarizing techniques, precision of document retrieval through search queries against summarized documents is expected to improve in comparison to querying against the full spectrum of original documents. Several generic text summarization algorithms have been developed, each with its own advantages and disadvantages. For example, some algorithms are particularly good for summarizing short documents but not for long ones. Others perform well in identifying and summarizing single-topic documents but their precision degrades sharply with multi-topic documents. In this article we present a survey of the literature in text summarization. We also surveyed some of the most common evaluation methods for the quality of automated text summarization techniques. Last, we identified some of the challenging problems that are still open, in particular the need for a universal approach that yields good results for mixed types of documents.

  9. Pedagogical documentation: Preschool teachers’ perspective

    Directory of Open Access Journals (Sweden)

    Pavlović-Breneselović Dragana

    2012-01-01

    Full Text Available Educational policy shapes the positions of all stakeholders and their mutual relations in the system of preschool education through its attitude towards documentation. The attitude towards the function of pedagogical documentation in preschool education programmes reflects certain views on children, learning and nature of the programmes. Although contemporary approaches to preschool education emphasise the issue of documentation, this problem is dealt with partially and technically in our country. The aim of our research was to explore preschool teachers’ perspective on documentation by investigating the current situation and teachers’ preferences related to documentation type, as well as to study the purpose, meaning and process of documentation. The research was conducted on the sample of 300 preschool teachers. The descriptive method, interviewing and scaling techniques were used. Research data suggest that the field of documentation is marked by contradictions in perceiving the meaning and function of documentation, as well as by discrepancy and lack of integration at the level of conceptions, practice and educational policy. Changing the current situation in the field of documentation is not a technical matter of elaboration of certain types and forms of documentation; it demands explication of the purpose and function of documentation in keeping with the conception of preschool education programmes and a systemic approach to changes originating from the given conception. [Projekat Ministarstva nauke Republike Srbije, br. 179060: Modeli procenjivanja i strategije unapređivanja kvaliteta obrazovanja u Srbiji

  10. Electronic Braille Document Reader

    OpenAIRE

    Arif, Shahab; Holmes, Violeta

    2013-01-01

    This paper presents an investigation into developing a portable Braille device which would allow visually impaired individuals to read electronic documents by actuating Braille text on a finger. Braille books tend to be bulky in size due to the minimum size requirements for each Braille cell. E-books can be read in Braille using refreshable Braille displays connected to a computer. However, the refreshable Braille displays are expensive, bulky and are not portable. These factors restrict blin...

  11. La Documentation photographique

    Directory of Open Access Journals (Sweden)

    Magali Hamm

    2009-03-01

    Full Text Available La Documentation photographique, revue destinée aux enseignants et étudiants en histoire-géographie, place l’image au cœur de sa ligne éditoriale. Afin de suivre les évolutions actuelles de la géographie, la collection propose une iconographie de plus en plus diversifiée : cartes, photographies, mais aussi caricatures, une de journal ou publicité, toutes étant considérées comme un document géographique à part entière. Car l’image peut se faire synthèse ; elle peut au contraire montrer les différentes facettes d’un objet ; souvent elle permet d’incarner des phénomènes géographiques. Associées à d’autres documents, les images aident les enseignants à initier leurs élèves à des raisonnements géographiques complexes. Mais pour apprendre à les lire, il est fondamental de les contextualiser, de les commenter et d’interroger leur rapport au réel.The Documentation photographique, magazine dedicated to teachers and students in History - Geography, places the image at the heart of its editorial line. In order to follow the evolutions of Geography, the collection presents a more and more diversified iconography: maps, photographs, but also drawings or advertisements, all this documents being considered as geographical ones. Because image can be a synthesis; on the contrary it can present the different facets of a same object; often it enables to portray geographical phenomena. Related to other documents, images assist the teachers in the students’ initiation to complex geographical reasoning. But in order to learn how to read them, it is fundamental to contextualize them, comment them and question their relations with reality.

  12. Interconnectedness und digitale Texte

    Directory of Open Access Journals (Sweden)

    Detlev Doherr

    2013-04-01

    Full Text Available Zusammenfassung Die multimedialen Informationsdienste im Internet werden immer umfangreicher und umfassender, wobei auch die nur in gedruckter Form vorliegenden Dokumente von den Bibliotheken digitalisiert und ins Netz gestellt werden. Über Online-Dokumentenverwaltungen oder Suchmaschinen können diese Dokumente gefunden und dann in gängigen Formaten wie z.B. PDF bereitgestellt werden. Dieser Artikel beleuchtet die Funktionsweise der Humboldt Digital Library, die seit mehr als zehn Jahren Dokumente von Alexander von Humboldt in englischer Übersetzung im Web als HDL (Humboldt Digital Library kostenfrei zur Verfügung stellt. Anders als eine digitale Bibliothek werden dabei allerdings nicht nur digitalisierte Dokumente als Scan oder PDF bereitgestellt, sondern der Text als solcher und in vernetzter Form verfügbar gemacht. Das System gleicht damit eher einem Informationssystem als einer digitalen Bibliothek, was sich auch in den verfügbaren Funktionen zur Auffindung von Texten in unterschiedlichen Versionen und Übersetzungen, Vergleichen von Absätzen verschiedener Dokumente oder der Darstellung von Bilden in ihrem Kontext widerspiegelt. Die Entwicklung von dynamischen Hyperlinks auf der Basis der einzelnen Textabsätze der Humboldt‘schen Werke in Form von Media Assets ermöglicht eine Nutzung der Programmierschnittstelle von Google Maps zur geographischen wie auch textinhaltlichen Navigation. Über den Service einer digitalen Bibliothek hinausgehend, bietet die HDL den Prototypen eines mehrdimensionalen Informationssystems, das mit dynamischen Strukturen arbeitet und umfangreiche thematische Auswertungen und Vergleiche ermöglicht. Summary The multimedia information services on Internet are becoming more and more comprehensive, even the printed documents are digitized and republished as digital Web documents by the libraries. Those digital files can be found by search engines or management tools and provided as files in usual formats as

  13. Short Text Classification: A Survey

    Directory of Open Access Journals (Sweden)

    Ge Song

    2014-05-01

    Full Text Available With the recent explosive growth of e-commerce and online communication, a new genre of text, short text, has been extensively applied in many areas. So many researches focus on short text mining. It is a challenge to classify the short text owing to its natural characters, such as sparseness, large-scale, immediacy, non-standardization. It is difficult for traditional methods to deal with short text classification mainly because too limited words in short text cannot represent the feature space and the relationship between words and documents. Several researches and reviews on text classification are shown in recent times. However, only a few of researches focus on short text classification. This paper discusses the characters of short text and the difficulty of short text classification. Then we introduce the existing popular works on short text classifiers and models, including short text classification using sematic analysis, semi-supervised short text classification, ensemble short text classification, and real-time classification. The evaluations of short text classification are analyzed in our paper. Finally we summarize the existing classification technology and prospect for development trend of short text classification

  14. A Customizable Text Classifier for Text Mining

    Directory of Open Access Journals (Sweden)

    Yun-liang Zhang

    2007-12-01

    Full Text Available Text mining deals with complex and unstructured texts. Usually a particular collection of texts that is specified to one or more domains is necessary. We have developed a customizable text classifier for users to mine the collection automatically. It derives from the sentence category of the HNC theory and corresponding techniques. It can start with a few texts, and it can adjust automatically or be adjusted by user. The user can also control the number of domains chosen and decide the standard with which to choose the texts based on demand and abundance of materials. The performance of the classifier varies with the user's choice.

  15. Working with text tools, techniques and approaches for text mining

    CERN Document Server

    Tourte, Gregory J L

    2016-01-01

    Text mining tools and technologies have long been a part of the repository world, where they have been applied to a variety of purposes, from pragmatic aims to support tools. Research areas as diverse as biology, chemistry, sociology and criminology have seen effective use made of text mining technologies. Working With Text collects a subset of the best contributions from the 'Working with text: Tools, techniques and approaches for text mining' workshop, alongside contributions from experts in the area. Text mining tools and technologies in support of academic research include supporting research on the basis of a large body of documents, facilitating access to and reuse of extant work, and bridging between the formal academic world and areas such as traditional and social media. Jisc have funded a number of projects, including NaCTem (the National Centre for Text Mining) and the ResDis programme. Contents are developed from workshop submissions and invited contributions, including: Legal considerations in te...

  16. Unstructured Documents Categorization: A Study

    Directory of Open Access Journals (Sweden)

    Debnath Bhattacharyya

    2008-12-01

    Full Text Available The main purpose of communication is to transfer information from onecorner to another of the world. The information is basically stored in forms of documents or files created on the basis of requirements. So, the randomness of creation and storage makes them unstructured in nature. As a consequence, data retrieval and modification become hard nut to crack. The data, that is required frequently, should maintain certain pattern. Otherwise, problems like retrievingerroneous data or anomalies in modification or time consumption in retrieving process may hike. As every problem has its own solution, these unstructured documents have also given the solution named unstructured document categorization. That means, the collected unstructured documents will be categorized based on some given constraints. This paper is a review which deals with different techniques like text and data mining, genetic algorithm, lexicalchaining, binarization method to reach the fulfillment of desired unstructured document categorization appeared in the literature.

  17. Indexation de Documents Manuscrits

    OpenAIRE

    Vinciarelli, Alessandro

    2006-01-01

    Les systèmes de reconnaissance automatique de l'écriture permettent de transfomer des collections de documents manuscrits en archives de documents numériques. L'avantage n'est pas tellement la réduction de l'espace nécéssaire pour stoquer les données, mais plutôt la possibilité d'appliquer les technologies de gestion du contenu normalement utilisées pour des textes numériques tels que pages web et e-mails. Le problème principal dans une telle démarche est que les transcriptions sont généralem...

  18. Document cards: a top trumps visualization for documents.

    Science.gov (United States)

    Strobelt, Hendrik; Oelke, Daniela; Rohrdantz, Christian; Stoffel, Andreas; Keim, Daniel A; Deussen, Oliver

    2009-01-01

    Finding suitable, less space consuming views for a document's main content is crucial to provide convenient access to large document collections on display devices of different size. We present a novel compact visualization which represents the document's key semantic as a mixture of images and important key terms, similar to cards in a top trumps game. The key terms are extracted using an advanced text mining approach based on a fully automatic document structure extraction. The images and their captions are extracted using a graphical heuristic and the captions are used for a semi-semantic image weighting. Furthermore, we use the image color histogram for classification and show at least one representative from each non-empty image class. The approach is demonstrated for the IEEE InfoVis publications of a complete year. The method can easily be applied to other publication collections and sets of documents which contain images.

  19. Handwritten text line segmentation by spectral clustering

    Science.gov (United States)

    Han, Xuecheng; Yao, Hui; Zhong, Guoqiang

    2017-02-01

    Since handwritten text lines are generally skewed and not obviously separated, text line segmentation of handwritten document images is still a challenging problem. In this paper, we propose a novel text line segmentation algorithm based on the spectral clustering. Given a handwritten document image, we convert it to a binary image first, and then compute the adjacent matrix of the pixel points. We apply spectral clustering on this similarity metric and use the orthogonal kmeans clustering algorithm to group the text lines. Experiments on Chinese handwritten documents database (HIT-MW) demonstrate the effectiveness of the proposed method.

  20. ASCII Text File of the Original 1-m Bathymetry from National Oceanic and Atmospheric Administration (NOAA) Survey H11321 in Central Rhode Island Sound (H11321_1M_UTM19NAD83.TXT)

    Data.gov (United States)

    U.S. Geological Survey, Department of the Interior — The United States Geological Survey (USGS) is working cooperatively with the National Oceanic and Atmospheric Administration (NOAA) to interpret the surficial...

  1. Documentation of Cultural Heritage Objects

    Directory of Open Access Journals (Sweden)

    Jon Grobovšek

    2013-09-01

    Full Text Available EXTENDED ABSTRACT:The first and important phase of documentation of cultural heritage objects is to understand which objects need to be documented. The entire documentation process is determined by the characteristics and scope of the cultural heritage object. The next question to be considered is the expected outcome of the documentation process and the purpose for which it will be used. These two essential guidelines determine each stage of the documentation workflow: the choice of the most appropriate data capturing technology and data processing method, how detailed should the documentation be, what problems may occur, what the expected outcome is, what it will be used for, and the plan for storing data and results. Cultural heritage objects require diverse data capturing and data processing methods. It is important that even the first stages of raw data capturing are oriented towards the applicability of results. The selection of the appropriate working method can facilitate the data processing and the preparation of final documentation. Documentation of paintings requires different data capturing method than documentation of buildings or building areas. The purpose of documentation can also be the preservation of the contemporary cultural heritage to posterity or the basis for future projects and activities on threatened objects. Documentation procedures should be adapted to our needs and capabilities. Captured and unprocessed data are lost unless accompanied by additional analyses and interpretations. Information on tools, procedures and outcomes must be included into documentation. A thorough analysis of unprocessed but accessible documentation, if adequately stored and accompanied by additional information, enables us to gather useful data. In this way it is possible to upgrade the existing documentation and to avoid data duplication or unintentional misleading of users. The documentation should be archived safely and in a way to meet

  2. Text Analytics to Data Warehousing

    Directory of Open Access Journals (Sweden)

    Kalli Srinivasa Nageswara Prasad

    2010-09-01

    Full Text Available Information hidden or stored in unstructured data can play a critical role in making decisions, understanding and conducting other business functions. Integrating data stored in both structured and unstructured formats can add significant value to an organization. With the extent of development happening in Text Mining and technologies to deal with unstructured and semi structured data like XML and MML(Mining Markup Language to extract and analyze data, textanalytics has evolved to handle unstructured data to helps unlock and predict business results via Business Intelligence and Data Warehousing. Text mining involves dealing with texts in documents and discovering hidden patterns, but Text Analytics enhances InformationRetrieval in form of search and enabling clustering of results and more over Text Analytics is text mining and visualization. In this paper we would discuss on handling unstructured data that are in documents so that they fit into business applications like Data Warehouses for further analysis and it helps in the framework we have used for the solution.

  3. Documenting the Invicible

    DEFF Research Database (Denmark)

    Pedersen, Peter Ole

    2017-01-01

    Documenting the Invisible is a polemical text that examines the potentials of documentary-based art to create useful aesthetic representations of ‘The Anthropocene’. The article is a result of the practice-based collaboration between researcher and curator Peter Ole Pedersen and the artists...... of representing it in art and photography, as well as visually representing phenomena like deep time and radioactivity. The article discusses Bruno Latour’s reflections on agency and the Anthropocene as well as the relations between documentary and fiction put forth by Jacques Rancière....

  4. Generic safety documentation model

    Energy Technology Data Exchange (ETDEWEB)

    Mahn, J.A.

    1994-04-01

    This document is intended to be a resource for preparers of safety documentation for Sandia National Laboratories, New Mexico facilities. It provides standardized discussions of some topics that are generic to most, if not all, Sandia/NM facilities safety documents. The material provides a ``core`` upon which to develop facility-specific safety documentation. The use of the information in this document will reduce the cost of safety document preparation and improve consistency of information.

  5. Colored-sketch of Text Information

    OpenAIRE

    Beomjin Kim; Philip Johnson; Adam S. Huarng

    2002-01-01

    This paper presents an information visualization method, which transforms text into abstracted visual representations. The proposed color-coding algorithm converts text into a sequence of colored icons that inform users about the distributional patterns of given queries, as well as the structural overview of a document simultaneously. By presenting the compact, but instructive visual abstraction of texts concurrently, users can compare multiple documents intuitively while alleviating the need...

  6. Document Clustering based on Topic Maps

    CERN Document Server

    Rafi, Muhammad; Farooq, Amir; 10.5120/1640-2204

    2011-01-01

    Importance of document clustering is now widely acknowledged by researchers for better management, smart navigation, efficient filtering, and concise summarization of large collection of documents like World Wide Web (WWW). The next challenge lies in semantically performing clustering based on the semantic contents of the document. The problem of document clustering has two main components: (1) to represent the document in such a form that inherently captures semantics of the text. This may also help to reduce dimensionality of the document, and (2) to define a similarity measure based on the semantic representation such that it assigns higher numerical values to document pairs which have higher semantic relationship. Feature space of the documents can be very challenging for document clustering. A document may contain multiple topics, it may contain a large set of class-independent general-words, and a handful class-specific core-words. With these features in mind, traditional agglomerative clustering algori...

  7. Near Duplicate Document Detection Survey

    Directory of Open Access Journals (Sweden)

    Bassma S. Alsulami

    2012-04-01

    Full Text Available Search engines are the major breakthrough on the web for retrieving the information. But List of retrieved documents contains a high percentage of duplicated and near document result. So there is the need to improve the performance of search results. Some of current search engine use data filtering algorithm which can eliminate duplicate and near duplicate documents to save the users’ time and effort. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. In this paper survey present an up-to-date review of the existing literature in duplicate and near duplicate detection in Web

  8. Working with Documents in Databases

    Directory of Open Access Journals (Sweden)

    Marian DARDALA

    2008-01-01

    Full Text Available Using on a larger and larger scale the electronic documents within organizations and public institutions requires their storage and unitary exploitation by the means of databases. The purpose of this article is to present the way of loading, exploitation and visualization of documents in a database, taking as example the SGBD MSSQL Server. On the other hand, the modules for loading the documents in the database and for their visualization will be presented through code sequences written in C#. The interoperability between averages will be carried out by the means of ADO.NET technology of database access.

  9. Contextualizing Data Warehouses with Documents

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Berlanga, Rafael; Aramburu, Maria Jose

    2008-01-01

    Current data warehouse and OLAP technologies are applied to analyze the structured data that companies store in databases. The context that helps to understand data over time is usually described separately in text-rich documents. This paper proposes to integrate the traditional corporate data...... warehouse with a document warehouse, resulting in a contextualized warehouse. Thus, the user first selects an analysis context by supplying some keywords. Then, the analysis is performed on a novel type of OLAP cube, called an R-cube, which is materialized by retrieving and ranking the documents...

  10. Basic Test Framework for the Evaluation of Text Line Segmentation and Text Parameter Extraction

    OpenAIRE

    Darko Brodić; Milivojević, Dragan R.; Zoran Milivojević

    2010-01-01

    Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, som...

  11. Identifying issue frames in text.

    Directory of Open Access Journals (Sweden)

    Eyal Sagi

    Full Text Available Framing, the effect of context on cognitive processes, is a prominent topic of research in psychology and public opinion research. Research on framing has traditionally relied on controlled experiments and manually annotated document collections. In this paper we present a method that allows for quantifying the relative strengths of competing linguistic frames based on corpus analysis. This method requires little human intervention and can therefore be efficiently applied to large bodies of text. We demonstrate its effectiveness by tracking changes in the framing of terror over time and comparing the framing of abortion by Democrats and Republicans in the U.S.

  12. A Survey on Web Text Information Retrieval in Text Mining

    Directory of Open Access Journals (Sweden)

    Tapaswini Nayak

    2015-08-01

    Full Text Available In this study we have analyzed different techniques for information retrieval in text mining. The aim of the study is to identify web text information retrieval. Text mining almost alike to analytics, which is a process of deriving high quality information from text. High quality information is typically derived in the course of the devising of patterns and trends through means such as statistical pattern learning. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, creation of coarse taxonomies, sentiment analysis, document summarization and entity relation modeling. It is used to mine hidden information from not-structured or semi-structured data. This feature is necessary because a large amount of the Web information is semi-structured due to the nested structure of HTML code, is linked and is redundant. Web content categorization with a content database is the most important tool to the efficient use of search engines. A customer requesting information on a particular subject or item would otherwise have to search through hundred of results to find the most relevant information to his query. Hundreds of results through use of mining text are reduced by this step. This eliminates the aggravation and improves the navigation of information on the Web.

  13. Cognitive Temporal Document Priors

    NARCIS (Netherlands)

    Peetz, M.H.; de Rijke, M.

    2013-01-01

    Temporal information retrieval exploits temporal features of document collections and queries. Temporal document priors are used to adjust the score of a document based on its publication time. We consider a class of temporal document priors that is inspired by retention functions considered in cogn

  14. Automatic text summarization

    CERN Document Server

    Torres Moreno, Juan Manuel

    2014-01-01

    This new textbook examines the motivations and the different algorithms for automatic document summarization (ADS). We performed a recent state of the art. The book shows the main problems of ADS, difficulties and the solutions provided by the community. It presents recent advances in ADS, as well as current applications and trends. The approaches are statistical, linguistic and symbolic. Several exemples are included in order to clarify the theoretical concepts.  The books currently available in the area of Automatic Document Summarization are not recent. Powerful algorithms have been develop

  15. The Second Text Retrieval Conference (TREC-2) [and] Overview of the Second Text Retrieval Conference (TREC-2) [and] Reflections on TREC [and] Automatic Routing and Retrieval Using Smart: TREC-2 [and] TREC and TIPSTER Experiments with INQUIRY [and] Large Test Collection Experiments on an Operational Interactive System: Okapi at TREC [and] Efficient Retrieval of Partial Documents [and] TREC Routing Experiments with the TRW/Paracel Fast Data Finder [and] CLARIT-TREC Experiments.

    Science.gov (United States)

    Harman, Donna; And Others

    1995-01-01

    Presents an overview of the second Text Retrieval Conference (TREC-2), an opinion paper about the program, and nine papers by participants that show a range of techniques used in TREC. Topics include traditional text retrieval and information technology, efficiency, the use of language processing techniques, unusual approaches to text retrieval,…

  16. Audit of Orthopaedic Surgical Documentation

    Directory of Open Access Journals (Sweden)

    Fionn Coughlan

    2015-01-01

    Full Text Available Introduction. The Royal College of Surgeons in England published guidelines in 2008 outlining the information that should be documented at each surgery. St. James’s Hospital uses a standard operation sheet for all surgical procedures and these were examined to assess documentation standards. Objectives. To retrospectively audit the hand written orthopaedic operative notes according to established guidelines. Methods. A total of 63 operation notes over seven months were audited in terms of date and time of surgery, surgeon, procedure, elective or emergency indication, operative diagnosis, incision details, signature, closure details, tourniquet time, postop instructions, complications, prosthesis, and serial numbers. Results. A consultant performed 71.4% of procedures; however, 85.7% of the operative notes were written by the registrar. The date and time of surgery, name of surgeon, procedure name, and signature were documented in all cases. The operative diagnosis and postoperative instructions were frequently not documented in the designated location. Incision details were included in 81.7% and prosthesis details in only 30% while the tourniquet time was not documented in any. Conclusion. Completion and documentation of operative procedures were excellent in some areas; improvement is needed in documenting tourniquet time, prosthesis and incision details, and the location of operative diagnosis and postoperative instructions.

  17. Handwritten Document Editor: An Approach

    Directory of Open Access Journals (Sweden)

    Sumit Nalawade

    2014-05-01

    Full Text Available With advancement in new technologies many individuals are moving towards personalization of the same. The same idea inspired us to develop a system which can provide a personal touch to all our documents including both electronic and paper media. In this article we are proposing a novel idea for creating an edito r system which will take handwritten scanned document as the input, recognizes the characters from the document, then proceed with creating the font of recognized handwriting to allow user to edit the document. We have proposed use of genetic algorithm along with K-NN classifier for fast recognition of handwritten characters and use of marching squares algorithm for tracing contour points of characters to generate a handwritten font.

  18. Handwritten Document Editor: An Approach

    Directory of Open Access Journals (Sweden)

    Sumit Nalawade

    2015-11-01

    Full Text Available With advancement in new technologies many individuals are moving towards personalization of the same. The same idea inspired us to develop a system which can provide a personal touch to all our documents including both electronic and paper media. In this article we are proposing a novel idea for creating an editor system which will take handwritten scanned document as the input, recognizes the characters from the document, then proceed with creating the font of recognized handwriting to allow user to edit the document. We have proposed use of genetic algorithm along with K-NN classifier for fast recognition of handwritten characters and use of marching squares algorithm for tracing contour points of characters to generate a handwritten font.

  19. The Ecological Approach to Text Visualization.

    Science.gov (United States)

    Wise, James A.

    1999-01-01

    Presents both theoretical and technical bases on which to build a "science of text visualization." The Spatial Paradigm for Information Retrieval and Exploration (SPIRE) text-visualization system, which images information from free-text documents as natural terrains, serves as an example of the "ecological approach" in its visual metaphor, its…

  20. Registration document 2005; Document de reference 2005

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2005-07-01

    This reference document of Gaz de France provides information and data on the Group activities in 2005: financial informations, business, activities, equipments factories and real estate, trade, capital, organization charts, employment, contracts and research programs. (A.L.B.)

  1. 2002 reference document; Document de reference 2002

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2002-07-01

    This 2002 reference document of the group Areva, provides information on the society. Organized in seven chapters, it presents the persons responsible for the reference document and for auditing the financial statements, information pertaining to the transaction, general information on the company and share capital, information on company operation, changes and future prospects, assets, financial position, financial performance, information on company management and executive board and supervisory board, recent developments and future prospects. (A.L.B.)

  2. Multilingual Topic Models for Unaligned Text

    CERN Document Server

    Boyd-Graber, Jordan

    2012-01-01

    We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

  3. Text-Fabric

    NARCIS (Netherlands)

    Roorda, Dirk

    2016-01-01

    Text-Fabric is a Python3 package for Text plus Annotations. It provides a data model, a text file format, and a binary format for (ancient) text plus (linguistic) annotations. The emphasis of this all is on: data processing; sharing data; and contributing modules. A defining characteristic is that T

  4. Contextual Text Mining

    Science.gov (United States)

    Mei, Qiaozhu

    2009-01-01

    With the dramatic growth of text information, there is an increasing need for powerful text mining systems that can automatically discover useful knowledge from text. Text is generally associated with all kinds of contextual information. Those contexts can be explicit, such as the time and the location where a blog article is written, and the…

  5. Ascii grids of predicted pH in depth zones used by domestic and public drinking water supply depths, Central Valley, California

    Science.gov (United States)

    Zamora, Celia; Nolan, Bernard T.; Gronberg, JoAnn M.

    2017-01-01

    The ascii grids associated with this data release are predicted distributions of continuous pH at the drinking water depth zones in the groundwater of Central Valley, California. The two prediction grids produced in this work represent predicted pH at the domestic supply and public supply drinking water depths, respectively and are bound by the alluvial boundary that defines the Central Valley. A depth of 46 m was used to stratify wells into the shallow and deep aquifer and were derived from depth percentiles associated with domestic and public supply in previous work by Burow et al. (2013). In this work, the median well depth categorized as domestic supply was 30 meters below land surface and the median well depth categorized as public supply is 100 meters below land surface. Prediction grids were created using prediction modeling methods, specifically Boosted Regression Trees (BRT) with a gaussian error distribution within a statistical learning framework within R's computing framework (http://www.r-project.org/). The statistical learning framework seeks to maximize the predictive performance of machine learning methods through model tuning by cross validation. The response variable was measured pH from 1337 wells, and was compiled from two sources: US Geological Survey (USGS) National Water Information System (NWIS) Database (all data are publicly available from the USGS: http://waterdata.usgs.gov/ca/nwis/nwis) and the California State Water Resources Control Board Division of Drinking Water (SWRCB-DDW) database (water quality data are publicly available from the SWRCB: http://www.waterboards.ca.gov/gama/geotracker_gama.shtml). Only wells with measured pH and well depth data were selected, and for wells with multiple records, only the most recent sample in the period 1993-2014 was used. A total of 1003 wells (training dataset) were used to train the BRT model and 334 wells (hold-out dataset) were used to validate the prediction model. The training r-squared was

  6. On Using Goldbach G0 Codes and Even-Rodeh Codes for Text Compression on Using Goldbach G0 Codes and Even-Rodeh Codes for Text Compression

    Science.gov (United States)

    Budiman, M. A.; Rachmawati, D.

    2017-03-01

    This research aims to study the efficiency of two variants of variable-length codes (i.e., Goldbach G0 codes and Even-Rodeh codes) in compressing texts. The parameters being examined are the ratio of compression, the space savings, and the bit rate. As a benchmark, all of the original (uncompressed) texts are assumed to be encoded in American Standard Codes for Information Interchange (ASCII). Several texts, including those derived from some corpora (the Artificial corpus, the Calgary corpus, the Canterbury corpus, the Large corpus, and the Miscellaneous corpus) are tested in the experiment. The overall result shows that the Even-Rodeh codes are consistently more efficient to compress texts than the unoptimzed Goldbach G0 codes.

  7. Subject (of documents)

    DEFF Research Database (Denmark)

    Hjørland, Birger

    2017-01-01

    This article presents and discuss the concept “subject” or subject matter (of documents) as it has been examined in library and information science (LIS) for more than 100 years. Different theoretical positions are outlined and it is found that the most important distinction is between document...... such as concepts, aboutness, topic, isness and ofness are also briefly presented. The conclusion is that the most fruitful way of defining “subject” (of a document) is the documents informative or epistemological potentials, that is, the documents potentials of informing users and advance the development......-oriented views versus request-oriented views. The document-oriented view conceive subject as something inherent in documents, whereas the request-oriented view (or the policy based view) understand subject as an attribution made to documents in order to facilitate certain uses of them. Related concepts...

  8. Enterprise Document Management

    Data.gov (United States)

    US Agency for International Development — The function of the operation is to provide e-Signature and document management support for Acquisition and Assisitance (A&A) documents including vouchers in...

  9. Semantic Text Indexing

    Directory of Open Access Journals (Sweden)

    Zbigniew Kaleta

    2014-01-01

    Full Text Available This article presents a specific issue of the semantic analysis of texts in natural language – text indexing and describes one field of its application (web browsing.The main part of this article describes the computer system assigning a set of semantic indexes (similar to keywords to a particular text. The indexing algorithm employs a semantic dictionary to find specific words in a text, that represent a text content. Furthermore it compares two given sets of semantic indexes to determine texts’ similarity (assigning numerical value. The article describes the semantic dictionary – a tool essentialto accomplish this task and its usefulness, main concepts of the algorithm and test results.

  10. Documenting Employee Conduct

    Science.gov (United States)

    Dalton, Jason

    2009-01-01

    One of the best ways for a child care program to lose an employment-related lawsuit is failure to document the performance of its employees. Documentation of an employee's performance can provide evidence of an employment-related decision such as discipline, promotion, or discharge. When properly implemented, documentation of employee performance…

  11. Informative document waste plastics

    NARCIS (Netherlands)

    Nagelhout D; Sein AA; Duvoort GL

    1989-01-01

    This "Informative document waste plastics" forms part of a series of "informative documents waste materials". These documents are conducted by RIVM on the indstruction of the Directorate General for the Environment, Waste Materials Directorate, in behalf of the program of acti

  12. La Documentacion Automatica (Automated Documentation).

    Science.gov (United States)

    Levery, Francis

    1971-01-01

    Documentation centers are needed to handle the vast amount of scientific and technical information currently being issued. Such centers should be concerned both with handling inquiries in a particular field and with producing a general catalog of current information. Automatic analysis of texts by computers will be the best way to handle material,…

  13. Measuring Interestingness of Political Documents

    NARCIS (Netherlands)

    Azarbonyad, H.

    2016-01-01

    Political texts are pervasive on the Web covering laws and policies in national and supranational jurisdictions. Access to this data is crucial for government transparency and accountability to the population. The main aim of our research is developing a ranking method for political documents which

  14. Integrating image data into biomedical text categorization.

    Science.gov (United States)

    Shatkay, Hagit; Chen, Nawei; Blostein, Dorothea

    2006-07-15

    Categorization of biomedical articles is a central task for supporting various curation efforts. It can also form the basis for effective biomedical text mining. Automatic text classification in the biomedical domain is thus an active research area. Contests organized by the KDD Cup (2002) and the TREC Genomics track (since 2003) defined several annotation tasks that involved document classification, and provided training and test data sets. So far, these efforts focused on analyzing only the text content of documents. However, as was noted in the KDD'02 text mining contest-where figure-captions proved to be an invaluable feature for identifying documents of interest-images often provide curators with critical information. We examine the possibility of using information derived directly from image data, and of integrating it with text-based classification, for biomedical document categorization. We present a method for obtaining features from images and for using them-both alone and in combination with text-to perform the triage task introduced in the TREC Genomics track 2004. The task was to determine which documents are relevant to a given annotation task performed by the Mouse Genome Database curators. We show preliminary results, demonstrating that the method has a strong potential to enhance and complement traditional text-based categorization methods.

  15. Document Analysis by Crosscount Approach

    Institute of Scientific and Technical Information of China (English)

    王海琴; 戴汝为

    1998-01-01

    In this paper a new feature called crosscount for document analysis is introduced.The reature crosscount is a function of white line segment with its start on the edge of document images.It reflects not only the contour of image,but also the periodicity of white lines(background)and text lines in the document images.In complex printed-page layouts,there are different blocks such as textual,graphical,tabular,and so on.Of these blocks,textual ones have the most obvious periodicity with their homogeneous white lines arranged regularly.The important property of textual blocks can be extracted by crosscount functions.here the document layouts are classified into three classes on the basis of their physical structures.Then the definition and properties of the crosscount function are described.According to the classification of document layouts,the application of this new feature to different types of document images' analysis and understanding is discussed.

  16. Electronic Document Management Using Inverted Files System

    Directory of Open Access Journals (Sweden)

    Suhartono Derwin

    2014-03-01

    Full Text Available The amount of documents increases so fast. Those documents exist not only in a paper based but also in an electronic based. It can be seen from the data sample taken by the SpringerLink publisher in 2010, which showed an increase in the number of digital document collections from 2003 to mid of 2010. Then, how to manage them well becomes an important need. This paper describes a new method in managing documents called as inverted files system. Related with the electronic based document, the inverted files system will closely used in term of its usage to document so that it can be searched over the Internet using the Search Engine. It can improve document search mechanism and document save mechanism.

  17. ODQ: A Fluid Office Document Query Language

    Directory of Open Access Journals (Sweden)

    Xuhong Liu

    2015-06-01

    Full Text Available Fluid office documents, as semi-structured data often represented by Extensible Markup Language (XML are important parts of Big Data. These office documents have different formats, and their matching Application Programming Interfaces (APIs depend on developing platform and versions, which causes difficulty in custom development and information retrieval from them. To solve this problem, we have been developing an office document query (ODQ language which provides a uniform method to retrieve content from documents with different formats and versions. ODQ builds common document model ontology to conceal the format details of documents and provides a uniform operation interface to handle office documents with different formats. The results show that ODQ has advantages in format independence, and can facilitate users in developing documents processing systems with good interoperability.

  18. Scheme Program Documentation Tools

    DEFF Research Database (Denmark)

    Nørmark, Kurt

    2004-01-01

    This paper describes and discusses two different Scheme documentation tools. The first is SchemeDoc, which is intended for documentation of the interfaces of Scheme libraries (APIs). The second is the Scheme Elucidator, which is for internal documentation of Scheme programs. Although the tools...... are separate and intended for different documentation purposes they are related to each other in several ways. Both tools are based on XML languages for tool setup and for documentation authoring. In addition, both tools rely on the LAML framework which---in a systematic way---makes an XML language available...

  19. Starlink Document Styles

    Science.gov (United States)

    Lawden, M. D.

    This document describes the various styles which are recommended for Starlink documents. It also explains how to use the templates which are provided by Starlink to help authors create documents in a standard style. This paper is concerned mainly with conveying the ``look and feel" of the various styles of Starlink document rather than describing the technical details of how to produce them. Other Starlink papers give recommendations for the detailed aspects of document production, design, layout, and typography. The only style that is likely to be used by most Starlink authors is the Standard style.

  20. Traceability Method for Software Engineering Documentation

    Directory of Open Access Journals (Sweden)

    Nur Adila Azram

    2012-03-01

    Full Text Available Traceability has been widely discussed in research area. It has been one of interest topic to be research in software engineering. Traceability in software documentation is one of the interesting topics to be research further. It is important in software documentation to trace out the flow or process in all the documents whether they depends with one another or not. In this paper, we present a traceability method for software engineering documentation. The objective of this research is to facilitate in tracing of the software documentation.

  1. Planning Argumentative Texts

    CERN Document Server

    Huang, X

    1994-01-01

    This paper presents \\proverb\\, a text planner for argumentative texts. \\proverb\\'s main feature is that it combines global hierarchical planning and unplanned organization of text with respect to local derivation relations in a complementary way. The former splits the task of presenting a particular proof into subtasks of presenting subproofs. The latter simulates how the next intermediate conclusion to be presented is chosen under the guidance of the local focus.

  2. Mining text data

    CERN Document Server

    Aggarwal, Charu C

    2012-01-01

    Text mining applications have experienced tremendous advances because of web 2.0 and social networking applications. Recent advances in hardware and software technology have lead to a number of unique scenarios where text mining algorithms are learned. ""Mining Text Data"" introduces an important niche in the text analytics field, and is an edited volume contributed by leading international researchers and practitioners focused on social networks & data mining. This book contains a wide swath in topics across social networks & data mining. Each chapter contains a comprehensive survey including

  3. Instant Sublime Text starter

    CERN Document Server

    Haughee, Eric

    2013-01-01

    A starter which teaches the basic tasks to be performed with Sublime Text with the necessary practical examples and screenshots. This book requires only basic knowledge of the Internet and basic familiarity with any one of the three major operating systems, Windows, Linux, or Mac OS X. However, as Sublime Text 2 is primarily a text editor for writing software, many of the topics discussed will be specifically relevant to software development. That being said, the Sublime Text 2 Starter is also suitable for someone without a programming background who may be looking to learn one of the tools of

  4. A new graph based text segmentation using Wikipedia for automatic text summarization

    Directory of Open Access Journals (Sweden)

    Mohsen Pourvali

    2012-01-01

    Full Text Available The technology of automatic document summarization is maturing and may provide a solution to the information overload problem. Nowadays, document summarization plays an important role in information retrieval. With a large volume of documents, presenting the user with a summary of each document greatly facilitates the task of finding the desired documents. Document summarization is a process of automatically creating a compressed version of a given document that provides useful information to users, and multi-document summarization is to produce a summary delivering the majority of information content from a set of documents about an explicit or implicit main topic. According to the input text, in this paper we use the knowledge base of Wikipedia and the words of the main text to create independent graphs. We will then determine the important of graphs. Then we are specified importance of graph and sentences that have topics with high importance. Finally, we extract sentences with high importance. The experimental results on an open benchmark datasets from DUC01 and DUC02 show that our proposed approach can improve the performance compared to state-of-the-art summarization approaches

  5. TEXT CLASSIFICATION TOWARD A SCIENTIFIC FORUM

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Text mining, also known as discovering knowledge from the text, which has emerged as a possible solution for the current information explosion, refers to the process of extracting non-trivial and useful patterns from unstructured text. Among the general tasks of text mining such as text clustering,summarization, etc, text classification is a subtask of intelligent information processing, which employs unsupervised learning to construct a classifier from training text by which to predict the class of unlabeled text. Because of its simplicity and objectivity in performance evaluation, text classification was usually used as a standard tool to determine the advantage or weakness of a text processing method, such as text representation, text feature selection, etc. In this paper, text classification is carried out to classify the Web documents collected from XSSC Website (http://www. xssc.ac.cn). The performance of support vector machine (SVM) and back propagation neural network (BPNN) is compared on this task. Specifically, binary text classification and multi-class text classification were conducted on the XSSC documents. Moreover, the classification results of both methods are combined to improve the accuracy of classification. An experiment is conducted to show that BPNN can compete with SVM in binary text classification; but for multi-class text classification, SVM performs much better. Furthermore, the classification is improved in both binary and multi-class with the combined method.

  6. Making Sense of Texts

    Science.gov (United States)

    Harper, Rebecca G.

    2014-01-01

    This article addresses the triadic nature regarding meaning construction of texts. Grounded in Rosenblatt's (1995; 1998; 2004) Transactional Theory, research conducted in an undergraduate Language Arts curriculum course revealed that when presented with unfamiliar texts, students used prior experiences, social interactions, and literary strategies…

  7. Systematic text condensation

    DEFF Research Database (Denmark)

    Malterud, Kirsti

    2012-01-01

    To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies.......To present background, principles, and procedures for a strategy for qualitative analysis called systematic text condensation and discuss this approach compared with related strategies....

  8. Linguistics in Text Interpretation

    DEFF Research Database (Denmark)

    Togeby, Ole

    2011-01-01

    A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'.......A model for how text interpretation proceeds from what is pronounced, through what is said to what is comunicated, and definition of the concepts 'presupposition' and 'implicature'....

  9. Text mining for the biocuration workflow.

    Science.gov (United States)

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  10. Recognition of Text Image Using Multilayer Perceptron

    OpenAIRE

    Vijendra, Singh; Vasudeva, Nisha; Parashar, Hem Jyotsana

    2016-01-01

    The biggest challenge in the field of image processing is to recognize documents both in printed and handwritten format. Optical Character Recognition OCR is a type of document image analysis where scanned digital image that contains either machine printed or handwritten script input into an OCR software engine and translating it into an editable machine readable digital text format. A Neural network is designed to model the way in which the brain performs a particular task or function of int...

  11. Tobacco documents research methodology.

    Science.gov (United States)

    Anderson, Stacey J; McCandless, Phyra M; Klausner, Kim; Taketa, Rachel; Yerger, Valerie B

    2011-05-01

    Tobacco documents research has developed into a thriving academic enterprise since its inception in 1995. The technology supporting tobacco documents archiving, searching and retrieval has improved greatly since that time, and consequently tobacco documents researchers have considerably more access to resources than was the case when researchers had to travel to physical archives and/or electronically search poorly and incompletely indexed documents. The authors of the papers presented in this supplement all followed the same basic research methodology. Rather than leave the reader of the supplement to read the same discussion of methods in each individual paper, presented here is an overview of the methods all authors followed. In the individual articles that follow in this supplement, the authors present the additional methodological information specific to their topics. This brief discussion also highlights technological capabilities in the Legacy Tobacco Documents Library and updates methods for organising internal tobacco documents data and findings.

  12. Extracting Text from Video

    Directory of Open Access Journals (Sweden)

    Jayshree Ghorpade

    2011-09-01

    Full Text Available The text data present in images and video contain certain useful information for automatic annotation,indexing, and structuring of images. However variations of the text due to differences in text style, font, size, orientation, alignment as well as low image contrast and complex background make the problem of automatic text extraction extremely difficult and challenging job. A large number of techniques have been proposed to address this problem and the purpose of this paper is to design algorithms for each phase of extracting text from a video using java libraries and classes. Here first we frame the input video into stream of images using the Java Media Framework (JMF with the input being a real time or a video from the database. Then we apply pre processing algorithms to convert the image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal.Then we continue with the algorithms for localization, segmentation and recognition for which we use the neural network pattern matching technique. The performance of our approach is demonstrated by presenting experimental results for a set of static images.

  13. EXTRACTING TEXT FROM VIDEO

    Directory of Open Access Journals (Sweden)

    Jayshree Ghorpade

    2011-06-01

    Full Text Available The text data present in images and video contain certain useful information for automatic annotation,indexing, and structuring of images. However variations of the text due to differences in text style, font, size, orientation, alignment as well as low image contrast and complex background make the problem of automatic text extraction extremely difficult and challenging job. A large number of techniques have been proposed to address this problem and the purpose of this paper is to design algorithms for each phase of extracting text from a video using java libraries and classes. Here first we frame the input video into stream of images using the Java Media Framework (JMF with the input being a real time or a video from the database. Then we apply pre processing algorithms to convert the image to gray scale and remove the disturbances like superimposed lines over the text, discontinuity removal, and dot removal.Then we continue with the algorithms for localization, segmentation and recognition for which we use the neural network pattern matching technique. The performance of our approach is demonstrated by presenting experimental results for a set of static images.

  14. CFO Payment Document Management

    Data.gov (United States)

    US Agency for International Development — Paperless management will enable the CFO to create, store, and access various financial documents electronically. This capability will reduce time looking for...

  15. CAED Document Repository

    Data.gov (United States)

    U.S. Environmental Protection Agency — Compliance Assurance and Enforcement Division Document Repository (CAEDDOCRESP) provides internal and external access of Inspection Records, Enforcement Actions, and...

  16. Modeling Documents with Event Model

    Directory of Open Access Journals (Sweden)

    Longhui Wang

    2015-08-01

    Full Text Available Currently deep learning has made great breakthroughs in visual and speech processing, mainly because it draws lessons from the hierarchical mode that brain deals with images and speech. In the field of NLP, a topic model is one of the important ways for modeling documents. Topic models are built on a generative model that clearly does not match the way humans write. In this paper, we propose Event Model, which is unsupervised and based on the language processing mechanism of neurolinguistics, to model documents. In Event Model, documents are descriptions of concrete or abstract events seen, heard, or sensed by people and words are objects in the events. Event Model has two stages: word learning and dimensionality reduction. Word learning is to learn semantics of words based on deep learning. Dimensionality reduction is the process that representing a document as a low dimensional vector by a linear mode that is completely different from topic models. Event Model achieves state-of-the-art results on document retrieval tasks.

  17. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2014-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly

  18. Colored-sketch of Text Information

    Directory of Open Access Journals (Sweden)

    Beomjin Kim

    2002-01-01

    Full Text Available This paper presents an information visualization method, which transforms text into abstracted visual representations. The proposed color-coding algorithm converts text into a sequence of colored icons that inform users about the distributional patterns of given queries, as well as the structural overview of a document simultaneously. By presenting the compact, but instructive visual abstraction of texts concurrently, users can compare multiple documents intuitively while alleviating the need to reference the underlying text. The system provides interactive navigation tools to support users' decision-making processes - including multi-level viewing, a tree hierarchy recording previous search activities, and suggestive words for refinement of the search scope. An experimental study evaluating this visual approach for delivering search results has been conducted on text corpora in comparison with a traditional information retrieval system. By informing search results to clientele in a perceptive form, the users' performance in obtaining desired information has been improved, while maintaining the accuracy.

  19. Enriching software architecture documentation

    NARCIS (Netherlands)

    Jansen, Anton; Avgeriou, Paris; Ven, Jan Salvador van der

    2009-01-01

    The effective documentation of Architectural Knowledge (AK) is one of the key factors in leveraging the paradigm shift toward sharing and reusing AK. However, current documentation approaches have severe shortcomings in capturing the knowledge of large and complex systems and subsequently facilitati

  20. Human Document Project

    NARCIS (Netherlands)

    Vries, de J.; Abelmann, L.; Manz, A.; Elwenspoek, M.C.

    2012-01-01

    “The Human Document Project” is a project which tries to answer all of the questions related to preserving information about the human race for tens of generations of humans to come or maybe even for a future intelligence which can emerge in the coming thousands of years. This document mainly focuss

  1. IDC System Specification Document.

    Energy Technology Data Exchange (ETDEWEB)

    Clifford, David J.

    2014-12-01

    This document contains the system specifications derived to satisfy the system requirements found in the IDC System Requirements Document for the IDC Reengineering Phase 2 project. Revisions Version Date Author/Team Revision Description Authorized by V1.0 12/2014 IDC Reengineering Project Team Initial delivery M. Harris

  2. Urdu Text Classification using Majority Voting

    Directory of Open Access Journals (Sweden)

    Muhammad Usman

    2016-08-01

    Full Text Available Text classification is a tool to assign the predefined categories to the text documents using supervised machine learning algorithms. It has various practical applications like spam detection, sentiment detection, and detection of a natural language. Based on the idea we applied five well-known classification techniques on Urdu language corpus and assigned a class to the documents using majority voting. The corpus contains 21769 news documents of seven categories (Business, Entertainment, Culture, Health, Sports, and Weird. The algorithms were not able to work directly on the data, so we applied the preprocessing techniques like tokenization, stop words removal and a rule-based stemmer. After preprocessing 93400 features are extracted from the data to apply machine learning algorithms. Furthermore, we achieved up to 94% precision and recall using majority voting.

  3. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health

  4. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain.It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment.CABI Full-text is one of the publishing products of CABI.CABI’s full text repository is growing rapidly and has now been integrated into all our databases including CAB Abstracts,Global Health,our Internet Resources and Jour-

  5. How Much Handwritten Text Is Needed for Text-Independent Writer Verification and Identification

    NARCIS (Netherlands)

    Brink, Axel; Bulacu, Marius; Schomaker, Lambert

    2008-01-01

    The performance of off-line text-independent writer verification and identification increases when the documents contain more text. This relation was examined by repeatedly conducting writer verification and identification performance tests while gradually increasing the amount of text on the pages.

  6. Text Skimming: The Process and Effectiveness of Foraging through Text under Time Pressure

    Science.gov (United States)

    Duggan, Geoffrey B.; Payne, Stephen J.

    2009-01-01

    Is Skim reading effective? How do readers allocate their attention selectively? The authors report 3 experiments that use expository texts and allow readers only enough time to read half of each document. Experiment 1 found that, relative to reading half the text, skimming improved memory for important ideas from a text but did not improve memory…

  7. Modified Approach to Transform Arc From Text to Linear Form Text : A Preprocessing Stage for OCR

    Directory of Open Access Journals (Sweden)

    Vijayashree C S

    2014-08-01

    Full Text Available Arc-form-text is an artistic-text which is quite common in several documents such as certificates, advertisements and history documents. OCRs fail to read such arc-form-text and it is necessary to transform the same to linear-form-text at preprocessing stage. In this paper, we present a modification to an existing transformation model for better readability by OCRs. The method takes the segmented arcform-text as input. Initially two concentric ellipses are approximated to enclose the arc-form-text and later the modified transformation model transforms the text in arc-form to linear-form. The proposed method is implemented on several upper semi-circular arc-form-text inputs and the readability of the transformed text is analyzed with an OCR

  8. Reading Authentic Texts

    DEFF Research Database (Denmark)

    Balling, Laura Winther

    2013-01-01

    Most research on cognates has focused on words presented in isolation that are easily defined as cognate between L1 and L2. In contrast, this study investigates what counts as cognate in authentic texts and how such cognates are read. Participants with L1 Danish read news articles in their highly...

  9. Texts On-Line.

    Science.gov (United States)

    Thomas, Jean-Jacques

    1993-01-01

    Maintains that the study of signs is divided between those scholars who use the Saussurian binary sign (semiology) and those who prefer the Peirce tripartite sign (semiotics). Concludes that neither the Saussurian nor Peircian analysis methods can produce a semiotic interpretation based on a hierarchy of the text's various components. (CFR)

  10. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International( CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  11. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2013-01-01

    <正>Centre for Agriculture and Bioscience International(CABI) is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people’s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  12. About CABI Full Text

    Institute of Scientific and Technical Information of China (English)

    2011-01-01

    <正>Centre for Agriculture and Bioscience International(CABI)is a not-for-profit international Agricultural Information Institute with headquarters in Britain. It aims to improve people s lives by providing information and applying scientific expertise to solve problems in agriculture and the environment. CABI Full-text is one of the publishing products of CABI.

  13. Summarizing Expository Texts

    Science.gov (United States)

    Westby, Carol; Culatta, Barbara; Lawrence, Barbara; Hall-Kenyon, Kendra

    2010-01-01

    Purpose: This article reviews the literature on students' developing skills in summarizing expository texts and describes strategies for evaluating students' expository summaries. Evaluation outcomes are presented for a professional development project aimed at helping teachers develop new techniques for teaching summarization. Methods: Strategies…

  14. Text analysis and computers

    OpenAIRE

    1995-01-01

    Content: Erhard Mergenthaler: Computer-assisted content analysis (3-32); Udo Kelle: Computer-aided qualitative data analysis: an overview (33-63); Christian Mair: Machine-readable text corpora and the linguistic description of danguages (64-75); Jürgen Krause: Principles of content analysis for information retrieval systems (76-99); Conference Abstracts (100-131).

  15. Polymorphous Perversity in Texts

    Science.gov (United States)

    Johnson-Eilola, Johndan

    2012-01-01

    Here's the tricky part: If we teach ourselves and our students that texts are made to be broken apart, remixed, remade, do we lose the polymorphous perversity that brought us pleasure in the first place? Does the pleasure of transgression evaporate when the borders are opened?

  16. Text as Image.

    Science.gov (United States)

    Woal, Michael; Corn, Marcia Lynn

    As electronically mediated communication becomes more prevalent, print is regaining the original pictorial qualities which graphemes (written signs) lost when primitive pictographs (or picture writing) and ideographs (simplified graphemes used to communicate ideas as well as to represent objects) evolved into first written, then printed, texts of…

  17. The Emar Lexical Texts

    NARCIS (Netherlands)

    Gantzert, Merijn

    2011-01-01

    This four-part work provides a philological analysis and a theoretical interpretation of the cuneiform lexical texts found in the Late Bronze Age city of Emar, in present-day Syria. These word and sign lists, commonly dated to around 1100 BC, were almost all found in the archive of a single school.

  18. Text Classification: A Sequential Reading Approach

    CERN Document Server

    Dulac-Arnold, Gabriel; Gallinari, Patrick

    2011-01-01

    We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided.

  19. Data Security by Preprocessing the Text with Secret Hiding

    Directory of Open Access Journals (Sweden)

    Ajit Singh

    2012-06-01

    Full Text Available With the advent of the Internet, an open forum, the massive increase in the data travel across networkmake an issue for secure transmission. Cryptography is the term that involves many encryption method to make data secure. But the transmission of the secure data is an intricate task. Steganography here comes with effect of transmission without revealing the secure data. The research paper provide the mechanism which enhance the security of data by using a crypto+stegano combination to increase the security level without knowing the fact that some secret data is sharing across networks. In the firstphase data is encrypted by manipulating the text using the ASCII codes and some random generated strings for the codes by taking some parameters. Steganography related to cryptography forms the basisfor many data hiding techniques. The data is encrypted using a proposed approach and then hide the message in random N images with the help of perfect hashing scheme which increase the security of the message before sending across the medium. Thus the sending and receiving of message will be safe and secure with an increased confidentiality.

  20. ON-LINE DOCUMENTS CONTENT MANAGEMENT

    Directory of Open Access Journals (Sweden)

    VASILESCU RAMONA VIOLETA

    2010-05-01

    Full Text Available This paper outlines the steps and technologies used in developing an on-line application server with many desktop clients, and with high power processing for a wide range of input documents to obtain searchable documents on the highest portability standards, PDF and PDF /A.

  1. Illumination Compensation Algorithm for Unevenly Lighted Document Segmentation

    Directory of Open Access Journals (Sweden)

    Ju Zhiyong

    2013-07-01

    Full Text Available For the problem of segmenting the unevenly lighted document image, this paper proposes an illumination compensation segmentation algorithm which can effectively segment the unevenly lighted document. The illumination compensation method is proposed to equivalently convert unevenly lighted document image to evenly lighted document image, then segment the evenly lighted document directly. Experimental results show that the proposed method can get the accurate evenly lighted document images so that we can segment the document accurately and it is more efficient to process unevenly lighted document  images than traditional binarization methods. The algorithm effectively overcomes the difficulty in handling uneven lighting and enhances segmentation quality considerably.

  2. Texts of presentation

    Energy Technology Data Exchange (ETDEWEB)

    Magnin, G.; Vidolov, K.; Dufour-Fallot, B.; Dewarrat, Th.; Rose, T.; Favatier, A.; Gazeley, D.; Pujol, T.; Worner, D.; Van de Wel, E.; Revaz, J.M.; Clerfayt, G.; Creedy, A.; Moisan, F.; Geissler, M.; Isbell, P.; Macaluso, M.; Litzka, V.; Gillis, W.; Jarvis, I.; Gorg, M.; Bebie, B.

    2004-07-01

    Implementing a sustainable local energy policy involves a long term reflection on the general interest, energy efficiency, distributed generation and environmental protection. Providing services on a market involves looking for activities that are profitable, if possible in the 'short-term'. The aim of this conference is to analyse the possibility of reconciling these apparently contradictory requirements and how this can be achieved. This conference brings together the best specialists from European municipalities as well as important partners for local authorities (energy agencies, service companies, institutions, etc.) in order to discuss the public-private partnerships concerning the various functions that municipalities may perform in the energy field as consumers and customers, planners and organizers of urban space and rousers as regards inhabitants and economic players of their areas. This document contains the summaries of the following presentations: 1 - Performance contracting: Bulgarian municipalities use private capital for energy efficiency improvement (K. VIDOLOV, Varna (BG)), Contracting experiences in Swiss municipalities: consistent energy policy thanks to the Energy-city label (B. DUFOUR-FALLOT and T. DEWARRAT (CH)), Experience of contracting in the domestic sector (T. ROSE (GB)); 2 - Public procurement: Multicolor electricity (A. FAVATIER (CH)), Tendering for new green electricity capacity (D. GAZELEY (GB)), The Barcelona solar thermal ordinance (T. PUJOL (ES)); 3 - Urban planning and schemes: Influencing energy issues through urban planning (D. WOERNER (DE)), Tendering for the supply of energy infrastructure (E. VAN DE WEL (NL)), Concessions and public utility warranty (J.M. REVAZ (CH)); 4 - Certificate schemes: the market of green certificates in Wallonia region in a liberalized power market (G. CLERFAYT (BE)), The Carbon Neutral{sup R} project: a voluntary certification scheme with opportunity for implementation in other European

  3. IR and OLAP in XML document warehouses

    DEFF Research Database (Denmark)

    Perez, Juan Manuel; Pedersen, Torben Bach; Berlanga, Rafael

    2005-01-01

    In this paper we propose to combine IR and OLAP (On-Line Analytical Processing) technologies to exploit a warehouse of text-rich XML documents. In the system we plan to develop, a multidimensional implementation of a relevance modeling document model will be used for interactively querying...... the warehouse by allowing navigation in the structure of documents and in a concept hierarchy of query terms. The facts described in the relevant documents will be ranked and analyzed in a novel OLAP cube model able to represent and manage facts with relevance indexes....

  4. Weaving with text

    DEFF Research Database (Denmark)

    Hagedorn-Rasmussen, Peter

    This paper explores how a school principal by means of practical authorship creates reservoirs of language that provide a possible context for collective sensemaking. The paper draws upon a field study in which a school principal, and his managerial team, was shadowed in a period of intensive...... changes. The paper explores how the manager weaves with text, extracted from stakeholders, administration, politicians, employees, public discourse etc., as a means of creating a new fabric, a texture, of diverse perspectives that aims for collective sensemaking....

  5. Transportation System Requirements Document

    Energy Technology Data Exchange (ETDEWEB)

    1993-09-01

    This Transportation System Requirements Document (Trans-SRD) describes the functions to be performed by and the technical requirements for the Transportation System to transport spent nuclear fuel (SNF) and high-level radioactive waste (HLW) from Purchaser and Producer sites to a Civilian Radioactive Waste Management System (CRWMS) site, and between CRWMS sites. The purpose of this document is to define the system-level requirements for Transportation consistent with the CRWMS Requirement Document (CRD). These requirements include design and operations requirements to the extent they impact on the development of the physical segments of Transportation. The document also presents an overall description of Transportation, its functions, its segments, and the requirements allocated to the segments and the system-level interfaces with Transportation. The interface identification and description are published in the CRWMS Interface Specification.

  6. Software Document Inventory Program

    Science.gov (United States)

    Merwarth, P. D.

    1984-01-01

    Program offers ways to file and locate sources of reference. DOCLIB system consists of two parts to serve needs of two type of users: general user and librarian. DOCLIB systems provides user with interactive, menudriven document inventory capability.

  7. NCDC Archive Documentation Manuals

    Data.gov (United States)

    National Oceanic and Atmospheric Administration, Department of Commerce — The National Climatic Data Center Tape Deck Documentation library is a collection of over 400 manuals describing NCDC's digital holdings (both historic and current)....

  8. Methodological Aspects of Architectural Documentation

    Directory of Open Access Journals (Sweden)

    Arivaldo Amorim

    2011-12-01

    Full Text Available This paper discusses the methodological approach that is being developed in the state of Bahia in Brazil since 2003, in architectural and urban sites documentation, using extensive digital technologies. Bahia has a vast territory with important architectural ensembles ranging from the sixteenth century to present day. As part of this heritage is constructed of raw earth and wood, it is very sensitive to various deleterious agents. It is therefore critical document this collection that is under threats. To conduct those activities diverse digital technologies that could be used in documentation process are being experimented. The task is being developed as an academic research, with few financial resources, by scholarship students and some volunteers. Several technologies are tested ranging from the simplest to the more sophisticated ones, used in the main stages of the documentation project, as follows: work overall planning, data acquisition, processing and management and ultimately, to control and evaluate the work. The activities that motivated this paper are being conducted in the cities of Rio de Contas and Lençóis in the Chapada Diamantina, located at 420 km and 750 km from Salvador respectively, in Cachoeira city at Recôncavo Baiano area, 120 km from Salvador, the capital of Bahia state, and at Pelourinho neighbourhood, located in the historic capital. Part of the material produced can be consulted in the website: < www.lcad.ufba.br>.

  9. Document Flash Thermography

    OpenAIRE

    Larsen, Cory; Baker, Doran

    2011-01-01

    This paper presents an extension of flash thermography techniques to the analysis of documents. Motivation for this research is to develop the ability to reveal covered writings in archaeological artifacts such as the Codex Selden or Egyptian Cartonnage. An emphasis is placed on evaluating several common existing signal processing techniques for their effectiveness in enhancing subsurface writings found within a set of test documents. These processing techniques include: contrast stretching, ...

  10. Document Flash Thermography

    OpenAIRE

    Larsen, Cory A.

    2011-01-01

    This thesis presents the application of ash thermography techniques to the analysis of documents. The motivation for this research is to develop the ability to non-destructively reveal covered writings in archaeological artifacts such as the Codex Selden or Egyptian car- tonnage. Current common signal processing techniques are evaluated for their effectiveness in enhancing subsurface writings found within a set of test documents. These processing techniques include: false colorization, contra...

  11. Metacomprehension of text material.

    Science.gov (United States)

    Maki, R H; Berry, S L

    1984-10-01

    Subjects' abilities to predict future multiple-choice test performance after reading sections of text were investigated in two experiments. In Experiment 1, subjects who scored above median test performance showed some accuracy in their predictions of that test performance. They gave higher mean ratings to material related to correct than to incorrect test answers. Subjects who scored below median test performance did not show this prediction accuracy. The retention interval between reading and the test was manipulated in Experiment 2. Subjects who were tested after at least a 24-hr delay showed results identical to those of Experiment 1. However, when subjects were tested immediately after reading, subjects above and below median test performance gave accurate predictions for the first immediate test. In contrast, both types of subjects gave inaccurate predictions for the second immediate test. Structural variables, such as length, serial position, and hierarchical level of the sections of text were related to subjects' predictions. These variables, in general, were not related to test performance, although the predictions were related to test performance in the conditions described above.

  12. Document reconstruction by layout analysis of snippets

    Science.gov (United States)

    Kleber, Florian; Diem, Markus; Sablatnig, Robert

    2010-02-01

    Document analysis is done to analyze entire forms (e.g. intelligent form analysis, table detection) or to describe the layout/structure of a document. Also skew detection of scanned documents is performed to support OCR algorithms that are sensitive to skew. In this paper document analysis is applied to snippets of torn documents to calculate features for the reconstruction. Documents can either be destroyed by the intention to make the printed content unavailable (e.g. tax fraud investigation, business crime) or due to time induced degeneration of ancient documents (e.g. bad storage conditions). Current reconstruction methods for manually torn documents deal with the shape, inpainting and texture synthesis techniques. In this paper the possibility of document analysis techniques of snippets to support the matching algorithm by considering additional features are shown. This implies a rotational analysis, a color analysis and a line detection. As a future work it is planned to extend the feature set with the paper type (blank, checked, lined), the type of the writing (handwritten vs. machine printed) and the text layout of a snippet (text size, line spacing). Preliminary results show that these pre-processing steps can be performed reliably on a real dataset consisting of 690 snippets.

  13. Automatic handwriting identification on medieval documents

    NARCIS (Netherlands)

    Bulacu, M.L.; Schomaker, L.R.B.

    2007-01-01

    In this paper, we evaluate the performance of text-independent writer identification methods on a handwriting dataset containing medieval English documents. Applicable identification rates are achieved by combining textural features (joint directional probability distributions) with allographic feat

  14. New Challenges of the Documentation in Media

    Directory of Open Access Journals (Sweden)

    Antonio García Jiménez

    2015-07-01

    Full Text Available This special issue, presented by index.comunicación, is focused on media related information & documentation. This field undergoes constant and profound changes, especially visible in documentation processes. A situation characterized by the existence of tablets, smartphones, applications, and by the almost achieved digitization of traditional documents, in addition to the crisis of the press business model, that involves mutations in the journalists’ tasks and in the relationship between them and Documentation. Papers included in this special issue focus on some of the concerns in this domain: the progressive autonomy of the journalist in access to information sources, the role of press offices as documentation sources, the search of information on the web, the situation of media blogs, the viability of elements of information architecture in smart TV and the development of social TV and its connection to Documentation.

  15. Improving text recall with multiple summaries

    NARCIS (Netherlands)

    Meij, van der Hans; Meij, van der Jan

    2012-01-01

    Background. QuikScan (QS) is an innovative design that aims to improve accessibility, comprehensibility, and subsequent recall of expository text by means of frequent within-document summaries that are formatted as numbered list items. The numbers in the QS summaries correspond to numbers placed in

  16. Automatic Syntactic Analysis of Free Text.

    Science.gov (United States)

    Schwarz, Christoph

    1990-01-01

    Discusses problems encountered with the syntactic analysis of free text documents in indexing. Postcoordination and precoordination of terms is discussed, an automatic indexing system call COPSY (context operator syntax) that uses natural language processing techniques is described, and future developments are explained. (60 references) (LRW)

  17. Polarity Analysis of Texts using Discourse Structure

    NARCIS (Netherlands)

    Heerschop, Bas; Goosen, Frank; Hogenboom, Alexander; Frasincar, Flavius; Kaymak, Uzay; Jong, de Franciska

    2011-01-01

    Sentiment analysis has applications in many areas and the exploration of its potential has only just begun. We propose Pathos, a framework which performs document sentiment analysis (partly) based on a document’s discourse structure. We hypothesize that by splitting a text into important and less im

  18. Automatic Induction of Rule Based Text Categorization

    Directory of Open Access Journals (Sweden)

    D.Maghesh Kumar

    2010-12-01

    Full Text Available The automated categorization of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuingneed to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. This paper describes, a novel method for the automatic induction of rule-based text classifiers. This method supports a hypothesis language of the form "if T1, … or Tn occurs in document d, and none of T1+n,... Tn+m occurs in d, then classify d under category c," where each Ti is a conjunction of terms. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. Issues pertaining tothree different problems, namely, document representation, classifier construction, and classifier evaluation were discussed in detail.

  19. Visualizing the semantic content of large text databases using text maps

    Science.gov (United States)

    Combs, Nathan

    1993-01-01

    A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.

  20. Wilmar joint market model, Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Meibom, P.; Larsen, Helge V. [Risoe National Lab. (Denmark); Barth, R.; Brand, H. [IER, Univ. of Stuttgart (Germany); Weber, C.; Voll, O. [Univ. of Duisburg-Essen (Germany)

    2006-01-15

    The Wilmar Planning Tool is developed in the project Wind Power Integration in Liberalised Electricity Markets (WILMAR) supported by EU (Contract No. ENK5-CT-2002-00663). A User Shell implemented in an Excel workbook controls the Wilmar Planning Tool. All data are contained in Access databases that communicate with various sub-models through text files that are exported from or imported to the databases. The Joint Market Model (JMM) constitutes one of these sub-models. This report documents the Joint Market model (JMM). The documentation describes: 1. The file structure of the JMM. 2. The sets, parameters and variables in the JMM. 3. The equations in the JMM. 4. The looping structure in the JMM. (au)

  1. Endangered Language Documentation and Transmission

    Directory of Open Access Journals (Sweden)

    D. Victoria Rau

    2007-01-01

    Full Text Available This paper describes an on-going project on digital archiving Yami language documentation (http://www.hrelp.org/grants/projects/index.php?projid=60. We present a cross-disciplinary approach, involving computer science and applied linguistics, to document the Yami language and prepare teaching materials. Our discussion begins with an introduction to an integrated framework for archiving, processing and developing learning materials for Yami (Yang and Rau 2005, followed by a historical account of Yami language teaching, from a grammatical syllabus (Dong and Rau 2000b to a communicative syllabus using a multimedia CD as a resource (Rau et al. 2005, to the development of interactive on-line learning based on the digital archiving project. We discuss the methods used and challenges of each stage of preparing Yami teaching materials, and present a proposal for rethinking pedagogical models for e-learning.

  2. A Proposed Arabic Handwritten Text Normalization Method

    Directory of Open Access Journals (Sweden)

    Tarik Abu-Ain

    2014-11-01

    Full Text Available Text normalization is an important technique in document image analysis and recognition. It consists of many preprocessing stages, which include slope correction, text padding, skew correction, and straight the writing line. In this side, text normalization has an important role in many procedures such as text segmentation, feature extraction and characters recognition. In the present article, a new method for text baseline detection, straightening, and slant correction for Arabic handwritten texts is proposed. The method comprises a set of sequential steps: first components segmentation is done followed by components text thinning; then, the direction features of the skeletons are extracted, and the candidate baseline regions are determined. After that, selection of the correct baseline region is done, and finally, the baselines of all components are aligned with the writing line.  The experiments are conducted on IFN/ENIT benchmark Arabic dataset. The results show that the proposed method has a promising and encouraging performance.

  3. Relation Based Mining Model for Enhancing Web Document Clustering

    Directory of Open Access Journals (Sweden)

    M.Reka

    2014-05-01

    Full Text Available The design of web Information management system becomes more complex one with more time complexity. Information retrieval is a difficult task due to the huge volume of web documents. The way of clustering makes the retrieval easier and less time consuming. Thisalgorithm introducesa web document clustering approach, which use the semantic relation between documents, which reduces the time complexity. It identifies the relations and concepts in a document and also computes the relation score between documents. This algorithm analyses the key concepts from the web documents by preprocessing, stemming, and stop word removal. Identified concepts are used to compute the document relation score and clusterrelation score. The domain ontology is used to compute the document relation score and cluster relation score. Based on the document relation score and cluster relation score, the web document cluster is identified. This algorithm uses 2,00,000 web documents for evaluation and 60 percentas trainingset and 40 percent as testing set.

  4. Problems and Methods of Source Study of Cinema Documents

    Directory of Open Access Journals (Sweden)

    Grigory N. Lanskoy

    2016-03-01

    Full Text Available The article is devoted to basic problems of analysis and interpretation of cinema documents in historical studies, with the possibility of shared approach to the study of cinema and paper documents, the using of art studies principles to the analysis of cinema documents and the efficacy of textual approach to the study of cinema documents among them. The forms of applying different scientific methods to the evaluation of cinema documents as historical sources are also discussed in the article.

  5. Customer Communication Document

    Science.gov (United States)

    2009-01-01

    This procedure communicates to the Customers of the Automation, Robotics and Simulation Division (AR&SD) Dynamics Systems Test Branch (DSTB) how to obtain services of the Six-Degrees-Of-Freedom Dynamic Test System (SDTS). The scope includes the major communication documents between the SDTS and its Customer. It established the initial communication and contact points as well as provides the initial documentation in electronic media for the customer. Contact the SDTS Manager (SM) for the names of numbers of the current contact points.

  6. Helios: Understanding Solar Evolution Through Text Analytics

    Energy Technology Data Exchange (ETDEWEB)

    Randazzese, Lucien [SRI International, Menlo Park, CA (United States)

    2016-12-02

    This proof-of-concept project focused on developing, testing, and validating a range of bibliometric, text analytic, and machine-learning based methods to explore the evolution of three photovoltaic (PV) technologies: Cadmium Telluride (CdTe), Dye-Sensitized solar cells (DSSC), and Multi-junction solar cells. The analytical approach to the work was inspired by previous work by the same team to measure and predict the scientific prominence of terms and entities within specific research domains. The goal was to create tools that could assist domain-knowledgeable analysts in investigating the history and path of technological developments in general, with a focus on analyzing step-function changes in performance, or “breakthroughs,” in particular. The text-analytics platform developed during this project was dubbed Helios. The project relied on computational methods for analyzing large corpora of technical documents. For this project we ingested technical documents from the following sources into Helios: Thomson Scientific Web of Science (papers), the U.S. Patent & Trademark Office (patents), the U.S. Department of Energy (technical documents), the U.S. National Science Foundation (project funding summaries), and a hand curated set of full-text documents from Thomson Scientific and other sources.

  7. SEMANTIC METADATA FOR HETEROGENEOUS SPATIAL PLANNING DOCUMENTS

    Directory of Open Access Journals (Sweden)

    A. Iwaniak

    2016-09-01

    Full Text Available Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa. The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.

  8. ONTOLOGY BASED DOCUMENT CLUSTERING USING MAPREDUCE

    Directory of Open Access Journals (Sweden)

    Abdelrahman Elsayed

    2015-05-01

    Full Text Available Nowadays, document clustering is considered as a data intensive task due to the dramatic, fast increase in the number of available documents. Nevertheless, the features that represent those documents are also too large. The most common method for representing documents is the vector space model, which represents document features as a bag of words and does not represent semantic relations between words. In this paper we introduce a distributed implementation for the bisecting k-means using MapReduce programming model. The aim behind our proposed implementation is to solve the problem of clustering intensive data documents. In addition, we propose integrating the WordNet ontology with bisecting k-means in order to utilize the semantic relations between words to enhance document clustering results. Our presented experimental results show that using lexical categories for nouns only enhances internal evaluation measures of document clustering; and decreases the documents features from thousands to tens features. Our experiments were conducted using Amazon Elastic MapReduce to deploy the Bisecting k-means algorithm.

  9. Text Classification Using Sentential Frequent Itemsets

    Institute of Scientific and Technical Information of China (English)

    Shi-Zhu Liu; He-Ping Hu

    2007-01-01

    Text classification techniques mostly rely on single term analysis of the document data set, while more concepts,especially the specific ones, are usually conveyed by set of terms. To achieve more accurate text classifier, more informative feature including frequent co-occurring words in the same sentence and their weights are particularly important in such scenarios. In this paper, we propose a novel approach using sentential frequent itemset, a concept comes from association rule mining, for text classification, which views a sentence rather than a document as a transaction, and uses a variable precision rough set based method to evaluate each sentential frequent itemset's contribution to the classification. Experiments over the Reuters and newsgroup corpus are carried out, which validate the practicability of the proposed system.

  10. ARABIC TEXT CATEGORIZATION ALGORITHM USING VECTOR EVALUATION METHOD

    Directory of Open Access Journals (Sweden)

    Ashraf Odeh

    2014-12-01

    Full Text Available Text categorization is the process of grouping documents into categories based on their contents. This process is important to make information retrieval easier, and it became more important due to the huge textual information available online. The main problem in text categorization is how to improve the classification accuracy. Although Arabic text categorization is a new promising field, there are a few researches in this field. This paper proposes a new method for Arabic text categorization using vector evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the tested document's words are calculated to determine the document keywords which will be compared with the keywords of the corpus categorizes to determine the tested document's best category.

  11. Biogas document; Dossier Biogaz

    Energy Technology Data Exchange (ETDEWEB)

    Verchin, J.C.; Servais, C. [Club BIOGAZ, 94 - Arcueil (France)

    2002-06-01

    In this document concerning the biogas, the author presents this renewable energy situation in 2001-2002, the concerned actors, the accounting of the industrial methanization installations in France, the three main chains of process for industrial wastes and two examples of methanization implementation in a paper industry and in a dairy. (A.L.B.)

  12. Documents on Disarmament.

    Science.gov (United States)

    Arms Control and Disarmament Agency, Washington, DC.

    This publication, latest in a series of volumes issued annually since 1960, contains primary source documents on arms control and disarmament developments during 1969. The main chronological arrangement is supplemented by both chronological and topical lists of contents. Other reference aids include a subject/author index, and lists of…

  13. Course documentation report

    DEFF Research Database (Denmark)

    Buus, Lillian; Bygholm, Ann; Walther, Tina Dyngby Lyng

    A documentation report on the three pedagogical courses developed during the MVU project period. The report describes the three processes taking departure in the structure and material avaiable at the virtual learning environment. Also the report describes the way the two of the courses developed...

  14. Hypertension Briefing: Technical documentation

    OpenAIRE

    Institute of Public Health in Ireland

    2012-01-01

    Blood pressure is the force exerted on artery walls as the heart pumps blood through the body. Hypertension, or high blood pressure, occurs when blood pressure is constantly higher than the pressure needed to carry blood through the body. This document details how the IPH uses a systematic and consistent method to produce prevalence data for hypertension on the island of Ireland.

  15. Extremely secure identification documents

    Energy Technology Data Exchange (ETDEWEB)

    Tolk, K.M. [Sandia National Labs., Albuquerque, NM (United States); Bell, M. [Sandia National Labs., Livermore, CA (United States)

    1997-09-01

    The technology developed in this project uses biometric information printed on the document and public key cryptography to ensure that an adversary cannot issue identification documents to unauthorized individuals or alter existing documents to allow their use by unauthorized individuals. This process can be used to produce many types of identification documents with much higher security than any currently in use. The system is demonstrated using a security badge as an example. This project focused on the technologies requiring development in order to make the approach viable with existing badge printing and laminating technologies. By far the most difficult was the image processing required to verify that the picture on the badge had not been altered. Another area that required considerable work was the high density printed data storage required to get sufficient data on the badge for verification of the picture. The image processing process was successfully tested, and recommendations are included to refine the badge system to ensure high reliability. A two dimensional data array suitable for printing the required data on the badge was proposed, but testing of the readability of the array had to be abandoned due to reallocation of the budgeted funds by the LDRD office.

  16. Using Primary Source Documents.

    Science.gov (United States)

    Mintz, Steven

    2003-01-01

    Explores the use of primary sources when teaching about U.S. slavery. Includes primary sources from the Gilder Lehrman Documents Collection (New York Historical Society) to teach about the role of slaves in the Revolutionary War, such as a proclamation from Lord Dunmore offering freedom to slaves who joined his army. (CMK)

  17. Motivation through Routine Documentation

    Science.gov (United States)

    Koth, Laurie J.

    2016-01-01

    This informed commentary article offers a simple, effective classroom management strategy in which the teacher uses routine documentation to motivate students both to perform academically and to behave in a manner consistent with established classroom rules and procedures. The pragmatic strategy is grounded in literature, free to implement,…

  18. Exploiting Surrounding Text for Retrieving Web Images

    Directory of Open Access Journals (Sweden)

    S. A. Noah

    2008-01-01

    Full Text Available Web documents contain useful textual information that can be exploited for describing images. Research had been focused on representing images by means of its content (low level description such as color, shape and texture, little research had been directed to exploiting such textual information. The aim of this research was to systematically exploit the textual content of HTML documents for automatically indexing and ranking of images embedded in web documents. A heuristic approach for locating and assigning weight surrounding web images and a modified tf.idf weighting scheme was proposed. Precision-recall measures of evaluation had been conducted for ten queries and promising results had been achieved. The proposed approach showed slightly better precision measure as compared to a popular search engine with an average of 0.63 and 0.55 relative precision measures respectively.

  19. Text Mining the History of Medicine.

    Directory of Open Access Journals (Sweden)

    Paul Thompson

    Full Text Available Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc., synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.. TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research

  20. Technical approach document

    Energy Technology Data Exchange (ETDEWEB)

    1989-12-01

    The Uranium Mill Tailings Radiation Control Act (UMTRCA) of 1978, Public Law 95-604 (PL95-604), grants the Secretary of Energy the authority and responsibility to perform such actions as are necessary to minimize radiation health hazards and other environmental hazards caused by inactive uranium mill sites. This Technical Approach Document (TAD) describes the general technical approaches and design criteria adopted by the US Department of Energy (DOE) in order to implement remedial action plans (RAPS) and final designs that comply with EPA standards. It does not address the technical approaches necessary for aquifer restoration at processing sites; a guidance document, currently in preparation, will describe aquifer restoration concerns and technical protocols. This document is a second revision to the original document issued in May 1986; the revision has been made in response to changes to the groundwater standards of 40 CFR 192, Subparts A--C, proposed by EPA as draft standards. New sections were added to define the design approaches and designs necessary to comply with the groundwater standards. These new sections are in addition to changes made throughout the document to reflect current procedures, especially in cover design, water resources protection, and alternate site selection; only minor revisions were made to some of the sections. Sections 3.0 is a new section defining the approach taken in the design of disposal cells; Section 4.0 has been revised to include design of vegetated covers; Section 8.0 discusses design approaches necessary for compliance with the groundwater standards; and Section 9.0 is a new section dealing with nonradiological hazardous constituents. 203 refs., 18 figs., 26 tabs.

  1. Detection of Plagiarism in Arabic Documents

    Directory of Open Access Journals (Sweden)

    Mohamed El Bachir Menai

    2012-09-01

    Full Text Available Many language-sensitive tools for detecting plagiarism in natural language documents have been developed, particularly for English. Language-independent tools exist as well, but are considered restrictive as they usually do not take into account specific language features. Detecting plagiarism in Arabic documents is particularly a challenging task because of the complex linguistic structure of Arabic. In this paper, we present a plagiarism detection tool for comparison of Arabic documents to identify potential similarities. The tool is based on a new comparison algorithm that uses heuristics to compare suspect documents at different hierarchical levels to avoid unnecessary comparisons. We evaluate its performance in terms of precision and recall on a large data set of Arabic documents, and show its capability in identifying direct and sophisticated copying, such as sentence reordering and synonym substitution. We also demonstrate its advantages over other plagiarism detection tools, including Turnitin, the well-known language-independent tool.

  2. Document Summarization Using Positive Pointwise Mutual Information

    Directory of Open Access Journals (Sweden)

    Aji S

    2012-05-01

    Full Text Available The degree of success in document summarization processes depends on the performance of the method used in identifying significant sentences in the documents. The collection of unique words characterizes the major signature of the document, and forms the basis for Term-Sentence-Matrix (TSM. The Positive Pointwise Mutual Information, which works well for measuring semantic similarity in the TermSentence-Matrix, is used in our method to assign weights for each entry in the Term-Sentence-Matrix. The Sentence-Rank-Matrix generated from this weighted TSM, is then used to extract a summary from the document. Our experiments show that such a method would outperform most of the existing methods in producing summaries from large documents.

  3. Using Stream Features for Instant Document Filtering

    OpenAIRE

    2012-01-01

    In this paper, we discuss how event processing technologies can be employed for real-time text stream processing and information filtering in the context of the TREC 2012 microblog task. After introducing basic characteristics of stream and event processing, the technical architecture of our text stream analysis engine is presented. Employing well-known term weighting schemes from document-centric text retrieval for temporally dynamic text streams is discussed next, giving details of the ESPE...

  4. NEW TECHNIQUES USED IN AUTOMATED TEXT ANALYSIS

    Directory of Open Access Journals (Sweden)

    M. I strate

    2010-12-01

    Full Text Available Automated analysis of natural language texts is one of the most important knowledge discovery tasks for any organization. According to Gartner Group, almost 90% of knowledge available at an organization today is dispersed throughout piles of documents buried within unstructured text. Analyzing huge volumes of textual information is often involved in making informed and correct business decisions. Traditional analysis methods based on statistics fail to help processing unstructured texts and the society is in search of new technologies for text analysis. There exist a variety of approaches to the analysis of natural language texts, but most of them do not provide results that could be successfully applied in practice. This article concentrates on recent ideas and practical implementations in this area.

  5. Semantic Metadata for Heterogeneous Spatial Planning Documents

    Science.gov (United States)

    Iwaniak, A.; Kaczmarek, I.; Łukowicz, J.; Strzelecki, M.; Coetzee, S.; Paluszyński, W.

    2016-09-01

    Spatial planning documents contain information about the principles and rights of land use in different zones of a local authority. They are the basis for administrative decision making in support of sustainable development. In Poland these documents are published on the Web according to a prescribed non-extendable XML schema, designed for optimum presentation to humans in HTML web pages. There is no document standard, and limited functionality exists for adding references to external resources. The text in these documents is discoverable and searchable by general-purpose web search engines, but the semantics of the content cannot be discovered or queried. The spatial information in these documents is geographically referenced but not machine-readable. Major manual efforts are required to integrate such heterogeneous spatial planning documents from various local authorities for analysis, scenario planning and decision support. This article presents results of an implementation using machine-readable semantic metadata to identify relationships among regulations in the text, spatial objects in the drawings and links to external resources. A spatial planning ontology was used to annotate different sections of spatial planning documents with semantic metadata in the Resource Description Framework in Attributes (RDFa). The semantic interpretation of the content, links between document elements and links to external resources were embedded in XHTML pages. An example and use case from the spatial planning domain in Poland is presented to evaluate its efficiency and applicability. The solution enables the automated integration of spatial planning documents from multiple local authorities to assist decision makers with understanding and interpreting spatial planning information. The approach is equally applicable to legal documents from other countries and domains, such as cultural heritage and environmental management.

  6. Text Mining the History of Medicine.

    Science.gov (United States)

    Thompson, Paul; Batista-Navarro, Riza Theresa; Kontonatsios, Georgios; Carter, Jacob; Toon, Elizabeth; McNaught, John; Timmermann, Carsten; Worboys, Michael; Ananiadou, Sophia

    2016-01-01

    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while

  7. Hiding Malicious Content in PDF Documents

    Directory of Open Access Journals (Sweden)

    Dan Sabin Popescu

    2011-09-01

    Full Text Available This paper is a proof-of-concept demonstration for a specific digital signatures vulnerability that shows the ineffectiveness of the WYSIWYS (What You See Is What You Sign concept. The algorithm is fairly simple: the attacker generates a polymorphic file that has two different types of content (text, as a PDF document for example, and image: TIFF – two of the most widely used file formats. When the victim signs the dual content file, he/ she only sees a PDF document and is unaware of the hidden content inside the file. After obtaining the legally signed document from the victim, the attacker simply has to change the extension to the other file format. This will not invalidate the digital signature, as no bits were altered. The destructive potential of the attack is considerable, as the Portable Document Format (PDF is widely used in e-government and in e-business contexts.

  8. Geometric Correction for Braille Document Images

    Directory of Open Access Journals (Sweden)

    Padmavathi.S

    2016-04-01

    Full Text Available Braille system has been used by the visually impair ed people for reading.The shortage of Braille books has caused a need for conversion of Braille t o text. This paper addresses the geometric correction of a Braille document images. Due to the standard measurement of the Braille cells, identification of Braille characters could be achie ved by simple cell overlapping procedure. The standard measurement varies in a scaled document an d fitting of the cells become difficult if the document is tilted. This paper proposes a line fitt ing algorithm for identifying the tilt (skew angle. The horizontal and vertical scale factor is identified based on the ratio of distance between characters to the distance between dots. Th ese are used in geometric transformation matrix for correction. Rotation correction is done prior to scale correction. This process aids in increased accuracy. The results for various Braille documents are tabulated.

  9. [Clinically documented fungal infections].

    Science.gov (United States)

    Kakeya, Hiroshi; Kohno, Shigeru

    2008-12-01

    Proven fungal infections are diagnosed by histological/microbiological evidence of fungi at the site of infection and positive blood culture (fungemia). However, invasive diagnosing examinations are not always applied for all of immunocompromised patients. Clinically documented invasive fungal infections are diagnosed by typical radiological findings such as halo sign on chest CT plus positive serological/molecular evidence of fungi. Serological tests of Aspergillus galactomannan antigen and beta-glucan for aspergillosis and cryptococcal glucuronoxylomannan antigen for cryptococcosis are useful. Hence, none of reliable serological tests for zygomycosis are available so far. In this article, risk factors, sign and symptoms, and diagnostic methods for clinically documented cases of invasive aspergillosis, pulmonary cryptococcosis, and zygomycosis with diabates, are reviewed.

  10. SANSMIC design document.

    Energy Technology Data Exchange (ETDEWEB)

    Weber, Paula D. [Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States); Rudeen, David Keith [GRAM, Inc., Albuquerque, NM (United States)

    2015-07-01

    The United States Strategic Petroleum Reserve (SPR) maintains an underground storage system consisting of caverns that were leached or solution mined in four salt domes located near the Gulf of Mexico in Texas and Louisiana. The SPR comprises more than 60 active caverns containing approximately 700 million barrels of crude oil. Sandia National Labo- ratories (SNL) is the geotechnical advisor to the SPR. As the most pressing need at the inception of the SPR was to create and fill storage volume with oil, the decision was made to leach the caverns and fill them simultaneously (leach-fill). Therefore, A.J. Russo developed SANSMIC in the early 1980s which allows for a transient oil-brine interface (OBI) making it possible to model leach-fill and withdrawal operations. As the majority of caverns are currently filled to storage capacity, the primary uses of SANSMIC at this time are related to the effects of small and large withdrawals, expansion of existing caverns, and projecting future pillar to diameter ratios. SANSMIC was identified by SNL as a priority candidate for qualification. This report continues the quality assurance (QA) process by documenting the "as built" mathematical and numerical models that comprise this document. The pro- gram flow is outlined and the models are discussed in detail. Code features that were added later or were not documented previously have been expounded. No changes in the code's physics have occurred since the original documentation (Russo, 1981, 1983) although recent experiments may yield improvements to the temperature and plume methods in the future.

  11. Content Documents Management

    Science.gov (United States)

    Muniz, R.; Hochstadt, J.; Boelke J.; Dalton, A.

    2011-01-01

    The Content Documents are created and managed under the System Software group with. Launch Control System (LCS) project. The System Software product group is lead by NASA Engineering Control and Data Systems branch (NEC3) at Kennedy Space Center. The team is working on creating Operating System Images (OSI) for different platforms (i.e. AIX, Linux, Solaris and Windows). Before the OSI can be created, the team must create a Content Document which provides the information of a workstation or server, with the list of all the software that is to be installed on it and also the set where the hardware belongs. This can be for example in the LDS, the ADS or the FR-l. The objective of this project is to create a User Interface Web application that can manage the information of the Content Documents, with all the correct validations and filters for administrator purposes. For this project we used one of the most excellent tools in agile development applications called Ruby on Rails. This tool helps pragmatic programmers develop Web applications with Rails framework and Ruby programming language. It is very amazing to see how a student can learn about OOP features with the Ruby language, manage the user interface with HTML and CSS, create associations and queries with gems, manage databases and run a server with MYSQL, run shell commands with command prompt and create Web frameworks with Rails. All of this in a real world project and in just fifteen weeks!

  12. Document Delivery Services around the World

    Directory of Open Access Journals (Sweden)

    Ashrafosadat Foladi

    2008-04-01

    Full Text Available Given the importance of information access versus collection, the present study identified and investigated ten most important document delivery websites which had the highest frequency of citations in online directories and printed sources. The evaluation was based on the indicators and policies of Iranian Scientific Information and Documentation Center (IRANDOC. These included document diversity, document request mechanisms, document delivery options, response time, payment options, costs, and copy right clearance. The findings were then processed statistically using SPSS. It was found that based on document diversity BLDSC, LHL and DocDeliver are the frontrunners. On the account of subject comprehensiveness, Doc Deliver, BLDSC, Infotrieve, Ingenta, ISI and UMI are at the same level. All ten sites studied covered basic sciences. BL is strong with respect to diversity of document delivery options, payment options and response time. ISI is most suitable when diversity in request options is required. Ingenta is suitable when diversity in payment options are required. NTIS is in the lead when special documents such as technical reports are required, while UMI is most suitable for Dissertations and rare books

  13. Teaching Text Structure: Examining the Affordances of Children's Informational Texts

    Science.gov (United States)

    Jones, Cindy D.; Clark, Sarah K.; Reutzel, D. Ray

    2016-01-01

    This study investigated the affordances of informational texts to serve as model texts for teaching text structure to elementary school children. Content analysis of a random sampling of children's informational texts from top publishers was conducted on text structure organization and on the inclusion of text features as signals of text…

  14. Text Character Extraction Implementation from Captured Handwritten Image to Text Conversionusing Template Matching Technique

    Directory of Open Access Journals (Sweden)

    Barate Seema

    2016-01-01

    Full Text Available Images contain various types of useful information that should be extracted whenever required. A various algorithms and methods are proposed to extract text from the given image, and by using that user will be able to access the text from any image. Variations in text may occur because of differences in size, style,orientation, alignment of text, and low image contrast, composite backgrounds make the problem during extraction of text. If we develop an application that extracts and recognizes those texts accurately in real time, then it can be applied to many important applications like document analysis, vehicle license plate extraction, text- based image indexing, etc and many applications have become realities in recent years. To overcome the above problems we develop such application that will convert the image into text by using algorithms, such as bounding box, HSV model, blob analysis,template matching, template generation.

  15. Introducing Text Analytics as a Graduate Business School Course

    Science.gov (United States)

    Edgington, Theresa M.

    2011-01-01

    Text analytics refers to the process of analyzing unstructured data from documented sources, including open-ended surveys, blogs, and other types of web dialog. Text analytics has enveloped the concept of text mining, an analysis approach influenced heavily from data mining. While text mining has been covered extensively in various computer…

  16. SSC Safety Review Document

    Energy Technology Data Exchange (ETDEWEB)

    Toohig, T.E. [ed.

    1988-11-01

    The safety strategy of the Superconducting Super Collider (SSC) Central Design Group (CDG) is to mitigate potential hazards to personnel, as far as possible, through appropriate measures in the design and engineering of the facility. The Safety Review Document identifies, on the basis of the Conceptual Design Report (CDR) and related studies, potential hazards inherent in the SSC project independent of its site. Mitigative measures in the design of facilities and in the structuring of laboratory operations are described for each of the hazards identified.

  17. What Documents Permit

    OpenAIRE

    2012-01-01

    Along with archives, which they are often associated with, documents have a central place in exhibitions, as they do in present-day contemporary art publications. The aim of the books here considered is not to shed light on this huge mnemonic turning-point which seems to have taken hold of art praxis and art discourse since the beginning of this third millennium, even if the contributions of some of their authors pinpoint circumstantial (post 9/11) and technical (the digital age) factors whic...

  18. Analysis of Design Documentation

    DEFF Research Database (Denmark)

    Hansen, Claus Thorp

    1998-01-01

    In design practice a process where a satisfactory solution is created within limited resources is required. However, since the design process is not well understood, research into how engineering designers actually solve design problems is needed. As a contribution to that end a research project...... has been established where we seek to identify useful design work patterns by retrospective analyses of documentation created during design projects. This paper describes the analysis method, a tentatively defined metric to evaluate identified work patterns, and presents results from the first...... analysis accomplished....

  19. An Integrated Multimedia Approach to Cultural Heritage e-Documents

    NARCIS (Netherlands)

    Smeulders, A.W.M.; Hardman, H.L.; Schreiber, G.; Geusebroek, J.M.

    2002-01-01

    We discuss access to e-documents from three different perspectives beyond the plain keyword web-search of the entire document. The first one is the situation-depending delivery of multimedia documents adapting the preferred form (picture, text, speech) to the available information capacity or need e

  20. Aiding the Interpretation of Ancient Documents

    DEFF Research Database (Denmark)

    Roued-Cunliffe, Henriette

    How can Decision Support System (DSS) software aid the interpretation process involved in the reading of ancient documents? This paper discusses the development of a DSS prototype for the reading of ancient texts. In this context the term ‘ancient documents’ is used to describe mainly Greek...... and Latin texts and the term ‘scholars’ is used to describe readers of these documents (e.g. papyrologists, epigraphers, palaeographers). However, the results from this research can be applicable to many other texts ranging from Nordic runes to 18th Century love letters. In order to develop an appropriate...... tool it is important first to comprehend the interpretation process involved in reading ancient documents. This is not a linear process but rather a recursive process where the scholar moves between different levels of reading, such as ‘understanding the meaning of a character’ or ‘understanding...

  1. Towards Multi Label Text Classification through Label Propagation

    Directory of Open Access Journals (Sweden)

    Shweta C. Dharmadhikari

    2012-06-01

    Full Text Available Classifying text data has been an active area of research for a long time. Text document is multifaceted object and often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text Classification. Our proposed approach models the relationship between class labels and also effectively represents input text documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for classification. Our proposed approach promises better classification accuracy and handling of complexity and elaborated on the basis of standard datasets such as Enron, Slashdot and Bibtex.

  2. Tank waste remediation system functions and requirements document

    Energy Technology Data Exchange (ETDEWEB)

    Carpenter, K.E

    1996-10-03

    This is the Tank Waste Remediation System (TWRS) Functions and Requirements Document derived from the TWRS Technical Baseline. The document consists of several text sections that provide the purpose, scope, background information, and an explanation of how this document assists the application of Systems Engineering to the TWRS. The primary functions identified in the TWRS Functions and Requirements Document are identified in Figure 4.1 (Section 4.0) Currently, this document is part of the overall effort to develop the TWRS Functional Requirements Baseline, and contains the functions and requirements needed to properly define the top three TWRS function levels. TWRS Technical Baseline information (RDD-100 database) included in the appendices of the attached document contain the TWRS functions, requirements, and architecture necessary to define the TWRS Functional Requirements Baseline. Document organization and user directions are provided in the introductory text. This document will continue to be modified during the TWRS life-cycle.

  3. Regulatory guidance document

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1994-05-01

    The Office of Civilian Radioactive Waste Management (OCRWM) Program Management System Manual requires preparation of the OCRWM Regulatory Guidance Document (RGD) that addresses licensing, environmental compliance, and safety and health compliance. The document provides: regulatory compliance policy; guidance to OCRWM organizational elements to ensure a consistent approach when complying with regulatory requirements; strategies to achieve policy objectives; organizational responsibilities for regulatory compliance; guidance with regard to Program compliance oversight; and guidance on the contents of a project-level Regulatory Compliance Plan. The scope of the RGD includes site suitability evaluation, licensing, environmental compliance, and safety and health compliance, in accordance with the direction provided by Section 4.6.3 of the PMS Manual. Site suitability evaluation and regulatory compliance during site characterization are significant activities, particularly with regard to the YW MSA. OCRWM`s evaluation of whether the Yucca Mountain site is suitable for repository development must precede its submittal of a license application to the Nuclear Regulatory Commission (NRC). Accordingly, site suitability evaluation is discussed in Chapter 4, and the general statements of policy regarding site suitability evaluation are discussed in Section 2.1. Although much of the data and analyses may initially be similar, the licensing process is discussed separately in Chapter 5. Environmental compliance is discussed in Chapter 6. Safety and Health compliance is discussed in Chapter 7.

  4. ExactPack Documentation

    Energy Technology Data Exchange (ETDEWEB)

    Singleton, Jr., Robert [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Israel, Daniel M. [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Doebling, Scott William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Woods, Charles Nathan [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Kaul, Ann [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Walter, Jr., John William [Los Alamos National Lab. (LANL), Los Alamos, NM (United States); Rogers, Michael Lloyd [Los Alamos National Lab. (LANL), Los Alamos, NM (United States)

    2016-05-09

    For code verification, one compares the code output against known exact solutions. There are many standard test problems used in this capacity, such as the Noh and Sedov problems. ExactPack is a utility that integrates many of these exact solution codes into a common API (application program interface), and can be used as a stand-alone code or as a python package. ExactPack consists of python driver scripts that access a library of exact solutions written in Fortran or Python. The spatial profiles of the relevant physical quantities, such as the density, fluid velocity, sound speed, or internal energy, are returned at a time specified by the user. The solution profiles can be viewed and examined by a command line interface or a graphical user interface, and a number of analysis tools and unit tests are also provided. We have documented the physics of each problem in the solution library, and provided complete documentation on how to extend the library to include additional exact solutions. ExactPack’s code architecture makes it easy to extend the solution-code library to include additional exact solutions in a robust, reliable, and maintainable manner.

  5. Important Text Characteristics for Early-Grades Text Complexity

    Science.gov (United States)

    Fitzgerald, Jill; Elmore, Jeff; Koons, Heather; Hiebert, Elfrieda H.; Bowen, Kimberly; Sanford-Moore, Eleanor E.; Stenner, A. Jackson

    2015-01-01

    The Common Core set a standard for all children to read increasingly complex texts throughout schooling. The purpose of the present study was to explore text characteristics specifically in relation to early-grades text complexity. Three hundred fifty primary-grades texts were selected and digitized. Twenty-two text characteristics were identified…

  6. Handwriting segmentation of unconstrained Oriya text

    Indian Academy of Sciences (India)

    N Tripathy; U Pal

    2006-12-01

    Segmentation of handwritten text into lines, words and characters is one of the important steps in the handwritten text recognition process. In this paper we propose a water reservoir concept-based scheme for segmentation of unconstrained Oriya handwritten text into individual characters. Here, at first, the text image is segmented into lines, and the lines are then segmented into individual words. For line segmentation, the document is divided into vertical stripes. Analysing the heights of the water reservoirs obtained from different components of the document, the width of a stripe is calculated. Stripe-wise horizontal histograms are then computed and the relationship of the peak–valley points of the histograms is used for line segmentation. Based on vertical projection profiles and structural features of Oriya characters, text lines are segmented into words. For character segmentation, at first, the isolated and connected (touching) characters in a word are detected. Using structural, topological and water reservoir concept-based features, characters of the word that touch are then segmented. From experiments we have observed that the proposed “touching character” segmentation module has 96·7% accuracy for two-character touching strings.

  7. “Dreamers Often Lie”: On “Compromise”, the subversive documentation of an Israeli- Palestinian political adaptation of Shakespeare’s Romeo and Juliet

    Directory of Open Access Journals (Sweden)

    Yael Munk

    2010-03-01

    Full Text Available

    Is Romeo and Juliet relevant to a description of the Middle-East conflict? This is the question raised in Compromise, an Israeli documentary that follows the

  8. Chinese multi-document personal name disambiguation

    Institute of Scientific and Technical Information of China (English)

    2005-01-01

    This paper presents a new approach to determining whether an interested personal name across documents refers to the same entity. Firstly, three vectors for each text are formed: the personal name Boolean vectors denoting whether a personal name occurs in the text, the biographical word Boolean vector representing title, occupation and so forth, and the feature vector with real values. Then, by combining a heuristic strategy based on Boolean vectors with an agglomerative clustering algorithm based on feature vectors, it seeks to resolve multi-document personal name coreference. Experimental results show that this approach achieves a good performance by testing on "Wang Gang" corpus.

  9. Retinal locus for scanning text.

    Science.gov (United States)

    Timberlake, George T; Sharma, Manoj K; Grose, Susan A; Maino, Joseph H

    2006-01-01

    A method of mapping the retinal location of text during reading is described in which text position is plotted cumulatively on scanning laser ophthalmoscope retinal images. Retinal locations that contain text most often are the brightest in the cumulative plot, and locations that contain text least often are the darkest. In this way, the retinal area that most often contains text is determined. Text maps were plotted for eight control subjects without vision loss and eight subjects with central scotomas from macular degeneration. Control subjects' text maps showed that the fovea contained text most often. Text maps of five of the subjects with scotomas showed that they used the same peripheral retinal area to scan text and fixate. Text maps of the other three subjects with scotomas showed that they used separate areas to scan text and fixate. Retinal text maps may help evaluate rehabilitative strategies for training individuals with central scotomas to use a particular retinal area to scan text.

  10. Does pedagogical documentation support maternal reminiscing conversations?

    Directory of Open Access Journals (Sweden)

    Bethany Fleck

    2015-12-01

    Full Text Available When parents talk with their children about lessons learned in school, they are participating in reminiscing of an unshared event. This study sought to understand if pedagogical documentation, from the Reggio Approach to early childhood education, would support and enhance the conversation. Mother–child dyads reminisced two separate times about preschool lessons, one time with documentation available to them and one time without. Transcripts were coded extracting variables indicative of high and low maternal reminiscing styles. Results indicate that mother and child conversation characteristics were more highly elaborative when documentation was present than when it was not. In addition, children added more information to the conversation supporting the notion that such conversations enhanced memory for lessons. Documentation could be used as a support tool for conversations and children’s memory about lessons learned in school.

  11. Visual Similarity Based Document Layout Analysis

    Institute of Scientific and Technical Information of China (English)

    Di Wen; Xiao-Qing Ding

    2006-01-01

    In this paper, a visual similarity based document layout analysis (DLA) scheme is proposed, which by using clustering strategy can adaptively deal with documents in different languages, with different layout structures and skew angles. Aiming at a robust and adaptive DLA approach, the authors first manage to find a set of representative filters and statistics to characterize typical texture patterns in document images, which is through a visual similarity testing process.Texture features are then extracted from these filters and passed into a dynamic clustering procedure, which is called visual similarity clustering. Finally, text contents are located from the clustered results. Benefit from this scheme, the algorithm demonstrates strong robustness and adaptability in a wide variety of documents, which previous traditional DLA approaches do not possess.

  12. Handwritten Text Image Authentication using Back Propagation

    CERN Document Server

    Chakravarthy, A S N; Avadhani, P S

    2011-01-01

    Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involve confirming the identity of a person, tracing the origins of an artefact, ensuring that a product is what it's packaging and labelling claims to be, or assuring that a computer program is a trusted one. The authentication of information can pose special problems (especially man-in-the-middle attacks), and is often wrapped up with authenticating identity. Literary can involve imitating the style of a famous author. If an original manuscript, typewritten text, or recording is available, then the medium itself (or its packaging - anything from a box to e-mail headers) can help prove or disprove the authenticity of the document. The use of digital images of handwritten historical documents has become more popular in recent years. Volunteers around the world now read thousands of these images as part of their indexing process. Handwritten text images of old documents are sometimes difficult to read or noisy du...

  13. Binary/BCD-to-ASCII data converter

    Science.gov (United States)

    Miller, A. J.

    1977-01-01

    Converter inputs multiple precision binary words, converts data to multiple precision binary-coded decimal, and routes data back to computer. Converter base can be readily changed without need for new gate structure for each base changeover.

  14. Multilingual documentation and classification.

    Science.gov (United States)

    Donnelly, Kevin

    2008-01-01

    Health care providers around the world have used classification systems for decades as a basis for documentation, communications, statistical reporting, reimbursement and research. In more recent years machine-readable medical terminologies have taken on greater importance with the adoption of electronic health records and the need for greater granularity of data in clinical systems. Use of a clinical terminology harmonised with classifications, implemented within a clinical information system, will enable the delivery of many patient health benefits including electronic clinical decision support, disease screening and enhanced patient safety. In order to be usable these systems must be translated into the language of use, without losing meaning. It is evident that today one system cannot meet all requirements which call for collaboration and harmonisation in order to achieve true interoperability on a multilingual basis.

  15. Revitalizing a documentation system.

    Science.gov (United States)

    DiBlasi, M; Savage, J

    1992-01-01

    The nursing department of a 154-bed acute rehabilitation facility, cognizant of the changing trends in health care and responding to feedback from staff, developed and implemented a comprehensive documentation system. The previous system had been fragmented, inconsistent, and inefficient. The development of the new system focused on the complex needs of the rehabilitation client and the equally complex standards required by the Joint Commission on Accreditation of Healthcare Organizations (JCAHO), the Commission on Accreditation of Rehabilitation Facilities (CARF), and insurance carriers. The final product, which was based on the nursing process and functional health patterns, encompassed the following areas from admission to discharge: providing feedback on clients' functional abilities and progress toward goals, satisfying requirements of the 1990 JCAHO standards, and, finally, using a flow sheet that saves nursing time and increases objectivity. This article describes the system from conceptualization to successful implementation.

  16. A Two Step Data Mining Approach for Amharic Text Classification

    Directory of Open Access Journals (Sweden)

    Seffi Gebeyehu

    2016-08-01

    Full Text Available Traditionally, text classifiers are built from labeled training examples (supervised. Labeling is usually done manually by human experts (or the users, which is a labor intensive and time consuming process. In the past few years, researchers have investigated various forms of semi-supervised learning to reduce the burden of manual labeling. In this paper is aimed to show as the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. In this paper, intended to implement an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation- Maximization (EM and two classifiers: Naive Bayes (NB and locally weighted learning (LWL. NB first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents while LWL uses a class of function approximation to build a model around the current point of interest. An experiment conducted on a mixture of labeled and unlabeled Amharic text documents showed that the new method achieved a significant performance in comparison with that of a supervised LWL and NB. The result also pointed out that the use of unlabeled data with EM reduces the classification absolute error by 27.6%. In general, since unlabeled documents are much less expensive and easier to collect than labeled documents, this method will be useful for text categorization tasks including online data sources such as web pages, e-mails and news group postings. If one uses this method, building text categorization systems will be significantly faster and less expensive than the supervised learning approach.

  17. Automatic Text Decomposition and Structuring.

    Science.gov (United States)

    Salton, Gerard; And Others

    1996-01-01

    Text similarity measurements are used to determine relationships between natural-language texts and text excerpts. The resulting linked hypertext maps can be broken down into text segments and themes used to identify different text types and structures, leading to improved information access and utilization. Examples are provided for text…

  18. Temporal Adverbials in Text Structuring: On Temporal Text Strategy.

    Science.gov (United States)

    Virtanen, Tuija

    This paper discusses clause-initial adverbials of time functioning as signals of the temporal text strategy. A chain of such markers creates cohesion and coherence by forming continuity in the text and also signals textual boundaries that occur on different hierarchic levels. The temporal text strategy is closely associated with narrative text.…

  19. Basic test framework for the evaluation of text line segmentation and text parameter extraction.

    Science.gov (United States)

    Brodić, Darko; Milivojević, Dragan R; Milivojević, Zoran

    2010-01-01

    Text line segmentation is an essential stage in off-line optical character recognition (OCR) systems. It is a key because inaccurately segmented text lines will lead to OCR failure. Text line segmentation of handwritten documents is a complex and diverse problem, complicated by the nature of handwriting. Hence, text line segmentation is a leading challenge in handwritten document image processing. Due to inconsistencies in measurement and evaluation of text segmentation algorithm quality, some basic set of measurement methods is required. Currently, there is no commonly accepted one and all algorithm evaluation is custom oriented. In this paper, a basic test framework for the evaluation of text feature extraction algorithms is proposed. This test framework consists of a few experiments primarily linked to text line segmentation, skew rate and reference text line evaluation. Although they are mutually independent, the results obtained are strongly cross linked. In the end, its suitability for different types of letters and languages as well as its adaptability are its main advantages. Thus, the paper presents an efficient evaluation method for text analysis algorithms.

  20. Text analysis methods, text analysis apparatuses, and articles of manufacture

    Science.gov (United States)

    Whitney, Paul D; Willse, Alan R; Lopresti, Charles A; White, Amanda M

    2014-10-28

    Text analysis methods, text analysis apparatuses, and articles of manufacture are described according to some aspects. In one aspect, a text analysis method includes accessing information indicative of data content of a collection of text comprising a plurality of different topics, using a computing device, analyzing the information indicative of the data content, and using results of the analysis, identifying a presence of a new topic in the collection of text.

  1. Toward Documentation of Program Evolution

    DEFF Research Database (Denmark)

    Vestdam, Thomas; Nørmark, Kurt

    2005-01-01

    The documentation of a program often falls behind the evolution of the program source files. When this happens it may be attractive to shift the documentation mode from updating the documentation to documenting the evolution of the program. This paper describes tools that support the documentation...... of program evolution. The tools are refinements of the Elucidative Programming tools, which in turn are inspired from Literate Programming tools. The version-aware Elucidative Programming tools are able to process a set of program source files in different versions together with unversioned documentation...... files. The paper introduces a set of fine grained program evolution steps, which are supported directly by the documentation tools. The automatic discovery of the fine grained program evolution steps makes up a platform for documenting coarse grained and more high-level program evolution steps...

  2. Formation peculiarities of tourism documentation

    OpenAIRE

    Zhezhnych, Pavlo; Soprunyuk, Oksana

    2013-01-01

    The article describes formation peculiarities of tourism documentation, the role of tourism data consolidation for unified format creation and the the need to use existing software tools to handle tourism information, formation process of tourism documentation is presented.

  3. Automatic document classification of biological literature

    Directory of Open Access Journals (Sweden)

    Sternberg Paul W

    2006-08-01

    Full Text Available Abstract Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578 compared to previously published results (F-value of 0.55 vs. 0.49, while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.

  4. Language Documentation in the Americas

    Science.gov (United States)

    Franchetto, Bruna; Rice, Keren

    2014-01-01

    In the last decades, the documentation of endangered languages has advanced greatly in the Americas. In this paper we survey the role that international funding programs have played in advancing documentation in this part of the world, with a particular focus on the growth of documentation in Brazil, and we examine some of the major opportunities…

  5. The Practicalities of Document Conversion.

    Science.gov (United States)

    Galbraith, Ian

    1993-01-01

    Describes steps involved in the conversion of source documents to scanned digital image format. Topics addressed include document preparation, including photographs and oversized material; indexing procedures, including automatic indexing possibilities; scanning documents, including resolution and throughput; quality control; backfile conversion;…

  6. Automated Postediting of Documents

    CERN Document Server

    Knight, K; Knight, Kevin; Chander, Ishwar

    1994-01-01

    Large amounts of low- to medium-quality English texts are now being produced by machine translation (MT) systems, optical character readers (OCR), and non-native speakers of English. Most of this text must be postedited by hand before it sees the light of day. Improving text quality is tedious work, but its automation has not received much research attention. Anyone who has postedited a technical report or thesis written by a non-native speaker of English knows the potential of an automated postediting system. For the case of MT-generated text, we argue for the construction of postediting modules that are portable across MT systems, as an alternative to hardcoding improvements inside any one system. As an example, we have built a complete self-contained postediting module for the task of article selection (a, an, the) for English noun phrases. This is a notoriously difficult problem for Japanese-English MT. Our system contains over 200,000 rules derived automatically from online text resources. We report on l...

  7. Text-Attentional Convolutional Neural Network for Scene Text Detection.

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-06-01

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this paper, we present a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/non-text information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates the main task of text/non-text classification. In addition, a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed, which extends the widely used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 data set, with an F-measure of 0.82, substantially improving the state-of-the-art results.

  8. Tool Gear Documentation

    Energy Technology Data Exchange (ETDEWEB)

    May, J; Gyllenhaal, J

    2002-04-03

    Tool Gear is designed to allow tool developers to insert instrumentation code into target programs using the DPCL library. This code can gather data and send it back to the Client for display or analysis. Tools can use the Tool Gear client without using the DPCL Collector. Any collector using the right protocols can send data to the Client for display and analysis. However, this document will focus on how to gather data with the DPCL Collector. There are three parts to the task of using Tool Gear to gather data through DPCL: (1) Write the instrumentation code that will be loaded and run in the target program. The code should be in the form of one or more functions, which can pass data structures back to the Client by way of DPCL. The collections of functions is compiled into a library, as described in this report. (2) Write the code that tells the DPCL Collector about the instrumentation and how to forward data back to the Client. (3) Extend the client to accept data from the Collector and display it in a useful way. The rest of this report describes how to carry out each of these steps.

  9. Mining knowledge from text repositories using information extraction: A review

    Indian Academy of Sciences (India)

    Sandeep R Sirsat; Dr Vinay Chavan; Dr Shrinivas P Deshpande

    2014-02-01

    There are two approaches to mining text form online repositories. First, when the knowledge to be discovered is expressed directly in the documents to be mined, Information Extraction (IE) alone can serve as an effective tool for such text mining. Second, when the documents contain concrete data in unstructured form rather than abstract knowledge, Information Extraction (IE) can be used to first transform the unstructured data in the document corpus into a structured database, and then use some state-of-the-art data mining algorithms/tools to identify abstract patterns in this extracted data. This paper presents the review of several methods related to these two approaches.

  10. The Challenge of Challenging Text

    Science.gov (United States)

    Shanahan, Timothy; Fisher, Douglas; Frey, Nancy

    2012-01-01

    The Common Core State Standards emphasize the value of teaching students to engage with complex text. But what exactly makes a text complex, and how can teachers help students develop their ability to learn from such texts? The authors of this article discuss five factors that determine text complexity: vocabulary, sentence structure, coherence,…

  11. Exploring Multicultural Discourse in Information Technology Documents

    Science.gov (United States)

    Piazza, Carolyn L.; Wallat, Cynthia

    2008-01-01

    Purpose: The primary purpose of this paper is to illustrate strategies college students learn to practice in their analysis of multicultural documents located through use of the internet. Design/methodology/approach: Students explored a variety of meanings for diversity within and across texts readily available in public databases. Using…

  12. Using LSA and text segmentation to improve automatic Chinese dialogue text summarization

    Institute of Scientific and Technical Information of China (English)

    LIU Chuan-han; WANG Yong-cheng; ZHENG Fei; LIU De-rong

    2007-01-01

    Automatic Chinese text summarization for dialogue style is a relatively new research area. In this paper, Latent Semantic Analysis (LSA) is first used to extract semantic knowledge from a given document, all question paragraphs are identified,an automatic text segmentation approach analogous to TextTiling is exploited to improve the precision of correlating question paragraphs and answer paragraphs, and finally some "important" sentences are extracted from the generic content and the question-answer pairs to generate a complete summary. Experimental results showed that our approach is highly efficient and improves significantly the coherence of the summary while not compromising informativeness.

  13. Proxima: a presentation-oriented editor for structured documents

    NARCIS (Netherlands)

    Schrage, M.M.

    2004-01-01

    A typical computer user deals with a large variety of documents, such as text files, spreadsheets, and web pages. The applications for constructing and modifying these documents are called editors (e.g. text editors, spreadsheet applications, and HTML editors). Despite the apparent differences betwe

  14. Text analysis devices, articles of manufacture, and text analysis methods

    Science.gov (United States)

    Turner, Alan E; Hetzler, Elizabeth G; Nakamura, Grant C

    2013-05-28

    Text analysis devices, articles of manufacture, and text analysis methods are described according to some aspects. In one aspect, a text analysis device includes processing circuitry configured to analyze initial text to generate a measurement basis usable in analysis of subsequent text, wherein the measurement basis comprises a plurality of measurement features from the initial text, a plurality of dimension anchors from the initial text and a plurality of associations of the measurement features with the dimension anchors, and wherein the processing circuitry is configured to access a viewpoint indicative of a perspective of interest of a user with respect to the analysis of the subsequent text, and wherein the processing circuitry is configured to use the viewpoint to generate the measurement basis.

  15. MULTILABEL CLASSIFICATION OF DOCUMENTS WITH MAPREDUCE

    Directory of Open Access Journals (Sweden)

    P.Malarvizhi

    2013-04-01

    Full Text Available Multilabel classification is the problem of assigning a set of positive labels to an instance and recently it is highly required in applications like protein function classification, music categorization, gene classification and document classification for easy identification and retrieving of information. Labeling the documents of the web manually is a time consuming and a difficult task due to the size of the web which is a huge information resource and to overcome this difficulty, we propose an algorithm of MapReduce for classifying labels to the documents of the web. MapReduce is a framework of parallel programming model with the functions map and reduce and meets a number of varieties of applications. In our approach, the documents of the web are given to the MapReduce framework and the MapReduce framework assigns the set of positive labels to the documents of the web using binary classification ofbinary classifier. On experimentation, our proposed approach satisfactorily classifies the labels to the documents of the web.

  16. Rank Based Clustering For Document Retrieval From Biomedical Databases

    Directory of Open Access Journals (Sweden)

    Jayanthi Manicassamy

    2009-09-01

    Full Text Available Now a day's, search engines are been most widely used for extracting information's from various resources throughout the world. Where, majority of searches lies in the field of biomedical for retrieving related documents from various biomedical databases. Currently search engines lacks in document clustering and representing relativeness level of documents extracted from the databases. In order to overcome these pitfalls a text based search engine have been developed for retrieving documents from Medline and PubMed biomedical databases. The search engine has incorporated page ranking bases clustering concept which automatically represents relativeness on clustering bases. Apart from this graph tree construction is made for representing the level of relatedness of the documents that are networked together. This advance functionality incorporation for biomedical document based search engine found to provide better results in reviewing related documents based on relativeness.

  17. Text-Attentional Convolutional Neural Networks for Scene Text Detection.

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-03-28

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature computed globally from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this work, we present a new system for scene text detection by proposing a novel Text-Attentional Convolutional Neural Network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/nontext information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates main task of text/non-text classification. In addition, a powerful low-level detector called Contrast- Enhancement Maximally Stable Extremal Regions (CE-MSERs) is developed, which extends the widely-used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 dataset, with a F-measure of 0.82, improving the state-of-the-art results substantially.

  18. Text-Attentional Convolutional Neural Network for Scene Text Detection

    Science.gov (United States)

    He, Tong; Huang, Weilin; Qiao, Yu; Yao, Jian

    2016-06-01

    Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature computed globally from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this work, we present a new system for scene text detection by proposing a novel Text-Attentional Convolutional Neural Network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/nontext information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates main task of text/non-text classification. In addition, a powerful low-level detector called Contrast- Enhancement Maximally Stable Extremal Regions (CE-MSERs) is developed, which extends the widely-used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 dataset, with a F-measure of 0.82, improving the state-of-the-art results substantially.

  19. Contrastive Study of Coherence in Chinese Text and English Text

    Institute of Scientific and Technical Information of China (English)

    王婷

    2013-01-01

    The paper presents the text-linguistic concepts on which the analysis of textual structure is based including text and discourse, coherence and cohesive. In addition we try to discover different manifestations of text between ET and CT, including different coherent structures.

  20. Test of Picture-Text Amalgams in Procedural Texts.

    Science.gov (United States)

    Stone, David Edey

    Designed to assess how people read and comprehend information presented in picture-text amalgams in procedural texts, this instrument presents various combinations of text information and illustrative information on slides. Subjects are assigned to one of four conditions and directed to follow the instructions presented on the slides. Videotapes…

  1. Guideline based structured documentation: the final goal?

    Science.gov (United States)

    Bürkle, Thomas; Ganslandt, Thomas; Tübergen, Dirk; Menzel, Josef; Kucharzik, Torsten; Neumann, Klaus; Schlüter, Stefan; Müller, Marcel; Veltmann, Ursula; Prokosch, Hans-Ulrich

    2002-01-01

    Structured documentation of medical procedures facilitates information retrieval for research and therapy and may help to improve patient care. Most medical documents until today however consist mainly of unstructured narrative text. Here we present an application for endoscopy which is not only fully integrated into a comprehensive clinical information system, but which also supports various degrees of structuring examination reports. The application is used routinely in a German University hospital since summer 2000. We present the first unstructured version which permits storage of a free text report together with selected examination images. The next step added improved structure to the document using a catalogue of index terms. The practical advantages of selective patient retrieval are described. Today we use a version which supports fully structured, guideline based documentation of endoscopy reports in order to automatically generate essential classification codes and the narrative examination report All versions have advantages and disadvantages and we conclude that guideline based documentation may not be suitable for all endoscopy cases.

  2. OVERLAPPING VIRTUAL CADASTRAL DOCUMENTATION

    Directory of Open Access Journals (Sweden)

    Madalina - Cristina Marian

    2013-12-01

    Full Text Available Two cadastrale plans of buildings, can overlap virtual. Overlap is highlighted when digital reception. According to Law no. 7/1996 as amended and supplemented, to solve these problems is by updating the database graphs, the repositioning. This paper addresses the issue of overlapping virtual cadastre in the history of the period 1999-2012.

  3. Multi-perspective Event Detection in Texts Documenting the 1944 Battle of Arnhem

    NARCIS (Netherlands)

    Düring, M.D.; Bosch, A.P.J. van den

    2014-01-01

    We present a pilot project which combines the respective strengths of research practices in history, memory studies, and computational linguistics. We present a proof-of-concept workflow for the semi-automatic detection and linking of narratives referring to the same event based on references to loc

  4. Annotating image ROIs with text descriptions for multimodal biomedical document retrieval

    Science.gov (United States)

    You, Daekeun; Simpson, Matthew; Antani, Sameer; Demner-Fushman, Dina; Thoma, George R.

    2013-01-01

    Regions of interest (ROIs) that are pointed to by overlaid markers (arrows, asterisks, etc.) in biomedical images are expected to contain more important and relevant information than other regions for biomedical article indexing and retrieval. We have developed several algorithms that localize and extract the ROIs by recognizing markers on images. Cropped ROIs then need to be annotated with contents describing them best. In most cases accurate textual descriptions of the ROIs can be found from figure captions, and these need to be combined with image ROIs for annotation. The annotated ROIs can then be used to, for example, train classifiers that separate ROIs into known categories (medical concepts), or to build visual ontologies, for indexing and retrieval of biomedical articles. We propose an algorithm that pairs visual and textual ROIs that are extracted from images and figure captions, respectively. This algorithm based on dynamic time warping (DTW) clusters recognized pointers into groups, each of which contains pointers with identical visual properties (shape, size, color, etc.). Then a rule-based matching algorithm finds the best matching group for each textual ROI mention. Our method yields a precision and recall of 96% and 79%, respectively, when ground truth textual ROI data is used.

  5. Text mining from ontology learning to automated text processing applications

    CERN Document Server

    Biemann, Chris

    2014-01-01

    This book comprises a set of articles that specify the methodology of text mining, describe the creation of lexical resources in the framework of text mining and use text mining for various tasks in natural language processing (NLP). The analysis of large amounts of textual data is a prerequisite to build lexical resources such as dictionaries and ontologies and also has direct applications in automated text processing in fields such as history, healthcare and mobile applications, just to name a few. This volume gives an update in terms of the recent gains in text mining methods and reflects

  6. Hierarchical Classification of Chinese Documents Based on N-grams

    Institute of Scientific and Technical Information of China (English)

    2001-01-01

    We explore the techniques of utilizing N-gram informatio n tocategorize Chinese text documents hierarchically so that the classifier can shak e off the burden of large dictionaries and complex segmentation processing, and subsequently be domain and time independent. A hierarchical Chinese text classif ier is implemented. Experimental results show that hierarchically classifying Chinese text documents based N-grams can achieve satisfactory performance and outperforms the other traditional Chinese text classifiers.

  7. Methods for Mining and Summarizing Text Conversations

    CERN Document Server

    Carenini, Giuseppe; Murray, Gabriel

    2011-01-01

    Due to the Internet Revolution, human conversational data -- in written forms -- are accumulating at a phenomenal rate. At the same time, improvements in speech technology enable many spoken conversations to be transcribed. Individuals and organizations engage in email exchanges, face-to-face meetings, blogging, texting and other social media activities. The advances in natural language processing provide ample opportunities for these "informal documents" to be analyzed and mined, thus creating numerous new and valuable applications. This book presents a set of computational methods

  8. REVISION AND REWRITING IN OFFICIAL DOCUMENTS: CONCEPTS AND METHODOLOGICAL ORIENTATIONS

    Directory of Open Access Journals (Sweden)

    Renilson José MENEGASSI

    2014-12-01

    Full Text Available The text discuss how the concepts and the methodological orientations about text revision and rewriting processes, in teaching context, are conceived, presented and guide the Portuguese Language teacher’s work. To this end, the concepts of revision and rewriting are characterized in four Brazilian official documents, two from national scope and two from Paraná state. The information was organized from what the documents show about the teacher and student attitude face to the investigated concepts, which determine the methodological orientations to the text production work. The results show irregularities in these processes handling, highlighting one of the official documents, from national scope, as the one that presents more suitable methodological and conceptual orientations. It shows that the documents which guide the mother language teaching in the country are still not appropriately discussing the writing text production process, specifically the revision and rewriting, even in more recent documents.

  9. Cluster Based Text Classification Model

    DEFF Research Database (Denmark)

    Nizamani, Sarwat; Memon, Nasrullah; Wiil, Uffe Kock

    2011-01-01

    We propose a cluster based classification model for suspicious email detection and other text classification tasks. The text classification tasks comprise many training examples that require a complex classification model. Using clusters for classification makes the model simpler and increases...

  10. Politeness Strategies of the Provisional Agreement Document

    Directory of Open Access Journals (Sweden)

    Mohammad Shehadeh

    2017-02-01

    Full Text Available This work investigates the politeness strategies of the so-called provisional agreement document in Jordanian context. It shows that such a document maintains several politeness strategies that help resolve any tension between the parties involved. It reflects that politeness is not meant only to elevate the communication between people but also it is an important clue in cataloguing this communication when it occurs in situations where the relationship between the two interlocutors is doomed by tragic event such as murder or car accidents.

  11. Dynamic documents with R and knitr

    CERN Document Server

    Xie, Yihui

    2013-01-01

    IntroductionReproducible ResearchLiteratureGood and Bad PracticesBarriersA First LookSetupMinimal ExamplesQuick Reporting Extracting R Code Editors RStudio LYX Emacs/ESS Other Editors Document FormatsInput Syntax Document Formats Output Renderers R Scripts Text OutputInline Output Chunk Output Tables Themes GraphicsGraphical Devices Plot Recording Plot Rearrangement Plot Size in Output Extra Output Options The tikz Device Figure Environment Figure Path CacheImplementation Write Cache When to Update Cache Side Effects Chunk Dependencies Cross Reference 79Chunk Reference Code Externalization Chi

  12. Text Signals Influence Team Artifacts

    Science.gov (United States)

    Clariana, Roy B.; Rysavy, Monica D.; Taricani, Ellen

    2015-01-01

    This exploratory quasi-experimental investigation describes the influence of text signals on team visual map artifacts. In two course sections, four-member teams were given one of two print-based text passage versions on the course-related topic "Social influence in groups" downloaded from Wikipedia; this text had two paragraphs, each…

  13. Too Dumb for Complex Texts?

    Science.gov (United States)

    Bauerlein, Mark

    2011-01-01

    High school students' lack of experience and practice with reading complex texts is a primary cause of their difficulties with college-level reading. Filling the syllabus with digital texts does little to address this deficiency. Complex texts demand three dispositions from readers: a willingness to probe works characterized by dense meanings, the…

  14. Multilingual Text Analysis for Text-to-Speech Synthesis

    CERN Document Server

    Sproat, R

    1996-01-01

    We present a model of text analysis for text-to-speech (TTS) synthesis based on (weighted) finite-state transducers, which serves as the text-analysis module of the multilingual Bell Labs TTS system. The transducers are constructed using a lexical toolkit that allows declarative descriptions of lexicons, morphological rules, numeral-expansion rules, and phonological rules, inter alia. To date, the model has been applied to eight languages: Spanish, Italian, Romanian, French, German, Russian, Mandarin and Japanese.

  15. Predicting Prosody from Text for Text-to-Speech Synthesis

    CERN Document Server

    Rao, K Sreenivasa

    2012-01-01

    Predicting Prosody from Text for Text-to-Speech Synthesis covers the specific aspects of prosody, mainly focusing on how to predict the prosodic information from linguistic text, and then how to exploit the predicted prosodic knowledge for various speech applications. Author K. Sreenivasa Rao discusses proposed methods along with state-of-the-art techniques for the acquisition and incorporation of prosodic knowledge for developing speech systems. Positional, contextual and phonological features are proposed for representing the linguistic and production constraints of the sound units present in the text. This book is intended for graduate students and researchers working in the area of speech processing.

  16. Text comprehension practice in school

    Directory of Open Access Journals (Sweden)

    Hernández, José Emilio

    2010-01-01

    Full Text Available The starting point of the study is the existence of relations between the two dimensions of text compression: the instrumental dimension and the cognitive dimension. The first one includes the system of actions, the second one the system of knowledge. A description of identifying, describing, inferring apprising and creating actions are suggested for each type of text. Likewise, the importance of implementing text comprehension is outlined on the basis of the assumption that the text is a tool for preserving and communicating culture, that allows human beings to wide their respective cultural horizons and develop cognitive and affective process that allow them to get universal morals.

  17. Aparecida Document Sociopolitical aspects

    Directory of Open Access Journals (Sweden)

    P. Alberto Henriques, sdb

    2010-12-01

    Full Text Available Hace tres años la Quinta Conferencia General del Episcopado Latinoamericano y de El Caribe emitió un Documento Conclusivo, que resumía las principales posiciones de los diversos representantes de la Iglesia latinoamericana de cara al futuro del catolicismo en nuestras tierras. Reconociendo que es un escrito fruto de la acción del Espíritu Santo en la Historia, sin embargo, no podemos soslayardiversos componentes sociales y políticos que condicionan la orientación de la mencionada publicación.

  18. Text mining: A Brief survey

    Directory of Open Access Journals (Sweden)

    Falguni N. Patel , Neha R. Soni

    2012-12-01

    Full Text Available The unstructured texts which contain massive amount of information cannot simply be used for further processing by computers. Therefore, specific processing methods and algorithms are required in order to extract useful patterns. The process of extracting interesting information and knowledge from unstructured text completed by using Text mining. In this paper, we have discussed text mining, as a recent and interesting field with the detail of steps involved in the overall process. We have also discussed different technologies that teach computers with natural language so that they may analyze, understand, and even generate text. In addition, we briefly discuss a number of successful applications of text mining which are used currently and in future.

  19. TEXT DEIXIS IN NARRATIVE SEQUENCES

    Directory of Open Access Journals (Sweden)

    Josep Rivera

    2007-06-01

    Full Text Available This study looks at demonstrative descriptions, regarding them as text-deictic procedures which contribute to weave discourse reference. Text deixis is thought of as a metaphorical referential device which maps the ground of utterance onto the text itself. Demonstrative expressions with textual antecedent-triggers, considered as the most important text-deictic units, are identified in a narrative corpus consisting of J. M. Barrie’s Peter Pan and its translation into Catalan. Some linguistic and discourse variables related to DemNPs are analysed to characterise adequately text deixis. It is shown that this referential device is usually combined with abstract nouns, thus categorising and encapsulating (non-nominal complex discourse entities as nouns, while performing a referential cohesive function by means of the text deixis + general noun type of lexical cohesion.

  20. A Noisy-Channel Model for Document Compression

    CERN Document Server

    Daumé, Hal

    2009-01-01

    We present a document compression system that uses a hierarchical noisy-channel model of text production. Our compression system first automatically derives the syntactic structure of each sentence and the overall discourse structure of the text given as input. The system then uses a statistical hierarchical model of text production in order to drop non-important syntactic and discourse constituents so as to generate coherent, grammatical document compressions of arbitrary length. The system outperforms both a baseline and a sentence-based compression system that operates by simplifying sequentially all sentences in a text. Our results support the claim that discourse knowledge plays an important role in document summarization.

  1. Document Clustering Based on Semi-Supervised Term Clustering

    Directory of Open Access Journals (Sweden)

    Hamid Mahmoodi

    2012-05-01

    Full Text Available The study is conducted to propose a multi-step feature (term selection process and in semi-supervised fashion, provide initial centers for term clusters. Then utilize the fuzzy c-means (FCM clustering algorithm for clustering terms. Finally assign each of documents to closest associated term clusters. While most text clustering algorithms directly use documents for clustering, we propose to first group the terms using FCM algorithm and then cluster documents based on terms clusters. We evaluate effectiveness of our technique on several standard text collections and compare our results with the some classical text clustering algorithms.

  2. Document image analysis: A primer

    Indian Academy of Sciences (India)

    Rangachar Kasturi; Lawrence O’Gorman; Venu Govindaraju

    2002-02-01

    Document image analysis refers to algorithms and techniques that are applied to images of documents to obtain a computer-readable description from pixel data. A well-known document image analysis product is the Optical Character Recognition (OCR) software that recognizes characters in a scanned document. OCR makes it possible for the user to edit or search the document’s contents. In this paper we briefly describe various components of a document analysis system. Many of these basic building blocks are found in most document analysis systems, irrespective of the particular domain or language to which they are applied. We hope that this paper will help the reader by providing the background necessary to understand the detailed descriptions of specific techniques presented in other papers in this issue.

  3. Chemical-text hybrid search engines.

    Science.gov (United States)

    Zhou, Yingyao; Zhou, Bin; Jiang, Shumei; King, Frederick J

    2010-01-01

    As the amount of chemical literature increases, it is critical that researchers be enabled to accurately locate documents related to a particular aspect of a given compound. Existing solutions, based on text and chemical search engines alone, suffer from the inclusion of "false negative" and "false positive" results, and cannot accommodate diverse repertoire of formats currently available for chemical documents. To address these concerns, we developed an approach called Entity-Canonical Keyword Indexing (ECKI), which converts a chemical entity embedded in a data source into its canonical keyword representation prior to being indexed by text search engines. We implemented ECKI using Microsoft Office SharePoint Server Search, and the resultant hybrid search engine not only supported complex mixed chemical and keyword queries but also was applied to both intranet and Internet environments. We envision that the adoption of ECKI will empower researchers to pose more complex search questions that were not readily attainable previously and to obtain answers at much improved speed and accuracy.

  4. Online Visual Analytics of Text Streams.

    Science.gov (United States)

    Liu, Shixia; Yin, Jialun; Wang, Xiting; Cui, Weiwei; Cao, Kelei; Pei, Jian

    2016-11-01

    We present an online visual analytics approach to helping users explore and understand hierarchical topic evolution in high-volume text streams. The key idea behind this approach is to identify representative topics in incoming documents and align them with the existing representative topics that they immediately follow (in time). To this end, we learn a set of streaming tree cuts from topic trees based on user-selected focus nodes. A dynamic Bayesian network model has been developed to derive the tree cuts in the incoming topic trees to balance the fitness of each tree cut and the smoothness between adjacent tree cuts. By connecting the corresponding topics at different times, we are able to provide an overview of the evolving hierarchical topics. A sedimentation-based visualization has been designed to enable the interactive analysis of streaming text data from global patterns to local details. We evaluated our method on real-world datasets and the results are generally favorable.

  5. Extraction of information from unstructured text

    Energy Technology Data Exchange (ETDEWEB)

    Irwin, N.H.; DeLand, S.M.; Crowder, S.V.

    1995-11-01

    Extracting information from unstructured text has become an emphasis in recent years due to the large amount of text now electronically available. This status report describes the findings and work done by the end of the first year of a two-year LDRD. Requirements of the approach included that it model the information in a domain independent way. This means that it would differ from current systems by not relying on previously built domain knowledge and that it would do more than keyword identification. Three areas that are discussed and expected to contribute to a solution include (1) identifying key entities through document level profiling and preprocessing, (2) identifying relationships between entities through sentence level syntax, and (3) combining the first two with semantic knowledge about the terms.

  6. Stamp Detection in Color Document Images

    DEFF Research Database (Denmark)

    Micenkova, Barbora; van Beusekom, Joost

    2011-01-01

    An automatic system for stamp segmentation and further verification is needed especially for environments like insurance companies where a huge volume of documents is processed daily. However, detection of a general stamp is not a trivial task as it can have different shapes and colors and......, moreover, it can be imprinted with a variable quality and rotation. Previous methods were restricted to detection of stamps of particular shapes or colors. The method presented in the paper includes segmentation of the image by color clustering and subsequent classification of candidate solutions...... by geometrical and color-related features. The approach allows for differentiation of stamps from other color objects in the document such as logos or texts. For the purpose of evaluation, a data set of 400 document images has been collected, annotated and made public. With the proposed method, recall of 83...

  7. Probabilistic Aspects in Spoken Document Retrieval

    Directory of Open Access Journals (Sweden)

    Macherey Wolfgang

    2003-01-01

    Full Text Available Accessing information in multimedia databases encompasses a wide range of applications in which spoken document retrieval (SDR plays an important role. In SDR, a set of automatically transcribed speech documents constitutes the files for retrieval, to which a user may address a request in natural language. This paper deals with two probabilistic aspects in SDR. The first part investigates the effect of recognition errors on retrieval performance and inquires the question of why recognition errors have only a little effect on the retrieval performance. In the second part, we present a new probabilistic approach to SDR that is based on interpolations between document representations. Experiments performed on the TREC-7 and TREC-8 SDR task show comparable or even better results for the new proposed method than other advanced heuristic and probabilistic retrieval metrics.

  8. Document Retrieval on Repetitive Collections

    OpenAIRE

    Navarro, Gonzalo; Puglisi, Simon J.; Sirén, Jouni

    2014-01-01

    Document retrieval aims at finding the most important documents where a pattern appears in a collection of strings. Traditional pattern-matching techniques yield brute-force document retrieval solutions, which has motivated the research on tailored indexes that offer near-optimal performance. However, an experimental study establishing which alternatives are actually better than brute force, and which perform best depending on the collection characteristics, has not been carried out. In this ...

  9. Data mining of text as a tool in authorship attribution

    Science.gov (United States)

    Visa, Ari J. E.; Toivonen, Jarmo; Autio, Sami; Maekinen, Jarno; Back, Barbro; Vanharanta, Hannu

    2001-03-01

    It is common that text documents are characterized and classified by keywords that the authors use to give them. Visa et al. have developed a new methodology based on prototype matching. The prototype is an interesting document or a part of an extracted, interesting text. This prototype is matched with the document database of the monitored document flow. The new methodology is capable of extracting the meaning of the document in a certain degree. Our claim is that the new methodology is also capable of authenticating the authorship. To verify this claim two tests were designed. The test hypothesis was that the words and the word order in the sentences could authenticate the author. In the first test three authors were selected. The selected authors were William Shakespeare, Edgar Allan Poe, and George Bernard Shaw. Three texts from each author were examined. Every text was one by one used as a prototype. The two nearest matches with the prototype were noted. The second test uses the Reuters-21578 financial news database. A group of 25 short financial news reports from five different authors are examined. Our new methodology and the interesting results from the two tests are reported in this paper. In the first test, for Shakespeare and for Poe all cases were successful. For Shaw one text was confused with Poe. In the second test the Reuters-21578 financial news were identified by the author relatively well. The resolution is that our text mining methodology seems to be capable of authorship attribution.

  10. Disaster documentation for the clinician.

    Science.gov (United States)

    Zoraster, Richard M; Burkle, Christopher M

    2013-08-01

    Documentation of the patient encounter is a traditional component of health care practice, a requirement of various regulatory agencies and hospital oversight committees, and a necessity for reimbursement. A disaster may create unexpected challenges to documentation. If patient volume and acuity overwhelm health care providers, what is the acceptable appropriate documentation? If alterations in scope of practice and environmental or resource limitations occur, to what degree should this be documented? The conflicts arising from allocation of limited resources create unfamiliar situations in which patient competition becomes a component of the medical decision making; should that be documented, and, if so, how? In addition to these challenges, ever-present liability worries are compounded by controversies over the standards to which health care providers will be held. Little guidance is available on how or what to document. We conducted a search of the literature and found no appropriate references for disaster documentation, and no guidelines from professional organizations. We review here the challenges affecting documentation during disasters and provide a rationale for specific patient care documentation that avoids regulatory and legal pitfalls.

  11. Document delivery services contrasting views

    CERN Document Server

    1999-01-01

    Design and maintain document delivery services that are ideal for academic patrons! In Document Delivery Services: Contrasting Views, you'll visit four university library systems to discover the considerations and challenges each library faced in bringing document delivery to its clientele. This book examines the questions about document delivery that are most pressing in the profession of library science. Despite their own unique experiences, you'll find common practices among all four?including planning, implementation of service, and evaluation of either user satisfaction and/or vendor per

  12. Knowledge Representation in Travelling Texts

    DEFF Research Database (Denmark)

    Mousten, Birthe; Locmele, Gunta

    2014-01-01

    Today, information travels fast. Texts travel, too. In a corporate context, the question is how to manage which knowledge elements should travel to a new language area or market and in which form? The decision to let knowledge elements travel or not travel highly depends on the limitation...... and the purpose of the text in a new context as well as on predefined parameters for text travel. For texts used in marketing and in technology, the question is whether culture-bound knowledge representation should be domesticated or kept as foreign elements, or should be mirrored or moulded—or should not travel...... at all! When should semantic and pragmatic elements in a text be replaced and by which other elements? The empirical basis of our work is marketing and technical texts in English, which travel into the Latvian and Danish markets, respectively....

  13. A NOVEL MULTIDICTIONARY BASED TEXT COMPRESSION

    Directory of Open Access Journals (Sweden)

    Y. Venkataramani

    2012-01-01

    Full Text Available The amount of digital contents grows at a faster speed as a result does the demand for communicate them. On the other hand, the amount of storage and bandwidth increases at a slower rate. Thus powerful and efficient compression methods are required. The repetition of words and phrases cause the reordered text much more compressible than the original text. On the whole system is fast and achieves close to the best result on the test files. In this study a novel fast dictionary based text compression technique MBRH (Multidictionary with burrows wheeler transforms, Run length coding and Huffman coding is proposed for the purpose of obtaining improved performance on various document sizes. MBRH algorithm comprises of two stages, the first stage is concerned with the conversion of input text into dictionary based compression .The second stage deals mainly with reduction of the redundancy in multidictionary based compression by using BWT, RLE and Huffman coding. Bib test files of input size of 111, 261 bytes achieves compression ratio of 0.192, bit rate of 1.538 and high speed using MBRH algorithm. The algorithm has attained a good compression ratio, reduction of bit rate and the increase in execution speed.

  14. ERRORS AND DIFFICULTIES IN TRANSLATING LEGAL TEXTS

    Directory of Open Access Journals (Sweden)

    Camelia, CHIRILA

    2014-11-01

    Full Text Available Nowadays the accurate translation of legal texts has become highly important as the mistranslation of a passage in a contract, for example, could lead to lawsuits and loss of money. Consequently, the translation of legal texts to other languages faces many difficulties and only professional translators specialised in legal translation should deal with the translation of legal documents and scholarly writings. The purpose of this paper is to analyze translation from three perspectives: translation quality, errors and difficulties encountered in translating legal texts and consequences of such errors in professional translation. First of all, the paper points out the importance of performing a good and correct translation, which is one of the most important elements to be considered when discussing translation. Furthermore, the paper presents an overview of the errors and difficulties in translating texts and of the consequences of errors in professional translation, with applications to the field of law. The paper is also an approach to the differences between languages (English and Romanian that can hinder comprehension for those who have embarked upon the difficult task of translation. The research method that I have used to achieve the objectives of the paper was the content analysis of various Romanian and foreign authors' works.

  15. Texting while driving: is speech-based text entry less risky than handheld text entry?

    Science.gov (United States)

    He, J; Chaparro, A; Nguyen, B; Burge, R J; Crandall, J; Chaparro, B; Ni, R; Cao, S

    2014-11-01

    Research indicates that using a cell phone to talk or text while maneuvering a vehicle impairs driving performance. However, few published studies directly compare the distracting effects of texting using a hands-free (i.e., speech-based interface) versus handheld cell phone, which is an important issue for legislation, automotive interface design and driving safety training. This study compared the effect of speech-based versus handheld text entries on simulated driving performance by asking participants to perform a car following task while controlling the duration of a secondary text-entry task. Results showed that both speech-based and handheld text entries impaired driving performance relative to the drive-only condition by causing more variation in speed and lane position. Handheld text entry also increased the brake response time and increased variation in headway distance. Text entry using a speech-based cell phone was less detrimental to driving performance than handheld text entry. Nevertheless, the speech-based text entry task still significantly impaired driving compared to the drive-only condition. These results suggest that speech-based text entry disrupts driving, but reduces the level of performance interference compared to text entry with a handheld device. In addition, the difference in the distraction effect caused by speech-based and handheld text entry is not simply due to the difference in task duration.

  16. Text Type and Translation Strategy

    Institute of Scientific and Technical Information of China (English)

    刘福娟

    2015-01-01

    Translation strategy and translation standards are undoubtedly the core problems translators are confronted with in translation. There have arisen many kinds of translation strategies in translation history, among which the text type theory is considered an important breakthrough and a significant complement of traditional translation standards. This essay attempts to demonstrate the value of text typology (informative, expressive, and operative) to translation strategy, emphasizing the importance of text types and their communicative functions.

  17. Text Mining Applications and Theory

    CERN Document Server

    Berry, Michael W

    2010-01-01

    Text Mining: Applications and Theory presents the state-of-the-art algorithms for text mining from both the academic and industrial perspectives.  The contributors span several countries and scientific domains: universities, industrial corporations, and government laboratories, and demonstrate the use of techniques from machine learning, knowledge discovery, natural language processing and information retrieval to design computational models for automated text analysis and mining. This volume demonstrates how advancements in the fields of applied mathematics, computer science, machine learning

  18. Conservation Documentation and the Implications of Digitisation

    Directory of Open Access Journals (Sweden)

    Michelle Moore

    2001-11-01

    Full Text Available Conservation documentation can be defined as the textual and visual records collected during the care and treatment of an object. It can include records of the object's condition, any treatment done to the object, any observations or conclusions made by the conservator as well as details on the object's past and present environment. The form of documentation is not universally agreed upon nor has it always been considered an important aspect of the conservation profession. Good documentation tells the complete story of an object thus far and should provide as much information as possible for the future researcher, curator, or conservator. The conservation profession will benefit from digitising its documentation using software such as databases and hardware like digital cameras and scanners. Digital technology will make conservation documentation more easily accessible, cost/time efficient, and will increase consistency and accuracy of the recorded data, and reduce physical storage space requirements. The major drawback to digitising conservation records is maintaining access to the information for the future; the notorious pace of technological change has serious implications for retrieving data from any machine- readable medium.

  19. Hermeneutic reading of classic texts.

    Science.gov (United States)

    Koskinen, Camilla A-L; Lindström, Unni Å

    2013-09-01

    The purpose of this article is to broaden the understandinfg of the hermeneutic reading of classic texts. The aim is to show how the choice of a specific scientific tradition in conjunction with a methodological approach creates the foundation that clarifies the actual realization of the reading. This hermeneutic reading of classic texts is inspired by Gadamer's notion that it is the researcher's own research tradition and a clearly formulated theoretical fundamental order that shape the researcher's attitude towards texts and create the starting point that guides all reading, uncovering and interpretation. The researcher's ethical position originates in a will to openness towards what is different in the text and which constantly sets the researcher's preunderstanding and research tradition in movement. It is the researcher's attitude towards the text that allows the text to address, touch and arouse wonder. Through a flexible, lingering and repeated reading of classic texts, what is different emerges with a timeless value. The reading of classic texts is an act that may rediscover and create understanding for essential dimensions and of human beings' reality on a deeper level. The hermeneutic reading of classic texts thus brings to light constantly new possibilities of uncovering for a new envisioning and interpretation for a new understanding of the essential concepts and phenomena within caring science.

  20. GURMUKHI TEXT EXTRACTION FROM IMAGE USING SUPPORT VECTOR MACHINE (SVM

    Directory of Open Access Journals (Sweden)

    SUKHWINDER KAUR

    2011-04-01

    Full Text Available Extensive research has been done on image classification for different purposes like face recognition, identification of different objects and identification/extraction of text from image having some background. Text identification is an active research area where by system tries to identify the text area in a given image. Text area identified is then passed to OCR system for further recognition of the text. This work is about classifying image area in two classes text and non text using SVM (support vector machine. We identified the features and train a model based on the feature vector which is then used to classify text and non text area in an image. The system reports 70.5% accuracy for caption text images, 70.43% for document text images and 50.40% for scene text image.

  1. Rank Based Clustering For Document Retrieval From Biomedical Databases

    CERN Document Server

    Manicassamy, Jayanthi

    2009-01-01

    Now a day's, search engines are been most widely used for extracting information's from various resources throughout the world. Where, majority of searches lies in the field of biomedical for retrieving related documents from various biomedical databases. Currently search engines lacks in document clustering and representing relativeness level of documents extracted from the databases. In order to overcome these pitfalls a text based search engine have been developed for retrieving documents from Medline and PubMed biomedical databases. The search engine has incorporated page ranking bases clustering concept which automatically represents relativeness on clustering bases. Apart from this graph tree construction is made for representing the level of relatedness of the documents that are networked together. This advance functionality incorporation for biomedical document based search engine found to provide better results in reviewing related documents based on relativeness.

  2. Document Classification Using Expectation Maximization with Semi Supervised Learning

    CERN Document Server

    Nigam, Bhawna; Salve, Sonal; Vamney, Swati

    2011-01-01

    As the amount of online document increases, the demand for document classification to aid the analysis and management of document is increasing. Text is cheap, but information, in the form of knowing what classes a document belongs to, is expensive. The main purpose of this paper is to explain the expectation maximization technique of data mining to classify the document and to learn how to improve the accuracy while using semi-supervised approach. Expectation maximization algorithm is applied with both supervised and semi-supervised approach. It is found that semi-supervised approach is more accurate and effective. The main advantage of semi supervised approach is "Dynamically Generation of New Class". The algorithm first trains a classifier using the labeled document and probabilistically classifies the unlabeled documents. The car dataset for the evaluation purpose is collected from UCI repository dataset in which some changes have been done from our side.

  3. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    CERN Document Server

    Puri, Shalini

    2012-01-01

    Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM) is proposed to classify a set of text documents into pre - defined Category Groups (CG) by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA) is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV) with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC) to classify correct...

  4. Inferring Group Processes from Computer-Mediated Affective Text Analysis

    Energy Technology Data Exchange (ETDEWEB)

    Schryver, Jack C [ORNL; Begoli, Edmon [ORNL; Jose, Ajith [Missouri University of Science and Technology; Griffin, Christopher [Pennsylvania State University

    2011-02-01

    Political communications in the form of unstructured text convey rich connotative meaning that can reveal underlying group social processes. Previous research has focused on sentiment analysis at the document level, but we extend this analysis to sub-document levels through a detailed analysis of affective relationships between entities extracted from a document. Instead of pure sentiment analysis, which is just positive or negative, we explore nuances of affective meaning in 22 affect categories. Our affect propagation algorithm automatically calculates and displays extracted affective relationships among entities in graphical form in our prototype (TEAMSTER), starting with seed lists of affect terms. Several useful metrics are defined to infer underlying group processes by aggregating affective relationships discovered in a text. Our approach has been validated with annotated documents from the MPQA corpus, achieving a performance gain of 74% over comparable random guessers.

  5. Replacement Attack: A New Zero Text Watermarking Attack

    Science.gov (United States)

    Bashardoost, Morteza; Mohd Rahim, Mohd Shafry; Saba, Tanzila; Rehman, Amjad

    2017-03-01

    The main objective of zero watermarking methods that are suggested for the authentication of textual properties is to increase the fragility of produced watermarks against tampering attacks. On the other hand, zero watermarking attacks intend to alter the contents of document without changing the watermark. In this paper, the Replacement attack is proposed, which focuses on maintaining the location of the words in the document. The proposed text watermarking attack is specifically effective on watermarking approaches that exploit words' transition in the document. The evaluation outcomes prove that tested word-based method are unable to detect the existence of replacement attack in the document. Moreover, the comparison results show that the size of Replacement attack is estimated less accurate than other common types of zero text watermarking attacks.

  6. Development of digital library system on regulatory documents for nuclear power plants

    Energy Technology Data Exchange (ETDEWEB)

    Lee, K. H.; Kim, K. J.; Yoon, Y. H.; Kim, M. W.; Lee, J. I. [KINS, Taejon (Korea, Republic of)

    2001-10-01

    The main objective of this study is to establish nuclear regulatory document retrieval system based on internet. With the advancement of internet and information processing technology, information management patterns are going through a new paradigm. Getting along the current of the time, it is general tendency to transfer paper-type documents into electronic-type documents through document scanning and indexing. This system consists of nuclear regulatory documents, nuclear safety documents, digital library, and information system with index and full text.

  7. Improve Reading with Complex Texts

    Science.gov (United States)

    Fisher, Douglas; Frey, Nancy

    2015-01-01

    The Common Core State Standards have cast a renewed light on reading instruction, presenting teachers with the new requirements to teach close reading of complex texts. Teachers and administrators should consider a number of essential features of close reading: They are short, complex texts; rich discussions based on worthy questions; revisiting…

  8. Strategies for Translating Vocative Texts

    Directory of Open Access Journals (Sweden)

    Olga COJOCARU

    2014-12-01

    Full Text Available The paper deals with the linguistic and cultural elements of vocative texts and the techniques used in translating them by giving some examples of texts that are typically vocative (i.e. advertisements and instructions for use. Semantic and communicative strategies are popular in translation studies and each of them has its own advantages and disadvantages in translating vocative texts. The advantage of semantic translation is that it takes more account of the aesthetic value of the SL text, while communicative translation attempts to render the exact contextual meaning of the original text in such a way that both content and language are readily acceptable and comprehensible to the readership. Focus is laid on the strategies used in translating vocative texts, strategies that highlight and introduce a cultural context to the target audience, in order to achieve their overall purpose, that is to sell or persuade the reader to behave in a certain way. Thus, in order to do that, a number of advertisements from the field of cosmetics industry and electronic gadgets were selected for analysis. The aim is to gather insights into vocative text translation and to create new perspectives on this field of research, now considered a process of innovation and diversion, especially in areas as important as economy and marketing.

  9. Text Retrieval on a Microcomputer.

    Science.gov (United States)

    Giordano, Richard; And Others

    1988-01-01

    Presents description of the Generalized Automatic Text Organization and Retrieval system (GATOR), a database system that indexes and retrieves information from machine-readable texts such as interviews and case histories. Qualitative and quantitative analyses are discussed, and integrating GATOR with standard statistical packages is described.…

  10. Dangers of Texting While Driving

    Science.gov (United States)

    ... nhtsa.gov/risky-driving/distracted-driving . Print Out Texting While Driving Guide (pdf) File a Complaint with the FCC ... Office: Consumer and Governmental Affairs Tags: Consumers - Distracted Driving - Health and Safety - Texting Federal Communications Commission 445 12th Street SW, Washington, ...

  11. Text analysis for knowledge graphs

    NARCIS (Netherlands)

    Popping, Roel

    2007-01-01

    The concept of knowledge graphs is introduced as a method to represent the state of the art in a specific scientific discipline. Next the text analysis part in the construction of such graphs is considered. Here the 'translation' from text to graph takes place. The method that is used here is compar

  12. Using Corpus Statistics to Remove Redundant Words in Text Categorization.

    Science.gov (United States)

    Yang, Yiming; Wilbur, John

    1996-01-01

    Studies aggressive automated word removal in text categorization in large databases based on corpus statistics to reduce the noise in free texts and to enhance the computational efficiency of categorization. Topics include stop word identification, categorization methods for comparison, tests on four document collections, and evaluation…

  13. Sources of evidence for automatic indexing of political texts

    NARCIS (Netherlands)

    Dehghani, M.; Azarbonyad, H.; Marx, M.; Kamps, J.; Hanbury, A.; Kazai, G.; Rauber, A.; Fuhr, N.

    2015-01-01

    Political texts on the Web, documenting laws and policies and the process leading to them, are of key importance to government, industry, and every individual citizen. Yet access to such texts is difficult due to the ever increasing volume and complexity of the content, prompting the need for indexi

  14. Linguistic Dating of Biblical Texts

    DEFF Research Database (Denmark)

    Ehrensvärd, Martin Gustaf

    2003-01-01

    For two centuries, scholars have pointed to consistent differences in the Hebrew of certain biblical texts and interpreted these differences as reflecting the date of composition of the texts. Until the 1980s, this was quite uncontroversial as the linguistic findings largely confirmed...... the chronology of the texts established by other means: the Hebrew of Genesis-2 Kings was judged to be early and that of Esther, Daniel, Ezra, Nehemiah, and Chronicles to be late. In the current debate where revisionists have questioned the traditional dating, linguistic arguments in the dating of texts have...... come more into focus. The study critically examines some linguistic arguments adduced to support the traditional position, and reviewing the arguments it points to weaknesses in the linguistic dating of EBH texts to pre-exilic times. When viewing the linguistic evidence in isolation it will be clear...

  15. Text mining for systems biology.

    Science.gov (United States)

    Fluck, Juliane; Hofmann-Apitius, Martin

    2014-02-01

    Scientific communication in biomedicine is, by and large, still text based. Text mining technologies for the automated extraction of useful biomedical information from unstructured text that can be directly used for systems biology modelling have been substantially improved over the past few years. In this review, we underline the importance of named entity recognition and relationship extraction as fundamental approaches that are relevant to systems biology. Furthermore, we emphasize the role of publicly organized scientific benchmarking challenges that reflect the current status of text-mining technology and are important in moving the entire field forward. Given further interdisciplinary development of systems biology-orientated ontologies and training corpora, we expect a steadily increasing impact of text-mining technology on systems biology in the future.

  16. EDCMS: A Content Management System for Engineering Documents

    Institute of Scientific and Technical Information of China (English)

    Shaofeng Liu; Chris McMahon; Mansur Darlington; Steve Culley; Peter Wild

    2007-01-01

    Engineers often need to look for the right pieces of information by sifting through long engineering documents. It is a very tiring and time-consuming job. To address this issue, researchers are increasingly devoting their attention to new ways to help information users, including engineers, to access and retrieve document content. The research reported in this paper explores how to use the key technologies of document decomposition (study of document structure), document mark-up (with Extensible Markup Language (XML), HyperText Mark-up Language (HTML), and Scalable Vector Graphics (SVG)), and a facetted classification mechanism. Document content extraction is implemented via computer programming (with Java). An Engineering Document Content Management System (EDCMS) developed in this research demonstrates that as information providers we can make document content in a more accessible manner for information users including engineers.The main features of the EDCMS system are:1) EDCMS is a system that enables users, especially engineers, to access and retrieve information at content rather than document level. In other words, it provides the right pieces of information that answer specific questions so that engineers don't need to waste time sifting through the whole document to obtain the required piece of information.2) Users can use the EDCMS via both the data and metadata of a document to access engineering document content.3) Users can use the EDCMS to access and retrieve content objects, I.e. Text, images and graphics (including engineering drawings)via multiple views and at different granularities based on decomposition schemes.Experiments with the EDCMS have been conducted on semi-structured documents, a textbook of CADCAM, and a set of project posters in the Engineering Design domain. Experimental results show that the system provides information users with a powerful solution to access document content.

  17. Government Documents Departmental Operations Guide.

    Science.gov (United States)

    Wilson, John S.; And Others

    This manual for the operation and maintenance of the Government Documents Department at Baylor University's Moody Memorial Library is divided into 13 topical sections. The guide opens with the collection development policy statement, which covers the general collection, the maps division, and weeding government documents. Technical processing…

  18. ITK optical links backup document

    CERN Document Server

    Huffman, B T; The ATLAS collaboration; Flick, T; Ye, J

    2013-01-01

    This document describes the proposed optical links to be used for the ITK in the phase II upgrade. The current R&D for optical links pursued in the Versatile Link group is reviewed. In particular the results demonstrating the radiation tolerance of all the on-detector components are documented. The bandwidth requirements and the resulting numerology are given.

  19. SRS ecology: Environmental information document

    Energy Technology Data Exchange (ETDEWEB)

    Wike, L.D.; Shipley, R.W.; Bowers, J.A. [and others

    1993-09-01

    The purpose of this Document is to provide a source of ecological information based on the exiting knowledge gained from research conducted at the Savannah River Site. This document provides a summary and synthesis of ecological research in the three main ecosystem types found at SRS and information on the threatened and endangered species residing there.

  20. Bulkloading and Maintaining XML Documents

    NARCIS (Netherlands)

    Schmidt, A.R.; Kersten, M.L.

    2002-01-01

    The popularity of XML as a exchange and storage format brings about massive amounts of documents to be stored, maintained and analyzed -- a challenge that traditionally has been tackled with Database Management Systems (DBMS). To open up the content of XML documents to analysis with declarative quer

  1. Storing XML Documents in Databases

    NARCIS (Netherlands)

    Schmidt, A.R.; Manegold, S.; Kersten, M.L.; Rivero, L.C.; Doorn, J.H.; Ferraggine, V.E.

    2005-01-01

    The authors introduce concepts for loading large amounts of XML documents into databases where the documents are stored and maintained. The goal is to make XML databases as unobtrusive in multi-tier systems as possible and at the same time provide as many services defined by the XML standards as pos

  2. Automatic digital document processing and management problems, algorithms and techniques

    CERN Document Server

    Ferilli, Stefano

    2011-01-01

    This text reviews the issues involved in handling and processing digital documents. Examining the full range of a document's lifetime, this book covers acquisition, representation, security, pre-processing, layout analysis, understanding, analysis of single components, information extraction, filing, indexing and retrieval. This title: provides a list of acronyms and a glossary of technical terms; contains appendices covering key concepts in machine learning, and providing a case study on building an intelligent system for digital document and library management; discusses issues of security,

  3. Biomarker Identification Using Text Mining

    Directory of Open Access Journals (Sweden)

    Hui Li

    2012-01-01

    Full Text Available Identifying molecular biomarkers has become one of the important tasks for scientists to assess the different phenotypic states of cells or organisms correlated to the genotypes of diseases from large-scale biological data. In this paper, we proposed a text-mining-based method to discover biomarkers from PubMed. First, we construct a database based on a dictionary, and then we used a finite state machine to identify the biomarkers. Our method of text mining provides a highly reliable approach to discover the biomarkers in the PubMed database.

  4. Complex document information processing: prototype, test collection, and evaluation

    Science.gov (United States)

    Agam, G.; Argamon, S.; Frieder, O.; Grossman, D.; Lewis, D.

    2006-01-01

    Analysis of large collections of complex documents is an increasingly important need for numerous applications. Complex documents are documents that typically start out on paper and are then electronically scanned. These documents have rich internal structure and might only be available in image form. Additionally, they may have been produced by a combination of printing technologies (or by handwriting); and include diagrams, graphics, tables and other non-textual elements. The state of the art today for a large document collection is essentially text search of OCR'd documents with no meaningful use of data found in images, signatures, logos, etc. Our prototype automatically generates rich metadata about a complex document and then applies query tools to integrate the metadata with text search. To ensure a thorough evaluation of the effectiveness of our prototype, we are also developing a roughly 42,000,000 page complex document test collection. The collection will include relevance judgments for queries at a variety of levels of detail and depending on a variety of content and structural characteristics of documents, as well as "known item" queries looking for particular documents.

  5. Enhancing Text Clustering Using Concept-based Mining Model

    Directory of Open Access Journals (Sweden)

    Lincy Liptha R.

    2012-03-01

    Full Text Available Text Mining techniques are mostly based on statistical analysis of a word or phrase. The statistical analysis of a term frequency captures the importance of the term without a document only. But two terms can have the same frequency in the same document. But the meaning that one term contributes might be more appropriate than the meaning contributed by the other term. Hence, the terms that capture the semantics of the text should be given more importance. Here, a new concept-based mining is introduced. It analyses the terms based on the sentence, document and corpus level. The model consists of sentence-based concept analysis which calculates the conceptual term frequency (ctf, document-based concept analysis which finds the term frequency (tf, corpus-based concept analysis which determines the document frequency (dfand concept-based similarity measure. The process of calculating ctf, tf, df, measures in a corpus is attained by the proposed algorithm which is called Concept-Based Analysis Algorithm. By doing so we cluster the web documents in an efficient way and the quality of the clusters achieved by this model significantly surpasses the traditional single-term-base approaches.

  6. INTEGRATION OF COMPUTER TECHNOLOGIES SMK: AUTOMATION OF THE PRODUCTION CERTIFICA-TION PROCEDURE AND FORMING OF SHIPPING DOCUMENTS

    Directory of Open Access Journals (Sweden)

    S. A. Pavlenko

    2009-01-01

    Full Text Available Integration of informational computer technologies allowed to reorganize and optimize some processes due to decrease of circulation of documents, unification of documentation forms and others.

  7. Documentation of TRU biological transport model (BIOTRAN)

    Energy Technology Data Exchange (ETDEWEB)

    Gallegos, A.F.; Garcia, B.J.; Sutton, C.M.

    1980-01-01

    Inclusive of Appendices, this document describes the purpose, rationale, construction, and operation of a biological transport model (BIOTRAN). This model is used to predict the flow of transuranic elements (TRU) through specified plant and animal environments using biomass as a vector. The appendices are: (A) Flows of moisture, biomass, and TRU; (B) Intermediate variables affecting flows; (C) Mnemonic equivalents (code) for variables; (D) Variable library (code); (E) BIOTRAN code (Fortran); (F) Plants simulated; (G) BIOTRAN code documentation; (H) Operating instructions for BIOTRAN code. The main text is presented with a specific format which uses a minimum of space, yet is adequate for tracking most relationships from their first appearance to their formulation in the code. Because relationships are treated individually in this manner, and rely heavily on Appendix material for understanding, it is advised that the reader familiarize himself with these materials before proceeding with the main text.

  8. Anomaly Detection with Text Mining

    Data.gov (United States)

    National Aeronautics and Space Administration — Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The...

  9. Functional Stylistics and Peripeteic Texts

    DEFF Research Database (Denmark)

    Borchmann, Simon

    2008-01-01

    Using a pragmatically based linguistic description apparatus on literary use of language is not unproblematic. Observations show that literary use of language violates the norms contained by this apparatus. With this paper I suggest how we can deal with this problem by setting up a frame for the ...... for the use of a functional linguistic description apparatus on literary texts. As an extension of this suggestion I present a model for describing a specific type of literary texts....

  10. Adaptive Personality Recogntion from Text

    OpenAIRE

    Celli, Fabio

    2012-01-01

    We address the issue of domain adaptation for automatic Personality Recognition from Text (PRT). The PRT task consists in the classification of the personality traits of some authors, given some pieces of text they wrote. The purpose of our work is to improve current approaches to PRT in order to extract personality information from social network sites, which is a really challenging task. We argue that current approaches, based on supervised learning, have several limitations for th...

  11. Intelligent bar chart plagiarism detection in documents.

    Science.gov (United States)

    Al-Dabbagh, Mohammed Mumtaz; Salim, Naomie; Rehman, Amjad; Alkawaz, Mohammed Hazim; Saba, Tanzila; Al-Rodhaan, Mznah; Al-Dhelaan, Abdullah

    2014-01-01

    This paper presents a novel features mining approach from documents that could not be mined via optical character recognition (OCR). By identifying the intimate relationship between the text and graphical components, the proposed technique pulls out the Start, End, and Exact values for each bar. Furthermore, the word 2-gram and Euclidean distance methods are used to accurately detect and determine plagiarism in bar charts.

  12. Layout-aware text extraction from full-text PDF of scientific articles

    Directory of Open Access Journals (Sweden)

    Ramakrishnan Cartic

    2012-05-01

    Full Text Available Abstract Background The Portable Document Format (PDF is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications. Results Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1 Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2 Classifying text blocks into rhetorical categories using a rule-based method and (3 Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF

  13. Analysing ESP Texts, but How?

    Directory of Open Access Journals (Sweden)

    Borza Natalia

    2015-03-01

    Full Text Available English as a second language (ESL teachers instructing general English and English for specific purposes (ESP in bilingual secondary schools face various challenges when it comes to choosing the main linguistic foci of language preparatory courses enabling non-native students to study academic subjects in English. ESL teachers intending to analyse English language subject textbooks written for secondary school students with the aim of gaining information about what bilingual secondary school students need to know in terms of language to process academic textbooks cannot avoiding deal with a dilemma. It needs to be decided which way it is most appropriate to analyse the texts in question. Handbooks of English applied linguistics are not immensely helpful with regard to this problem as they tend not to give recommendation as to which major text analytical approaches are advisable to follow in a pre-college setting. The present theoretical research aims to address this lacuna. Respectively, the purpose of this pedagogically motivated theoretical paper is to investigate two major approaches of ESP text analysis, the register and the genre analysis, in order to find the more suitable one for exploring the language use of secondary school subject texts from the point of view of an English as a second language teacher. Comparing and contrasting the merits and limitations of the two contrastive approaches allows for a better understanding of the nature of the two different perspectives of text analysis. The study examines the goals, the scope of analysis, and the achievements of the register perspective and those of the genre approach alike. The paper also investigates and reviews in detail the starkly different methods of ESP text analysis applied by the two perspectives. Discovering text analysis from a theoretical and methodological angle supports a practical aspect of English teaching, namely making an informed choice when setting out to analyse

  14. Practical vision based degraded text recognition system

    Science.gov (United States)

    Mohammad, Khader; Agaian, Sos; Saleh, Hani

    2011-02-01

    Rapid growth and progress in the medical, industrial, security and technology fields means more and more consideration for the use of camera based optical character recognition (OCR) Applying OCR to scanned documents is quite mature, and there are many commercial and research products available on this topic. These products achieve acceptable recognition accuracy and reasonable processing times especially with trained software, and constrained text characteristics. Even though the application space for OCR is huge, it is quite challenging to design a single system that is capable of performing automatic OCR for text embedded in an image irrespective of the application. Challenges for OCR systems include; images are taken under natural real world conditions, Surface curvature, text orientation, font, size, lighting conditions, and noise. These and many other conditions make it extremely difficult to achieve reasonable character recognition. Performance for conventional OCR systems drops dramatically as the degradation level of the text image quality increases. In this paper, a new recognition method is proposed to recognize solid or dotted line degraded characters. The degraded text string is localized and segmented using a new algorithm. The new method was implemented and tested using a development framework system that is capable of performing OCR on camera captured images. The framework allows parameter tuning of the image-processing algorithm based on a training set of camera-captured text images. Novel methods were used for enhancement, text localization and the segmentation algorithm which enables building a custom system that is capable of performing automatic OCR which can be used for different applications. The developed framework system includes: new image enhancement, filtering, and segmentation techniques which enabled higher recognition accuracies, faster processing time, and lower energy consumption, compared with the best state of the art published

  15. A Hough Transform based Technique for Text Segmentation

    CERN Document Server

    Saha, Satadal; Nasipuri, Mita; Basu, Dipak Kr

    2010-01-01

    Text segmentation is an inherent part of an OCR system irrespective of the domain of application of it. The OCR system contains a segmentation module where the text lines, words and ultimately the characters must be segmented properly for its successful recognition. The present work implements a Hough transform based technique for line and word segmentation from digitized images. The proposed technique is applied not only on the document image dataset but also on dataset for business card reader system and license plate recognition system. For standardization of the performance of the system the technique is also applied on public domain dataset published in the website by CMATER, Jadavpur University. The document images consist of multi-script printed and hand written text lines with variety in script and line spacing in single document image. The technique performs quite satisfactorily when applied on mobile camera captured business card images with low resolution. The usefulness of the technique is verifie...

  16. Word-Based Text Compression

    CERN Document Server

    Platos, Jan

    2008-01-01

    Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or methods with non-character access, e.g. word-based compression. In the past, several papers describing variants of word-based compression using Huffman encoding or LZW method were published. The subject of this paper is the description of a word-based compression variant based on the LZ77 algorithm. The LZ77 algorithm and its modifications are described in this paper. Moreover, various ways of sliding window implementation and various possibilities of output encoding are described, as well. This paper also includes the implementation of an experimental application, testing of its efficiency and finding the best combination of all parts of the LZ77 coder. This is done to achieve the best compression ratio. In conclusion there is comparison of this implemented application wi...

  17. Princess Brambilla - images/text

    Directory of Open Access Journals (Sweden)

    Maria Aparecida Barbosa

    2016-01-01

    Full Text Available Read the illustrated literary text is simultaneously think pictures and words. This articulation between the written text and pictures adds potential, expands and becomes complex. Coincides with nowadays discussions on Giorgio Agamben's "contemporary" that add to what adheres to respectively time the displacement and the distance needed to understand it, shakes linear notions of historical chronology. Somehow the coincidence is related to the current interest in the concept of "Nachleben" (survival, which assumes the images of the past ransom, postulated by the art historian Aby Warburg in a research on ancient art of motion characteristics in Renaissance pictures Botticelli's. For the translation of the Princesa Brambilla – um capriccio segundo Jakob Callot, de E. T. A. Hoffmann, com 8 gravuras cunhadas a partir de moldes originais de Callot (1820 to Portuguese such discussions were fundamental, as I try to present in this article.

  18. A Guide Text or Many Texts? "That is the Question”

    Directory of Open Access Journals (Sweden)

    Delgado de Valencia Sonia

    2001-08-01

    Full Text Available The use of supplementary materials in the classroom has always been an essential part of the teaching and learning process. To restrict our teaching to the scope of one single textbook means to stand behind the advances of knowledge, in any area and context. Young learners appreciate any new and varied support that expands their knowledge of the world: diaries, letters, panels, free texts, magazines, short stories, poems or literary excerpts, and articles taken from Internet are materials that will allow learnersto share more and work more collaboratively. In this article we are going to deal with some of these materials, with the criteria to select, adapt, and create them that may be of interest to the learner and that may promote reading and writing processes. Since no text can entirely satisfy the needs of students and teachers, the creativity of both parties will be necessary to improve the quality of teaching through the adequate use and adaptation of supplementary materials.

  19. Quality Inspection of Printed Texts

    DEFF Research Database (Denmark)

    Pedersen, Jesper Ballisager; Nasrollahi, Kamal; Moeslund, Thomas B.

    2016-01-01

    Inspecting the quality of printed texts has its own importance in many industrial applications. To do so, this paper proposes a grading system which evaluates the performance of the printing task using some quality measures for each character and symbols. The purpose of these grading system is two......-folded: for costumers of the printing and verification system, the overall grade used to verify if the text is of sufficient quality, while for printer's manufacturer, the detailed character/symbols grades and quality measurements are used for the improvement and optimization of the printing task. The proposed system...

  20. Handwritten and printed text distinction by using stroke thickness features

    Science.gov (United States)

    Ding, Hong; Wu, Huiqun; Wang, Jun; Zhang, Xiaofeng

    2017-01-01

    This paper presents an algorithm to identify the handwritten and the printed texts among document images. The characteristic of stroke thickness is used and a kind of calculating method is designed for this feature. The proposed method, which is clearly defined and easily realized, calculates the stroke thickness feature by counting edge pixels in a neighborhood. Document images are generally divided into text lines or characters. However, the line and the character are not conducive to the judgment between handwritten and printed text distinction. The line is too rough and the character is too small. Using the stroke thickness characteristics, combined with layout analysis, the text line in the document image is further divided into the area of uniform thickness. This kind of area is more detailed than text line and larger than a single character. So more stable features can be extracted from it. Last, the features of these regions are divided by using SVM. The proposed algorithm obtained better performance in the document image database including handwritten and printed texts.

  1. Sharing and Adaptation of Educational Documents in E-Learning

    Directory of Open Access Journals (Sweden)

    Chekry Abderrahman

    2012-03-01

    Full Text Available Few documents can be reused among the huge number of the educational documents on the web. The exponential increase of these documents makes it almost impossible to search for relevant documents. In addition to this, e-learning is designed for public users who have different levels of knowledge and varied skills so they should be given a content that sees to their needs. This work is about adapting the content of learning with learners preferences, and give the teachers the ability to reuse a given content.

  2. ARCHITECTURE SOFTWARE SOLUTION TO SUPPORT AND DOCUMENT MANAGEMENT QUALITY SYSTEM

    Directory of Open Access Journals (Sweden)

    Milan Eric

    2010-12-01

    Full Text Available One of the basis of a series of standards JUS ISO 9000 is quality system documentation. An architecture of the quality system documentation depends on the complexity of business system. An establishment of an efficient management documentation of system of quality is of a great importance for the business system, as well as in the phase of introducing the quality system and in further stages of its improvement. The study describes the architecture and capability of software solutions to support and manage the quality system documentation in accordance with the requirements of standards ISO 9001:2001, ISO 14001:2005 HACCP etc.

  3. Automation in Developing Technical Documentation of Telecommunication Networks

    Directory of Open Access Journals (Sweden)

    Slavko Šarić

    2004-09-01

    Full Text Available During the last decade of the 20th century, intensive developmentof telecommunication infrastructure set high requirementsregarding technical documentation that started to haveproblems in following the changes in the network. In the lastseveral years HT made a great shift regarding automation andpresentation of technical documentation, that is introductionof GIS (Geographic Information System, as precondition forthe introduction of DIS (Documentation Information System,thus realising the necessary preconditions to use the gatheredand organised spatial and atllibute data, for higher quality ofanalysis, processing and repair of interference in the telecommunicationnetwork The aim of this paper is to inform aboutthe segments and computer programs used for the developmentof technical documentation.

  4. TEXT SIGNAGE RECOGNITION IN ANDROID MOBILE DEVICES

    Directory of Open Access Journals (Sweden)

    Oi-Mean Foong

    2013-01-01

    Full Text Available This study presents a Text Signage Recognition (TSR model in Android mobile devices for Visually Impaired People (VIP. Independence navigation is always a challenge to VIP for indoor navigation in unfamiliar surroundings. Assistive Technology such as Android smart devices has great potential to assist VIPs in indoor navigation using built-in speech synthesizer. In contrast to previous TSR research which was deployed in standalone personal computer system using Otsu’s algorithm, we have developed an affordable Text Signage Recognition in Android Mobile Devices using Tesseract OCR engine. The proposed TSR model used the input images from the International Conference on Document Analysis and Recognition (ICDAR 2003 dataset for system training and testing. The TSR model was tested by four volunteers who were blind-folded. The system performance of the TSR model was assessed using different metrics (i.e., Precision, Recall, F-Score and Recognition Formulas to determine its accuracy. Experimental results show that the proposed TSR model has achieved recognition rate satisfactorily.

  5. Editorial: Well Documented Articles Achieve More Impact

    Directory of Open Access Journals (Sweden)

    Sönke Albers

    2009-05-01

    Full Text Available This issue offers for the first time supplementary material to articles. It consists of Excel files providing data the research is based on, calculation sheets, and optimization tools. Moreover, text files with pseudo code of algorithms and data of instances used for testing these algorithms as well as a video file with a demonstration of the use of a system in experiments are supplied for download. This material facilitates the understanding of the published articles and the replication of their findings to the benefit of scientific progress. And last but not least, such well documented articles will achieve much more impact.

  6. Seductive Texts with Serious Intentions.

    Science.gov (United States)

    Nielsen, Harriet Bjerrum

    1995-01-01

    Debates whether a text claiming to have scientific value is using seduction irresponsibly at the expense of the truth, and discusses who is the subject and who is the object of such seduction. It argues that, rather than being an assault against scientific ethics, seduction is a necessary premise for a sensible conversation to take place. (GR)

  7. Multilingual text induced spelling correction

    NARCIS (Netherlands)

    Reynaert, M.W.C.

    2004-01-01

    We present TISC, a multilingual, language-independent and context-sensitive spelling checking and correction system designed to facilitate the automatic removal of non-word spelling errors in large corpora. Its lexicon is derived from raw text corpora, without supervision, and contains word unigrams

  8. Text as an Autopoietic System

    DEFF Research Database (Denmark)

    Nicolaisen, Maria Skou

    2016-01-01

    The aim of the present research article is to discuss the possibilities and limitations in addressing text as an autopoietic system. The theory of autopoiesis originated in the field of biology in order to explain the dynamic processes entailed in sustaining living organisms at cellular level...

  9. Basic Chad Arabic: Comprehension Texts.

    Science.gov (United States)

    Absi, Samir Abu; Sinaud, Andre

    This text, principally designed for use in a three-volume course on Chad Arabic, complements the pre-speech and active phases of the course in that it provides the answers to comprehension exercises students are required to complete during the course. The comprehension exercises require that students listen to an instructor or tape and write…

  10. Comparison of Text Categorization Algorithms

    Institute of Scientific and Technical Information of China (English)

    SHI Yong-feng; ZHAO Yan-ping

    2004-01-01

    This paper summarizes several automatic text categorization algorithms in common use recently, analyzes and compares their advantages and disadvantages.It provides clues for making use of appropriate automatic classifying algorithms in different fields.Finally some evaluations and summaries of these algorithms are discussed, and directions to further research have been pointed out.

  11. Reviving "Walden": Mining the Text.

    Science.gov (United States)

    Hewitt Julia

    2000-01-01

    Describes how the author and her high school English students begin their study of Thoreau's "Walden" by mining the text for quotations to inspire their own writing and discussion on the topic, "How does Thoreau speak to you or how could he speak to someone you know?" (SR)

  12. COMPENDEX/TEXT-PAC: CIS.

    Science.gov (United States)

    Standera, Oldrich

    This report evaluates the engineering information services provided by the University of Calgary since implementation of the COMPENDEX (tape service of Engineering Index, Inc.) service using the IBM TEXT-PAC system. Evaluation was made by a survey of the users of the Current Information Selection (CIS) service, the interaction between the system…

  13. Reactor operation environmental information document

    Energy Technology Data Exchange (ETDEWEB)

    Bauer, L.R.; Hayes, D.W.; Hunter, C.H.; Marter, W.L.; Moyer, R.A.

    1989-12-01

    This volume is a reactor operation environmental information document for the Savannah River Plant. Topics include meteorology, surface hydrology, transport, environmental impacts, and radiation effects. 48 figs., 56 tabs. (KD)

  14. ENDF/B summary documentation

    Energy Technology Data Exchange (ETDEWEB)

    Kinsey, R. (comp.)

    1979-07-01

    This publication provides a localized source of descriptions for the evaluations contained in the ENDF/B Library. The summary documentation presented is intended to be a more detailed description than the (File 1) comments contained in the computer readable data files, but not so detailed as the formal reports describing each ENDF/B evaluation. The summary documentations were written by the CSEWB (Cross Section Evaluation Working Group) evaluators and compiled by NNDC (National Nuclear Data Center). This edition includes documentation for materials found on ENDF/B Version V tapes 501 to 516 (General Purpose File) excluding tape 504. ENDF/B-V also includes tapes containing partial evaluations for the Special Purpose Actinide (521, 522), Dosimetry (531), Activation (532), Gas Production (533), and Fission Product (541-546) files. The materials found on these tapes are documented elsewhere. Some of the evaluation descriptions in this report contain cross sections or energy level information. (RWR)

  15. A database for TMT interface control documents

    Science.gov (United States)

    Gillies, Kim; Roberts, Scott; Brighton, Allan; Rogers, John

    2016-08-01

    The TMT Software System consists of software components that interact with one another through a software infrastructure called TMT Common Software (CSW). CSW consists of software services and library code that is used by developers to create the subsystems and components that participate in the software system. CSW also defines the types of components that can be constructed and their roles. The use of common component types and shared middleware services allows standardized software interfaces for the components. A software system called the TMT Interface Database System was constructed to support the documentation of the interfaces for components based on CSW. The programmer describes a subsystem and each of its components using JSON-style text files. A command interface file describes each command a component can receive and any commands a component sends. The event interface files describe status, alarms, and events a component publishes and status and events subscribed to by a component. A web application was created to provide a user interface for the required features. Files are ingested into the software system's database. The user interface allows browsing subsystem interfaces, publishing versions of subsystem interfaces, and constructing and publishing interface control documents that consist of the intersection of two subsystem interfaces. All published subsystem interfaces and interface control documents are versioned for configuration control and follow the standard TMT change control processes. Subsystem interfaces and interface control documents can be visualized in the browser or exported as PDF files.

  16. Relation Inclusive Search for Hindi Documents

    Directory of Open Access Journals (Sweden)

    Pooja Arora

    2013-08-01

    Full Text Available Information retrieval (IR techniques become a challenge to researchers due to huge growth of digital and information retrieval. As a wide variety of Hindi Data and Literature is now available on web, we have developed information retrieval system for Hindi documents. This paper presents a new searching technique that has promising results in terms of F-measure. Historically, there have been two major approaches to IR - keyword based search and concept based search. We have introduced new relation inclusive search which performs searching of documents using case role relation, spatial relation and temporal relation of query terms and gives results better than previously used approaches. In this method we have used new indexing technique which stores information about relation between terms along with its position. We have compared four types of searching: Keyword Based search without Relation Inclusive, Keyword Based search with Relation Inclusive, Concept Based search without Relation Inclusive and Concept Based search with Relation Inclusive. Our proposed searching method gave significant improvement in terms of F-measure. For experiments we have used Hindi document corpus, Gyannidhi from C-DAC. This technique effectively improves search performance for documents in English as well.

  17. Invisible in Thailand: documenting the need for protection

    Directory of Open Access Journals (Sweden)

    Margaret Green

    2008-04-01

    Full Text Available The International Rescue Committee (IRC has conducted asurvey to document the experiences of Burmese people livingin border areas of Thailand and assess the degree to whichthey merit international protection as refugees.

  18. Photographic Documentation in Plastic Surgeon’s Practice

    Directory of Open Access Journals (Sweden)

    Kasielska-Trojan Anna

    2016-05-01

    Full Text Available The aim of the study was to analyze practices of clinical photographic documentation management among plastic surgeons in Poland as well as to gain their opinion about the characteristics of “ideal” software for images archiving.

  19. Vector space model for document representation in information retrieval

    Directory of Open Access Journals (Sweden)

    Dan MUNTEANU

    2007-12-01

    Full Text Available This paper presents the basics of information retrieval: the vector space model for document representation with Boolean and term weighted models, ranking methods based on the cosine factor and evaluation measures: recall, precision and combined measure.

  20. Using Collaborative Tagging for Text Classification: From Text Classification to Opinion Mining

    Directory of Open Access Journals (Sweden)

    Eric Charton

    2013-11-01

    Full Text Available Numerous initiatives have allowed users to share knowledge or opinions using collaborative platforms. In most cases, the users provide a textual description of their knowledge, following very limited or no constraints. Here, we tackle the classification of documents written in such an environment. As a use case, our study is made in the context of text mining evaluation campaign material, related to the classification of cooking recipes tagged by users from a collaborative website. This context makes some of the corpus specificities difficult to model for machine-learning-based systems and keyword or lexical-based systems. In particular, different authors might have different opinions on how to classify a given document. The systems presented hereafter were submitted to the D´Efi Fouille de Textes 2013 evaluation campaign, where they obtained the best overall results, ranking first on task 1 and second on task 2. In this paper, we explain our approach for building relevant and effective systems dealing with such a corpus.

  1. Semantic-Sensitive Web Information Retrieval Model for HTML Documents

    CERN Document Server

    Bassil, Youssef

    2012-01-01

    With the advent of the Internet, a new era of digital information exchange has begun. Currently, the Internet encompasses more than five billion online sites and this number is exponentially increasing every day. Fundamentally, Information Retrieval (IR) is the science and practice of storing documents and retrieving information from within these documents. Mathematically, IR systems are at the core based on a feature vector model coupled with a term weighting scheme that weights terms in a document according to their significance with respect to the context in which they appear. Practically, Vector Space Model (VSM), Term Frequency (TF), and Inverse Term Frequency (IDF) are among other long-established techniques employed in mainstream IR systems. However, present IR models only target generic-type text documents, in that, they do not consider specific formats of files such as HTML web documents. This paper proposes a new semantic-sensitive web information retrieval model for HTML documents. It consists of a...

  2. Information Seeking & Documentation as Communication: A Software Engineering Perspective

    Directory of Open Access Journals (Sweden)

    Michael O'Brien

    2015-02-01

    Full Text Available Effective communication of knowledge is paramount in every software organisation. Essentially, the role of documentation in a software engineering context is to communicate information and knowledge of the system it describes. Unfortunately, the current perception of documentation is that it is outdated, irrelevant and incomplete. Several studies to date have revealed that documentation is unfortunately often far from ideal. Problems tend to be diverse, ranging from incompleteness, to lack of clarity, to inaccuracy, obsolescence, difficulty of access, and lack of availability in local languages. This paper begins with a discussion of information seeking as an appropriate perspective for studying software maintenance activities. To this end, it examines the importance and centrality of documentation in this process. It finally concludes with a discussion on how software documentation practices can be improved to ensure software engineers communicate more effectively via the wide variety of documents that their projects require.

  3. History Document Image Background Noise and Removal Methods

    Directory of Open Access Journals (Sweden)

    Ganchimeg.G

    2015-12-01

    Full Text Available It is common for archive libraries to provide public access to historical and ancient document image collections. It is common for such document images to require specialized processing in order to remove background noise and become more legible. Document images may be contaminated with noise during transmission, scanning or conversion to digital form. We can categorize noises by identifying their features and can search for similar patterns in a document image to choose appropriate methods for their removal. In this paper, we propose a hybrid binarization approach for improving the quality of old documents using a combination of global and local thresholding. This article also reviews noises that might appear in scanned document images and discusses some noise removal methods.

  4. DEVELOPEMET OF DOCUMENT DATA SYNCRONIZATION TOOLS ON SHAREPOINT PLATFORM

    Directory of Open Access Journals (Sweden)

    Vasyl A. Petrushko

    2011-02-01

    Full Text Available The development of document management automation systems for managing by educational processes is an important task in the modern informational society. One of the popular platforms for such systems’ development is Microsoft SharePoint Products and Technologies. SharePoint contains libraries for storing documents, convenient tools for editing and managing document versions. All workflow systems need to reuse data across different documents. SharePoint does not provide built-in tools for data synchronization across documents. This article describes the developement of the tool for automatic data synchronization across documents in SharePoint. This method is used as a basis for the Internet-based information system of the scientific researches planning in National Academy of Pedagogical Sciences of Ukraine.

  5. Star2HTML -- Converting Starlink Documents to Hypertext

    Science.gov (United States)

    Draper, P. W.; Chipperfield, A. J.; Lawden, M. D.

    Star2HTML lets you write (or convert) a Starlink document so that you can create two versions of it from a single source file. A paper version is produced by LaTeX, and a hypertext version (suitable for browsing on the web) is produced by latex2html. You can tailor each version to its own medium by marking selected text as LaTeX-only or HTML-only. Star2HTML also includes a set of document templates for producing Starlink documents in a standard style (such as Starlink User Note). They also define new LaTeX commands for adding extra links to the hypertext version of your document (without affecting the paper version). This document explains these new facilities, and gives advice on good practice and on how to deal with some specific formatting problems when converting a document to hypertext. You are assumed to be familiar with LaTeX.

  6. Locative inferences in medical texts.

    Science.gov (United States)

    Mayer, P S; Bailey, G H; Mayer, R J; Hillis, A; Dvoracek, J E

    1987-06-01

    Medical research relies on epidemiological studies conducted on a large set of clinical records that have been collected from physicians recording individual patient observations. These clinical records are recorded for the purpose of individual care of the patient with little consideration for their use by a biostatistician interested in studying a disease over a large population. Natural language processing of clinical records for epidemiological studies must deal with temporal, locative, and conceptual issues. This makes text understanding and data extraction of clinical records an excellent area for applied research. While much has been done in making temporal or conceptual inferences in medical texts, parallel work in locative inferences has not been done. This paper examines the locative inferences as well as the integration of temporal, locative, and conceptual issues in the clinical record understanding domain by presenting an application that utilizes two key concepts in its parsing strategy--a knowledge-based parsing strategy and a minimal lexicon.

  7. Text Segmentation Using Exponential Models

    CERN Document Server

    Beeferman, D; Lafferty, G D; Beeferman, Doug; Berger, Adam; Lafferty, John

    1997-01-01

    This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.

  8. Transitioning Existing Content: inferring organisation-specific documents

    Directory of Open Access Journals (Sweden)

    Arijit Sengupta

    2000-11-01

    Full Text Available A definition for a document type within an organization represents an organizational norm about the way the organizational actors represent products and supporting evidence of organizational processes. Generating a good organization-specific document structure is, therefore, important since it can capture a shared understanding among the organizational actors about how certain business processes should be performed. Current tools that generate document type definitions focus on the underlying technology, emphasizing tags created in a single instance document. The tools, thus, fall short of capturing the shared understanding between organizational actors about how a given document type should be represented. We propose a method for inferring organization-specific document structures using multiple instance documents as inputs. The method consists of heuristics that combine individual document definitions, which may have been compiled using standard algorithms. We propose a number of heuristics utilizing artificial intelligence and natural language processing techniques. As the research progresses, the heuristics will be tested on a suite of test cases representing multiple instance documents for different document types. The complete methodology will be implemented as a research prototype

  9. Text as an Autopoietic System

    DEFF Research Database (Denmark)

    Nicolaisen, Maria Skou

    2016-01-01

    The aim of the present research article is to discuss the possibilities and limitations in addressing text as an autopoietic system. The theory of autopoiesis originated in the field of biology in order to explain the dynamic processes entailed in sustaining living organisms at cellular level. Th....... By comparing the biological with the textual account of autopoietic agency, the end conclusion is that a newly derived concept of sociopoiesis might be better suited for discussing the architecture of textual systems....

  10. Graphic composite segmentation for PDF documents with complex layouts

    Science.gov (United States)

    Xu, Canhui; Tang, Zhi; Tao, Xin; Shi, Cao

    2013-01-01

    Converting the PDF books to re-flowable format has recently attracted various interests in the area of e-book reading. Robust graphic segmentation is highly desired for increasing the practicability of PDF converters. To cope with various layouts, a multi-layer concept is introduced to segment graphic composites including photographic images, drawings with text insets or surrounded with text elements. Both image based analysis and inherent digital born document advantages are exploited in this multi-layer based layout analysis method. By combining low-level page elements clustering applied on PDF documents and connected component analysis on synthetically generated PNG image document, graphic composites can be segmented for PDF documents with complex layouts. The experimental results on graphic composite segmentation of PDF document pages have shown satisfactory performance.

  11. Attitudes and emotions through written text: the case of textual deformation in internet chat rooms.

    Directory of Open Access Journals (Sweden)

    Francisco Yus Ramos

    2010-11-01

    Full Text Available Normal 0 21 false false false ES X-NONE X-NONE MicrosoftInternetExplorer4 /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabla normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-priority:99; mso-style-qformat:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin-top:0cm; mso-para-margin-right:0cm; mso-para-margin-bottom:10.0pt; mso-para-margin-left:0cm; line-height:115%; mso-pagination:widow-orphan; font-size:11.0pt; font-family:"Calibri","sans-serif"; mso-ascii-font-family:Calibri; mso-ascii-theme-font:minor-latin; mso-fareast-font-family:"Times New Roman"; mso-fareast-theme-font:minor-fareast; mso-hansi-font-family:Calibri; mso-hansi-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman"; mso-bidi-theme-font:minor-bidi;} Los chats españoles de Internet son visitados por muchos jóvenes que usan el lenguaje de una forma muy creativa (ej. repetición de letras y signos de puntuación. En este artículo se evalúan varias hipótesis sobre el uso de la deformación textual respecto a su eficacia comunicativa. Se trata de comprobar si estas deformaciones favorecen una identificación y evaluación más adecuada de las actitudes (proposicionales o afectivas y emociones de sus autores. Las respuestas a un cuestionario revelan que a pesar de la información adicional que la deformación textual aporta, los lectores no suelen coincidir en la cualidad exacta de estas actitudes y emociones, ni establecen grados de intensidad relacionados con la cantidad de texto tecleada. Sin embargo, y a pesar de estos resultados, la deformación textual parece jugar un papel en la interpretación que finalmente se elige de estos mensajes enviados a los chats.

  12. Text Mining for Protein Docking.

    Directory of Open Access Journals (Sweden)

    Varsha D Badal

    2015-12-01

    Full Text Available The rapidly growing amount of publicly available information from biomedical research is readily accessible on the Internet, providing a powerful resource for predictive biomolecular modeling. The accumulated data on experimentally determined structures transformed structure prediction of proteins and protein complexes. Instead of exploring the enormous search space, predictive tools can simply proceed to the solution based on similarity to the existing, previously determined structures. A similar major paradigm shift is emerging due to the rapidly expanding amount of information, other than experimentally determined structures, which still can be used as constraints in biomolecular structure prediction. Automated text mining has been widely used in recreating protein interaction networks, as well as in detecting small ligand binding sites on protein structures. Combining and expanding these two well-developed areas of research, we applied the text mining to structural modeling of protein-protein complexes (protein docking. Protein docking can be significantly improved when constraints on the docking mode are available. We developed a procedure that retrieves published abstracts on a specific protein-protein interaction and extracts information relevant to docking. The procedure was assessed on protein complexes from Dockground (http://dockground.compbio.ku.edu. The results show that correct information on binding residues can be extracted for about half of the complexes. The amount of irrelevant information was reduced by conceptual analysis of a subset of the retrieved abstracts, based on the bag-of-words (features approach. Support Vector Machine models were trained and validated on the subset. The remaining abstracts were filtered by the best-performing models, which decreased the irrelevant information for ~ 25% complexes in the dataset. The extracted constraints were incorporated in the docking protocol and tested on the Dockground unbound

  13. New Historicism: Text and Context

    Directory of Open Access Journals (Sweden)

    Violeta M. Vesić

    2016-02-01

    Full Text Available During most of the twentieth century history was seen as a phenomenon outside of literature that guaranteed the veracity of literary interpretation. History was unique and it functioned as a basis for reading literary works. During the seventies of the twentieth century there occurred a change of attitude towards history in American literary theory, and there appeared a new theoretical approach which soon became known as New Historicism. Since its inception, New Historicism has been identified with the study of Renaissance and Romanticism, but nowadays it has been increasingly involved in other literary trends. Although there are great differences in the arguments and practices at various representatives of this school, New Historicism has clearly recognizable features and many new historicists will agree with the statement of Walter Cohen that New Historicism, when it appeared in the eighties, represented something quite new in reference to the studies of theory, criticism and history (Cohen 1987, 33. Theoretical connection with Bakhtin, Foucault and Marx is clear, as well as a kind of uneasy tie with deconstruction and the work of Paul de Man. At the center of this approach is a renewed interest in the study of literary works in the light of historical and political circumstances in which they were created. Foucault encouraged readers to begin to move literary texts and to link them with discourses and representations that are not literary, as well as to examine the sociological aspects of the texts in order to take part in the social struggles of today. The study of literary works using New Historicism is the study of politics, history, culture and circumstances in which these works were created. With regard to one of the main fact which is located in the center of the criticism, that history cannot be viewed objectively and that reality can only be understood through a cultural context that reveals the work, re-reading and interpretation of

  14. Converting Relational Database Into Xml Document

    Directory of Open Access Journals (Sweden)

    Kanagaraj.S

    2012-03-01

    Full Text Available XML (Extensible Markup Language is emerging and gradually accepted as the standard for data interchange in the Internet world. Interoperation of relational database and XML database involves schema and data translations. Through EER (extended entity relationship model can convert the schema of relational database into XML. The semantics of the relational database, captured in EER diagram, are mapped to XML schema using stepwise procedures and mapped to XML document under the definitions of the XML schema. Converting Relational Database into XML Document is a process of converting the existing databases into XML file format. Existing conversion techniques convert a single database into xml. The proposed approach performs the conversion of databases like Ms-Access, MS-SQL to XML file format. Read the tables information from the corresponding database and generate code for the appropriate databases and convert the tables into XML flat file format. This converted XML file is been presented to the user.

  15. Citation analysis of Journal of Documentation

    Directory of Open Access Journals (Sweden)

    Navneet Kaur

    2011-06-01

    Full Text Available Citation analysis of all the journal articles published in the Journal of Documentation from 1996-2010 is carried out. 487 articles are published in the journal during 15 years. Highest numbers (44 of articles are published in the year 2005. The journal contained 15587 citations from 1996-2010. Average number of citation per article is maximum in the year 2009. This study also covers the analyses of authorship patterns in citing article. In authorship pattern, single author citations are dominant than others and it is 201 (49%. This study also reveals that Journal of Documentation is the most preferred journal used by authors in their citation. The paper concludes that only 10 core periodicals can cover more than 2951 (16 % references.

  16. ARCHAEOLOGICAL DOCUMENTATION OF A DEFUNCT IRAQI TOWN

    Directory of Open Access Journals (Sweden)

    J. Šedina

    2016-06-01

    Full Text Available The subject of this article is the possibilities of the documentation of a defunct town from the Pre-Islamic period to Early Islamic period. This town is located near the town Makhmur in Iraq. The Czech archaeological mission has worked at this dig site. This Cultural Heritage site is threatened by war because in the vicinity are positions of ISIS. For security reasons, the applicability of Pleiades satellite data has been tested. Moreover, this area is a no-fly zone. However, the DTM created from stereo-images was insufficient for the desired application in archeology. The subject of this paper is the testing of the usability of RPAS technology and terrestrial photogrammetry for documentation of the remains of buildings. RPAS is a very fast growing technology that combines the advantages of aerial photogrammetry and terrestrial photogrammetry. A probably defunct church is a sample object.

  17. Succincter Text Indexing with Wildcards

    CERN Document Server

    Thachuk, Chris

    2011-01-01

    We study the problem of indexing text with wildcard positions, motivated by the challenge of aligning sequencing data to large genomes that contain millions of single nucleotide polymorphisms (SNPs)---positions known to differ between individuals. SNPs modeled as wildcards can lead to more informed and biologically relevant alignments. We improve the space complexity of previous approaches by giving a succinct index requiring $(2 + o(1))n \\log \\sigma + O(n) + O(d \\log n) + O(k \\log k)$ bits for a text of length $n$ over an alphabet of size $\\sigma$ containing $d$ groups of $k$ wildcards. A key to the space reduction is a result we give showing how any compressed suffix array can be supplemented with auxiliary data structures occupying $O(n) + O(d \\log \\frac{n}{d})$ bits to also support efficient dictionary matching queries. The query algorithm for our wildcard index is faster than previous approaches using reasonable working space. More importantly our new algorithm greatly reduces the query working space to ...

  18. A Survey On Various Approaches Of Text Extraction In Images

    Directory of Open Access Journals (Sweden)

    C.P. Sumathi

    2012-09-01

    Full Text Available Text Extraction plays a major role in finding vital and valuable information. Text extraction involvesdetection, localization, tracking, binarization, extraction, enhancement and recognition of the text from the given image. These text characters are difficult to be detected and recognized due to their deviation of size, font, style, orientation, alignment, contrast, complex colored, textured background. Due to rapid growth of available multimedia documents and growing requirement for information, identification, indexing and retrieval, many researches have been done on text extraction in images.Several techniqueshave been developed for extracting the text from an image. The proposed methods were based on morphological operators, wavelet transform, artificial neural network,skeletonization operation,edge detection algorithm, histogram technique etc. All these techniques have their benefits and restrictions. This article discusses various schemes proposed earlier for extracting the text from an image. This paper also provides the performance comparison of several existing methods proposed by researchers in extracting the text from an image.

  19. Terminologie de Base de la Documentation. (Basic Terminology of Documentation).

    Science.gov (United States)

    Commission des Communautes Europeennes (Luxembourg). Bureau de Terminologie.

    This glossary is designed to aid non-specialists whose activities require that they have some familiarity with the terminology of the modern methods of documentation. Definitions have been assembled from various dictionaries, manuals, etc., with particular attention being given to the publications of UNESCO and the International Standards…

  20. Reactor operation environmental information document

    Energy Technology Data Exchange (ETDEWEB)

    Haselow, J.S.; Price, V.; Stephenson, D.E.; Bledsoe, H.W.; Looney, B.B.

    1989-12-01

    The Savannah River Site (SRS) produces nuclear materials, primarily plutonium and tritium, to meet the requirements of the Department of Defense. These products have been formed in nuclear reactors that were built during 1950--1955 at the SRS. K, L, and P reactors are three of five reactors that have been used in the past to produce the nuclear materials. All three of these reactors discontinued operation in 1988. Currently, intense efforts are being extended to prepare these three reactors for restart in a manner that protects human health and the environment. To document that restarting the reactors will have minimal impacts to human health and the environment, a three-volume Reactor Operations Environmental Impact Document has been prepared. The document focuses on the impacts of restarting the K, L, and P reactors on both the SRS and surrounding areas. This volume discusses the geology, seismology, and subsurface hydrology. 195 refs., 101 figs., 16 tabs.

  1. Segmental Rescoring in Text Recognition

    Science.gov (United States)

    2014-02-04

    ttm № tes/m, m* tmvr mowm* a Smyrna Of l δrtA£ACf02S’ A w m - y i p m AmiKSiS € f № ) C № № m .. sg6#?«rA fiθN ; Atφ h Sft№’·’Spxn mm m fim f№b t&m&mm...applying a Hidden Markov Model (HMM) recognition approach. Generating the plurality text hypotheses for the image forming includes generating a first...image. Applying segmental analysis to a segmentation determined by a first OCR engine, such as a segmentation determined by a Hidden Markov Model (HMM

  2. Linguistic dating of biblical texts

    DEFF Research Database (Denmark)

    Young, Ian; Rezetko, Robert; Ehrensvärd, Martin Gustaf

    and diglossia and textual criticism (Chapters 7, 13), and the significance of extra-biblical sources, including Amarna Canaanite, Ugaritic, Aramaic, Hebrew inscriptions of the monarchic period, Qumran and Mishnaic Hebrew, the Hebrew language of Ben Sira and Bar Kochba, and also Egyptian, Akkadian, Persian....... This is followed by an detailed synthesis of the topics introduced in the first volume, a series of detailed case studies on various linguistic issues, extensive tables of grammatical and lexical features, and a comprehensive bibliography. The authors argue that the scholarly use of language in dating biblical...... texts, and even the traditional standpoint on the chronological development of biblical Hebrew, require a thorough re-evaluation, and propose a new perspective on linguistic variety in biblical Hebrew. Early Biblical Hebrew and Late Biblical Hebrew do not represent different chronological periods...

  3. Everyday Life as a Text

    Directory of Open Access Journals (Sweden)

    Michael Lahey

    2016-02-01

    Full Text Available This article explores how audience data are utilized in the tentative partnerships created between television and social media companies. Specially, it looks at the mutually beneficial relationship formed between the social media platform Twitter and television. It calls attention to how audience data are utilized as a way for the television industry to map itself onto the everyday lives of digital media audiences. I argue that the data-intensive monitoring of everyday life offers some measure of soft control over audiences in a digital media landscape. To do this, I explore “Social TV”—the relationships created between social media technologies and television—before explaining how Twitter leverages user data into partnerships with various television companies. Finally, the article explains what is fruitful about understanding the Twitter–television relationship as a form of soft control.

  4. SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION

    Directory of Open Access Journals (Sweden)

    Ashis Kumar Mandal

    2014-09-01

    Full Text Available This paper explores the use of machine learning approaches, or more specifically, four supervised learning Methods, namely Decision Tree(C 4.5, K-Nearest Neighbour (KNN, Naïve Bays (NB, and Support Vector Machine (SVM for categorization of Bangla web documents. This is a task of automatically sorting a set of documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy document feature vectors.

  5. A concept-based approach to text categorization

    NARCIS (Netherlands)

    Schijvenaars, B.J.A.; Schuemie, M.J.; Mulligen, E.M. van; Weeber, M.; Jelier, R.; Mons, B.; Kors, J.A.; Kraaij, W.

    2005-01-01

    The Biosemantics group (Erasmus University Medical Center, Rotterdam) participated in the text categorization task of the Genomics Track. We followed a thesaurus-based approach, using the Collexis indexing system, in combination with a simple classification algorithm to assign a document to one of t

  6. Extracting bimodal representations for language-based image text retrieval

    NARCIS (Netherlands)

    Westerveld, T.H.W.; Hiemstra, D.; Jong, de F.M.G.; Correia, N.; Chambel, T.; Davenport, G.

    2000-01-01

    This paper explores two approaches to multimedia indexing that might contribute to the advancement of text-based conceptual search for pictorial information. Insights from relatively mature retrieval areas (spoken document retrieval and cross-language retrieval) are taken as a starting point for an

  7. The Challenge of Violence. [Student Text and] Teacher's Guide.

    Science.gov (United States)

    Croddy, Marshall; Degelman, Charles; Hayes, Bill

    This document addresses violence as one of the key challenges facing the democratic and pluralistic republic under the framework of the Constitution and its Bill of Rights. Primary focus is on criminal violence and the factors and behaviors that contribute to violent crime. The text is organized into three chapters: (1) "The Problem of…

  8. Text Passage Retrieval Based on Colon Classification: Retrieval Performance.

    Science.gov (United States)

    Shepherd, Michael A.

    1981-01-01

    Reports the results of experiments using colon classification for the analysis, representation, and retrieval of primary information from the full text of documents. Recall, precision, and search length measures indicate colon classification did not perform significantly better than Boolean or simple word occurrence systems. Thirteen references…

  9. Signal Detection Framework Using Semantic Text Mining Techniques

    Science.gov (United States)

    Sudarsan, Sithu D.

    2009-01-01

    Signal detection is a challenging task for regulatory and intelligence agencies. Subject matter experts in those agencies analyze documents, generally containing narrative text in a time bound manner for signals by identification, evaluation and confirmation, leading to follow-up action e.g., recalling a defective product or public advisory for…

  10. Text Summarization Using FrameNet-Based Semantic Graph Model

    Directory of Open Access Journals (Sweden)

    Xu Han

    2016-01-01

    Full Text Available Text summarization is to generate a condensed version of the original document. The major issues for text summarization are eliminating redundant information, identifying important difference among documents, and recovering the informative content. This paper proposes a Semantic Graph Model which exploits the semantic information of sentence using FSGM. FSGM treats sentences as vertexes while the semantic relationship as the edges. It uses FrameNet and word embedding to calculate the similarity of sentences. This method assigns weight to both sentence nodes and edges. After all, it proposes an improved method to rank these sentences, considering both internal and external information. The experimental results show that the applicability of the model to summarize text is feasible and effective.

  11. TEXTS SENTIMENT-ANALYSIS APPLICATION FOR PUBLIC OPINION ASSESSMENT

    Directory of Open Access Journals (Sweden)

    I. A. Bessmertny

    2015-01-01

    Full Text Available The paper describes an approach to the emotional tonality assessment of natural language texts based on special dictionaries. A method for an automatic assessment of public opinion by means of sentiment-analysis of reviews and discussions followed by published Web-documents is proposed. The method is based on statistics of words in the documents. A pilot model of the software system implementing the sentiment-analysis of natural language text in Russian based on a linear assessment scale is developed. A syntactic analysis and words lemmatization are used to identify terms more correctly. Tonality dictionaries are presented in editable format and are open for enhancing. The program system implementing a sentiment-analysis of the Russian texts based on open dictionaries of tonality is presented for the first time.

  12. A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING

    Directory of Open Access Journals (Sweden)

    Zhou Tong

    2016-05-01

    Full Text Available A Large number of digital text information is generated every day. Effectively searching, managing and exploring the text data has become a main task. In this paper, we first represent an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The former one builds up a document topic model, aiming to a topic perspective solution on searching, exploring and recommending articles. The latter one sets up a user topic model, providing a full research and analysis over Twitter users’ interest. The experiment process including data collecting, data pre-processing and model training is fully documented and commented. Further more, the conclusion and application of this paper could be a useful computation tool for social and business research.

  13. Social Media Text Classification by Enhancing Well-Formed Text Trained Model

    Directory of Open Access Journals (Sweden)

    Phat Jotikabukkana

    2016-09-01

    Full Text Available Social media are a powerful communication tool in our era of digital information. The large amount of user-generated data is a useful novel source of data, even though it is not easy to extract the treasures from this vast and noisy trove. Since classification is an important part of text mining, many techniques have been proposed to classify this kind of information. We developed an effective technique of social media text classification by semi-supervised learning utilizing an online news source consisting of well-formed text. The computer first automatically extracts news categories, well-categorized by publishers, as classes for topic classification. A bag of words taken from news articles provides the initial keywords related to their category in the form of word vectors. The principal task is to retrieve a set of new productive keywords. Term Frequency-Inverse Document Frequency weighting (TF-IDF and Word Article Matrix (WAM are used as main methods. A modification of WAM is recomputed until it becomes the most effective model for social media text classification. The key success factor was enhancing our model with effective keywords from social media. A promising result of 99.50% accuracy was achieved, with more than 98.5% of Precision, Recall, and F-measure after updating the model three times.

  14. Project Documentation as a Risk for Public Projects

    Directory of Open Access Journals (Sweden)

    Vladěna Štěpánková

    2015-08-01

    Full Text Available Purpose of the article: The paper presents the different methodologies used for creating documentation and focuses on public projects and their requirements for this documentation. Since documentation is also incorporated in the overall planning of the project and its duration is estimated using expert qualified estimate, can any change in this documentation lead to project delays, or increase its cost as a result of consuming administration, and therefore the documentation is seen as a risk, which may threaten the project as a public contract by which a company trying to achieve and obtains it, and generally any project. Methodology/methods: There are used methods of obtaining information in this paper. These are mainly structured interviews in combination with a brainstorming, furthermore also been used questionnaire for companies dealing with public procurement. As a data processing program was used MS Excel and basic statistical methods based on regression analysis. Scientific aim: The article deals with the construction market in the Czech Republic and examines the impact of changes in project documentation of public projects on their turnover. Findings: In this paper we summarize the advantages and disadvantages of having project documentation. In the case of public contracts and changes in legislation it is necessary to focus on creating documentation in advance, follow the new requirements and try to reach them in the shortest possible time. Conclusions: The paper concludes with recommendations on how to proceed, if these changes and how to reduce costs, which may cause the risk of documentation.

  15. STANDARDIZATION OF MEDICAL DOCUMENT FLOW: PRINCIPLES AND FEATURES

    Directory of Open Access Journals (Sweden)

    Владимир Анатольевич Мелентьев

    2013-04-01

    Full Text Available In presented article the questions connected with the general concepts and bases of functioning of document flow in borders of any economic object (the enterprise, establishment, the organization are considered. Gostirovanny definition of document flow, classification of types of documentary streams is given. The basic principles of creation of document flow, following which are considered allows to create optimum structure документопотока and nature of movement of documents; interrelation of external and internal influences. Further basic elements of medical document flow are considered; the main problems of medical document flow being, besides, major factors, distinguishing medical document flow from document flow of manufacturing enterprises or other economic objects are specified. From consideration of these problems the conclusion about an initial stage of their decision - standardization of the medical document flow, being, besides, is drawn by the first stage of creation of a common information space of medical branch.DOI: http://dx.doi.org/10.12731/2218-7405-2013-4-3

  16. STANDARDIZATION OF MEDICAL DOCUMENT FLOW: PRINCIPLES AND FEATURES

    Directory of Open Access Journals (Sweden)

    Melentev Vladimir Anatolevich

    2013-04-01

    Full Text Available In presented article the questions connected with the general concepts and bases of functioning of document flow in borders of any economic object (the enterprise, establishment, the organization are considered. Gostirovanny definition of document flow, classification of types of documentary streams is given. The basic principles of creation of document flow, following which are considered allows to create optimum structure документопотока and nature of movement of documents; interrelation of external and internal influences. Further basic elements of medical document flow are considered; the main problems of medical document flow being, besides, major factors, distinguishing medical document flow from document flow of manufacturing enterprises or other economic objects are specified. From consideration of these problems the conclusion about an initial stage of their decision - standardization of the medical document flow, being, besides, is drawn by the first stage of creation of a common information space of medical branch.

  17. FacetAtlas: multifaceted visualization for rich text corpora.

    Science.gov (United States)

    Cao, Nan; Sun, Jimeng; Lin, Yu-Ru; Gotz, David; Liu, Shixia; Qu, Huamin

    2010-01-01

    Documents in rich text corpora usually contain multiple facets of information. For example, an article about a specific disease often consists of different facets such as symptom, treatment, cause, diagnosis, prognosis, and prevention. Thus, documents may have different relations based on different facets. Powerful search tools have been developed to help users locate lists of individual documents that are most related to specific keywords. However, there is a lack of effective analysis tools that reveal the multifaceted relations of documents within or cross the document clusters. In this paper, we present FacetAtlas, a multifaceted visualization technique for visually analyzing rich text corpora. FacetAtlas combines search technology with advanced visual analytical tools to convey both global and local patterns simultaneously. We describe several unique aspects of FacetAtlas, including (1) node cliques and multifaceted edges, (2) an optimized density map, and (3) automated opacity pattern enhancement for highlighting visual patterns, (4) interactive context switch between facets. In addition, we demonstrate the power of FacetAtlas through a case study that targets patient education in the health care domain. Our evaluation shows the benefits of this work, especially in support of complex multifaceted data analysis.

  18. A programmed text in statistics

    CERN Document Server

    Hine, J

    1975-01-01

    Exercises for Section 2 42 Physical sciences and engineering 42 43 Biological sciences 45 Social sciences Solutions to Exercises, Section 1 47 Physical sciences and engineering 47 49 Biological sciences 49 Social sciences Solutions to Exercises, Section 2 51 51 PhYSical sciences and engineering 55 Biological sciences 58 Social sciences 62 Tables 2 62 x - tests involving variances 2 63,64 x - one tailed tests 2 65 x - two tailed tests F-distribution 66-69 Preface This project started some years ago when the Nuffield Foundation kindly gave a grant for writing a pro­ grammed text to use with service courses in statistics. The work carried out by Mrs. Joan Hine and Professor G. B. Wetherill at Bath University, together with some other help from time to time by colleagues at Bath University and elsewhere. Testing was done at various colleges and universities, and some helpful comments were received, but we particularly mention King Edwards School, Bath, who provided some sixth formers as 'guinea pigs' for the fir...

  19. Density Based Script Identification of a Multilingual Document Image

    Directory of Open Access Journals (Sweden)

    Rumaan Bashir

    2015-01-01

    Full Text Available Automatic Pattern Recognition field has witnessed enormous growth in the past few decades. Being an essential element of Pattern Recognition, Document Image Analysis is the procedure of analyzing a document image with the intention of working out the contents so that they can be manipulated as per the requirements at various levels. It involves various procedures like document classification, organizing, conversion, identification and many more. Since a document chiefly contains text, Script Identification has grown to be a very important area of this field. A Script comprises the text of a document or a manuscript. It is a scheme of written characters and symbols used to write a particular language. Languages are written using scripts, but script itself is made up of symbols. Every language has its own set of symbols used for writing it. Sometimes different languages are written using the same script, but with marginal modification. Script Identification has been performed for unilingual, bilingual and multilingual document images. But, negligible work has been reported for Kashmiri script. In this paper, we are analyzing and experimentally testing statistical approach for identification of Kashmiri script in a document image along with Roman, Devanagari & Urdu scripts. The identification is performed on offline machine-printed scripts and yields promising results.

  20. Semantic retrieval and navigation in clinical document collections.

    Science.gov (United States)

    Kreuzthaler, Markus; Daumke, Philipp; Schulz, Stefan

    2015-01-01

    Patients with chronic diseases undergo numerous in- and outpatient treatment periods, and therefore many documents accumulate in their electronic records. We report on an on-going project focussing on the semantic enrichment of medical texts, in order to support recall-oriented navigation across a patient's complete documentation. A document pool of 1,696 de-identified discharge summaries was used for prototyping. A natural language processing toolset for document annotation (based on the text-mining framework UIMA) and indexing (Solr) was used to support a browser-based platform for document import, search and navigation. The integrated search engine combines free text and concept-based querying, supported by dynamically generated facets (diagnoses, procedures, medications, lab values, and body parts). The prototype demonstrates the feasibility of semantic document enrichment within document collections of a single patient. Originally conceived as an add-on for the clinical workplace, this technology could also be adapted to support personalised health record platforms, as well as cross-patient search for cohort building and other secondary use scenarios.

  1. Technical Documentation and Legal Liability.

    Science.gov (United States)

    Caher, John M.

    1995-01-01

    States that litigation over the interpretation and sufficiency of technical documentation is increasingly common as a number of suits have been filed in state and federal courts. Describes the case of "Martin versus Hacker," a recent case in which New York's highest court analyzed a technical writer's prose in the context of a lawsuit…

  2. Annex II technical documentation assessed.

    Science.gov (United States)

    van Drongelen, A W; Roszek, B; van Tienhoven, E A E; Geertsma, R E; Boumans, R T; Kraus, J J A M

    2005-12-01

    Annex II of the Medical Device Directive (MDD) is used frequently by manufacturers to obtain CE-marking. This procedure relies on a full quality assurance system and does not require an assessment of the individual medical device by a Notified Body. An investigation into the availability and the quality of technical documentation for Annex II devices revealed severe shortcomings, which are reported here.

  3. Synchronizing Web Documents with Style

    NARCIS (Netherlands)

    Guimarães, R.L.; Bulterman, D.C.A.; Cesar Garcia, P.S.; Jansen, A.J.

    2014-01-01

    In this paper we report on our efforts to define a set of document extensions to Cascading Style Sheets (CSS) that allow for structured timing and synchronization of elements within a Web page. Our work considers the scenario in which the temporal structure can be decoupled from the content of the W

  4. The Digital technical documentation handbook

    CERN Document Server

    Schultz, Susan I; Kavanagh, Frank X; Morse, Marjorie J

    1993-01-01

    The Digital Technical Documentation Handbook describes the process of developing and producing technical user information at Digital Equipment Corporation. * Discusses techniques for making user information _more effective * Covers the draft and reviewprocess, the production and distribution of printed and electronic media, archiving, indexing, testing for usability, and many other topics * Provides quality assurance checklists, contains a glossary and a bibliography of resources for technicalcommunicators

  5. Melter Disposal Strategic Planning Document

    Energy Technology Data Exchange (ETDEWEB)

    BURBANK, D.A.

    2000-09-25

    This document describes the proposed strategy for disposal of spent and failed melters from the tank waste treatment plant to be built by the Office of River Protection at the Hanford site in Washington. It describes program management activities, disposal and transportation systems, leachate management, permitting, and safety authorization basis approvals needed to execute the strategy.

  6. Population Education Documents, Reprint Series.

    Science.gov (United States)

    United Nations Educational, Scientific, and Cultural Organization, Bangkok (Thailand). Regional Office for Education in Asia and Oceania.

    This publication contains reprints of five documents that were either published in foreign journals or released in limited numbers by author or publisher. The papers are all concerned with population education, but deal more specifically with the role of population and the schools. Among the topics discussed are population education and the school…

  7. t-Plausibility: Generalizing Words to Desensitize Text

    Directory of Open Access Journals (Sweden)

    Balamurugan Anandan

    2012-12-01

    Full Text Available De-identified data has the potential to be shared widely to support decision making and research. While significant advances have been made in anonymization of structured data, anonymization of textual information is in it infancy. Document sanitization requires finding and removing personally identifiable information. While current tools are effective at removing specific types of information (names, addresses, dates, they fail on two counts. The first is that complete text redaction may not be necessary to prevent re-identification, since this can affect the readability and usability of the text. More serious is that identifying information, as well as sensitive information, can be quite subtle and still be present in the text even after the removal of obvious identifiers. Observe that a diagnosis ``tuberculosis'' is sensitive, but in some situations it can also be identifying. Replacing it with the less sensitive term ``infectious disease'' also reduces identifiability. That is, instead of simply removing sensitive terms, these terms can be hidden by more general but semantically related terms to protect sensitive and identifying information, without unnecessarily degrading the amount of information contained in the document. Based on this observation, the main contribution of this paper is to provide a novel information theoretic approach to text sanitization and develop efficient heuristics to sanitize text documents.

  8. International environmental law. Documents; 2. ed.

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    1997-07-01

    The first edition of this collection of documents on International Environmental Law appeared in 1995 and proved to be highly successful, not only for educational purposes at the universities, but also with practitioners. Many suggestions as to revisions of and additions to the materials presented received from its users in the past two years have prompted this second edition. Texts have been brought up to date when amended, completed versions of two ILC (International Law Conciliation) drafts have been included and new documents have been added. Furthermore, it has been decided to add some documents in the field of European law pertaining to environmental issues. Finally, the request for a different format has been honored, thereby turning this collection into a `proper book`, not only as far as the contents are concerned but also by way of appearance. The collection has been assembled through the inter-university cooperation, co- ordinated by the T.M.C. Asser Institute, of Dutch law faculties teaching international law

  9. Distributed and Conditional Documents: Conceptualizing Bibliographical Alterities

    Directory of Open Access Journals (Sweden)

    Johanna Drucker

    2014-11-01

    Full Text Available To conceptualize a future history of the book we have to recognize that our understanding of the bibliographical object of the past is challenged by the ontologically unbound, distributed, digital, and networked conditions of the present. As we draw on rich intellectual traditions, we must keep in view the need to let go of the object-centered approach that is at the heart of book history. My argument begins, therefore, with a few assertions. First, that we have much to learn from the scholarship on Old and New World contact that touches on bibliography, document studies, and book history for formulating a non-object centered conception of what a book is. Second, that the insights from these studies can be usefully combined with a theory of the “conditional” document to develop the model of the kinds of distributed artifacts we encounter on a daily basis in the networked conditions of current practices. Finally, I would suggest that this model provides a different conception of artifacts (books, documents, works of textual or graphic art, one in which reception is production and therefore all materiality is subject to performative engagement within varied, and specific, conditions of encounter.

  10. ELECTRICAL SUPPORT SYSTEM DESCRIPTION DOCUMENT

    Energy Technology Data Exchange (ETDEWEB)

    S. Roy

    2004-06-24

    The purpose of this revision of the System Design Description (SDD) is to establish requirements that drive the design of the electrical support system and their bases to allow the design effort to proceed to License Application. This SDD is a living document that will be revised at strategic points as the design matures over time. This SDD identifies the requirements and describes the system design as they exist at this time, with emphasis on those attributes of the design provided to meet the requirements. This SDD has been developed to be an engineering tool for design control. Accordingly, the primary audience/users are design engineers. This type of SDD both ''leads'' and ''trails'' the design process. It leads the design process with regard to the flow down of upper tier requirements onto the system. Knowledge of these requirements is essential in performing the design process. The SDD trails the design with regard to the description of the system. The description provided in the SDD is a reflection of the results of the design process to date. Functional and operational requirements applicable to electrical support systems are obtained from the ''Project Functional and Operational Requirements'' (F&OR) (Siddoway 2003). Other requirements to support the design process have been taken from higher-level requirements documents such as the ''Project Design Criteria Document'' (PDC) (Doraswamy 2004), and fire hazards analyses. The above-mentioned low-level documents address ''Project Requirements Document'' (PRD) (Canon and Leitner 2003) requirements. This SDD contains several appendices that include supporting information. Appendix B lists key system charts, diagrams, drawings, and lists, and Appendix C includes a list of system procedures.

  11. Coaltrans 2003 Vienna. Conference documentation and information

    Energy Technology Data Exchange (ETDEWEB)

    NONE

    2003-07-01

    The sessions of the conference are: overview of the international coal markets; coal's role in changing European market; overview of the shipping markets; coking coal in 2004 and beyond; risk management and hedging; steam coal forecast; and coal in a carbon constrained future. The conference documentation mainly includes the text of overheads/slides of most but not all of the papers. Six papers have been abstracted separately for the Coal Abstracts database. The complete set of papers is available, to conference delegates, on the website www.coaltrans.com.

  12. Orientalist discourse in media texts

    Directory of Open Access Journals (Sweden)

    Necla Mora

    2009-10-01

    Full Text Available By placing itself at the center of the world with a Eurocentric point of view, the West exploits other countries and communities through inflicting cultural change and transformation on them either from within via colonialist movements or from outside via “Orientalist” discourses in line with its imperialist objectives.The West has fictionalized the “image of the Orient” in terms of science by making use of social sciences like anthropology, history and philology and launched an intensive propaganda which covers literature, painting, cinema and other fields of art in order to actualize this fiction. Accordingly, the image of the Orient – which has been built firstly in terms of science then socially – has been engraved into the collective memory of both the Westerner and the Easterner.The internalized “Orientalist” point of view and discourse cause the Westerner to see and perceive the Easterner with the image formed in his/her memory while looking at them. The Easterner represents and expresses himself/herself from the eyes of the Westerner and with the image which the Westerner fictionalized for him/her. Hence, in order to gain acceptance from the West, the East tries to shape itself into the “Orientalist” mold which the Westerner fictionalized for it.Artists, intellectuals, writers and media professionals, who embrace and internalize the stereotypical hegemonic-driven “Orientalist” discourse of the Westerner and who rank among the elite group, reflect their internalized “Orientalist” discourse on their own actions. This condition causes the “Orientalist” clichés to be engraved in the memory of the society; causes the society to view itself with an “Orientalist” point of view and perceive itself with the clichés of the Westerner. Consequently, the second ring of the hegemony is reproduced by the symbolic elites who represent the power/authority within the country.The “Orientalist” discourse, which is

  13. ServCat Document Selection Guidelines

    Data.gov (United States)

    US Fish and Wildlife Service, Department of the Interior — The ServCat document selection guidelines were developed for selecting appropriate documents to upload into ServCat. When beginning to upload documents into ServCat,...

  14. Informative document halogenated hydrocarbon-containing waste

    NARCIS (Netherlands)

    Verhagen H

    1992-01-01

    This "Informative document halogenated hydrocarbon-containing waste" forms part of a series of "Informative documents waste materials". These documents are conducted by RIVM on the instructions of the Directorate General for the Environment, Waste Materials Directorate, in behal

  15. An Effective Concept Extraction Method for Improving Text Classification Performance

    Institute of Scientific and Technical Information of China (English)

    ZHANG Yuntao; GONG Ling; WANG Yongcheng; YIN Zhonghang

    2003-01-01

    This paper presents anew way to extract concept that can beused to improve text classification per-formance (precision and recall). Thecomputational measure will be dividedinto two layers. The bottom layercalled document layer is concernedwith extracting the concepts of parti-cular document and the upper layercalled category layer is with findingthe description and subject concepts ofparticular category. The relevant im-plementation algorithm that dramatic-ally decreases the search space is dis-cussed in detail. The experiment basedon real-world data collected from Info-Bank shows that the approach is supe-rior to the traditional ones.

  16. FaDA: Fast Document Aligner using Word Embedding

    Directory of Open Access Journals (Sweden)

    Lohar Pintu

    2016-10-01

    Full Text Available FaDA is a free/open-source tool for aligning multilingual documents. It employs a novel crosslingual information retrieval (CLIR-based document-alignment algorithm involving the distances between embedded word vectors in combination with the word overlap between the source-language and the target-language documents. In this approach, we initially construct a pseudo-query from a source-language document. We then represent the target-language documents and the pseudo-query as word vectors to find the average similarity measure between them. This word vector-based similarity measure is then combined with the term overlap-based similarity. Our initial experiments show that s standard Statistical Machine Translation (SMT- based approach is outperformed by our CLIR-based approach in finding the correct alignment pairs. In addition to this, subsequent experiments with the word vector-based method show further improvements in the performance of the system.

  17. PROJECT ENGINEERING DATA MANAGEMENT AT AUTOMATED PREPARATION OF DESIGN DOCUMENTATION

    Directory of Open Access Journals (Sweden)

    A. V. Guryanov

    2017-01-01

    Full Text Available We have developed and realized instrumental means for automated support of end-to-end design process for design documentation on a product at the programming level. The proposed decision is based on processing of the engineering project data that are contained in interdependent design documents: tactical technical characteristics of products, data on the valuable metals contained in them, the list of components applied in a product and others. Processing of engineering data is based on their conversion to the form provided by requirements of industry standards for design documentation preparation. The general graph of the design documentation developed on a product is provided. The description of the developed software product is given. Automated preparation process of interdependent design documents is shown on the example of preparation of purchased products list. Results of work can be used in case of research and development activities on creation of perspective samples of ADP equipment.

  18. Experiments with result diversity and entity ranking: Text, anchors, links, and Wikipedia

    NARCIS (Netherlands)

    Kaptein, R.; Koolen, M.; Kamps, J.

    2009-01-01

    In this paper, we document our efforts in participating to the TREC 2009 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track’s Adhoc task we experiment with document text and anchor text representation, and the use of the link structure. For the Web Track’s Diversity task we exper

  19. Result diversity and entity ranking experiments: anchors, links, text and Wikipedia

    NARCIS (Netherlands)

    Kaptein, R.; Koolen, M.; Kamps, J.

    2010-01-01

    In this paper, we document our efforts in participating to the TREC 2009 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track’s Adhoc task we experiment with document text and anchor text representation, and the use of the link structure. For the Web Track’s Diversity task we exper

  20. 77 FR 60475 - Draft of SWGDOC Standard Classification of Typewritten Text

    Science.gov (United States)

    2012-10-03

    ... From the Federal Register Online via the Government Publishing Office DEPARTMENT OF JUSTICE Office of Justice Programs Draft of SWGDOC Standard Classification of Typewritten Text AGENCY: National... general public a draft document entitled, ``SWGDOC Standard Classification of Typewritten Text''....

  1. The Role of Text Mining in Export Control

    Energy Technology Data Exchange (ETDEWEB)

    Tae, Jae-woong; Son, Choul-woong; Shin, Dong-hoon [Korea Institute of Nuclear Nonproliferation and Control, Daejeon (Korea, Republic of)

    2015-10-15

    Korean government provides classification services to exporters. It is simple to copy technology such as documents and drawings. Moreover, it is also easy that new technology derived from the existing technology. The diversity of technology makes classification difficult because the boundary between strategic and nonstrategic technology is unclear and ambiguous. Reviewers should consider previous classification cases enough. However, the increase of the classification cases prevent consistent classifications. This made another innovative and effective approaches necessary. IXCRS (Intelligent Export Control Review System) is proposed to coincide with demands. IXCRS consists of and expert system, a semantic searching system, a full text retrieval system, and image retrieval system and a document retrieval system. It is the aim of the present paper to observe the document retrieval system based on text mining and to discuss how to utilize the system. This study has demonstrated how text mining technique can be applied to export control. The document retrieval system supports reviewers to treat previous classification cases effectively. Especially, it is highly probable that similarity data will contribute to specify classification criterion. However, an analysis of the system showed a number of problems that remain to be explored such as a multilanguage problem and an inclusion relationship problem. Further research should be directed to solve problems and to apply more data mining techniques so that the system should be used as one of useful tools for export control.

  2. Linguistically informed digital fingerprints for text

    Science.gov (United States)

    Uzuner, Özlem

    2006-02-01

    Digital fingerprinting, watermarking, and tracking technologies have gained importance in the recent years in response to growing problems such as digital copyright infringement. While fingerprints and watermarks can be generated in many different ways, use of natural language processing for these purposes has so far been limited. Measuring similarity of literary works for automatic copyright infringement detection requires identifying and comparing creative expression of content in documents. In this paper, we present a linguistic approach to automatically fingerprinting novels based on their expression of content. We use natural language processing techniques to generate "expression fingerprints". These fingerprints consist of both syntactic and semantic elements of language, i.e., syntactic and semantic elements of expression. Our experiments indicate that syntactic and semantic elements of expression enable accurate identification of novels and their paraphrases, providing a significant improvement over techniques used in text classification literature for automatic copy recognition. We show that these elements of expression can be used to fingerprint, label, or watermark works; they represent features that are essential to the character of works and that remain fairly consistent in the works even when works are paraphrased. These features can be directly extracted from the contents of the works on demand and can be used to recognize works that would not be correctly identified either in the absence of pre-existing labels or by verbatim-copy detectors.

  3. Mining Causality for Explanation Knowledge from Text

    Institute of Scientific and Technical Information of China (English)

    Chaveevan Pechsiri; Asanee Kawtrakul

    2007-01-01

    Mining causality is essential to provide a diagnosis. This research aims at extracting the causality existing within multiple sentences or EDUs (Elementary Discourse Unit). The research emphasizes the use of causality verbs because they make explicit in a certain way the consequent events of a cause, e.g., "Aphids suck the sap from rice leaves. Then leaves will shrink. Later, they will become yellow and dry.". A verb can also be the causal-verb link between cause and effect within EDU(s), e.g., "Aphids suck the sap from rice leaves causing leaves to be shrunk" ("causing" is equivalent to a causal-verb link in Thai). The research confronts two main problems: identifying the interesting causality events from documents and identifying their boundaries. Then, we propose mining on verbs by using two different machine learning techniques, Naive Bayes classifier and Support Vector Machine. The resulted mining rules will be used for the identification and the causality extraction of the multiple EDUs from text. Our multiple EDUs extraction shows 0.88 precision with 0.75 recall from Na'ive Bayes classifier and 0.89 precision with 0.76 recall from Support Vector Machine.

  4. Context and Keyword Extraction in Plain Text Using a Graph Representation

    CERN Document Server

    Chahine, Carlo Abi; Kotowicz, Jean-Philippe; Pécuchet, Jean-Pierre

    2009-01-01

    Document indexation is an essential task achieved by archivists or automatic indexing tools. To retrieve relevant documents to a query, keywords describing this document have to be carefully chosen. Archivists have to find out the right topic of a document before starting to extract the keywords. For an archivist indexing specialized documents, experience plays an important role. But indexing documents on different topics is much harder. This article proposes an innovative method for an indexing support system. This system takes as input an ontology and a plain text document and provides as output contextualized keywords of the document. The method has been evaluated by exploiting Wikipedia's category links as a termino-ontological resources.

  5. Incorporating other texts: Intertextuality in Malaysian CSR reports

    Directory of Open Access Journals (Sweden)

    Kumaran Rajandran

    2016-11-01

    Full Text Available In Malaysia, corporate social responsibility (CSR is relatively new but corporations have been required to engage in and disclose their CSR. A typical genre for disclosure is CSR reports and these reports often refer to other texts. The article investigates the act of referencing to other texts or intertextuality in Malaysian CSR reports. It creates an archive of CEO Statements and Environment Sections in CSR reports and studies the archive for keywords, which can identify the incorporated texts. The function of these texts is examined in relation to Malaysia’s corporate context. CSR reports contain explicit references to documents (policies, regulations, reports, research, standards and to individuals/groups (CEOs, stakeholders, expert organizations. The incorporated texts display variation in corporate control, which organizes these texts along an intertextual cline. The cline helps to identify corporate and non-corporate sources among the texts. The selection of incorporated texts may reflect government and stock exchange demands. The texts are not standardized and are relevant for the CSR domain and corporations, where these texts monitor and justify CSR performance. Yet, the incorporated texts may perpetuate inexact reporting because corporations select the texts and the parts of texts to refer to. Since these texts have been employed to scrutinize initiatives and results, CSR reports can claim to represent the “truth” about a corporation’s CSR. Hence, intertextuality serves corporate interests.

  6. 13. Project Management Documentation and Communications

    DEFF Research Database (Denmark)

    Kampf, Constance Elizabeth

    2014-01-01

    This chapter discusses the relationship between documentation and communication practices in organizational contexts.......This chapter discusses the relationship between documentation and communication practices in organizational contexts....

  7. Unsupervised mining of frequent tags for clinical eligibility text indexing.

    Science.gov (United States)

    Miotto, Riccardo; Weng, Chunhua

    2013-12-01

    Clinical text, such as clinical trial eligibility criteria, is largely underused in state-of-the-art medical search engines due to difficulties of accurate parsing. This paper proposes a novel methodology to derive a semantic index for clinical eligibility documents based on a controlled vocabulary of frequent tags, which are automatically mined from the text. We applied this method to eligibility criteria on ClinicalTrials.gov and report that frequent tags (1) define an effective and efficient index of clinical trials and (2) are unlikely to grow radically when the repository increases. We proposed to apply the semantic index to filter clinical trial search results and we concluded that frequent tags reduce the result space more efficiently than an uncontrolled set of UMLS concepts. Overall, unsupervised mining of frequent tags from clinical text leads to an effective semantic index for the clinical eligibility documents and promotes their computational reuse.

  8. A Fuzzy Similarity Based Concept Mining Model for Text Classification

    Directory of Open Access Journals (Sweden)

    Shalini Puri

    2011-11-01

    Full Text Available Text Classification is a challenging and a red hot field in the current scenario and has great importance in text categorization applications. A lot of research work has been done in this field but there is a need to categorize a collection of text documents into mutually exclusive categories by extracting the concepts or features using supervised learning paradigm and different classification algorithms. In this paper, a new Fuzzy Similarity Based Concept Mining Model (FSCMM is proposed to classify a set of text documents into pre - defined Category Groups (CG by providing them training and preparing on the sentence, document and integrated corpora levels along with feature reduction, ambiguity removal on each level to achieve high system performance. Fuzzy Feature Category Similarity Analyzer (FFCSA is used to analyze each extracted feature of Integrated Corpora Feature Vector (ICFV with the corresponding categories or classes. This model uses Support Vector Machine Classifier (SVMC to classify correctly the training data patterns into two groups; i. e., + 1 and – 1, thereby producing accurate and correct results. The proposed model works efficiently and effectively with great performance and high - accuracy results.

  9. Relevant XML Documents - Approach Based on Vectors and Weight Calculation of Terms

    Directory of Open Access Journals (Sweden)

    Abdeslem DENNAI

    2016-10-01

    Full Text Available Three classes of documents, based on their data, circulate in the web: Unstructured documents (.Doc, .html, .pdf ..., semi-structured documents (.xml, .Owl ... and structured documents (Tables database for example. A semi-structured document is organized around predefined tags or defined by its author. However, many studies use a document classification by taking into account their textual content and underestimate their structure. We attempt in this paper to propose a representation of these semi-structured web documents based on weighted vectors allowing exploiting their content for a possible treatment. The weight of terms is calculated using: The normal frequency for a document, TF-IDF (Term Frequency - Inverse Document Frequency and logic (Boolean frequency for a set of documents. To assess and demonstrate the relevance of our proposed approach, we will realize several experiments on different corpus.

  10. Documentation requirements for radiation sterilization

    DEFF Research Database (Denmark)

    Miller, A.

    1995-01-01

    Several standards are recently approved or are under development by the standard organizations ISO and CEN in the field of radiation sterilization. Particularly in Europe these standards define new requirements on some issues and on other issues they emphasize the necessary documentation for appr......Several standards are recently approved or are under development by the standard organizations ISO and CEN in the field of radiation sterilization. Particularly in Europe these standards define new requirements on some issues and on other issues they emphasize the necessary documentation...... for approval of radiation sterilized products. The impact of these standards on the radiation sterilization is discussed, with special attention given to a few special issues, mainly traceability and uncertainty of measurement results....

  11. Employing Metadata Standards in Electronic Records and Document Management a Path before Archives and Documentation and Information Centers

    Directory of Open Access Journals (Sweden)

    Ali Reza Saadat

    2006-10-01

    Full Text Available Archives and special documentations and information centers within government offices, companies and organizations house a collection of paper documents within themselves. The rising number of these documents and storage space limitation on one hand, and current organizational trend towards e-government on the other, had caused these documents to be increasingly converted into electronic format with concomitant change in management and preservation strategy. Electronic Document and Records Management or EDRM is one such management strategy. The most important management issues are consistency, authority, interface, description and retrieval.  These issues emphasize the role of metadata given their unique capabilities in this respect. The present paper, while introducing the international standards in Electronic Record Management, would discuss the common metadata standards drafted such as e-GMS, AGLS, GILS, DC.

  12. On Tangut Historical Documents Recognition*

    Science.gov (United States)

    Liu, Changqing

    As the Tangut studies have made progress, a considerable number of Tangut historical documents' copies have been published. It is of great importance to carry out digitalization and domestication of these copies. The paper firstly makes an initial processing of images by global threshold, then dissect the photocopies by scanning. Finally adopts the recognition approach of principal component analysis. The experiment shows that a better recognition can be achieved by calculation without extra time.

  13. Review Document: Full Software Trigger

    CERN Document Server

    Albrecht, J; Raven, G

    2014-01-01

    This document presents a trigger system for the upgraded LHCb detector, scheduled to begin operation in 2020. This document serves as input for the internal review towards the "DAQ, online and trigger TDR". The proposed trigger system is implemented entirely in software. In this document we show that track reconstruction of a similar quality to that available in the offline algorithms can be performed on the full inelastic $pp$-collision rate, without prior event selections implemented in custom hardware and without relying upon a partial event reconstruction. A track nding eciency of 98.8 % relative to oine can be achieved for tracks with $p_T >$ 500 MeV/$c$. The CPU time required for this reconstruction is about 40 % of the available budget. Proof-of-principle selections are presented which demonstrate that excellent performance is achievable using an inclusive beauty trigger, in addition to exclusive beauty and charm triggers. Finally, it is shown that exclusive beauty and charm selections that do not intr...

  14. Improving the quality of nursing documentation: An action research project

    Directory of Open Access Journals (Sweden)

    Elisha M. Okaisu

    2014-10-01

    Full Text Available Background: Documentation is an important function of professional nursing practise. In spite of numerous improvement efforts globally, inadequate documentation continues to be reported as nurse authors investigate barriers and challenges. Objectives: The project aimed to improve nurses’ documentation of their patient assessments at the CURE Children’s Hospital of Uganda in order to enhance the quality of nursing practise. Method: An action research methodology, using repeated cycles of planning, intervention, reflection and modification, was used to establish best practise approaches in this context for improving nurses’ efficacy in documenting assessments in the patient record. The researchers gathered data from chart audits, literature reviews and key informant interviews. Through analysis and critical reflection, these data informed three cycles of systems and practise modifications to improve the quality of documentation. Results: The initial cycle revealed that staff training alone was insufficient to achievethe project goal. To achieve improved documentation, broader changes were necessary, including building a critical mass of competent staff, redesigned orientation and continuing education, documentation form redesign, changes in nurse skill mix, and continuous leadership support. Conclusion: Improving nursing documentation involved complex challenges in this setting and demanded multiple approaches. Evidence-based practise was the foundation of changes in systems required to produce visible improvement in practise. The involved role of leadership in these efforts was very important.

  15. An improved Approach for Document Retrieval Using Suffix Trees

    Directory of Open Access Journals (Sweden)

    N. Sandhya

    2011-09-01

    Full Text Available Huge collection of documents is available at few mouse clicks. The current World Wide Web is a web of pages. Users have to guess possible keywords that might lead through search engines to the pages that contain information of interest and browse hundreds or even thousands of the returned pages in order to obtain what they want. In our work we build a generalized suffix tree for our documents and propose a search technique for retrieving documents based on a sort of phrase called word sequences. Our proposed method efficiently searches for a given phrase (with missing or additional words in between with better performance.

  16. Similarity Based Clustering with Indexing for Semi-Structured Document

    Directory of Open Access Journals (Sweden)

    S. Palanisamy

    2012-01-01

    Full Text Available Problem statement: To improve the performance of data retrieval in a homogeneous large XML document. Approach: Clustering of XML elements based on the content with indexing. The element which is used for clustering has been identified from the document and/or XML schema. This element is used as a parameter for clustering. The suitable index is created after clustering. Results: The clustering combined with indexing strategy support the efficient retrieval of XML element from the document. Conclusion: The proposed method is used to improve the efficiency of XML data manipulation and comparatively give the better performance rather than clustering or indexing alone.

  17. Optimization of the Document Placement in the RFID Cabinet

    Directory of Open Access Journals (Sweden)

    Kiedrowicz Maciej

    2016-01-01

    Full Text Available The study is devoted to the issue of optimization of the document placement in a single RFID cabinet. It has been assumed that the optimization problem means the reduction of archivization time with respect to the information on all documents with RFID tags. Since the explicit form of the criterion function remains unknown, for the purpose of its approximation, the regression analysis method has been used. The method uses data from a computer simulation of the process of archiving data about documents. To solve the optimization problem, the modified gradient projection method has been used.

  18. Software System for Vocal Rendering of Printed Documents

    Directory of Open Access Journals (Sweden)

    Marian DARDALA

    2008-01-01

    Full Text Available The objective of this paper is to present a software system architecture developed to render the printed documents in a vocal form. On the other hand, in the paper are described the software solutions that exist as software components and are necessary for documents processing as well as for multimedia device controlling used by the system. The usefulness of this system is for people with visual disabilities that can access the contents of documents without that they be printed in Braille system or to exist in an audio form.

  19. Extended Approach to Water Flow Algorithm for Text Line Segmentation

    Institute of Scientific and Technical Information of China (English)

    Darko Brodi(c)

    2012-01-01

    This paper proposes a new approach to the water flow algorithm for text line segmentation.In the basic method the hypothetical water flows under few specified angles which have been defined by water flow angle as parameter.It is applied to the document image frame from left to right and vice versa.As a result,the unwetted and wetted areas are established.Thesc areas separate text from non-text elements in each text line,respectively.Hence,they represent the control areas that are of major importance for text line segmentation.Primarily,an extended approach means extraction of the connected-components by bounding boxes ovcr text.By this way,each connected component is mutually separated.Hence,the water flow angle,which defines the unwetted areas,is determined adaptively.By choosing appropriate water flow angle,the unwetted areas are lengthening which leads to the better text line segmentation.Results of this approach are encouraging due to the text line segmentation improvement which is the most challenging step in document image processing.

  20. Comparing document classification schemes using k-means clustering

    OpenAIRE

    Šivić, Artur; Žmak, Lovro; Dalbelo Bašić, Bojana; Moens, Marie-Francine

    2008-01-01

    In this work, we jointly apply several text mining methods to a corpus of legal documents in order to compare the separation quality of two inherently different document classification schemes. The classification schemes are compared with the clusters produced by the k-means algorithm. In the future, we believe that our comparison method will be coupled with semi-supervised and active learning techniques. Also, this paper presents the idea of combining k-means and Principal Component Analysis...