WorldWideScience

Sample records for accurate automatic speech

  1. Techniques for automatic speech recognition

    Moore, R. K.

    1983-05-01

    A brief insight into some of the algorithms that lie behind current automatic speech recognition system is provided. Early phonetically based approaches were not particularly successful, due mainly to a lack of appreciation of the problems involved. These problems are summarized, and various recognition techniques are reviewed in the contect of the solutions that they provide. It is pointed out that the majority of currently available speech recognition equipments employ a "whole-word' pattern matching approach which, although relatively simple, has proved particularly successful in its ability to recognize speech. The concepts of time-normalizing plays a central role in this type of recognition process and a family of such algorithms is described in detail. The technique of dynamic time warping is not only capable of providing good performance for isolated word recognition, but how it is also extended to the recognition of connected speech (thereby removing one of the most severe limitations of early speech recognition equipment).

  2. Automatic Speech Segmentation Based on HMM

    M. Kroul

    2007-01-01

    This contribution deals with the problem of automatic phoneme segmentation using HMMs. Automatization of speech segmentation task is important for applications, where large amount of data is needed to process, so manual segmentation is out of the question. In this paper we focus on automatic segmentation of recordings, which will be used for triphone synthesis unit database creation. For speech synthesis, the speech unit quality is a crucial aspect, so the maximal accuracy in segmentation is ...

  3. Dynamic Automatic Noisy Speech Recognition System (DANSR)

    Paul, Sheuli

    2014-01-01

    In this thesis we studied and investigated a very common but a long existing noise problem and we provided a solution to this problem. The task is to deal with different types of noise that occur simultaneously and which we call hybrid. Although there are individual solutions for specific types one cannot simply combine them because each solution affects the whole speech. We developed an automatic speech recognition system DANSR ( Dynamic Automatic Noisy Speech Recognition System) for hybri...

  4. Automatic speech recognition a deep learning approach

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  5. Personality in speech assessment and automatic classification

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  6. Development of a System for Automatic Recognition of Speech Development of a System for Automatic Recognition of Speech

    Michal Kuba

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of Zilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  7. Automatic Phonetic Transcription for Danish Speech Recognition

    Kirkedal, Andreas Søeborg

    to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality...... of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...

  8. a New Structure for Automatic Speech Recognition

    Duchnowski, Paul

    Speech is a wideband signal with cues identifying a particular element distributed across frequency. To capture these cues, most ASR systems analyze the speech signal into spectral (or spectrally-derived) components prior to recognition. Traditionally, these components are integrated across frequency to form a vector of "acoustic evidence" on which a decision by the ASR system is based. This thesis develops an alternate approach, post-labeling integration. In this scheme, tentative decisions or labels, of the identity of a given speech element are assigned in parallel by sub -recognizers, each operating on a band-limited portion of the speech waveform. Outputs of these independent channels are subsequently combined (integrated) to render the final decision. Remarkably good recognition of bandlimited nonsense syllables by humans leads to the consideration of this method. It also allows potentially more accurate parameterization of the speech waveform and simultaneously robust estimation of parameter probabilities. The algorithm also represents an attempt to make explicit use of redundancies in speech. Three basic methods of parameterizing the bandlimited input of the sub-recognizers were considered, focusing respectively on LPC and cepstrum coefficients, and parameters based on the autocorrelation function. Four sub-recognizers were implemented as discrete Hidden Markov Model (HMM) systems. Maximum A Posteriori (MAP) hypothesis testing approach was applied to the problem of integrating the individual sub-recognizer decisions on a frame by frame basis. Final segmentation was achieved by a secondary HMM. Five methods of estimating the probabilities necessary for MAP integration were tested. The proposed structure was applied to the task of phonetic, speaker-independent, continuous speech recognition. Performance for several combinations of parameterization schemes and integration methods was measured. The best score of 58.5% on a 39 phone alphabet is roughly

  9. Punjabi Automatic Speech Recognition Using HTK

    Mohit Dua

    2012-07-01

    Full Text Available This paper aims to discuss the implementation of an isolated word Automatic Speech Recognition system (ASR for an Indian regional language Punjabi. The HTK toolkit based on Hidden Markov Model (HMM, a statistical approach, is used to develop the system. Initially the system is trained for 115 distinct Punjabi words by collecting data from eight speakers and then is tested by using samples from six speakers in real time environments. To make the system more interactive and fast a GUI has been developed using JAVA platform for implementing the testing module. The paper also describes the role of each HTK tool, used in various phases of system development, by presenting a detailed architecture of an ASR system developed using HTK library modules and tools. The experimental results show that the overall system performance is 95.63% and 94.08%.

  10. Confidence Measures for Automatic and Interactive Speech Recognition

    Sánchez Cortina, Isaías

    2016-01-01

    [EN] This thesis work contributes to the field of the {Automatic Speech Recognition} (ASR). And particularly to the {Interactive Speech Transcription} and {Confidence Measures} (CM) for ASR. The main goals of this thesis work can be summarised as follows: 1. To design IST methods and tools to tackle the problem of improving automatically generated transcripts. 2. To assess the designed IST methods and tools on real-life tasks of transcription in large educational repositories of vide...

  11. Disordered Speech Assessment Using Automatic Methods Based on Quantitative Measures

    Shrivastav Rahul

    2005-01-01

    Full Text Available Speech quality assessment methods are necessary for evaluating and documenting treatment outcomes of patients suffering from degraded speech due to Parkinson's disease, stroke, or other disease processes. Subjective methods of speech quality assessment are more accurate and more robust than objective methods but are time-consuming and costly. We propose a novel objective measure of speech quality assessment that builds on traditional speech processing techniques such as dynamic time warping (DTW and the Itakura-Saito (IS distortion measure. Initial results show that our objective measure correlates well with the more expensive subjective methods.

  12. Automatic speech recognition for radiological reporting

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  13. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  14. Automatic Speech Receognition for Human-Machine Interaction

    Biundo, Giuseppina; Grassi Pauletti, Sara; Ansorge, Michael; Farine, Pierre-André

    2005-01-01

    Since the sixties, movies such as “2001: A Space Odyssey” have familiarized us with the idea of com-puters that can speak and hear just as a human being does. Automatic speech recogni-tion (ASR) is the technol-ogy that allows machines to interpret human speech (i.e. to answer the ques-tion: What is being said?). The machine ”speaks back“, either by playing pre-recorded messages or by using text-to-speech (TTS) technology.

  15. Automatic speech signal segmentation based on the innovation adaptive filter

    Makowski Ryszard

    2014-06-01

    Full Text Available Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors’ studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006, and it is a modified version of Brandt’s algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.

  16. Automatic discrimination between laughter and speech

    Truong, K.; Leeuwen, D. van

    2007-01-01

    Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker’s state and emotion can be revealed. This paper describes the dev

  17. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  18. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  19. On Automatic Voice Casting for Expressive Speech: Speaker Recognition vs. Speech Classification

    Obin, Nicolas; Roebel, Axel; Bachman, Grégoire

    2014-01-01

    This paper presents the first large-scale automatic voice casting system, and explores the adaptation of speaker recognition techniques to measure voice similarities. The proposed system is based on the representation of a voice by classes (e.g., age/gender, voice quality, emotion). First, a multi-label system is used to classify speech into classes. Then, the output probabilities for each class are concatenated to form a vector that represents the vocal signature of a speech recording. Final...

  20. Mixed Bayesian Networks with Auxiliary Variables for Automatic Speech Recognition

    Stephenson, Todd Andrew; Magimai.-Doss, Mathew; Bourlard, Hervé

    2001-01-01

    Standard hidden Markov models (HMMs), as used in automatic speech recognition (ASR), calculate their emission probabilities by an artificial neural network (ANN) or a Gaussian distribution conditioned on the hidden state variable, considering the emissions independent of any other variable in the model. Recent work showed the benefit of conditioning the emission distributions on a discrete auxiliary variable, which is observed in training and hidden in recognition. Related work has shown the ...

  1. Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders

    Baghai-Ravary, Ladan

    2013-01-01

    Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders provides a survey of methods designed to aid clinicians in the diagnosis and monitoring of speech disorders such as dysarthria and dyspraxia, with an emphasis on the signal processing techniques, statistical validity of the results presented in the literature, and the appropriateness of methods that do not require specialized equipment, rigorously controlled recording procedures or highly skilled personnel to interpret results. Such techniques offer the promise of a simple and cost-effective, yet objective, assessment of a range of medical conditions, which would be of great value to clinicians. The ideal scenario would begin with the collection of examples of the clients’ speech, either over the phone or using portable recording devices operated by non-specialist nursing staff. The recordings could then be analyzed initially to aid diagnosis of conditions, and subsequently to monitor the clients’ progress and res...

  2. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT i

  3. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Santiago Omar Caballero Morales

    2009-01-01

    Full Text Available Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1 a set of “metamodels” that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2 a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  4. Automatic audiovisual integration in speech perception.

    Gentilucci, Maurizio; Cattaneo, Luigi

    2005-11-01

    Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants' spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants' voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a labial consonant. The data are discussed in favor of the hypothesis that features of both

  5. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  6. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  7. Robust Automatic Speech Recognition in Impulsive Noise Environment

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  8. Computer-based automatic finger- and speech-tracking system.

    Breidegard, Björn

    2007-11-01

    This article presents the first technology ever for online registration and interactive and automatic analysis of finger movements during tactile reading (Braille and tactile pictures). Interactive software has been developed for registration (with two cameras and a microphone), MPEG-2 video compression and storage on disk or DVD as well as an interactive analysis program to aid human analysis. An automatic finger-tracking system has been implemented which also semiautomatically tracks the reading aloud speech on the syllable level. This set of tools opens the way for large scale studies of blind people reading Braille or tactile images. It has been tested in a pilot project involving congenitally blind subjects reading texts and pictures. PMID:18183897

  9. Speech recognition for embedded automatic positioner for laparoscope

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  10. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  11. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    Kang, S.; Kim, J.-H.; Seo, J.

    2014-01-01

    Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech reco...

  12. Assessing the efficacy of benchmarks for automatic speech accent recognition

    Benjamin Bock

    2015-08-01

    Full Text Available Speech accents can possess valuable information about the speaker, and can be used in intelligent multimedia-based human-computer interfaces. The performance of algorithms for automatic classification of accents is often evaluated using audio datasets that include recording samples of different people, representing different accents. Here we describe a method that can detect bias in accent datasets, and apply the method to two accent identification datasets to reveal the existence of dataset bias, meaning that the datasets can be classified with accuracy higher than random even if the tested algorithm has no ability to analyze speech accent. We used the datasets by separating one second of silence from the beginning of each audio sample, such that the one-second sample did not contain voice, and therefore no information about the accent. An audio classification method was then applied to the datasets of silent audio samples, and provided classification accuracy significantly higher than random. These results indicate that the performance of accent classification algorithms measured using some accent classification benchmarks can be biased, and can be driven by differences in the background noise rather than the auditory features of the accents.

  13. Automatic speech recognition for report generation in computed tomography

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.)

  14. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  15. Can automatic speech transcripts be used for large scale TV stream description and structuring?

    Guinaudeau, Camille; Gravier, Guillaume; Sébillot, Pascale

    2009-01-01

    International audience The increasing quantity of TV material requires methods to help users navigate such data streams. Automatically associating a short textual description with each program in a stream, is a first stage to navigating or structuring tasks. Speech contained in TV broadcasts--accessible by means of automatic speech recognition systems in the absence of closed caption--is a highly valuable semantic clue that might be used to link existing textual description such as program...

  16. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  17. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  18. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which......An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior to the...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  19. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  20. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    Ibrahim Missaoui; Zied Lachiri

    2016-01-01

    In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that ...

  1. Correcting Automatic Speech Recognition Errors in Real Time

    Wald, M; Boulain, P; Bell, J.; Doody, K; Gerrard, J

    2007-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search fo...

  2. Studies in automatic speech recognition and its application in aerospace

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  3. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    Ibrahim Missaoui

    2016-04-01

    Full Text Available In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that the proposed feature extraction method outperforms the classic methods such as Perceptual Linear Prediction, Linear Predictive Coding, Linear Prediction Cepstral coefficients and Mel Frequency Cepstral Coefficients.

  4. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  5. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the...

  6. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  7. Objective automatic assessment of rehabilitative speech treatment in Parkinson's disease

    Tsanas, A; Little, M.A.; Fox, C.; Ramig, L O

    2014-01-01

    Vocal performance degradation is a common symptom for the vast majority of Parkinson's disease (PD) subjects, who typically follow personalized one-to-one periodic rehabilitation meetings with speech experts over a long-term period. Recently, a novel computer program called Lee Silverman voice treatment (LSVT) Companion was developed to allow PD subjects to independently progress through a rehabilitative treatment session. This study is part of the assessment of the LSVT Companion, aiming to ...

  8. Automatic classification and accurate size measurement of blank mask defects

    Bhamidipati, Samir; Paninjath, Sankaranarayanan; Pereira, Mark; Buck, Peter

    2015-07-01

    A blank mask and its preparation stages, such as cleaning or resist coating, play an important role in the eventual yield obtained by using it. Blank mask defects' impact analysis directly depends on the amount of available information such as the number of defects observed, their accurate locations and sizes. Mask usability qualification at the start of the preparation process, is crudely based on number of defects. Similarly, defect information such as size is sought to estimate eventual defect printability on the wafer. Tracking of defect characteristics, specifically size and shape, across multiple stages, can further be indicative of process related information such as cleaning or coating process efficiencies. At the first level, inspection machines address the requirement of defect characterization by detecting and reporting relevant defect information. The analysis of this information though is still largely a manual process. With advancing technology nodes and reducing half-pitch sizes, a large number of defects are observed; and the detailed knowledge associated, make manual defect review process an arduous task, in addition to adding sensitivity to human errors. Cases where defect information reported by inspection machine is not sufficient, mask shops rely on other tools. Use of CDSEM tools is one such option. However, these additional steps translate into increased costs. Calibre NxDAT based MDPAutoClassify tool provides an automated software alternative to the manual defect review process. Working on defect images generated by inspection machines, the tool extracts and reports additional information such as defect location, useful for defect avoidance[4][5]; defect size, useful in estimating defect printability; and, defect nature e.g. particle, scratch, resist void, etc., useful for process monitoring. The tool makes use of smart and elaborate post-processing algorithms to achieve this. Their elaborateness is a consequence of the variety and

  9. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    KANG, S.

    2014-05-01

    Full Text Available Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech recognition systems focus only on acoustic and language model scores for the current sentence. The proposed model successfully reduces the word error rate and semantic error rate by 3.65% and 8.61%, respectively.

  10. Suprasegmental Duration Modelling with Elastic Constraints in Automatic Speech Recognition

    Molloy, Laurence; Isard, Stephen

    1998-01-01

    In this paper a method of integrating a model of suprasegmental duration with a HMM-based recogniser at the post-processing level is presented. The N-Best utterance output is rescored using a suitable linear combination of acoustic log-likelihood (provided by a set of tied-state triphone HMMs) and duration log-likelihood (provided by a set of durational models). The durational model used in the post-processing imposes syllable-level elastic constraints on the durational behaviour of speech se...

  11. Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

    Terbeh, Naim

    2012-01-01

    This work consists of creating a system of the Computer Assisted Language Learning (CALL) based on a system of Automatic Speech Recognition (ASR) for the Arabic language using the tool CMU Sphinx3 [1], based on the approach of HMM. To this work, we have constructed a corpus of six hours of speech recordings with a number of nine speakers. we find in the robustness to noise a grounds for the choice of the HMM approach [2]. the results achieved are encouraging since our corpus is made by only nine speakers, but they are always reasons that open the door for other improvement works.

  12. Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives

    S. Hamidreza Kasaei

    2011-05-01

    Full Text Available In this paper, we propose design and initial implementation of a robust system which can automatically translates voice into text and text to sign language animations. Sign Language
    Translation Systems could significantly improve deaf lives especially in communications, exchange of information and employment of machine for translation conversations from one language to another has. Therefore, considering these points, it seems necessary to study the speech recognition. Usually, the voice recognition algorithms address three major challenges. The first is extracting feature form speech and the second is when limited sound gallery are available for recognition, and the final challenge is to improve speaker dependent to speaker independent voice recognition. Extracting feature form speech is an important stage in our method. Different procedures are available for extracting feature form speech. One of the commonest of which used in speech
    recognition systems is Mel-Frequency Cepstral Coefficients (MFCCs. The algorithm starts with preprocessing and signal conditioning. Next extracting feature form speech using Cepstral coefficients will be done. Then the result of this process sends to segmentation part. Finally recognition part recognizes the words and then converting word recognized to facial animation. The project is still in progress and some new interesting methods are described in the current report.

  13. Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

    Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy

    2016-04-01

    Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. PMID:26779665

  14. Automatic evaluation of speech rhythm instability and acceleration in dysarthrias associated with basal ganglia dysfunction

    Jan eRusz

    2015-07-01

    Full Text Available Speech rhythm abnormalities are commonly present in patients with different neurodegenerative disorders. These alterations are hypothesized to be a consequence of disruption to the basal ganglia circuitry involving dysfunction of motor planning, programming and execution, which can be detected by a syllable repetition paradigm. Therefore, the aim of the present study was to design a robust signal processing technique that allows the automatic detection of spectrally-distinctive nuclei of syllable vocalizations and to determine speech features that represent rhythm instability and acceleration. A further aim was to elucidate specific patterns of dysrhythmia across various neurodegenerative disorders that share disruption of basal ganglia function. Speech samples based on repetition of the syllable /pa/ at a self-determined steady pace were acquired from 109 subjects, including 22 with Parkinson's disease (PD, 11 progressive supranuclear palsy (PSP, 9 multiple system atrophy (MSA, 24 ephedrone-induced parkinsonism (EP, 20 Huntington's disease (HD, and 23 healthy controls. Subsequently, an algorithm for the automatic detection of syllables as well as features representing rhythm instability and rhythm acceleration were designed. The proposed detection algorithm was able to correctly identify syllables and remove erroneous detections due to excessive inspiration and nonspeech sounds with a very high accuracy of 99.6%. Instability of vocal pace performance was observed in PSP, MSA, EP and HD groups. Significantly increased pace acceleration was observed only in the PD group. Although not significant, a tendency for pace acceleration was observed also in the PSP and MSA groups. Our findings underline the crucial role of the basal ganglia in the execution and maintenance of automatic speech motor sequences. We envisage the current approach to become the first step towards the development of acoustic technologies allowing automated assessment of rhythm

  15. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  16. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  17. Automatic transcription of continuous speech into syllable-like units for Indian languages

    G Lakshmi Sarada; A Lakshmi; Hema A Murthy; T Nagarajan

    2009-04-01

    The focus of this paper is to automatically segment and label continuous speech signal into syllable-like units for Indian languages. In this approach, the continuous speech signal is first automatically segmented into syllable-like units using group delay based algorithm. Similar syllable segments are then grouped together using an unsupervised and incremental training (UIT) technique. Isolated style HMM models are generated for each of the clusters during training. During testing, the speech signal is segmented into syllable-like units which are then tested against the HMMs obtained during training. This results in a syllable recognition performance of 42·6% and 39·94% for Tamil and Telugu. A new feature extraction technique that uses features extracted from multiple frame sizes and frame rates during both training and testing is explored for the syllable recognition task. This results in a recognition performance of 48·7% and 45·36%, for Tamil and Telugu respectively. The performance of segmentation followed by labelling is superior to that of a flat start syllable recogniser (27·8% and 28·8% for Tamil and Telugu respectively).

  18. A perception system for accurate automatic control of an articulated bus

    Salinas, Carlota; Montes, Héctor; Armada, Manuel

    2010-01-01

    This paper describes the perception system for an automatic articulated bus where an accurate tracking trajectory is desired. Among the most promising transport infrastructures of the autonomous or semi-autonomous transportation systems, the articulated bus is an interesting low cost and friendly option. This platform involves a mobile vehicle and a private circuit inside CSIC premises. The perception system, presented in this work, based on 2D laser scanner as a prime sensor generates local ...

  19. Deformable meshes for medical image segmentation accurate automatic segmentation of anatomical structures

    Kainmueller, Dagmar

    2014-01-01

    ? Segmentation of anatomical structures in medical image data is an essential task in clinical practice. Dagmar Kainmueller introduces methods for accurate fully automatic segmentation of anatomical structures in 3D medical image data. The author's core methodological contribution is a novel deformation model that overcomes limitations of state-of-the-art Deformable Surface approaches, hence allowing for accurate segmentation of tip- and ridge-shaped features of anatomical structures. As for practical contributions, she proposes application-specific segmentation pipelines for a range of anatom

  20. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  1. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  2. Accurate and automatic extrinsic calibration method for blade measurement system integrated by different optical sensors

    He, Wantao; Li, Zhongwei; Zhong, Kai; Shi, Yusheng; Zhao, Can; Cheng, Xu

    2014-11-01

    Fast and precise 3D inspection system is in great demand in modern manufacturing processes. At present, the available sensors have their own pros and cons, and hardly exist an omnipotent sensor to handle the complex inspection task in an accurate and effective way. The prevailing solution is integrating multiple sensors and taking advantages of their strengths. For obtaining a holistic 3D profile, the data from different sensors should be registrated into a coherent coordinate system. However, some complex shape objects own thin wall feather such as blades, the ICP registration method would become unstable. Therefore, it is very important to calibrate the extrinsic parameters of each sensor in the integrated measurement system. This paper proposed an accurate and automatic extrinsic parameter calibration method for blade measurement system integrated by different optical sensors. In this system, fringe projection sensor (FPS) and conoscopic holography sensor (CHS) is integrated into a multi-axis motion platform, and the sensors can be optimally move to any desired position at the object's surface. In order to simple the calibration process, a special calibration artifact is designed according to the characteristics of the two sensors. An automatic registration procedure based on correlation and segmentation is used to realize the artifact datasets obtaining by FPS and CHS rough alignment without any manual operation and data pro-processing, and then the Generalized Gauss-Markoff model is used to estimate the optimization transformation parameters. The experiments show the measurement result of a blade, where several sampled patches are merged into one point cloud, and it verifies the performance of the proposed method.

  3. The effect of automatic gain control structure and release time on cochlear implant speech intelligibility.

    Phyu P Khing

    Full Text Available Nucleus cochlear implant systems incorporate a fast-acting front-end automatic gain control (AGC, sometimes called a compression limiter. The objective of the present study was to determine the effect of replacing the front-end compression limiter with a newly proposed envelope profile limiter. A secondary objective was to investigate the effect of AGC speed on cochlear implant speech intelligibility. The envelope profile limiter was located after the filter bank and reduced the gain when the largest of the filter bank envelopes exceeded the compression threshold. The compression threshold was set equal to the saturation level of the loudness growth function (i.e. the envelope level that mapped to the maximum comfortable current level, ensuring that no envelope clipping occurred. To preserve the spectral profile, the same gain was applied to all channels. Experiment 1 compared sentence recognition with the front-end limiter and with the envelope profile limiter, each with two release times (75 and 625 ms. Six implant recipients were tested in quiet and in four-talker babble noise, at a high presentation level of 89 dB SPL. Overall, release time had a larger effect than the AGC type. With both AGC types, speech intelligibility was lower for the 75 ms release time than for the 625 ms release time. With the shorter release time, the envelope profile limiter provided higher group mean scores than the front-end limiter in quiet, but there was no significant difference in noise. Experiment 2 measured sentence recognition in noise as a function of presentation level, from 55 to 89 dB SPL. The envelope profile limiter with 625 ms release time yielded better scores than the front-end limiter with 75 ms release time. A take-home study showed no clear pattern of preferences. It is concluded that the envelope profile limiter is a feasible alternative to a front-end compression limiter.

  4. Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

    Huet, Stéphane; Gravier, Guillaume; Sébillot, Pascale

    2010-01-01

    Abstract Many automatic speech recognition (ASR) systems rely on the sole pronunciation dictionaries and language models to take into account information about language. Implicitly, morphology and syntax are to a certain extent embedded in the language models but the richness of such linguistic knowledge is not exploited. This paper studies the use of morpho-syntactic (MS) information in a post-processing stage of an ASR system, by reordering N-best lists. Each sentence hypothesis ...

  5. A new automatic blood pressure kit auscultates for accurate reading with a smartphone

    Wu, Hongjun; Wang, Bingjian; Zhu, Xinpu; Chu, Guang; Zhang, Zhi

    2016-01-01

    Abstract The widely used oscillometric automated blood pressure (BP) monitor was continuously questioned on its accuracy. A novel BP kit named Accutension which adopted Korotkoff auscultation method was then devised. Accutension worked with a miniature microphone, a pressure sensor, and a smartphone. The BP values were automatically displayed on the smartphone screen through the installed App. Data recorded in the phone could be played back and reconfirmed after measurement. They could also be uploaded and saved to the iCloud. The accuracy and consistency of this novel electronic auscultatory sphygmomanometer was preliminarily verified here. Thirty-two subjects were included and 82 qualified readings were obtained. The mean differences ± SD for systolic and diastolic BP readings between Accutension and mercury sphygmomanometer were 0.87 ± 2.86 and −0.94 ± 2.93 mm Hg. Agreements between Accutension and mercury sphygmomanometer were highly significant for systolic (ICC = 0.993, 95% confidence interval (CI): 0.989–0.995) and diastolic (ICC = 0.987, 95% CI: 0.979–0.991). In conclusion, Accutension worked accurately based on our pilot study data. The difference was acceptable. ICC and Bland–Altman plot charts showed good agreements with manual measurements. Systolic readings of Accutension were slightly higher than those of manual measurement, while diastolic readings were slightly lower. One possible reason was that Accutension captured the first and the last korotkoff sound more sensitively than human ear during manual measurement and avoided sound missing, so that it might be more accurate than traditional mercury sphygmomanometer. By documenting and analyzing of variant tendency of BP values, Accutension helps management of hypertension and therefore contributes to the mobile heath service. PMID:27512876

  6. Rapid and automatic speech-specific learning mechanism in human neocortex.

    Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury

    2015-09-01

    A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. PMID:26074199

  7. Automatic speech recognizer based on the Spanish spoken in Valdivia, Chile

    Sanchez, Maria L.; Poblete, Victor H.; Sommerhoff, Jorge

    2001-05-01

    The performance of an automatic speech recognizer is affected by training process (dependent on or independent of the speaker) and the size of the vocabulary. The language used in this study was the Spanish spoken in the city of Valdivia, Chile. A representative sample of 14 students and six professionals all natives of Valdivia (ten women and ten men) were used to complete the study. The sample ranged in age between 20 and 30 years old. Two systems were programmed based on the classical principles: digitalizing, end point detection, linear prediction coding, cepstral coefficients, dynamic time warping, and a final decision stage with a previous step of training: (i) one dependent speaker (15 words: five colors and ten numbers), (ii) one independent speaker (30 words: ten verbs, ten nouns, and ten adjectives). A simple didactical application, with options to choose colors, numbers and drawings of the verbs, nouns and adjectives, was designed to be used with a personal computer. In both programs, the tests carried out showed a tendency towards errors in short words with monosyllables like ``flor,'' and ``sol.'' The best results were obtained in words with three syllables like ``disparar'' and ``mojado.'' [Work supported by Proyecto DID UACh N S-200278.

  8. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  9. Automatic Speech Recognition Using Template Model for Man-Machine Interface

    Mishra, Neema; Shrawankar, Urmila; Thakare, V. M

    2013-01-01

    Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech is being an important mode of interaction with computers. In this paper Feature extraction is implemented using well-known Mel-Frequenc...

  10. Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

    SAK, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise

    2015-01-01

    We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques tha...

  11. Novel Techniques for Dialectal Arabic Speech Recognition

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  12. Subjective Quality Measurement of Speech Its Evaluation, Estimation and Applications

    Kondo, Kazuhiro

    2012-01-01

    It is becoming crucial to accurately estimate and monitor speech quality in various ambient environments to guarantee high quality speech communication. This practical hands-on book shows speech intelligibility measurement methods so that the readers can start measuring or estimating speech intelligibility of their own system. The book also introduces subjective and objective speech quality measures, and describes in detail speech intelligibility measurement methods. It introduces a diagnostic rhyme test which uses rhyming word-pairs, and includes: An investigation into the effect of word familiarity on speech intelligibility. Speech intelligibility measurement of localized speech in virtual 3-D acoustic space using the rhyme test. Estimation of speech intelligibility using objective measures, including the ITU standard PESQ measures, and automatic speech recognizers.

  13. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author)

  14. Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning

    Aylett, Matthew; Yamagishi, Junichi

    2008-01-01

    The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have ...

  15. Estimation of phoneme-specific HMM topologies for the automatic recognition of dysarthric speech.

    Caballero-Morales, Santiago-Omar

    2013-01-01

    Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic) for each phoneme in the speaker's phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components) is performed with a Genetic Algorithm (GA). Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria. PMID:24222784

  16. Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

    Anguita Ortega, Jan; Hernando Pericás, Francisco Javier

    2005-01-01

    In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have te...

  17. Silent Speech Interfaces

    Denby, B; Schultz, T.; Honda, K.; Hueber, T.; Gilbert, J.M.; Brumberg, J.S.

    2010-01-01

    Abstract The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a `silent speech? interface, to be used as an aid for the speech handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telec...

  18. Creating Accessible Educational Multimedia through Editing Automatic Speech Recognition Captioning in Real Time

    Wald, M

    2006-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Notetakers can only summarise what is being said while qualified sign language interpreters with a good understanding of the re...

  19. Captioning for Deaf and Hard of Hearing People by Editing Automatic Speech Recognition in Real Time

    Wald, M

    2006-01-01

    Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes when lip-reading or watching a sign-language interpreter. Notetakers summarise what is being said while qualified sign language interpreters with a good understanding of the relevant higher education subject content are in very scarce supply. Real time captioning/transcription is not normally available in UK higher education because of the shortage of real time stenographers. Lectures...

  20. Automatic and Accurate Conflation of Different Road-Network Vector Data towards Multi-Modal Navigation

    Meng Zhang

    2016-05-01

    Full Text Available With the rapid improvement of geospatial data acquisition and processing techniques, a variety of geospatial databases from public or private organizations have become available. Quite often, one dataset may be superior to other datasets in one, but not all aspects. In Germany, for instance, there were three major road network vector data, viz. Tele Atlas (which is now “TOMTOM”, NAVTEQ (which is now “here”, and ATKIS. However, none of them was qualified for the purpose of multi-modal navigation (e.g., driving + walking: Tele Atlas and NAVTEQ consist of comprehensive routing-relevant information, but many pedestrian ways are missing; ATKIS covers more pedestrian areas but the road objects are not fully attributed. To satisfy the requirements of multi-modal navigation, an automatic approach has been proposed to conflate different road networks together, which involves five routines: (a road-network matching between datasets; (b identification of the pedestrian ways; (c geometric transformation to eliminate geometric inconsistency; (d topologic remodeling of the conflated road network; and (e error checking and correction. The proposed approach demonstrates high performance in a number of large test areas and therefore has been successfully utilized for the real-world data production in the whole region of Germany. As a result, the conflated road network allows the multi-modal navigation of “driving + walking”.

  1. A FAST AND ACCURATE METHOD FOR AUTOMATIC CORONARY ARTERIAL TREE EXTRACTION IN ANGIOGRAMS

    Rohollah Moosavi Tayebi

    2014-01-01

    Full Text Available Coronary arterial tree extraction in angiograms is an essential component of each cardiac image processing system. Once physicians decide to check up coronary arteries from x-ray angiograms, extraction must be done precisely, fast, automatically and including whole arterial tree to help diagnosis or treatment during the cardiac surgical operation. This application is very helpful for the surgeon on deciding the target vessels prior to coronary artery bypass graft surgery. Some techniques and algorithms are proposed for extracting coronary arteries in angiograms. However, most of them suffer from some disadvantages such as time complexity, low accuracy, extracting only parts of main arteries instead of the full coronary arterial tree, need manual segmentation, appearance of artifacts and so forth. This study presents a new method for extracting whole coronary arterial tree in angiography images using Starlet wavelet transform. To this end, firstly we remove noise from raw angiograms and then sharpen the coronary arteries. Then coronary arterial tree is extracted by applying a modified Starlet wavelet transform and afterwards the residual noises and artifacts are cleaned. For evaluation, we measure proposed method performance on our created data set from 4932 Left Coronary Artery (LCA and Right Coronary Artery (RCA angiograms and compared with some state-of-the-art approaches. The proposed method shows much higher accuracy 96% for LCA and 97% for RCA, higher sensitivity 86% for LCA and 89% for RCA, higher specificity 98% for LCA and 99% for RCA and also higher precision 87% for LCA and 93% for RCA angiograms.

  2. Speech recognition and understanding

    Vintsyuk, T.K.

    1983-05-01

    This article discusses the automatic processing of speech signals with the aim of finding a sequence of works (speech recognition) or a concept (speech understanding) being transmitted by the speech signal. The goal of the research is to develop an automatic typewriter that will automatically edit and type text under voice control. A dynamic programming method is proposed in which all possible class signals are stored, after which the presented signal is compared to all the stored signals during the recognition phase. Topics considered include element-by-element recognition of words of speech, learning speech recognition, phoneme-by-phoneme speech recognition, the recognition of connected speech, understanding connected speech, and prospects for designing speech recognition and understanding systems. An application of the composition dynamic programming method for the solution of basic problems in the recognition and understanding of speech is presented.

  3. Fast, automatic, and accurate catheter reconstruction in HDR brachytherapy using an electromagnetic 3D tracking system

    Purpose: In high dose rate brachytherapy (HDR-B), current catheter reconstruction protocols are relatively slow and error prone. The purpose of this technical note is to evaluate the accuracy and the robustness of an electromagnetic (EM) tracking system for automated and real-time catheter reconstruction. Methods: For this preclinical study, a total of ten catheters were inserted in gelatin phantoms with different trajectories. Catheters were reconstructed using a 18G biopsy needle, used as an EM stylet and equipped with a miniaturized sensor, and the second generation Aurora® Planar Field Generator from Northern Digital Inc. The Aurora EM system provides position and orientation value with precisions of 0.7 mm and 0.2°, respectively. Phantoms were also scanned using a μCT (GE Healthcare) and Philips Big Bore clinical computed tomography (CT) system with a spatial resolution of 89 μm and 2 mm, respectively. Reconstructions using the EM stylet were compared to μCT and CT. To assess the robustness of the EM reconstruction, five catheters were reconstructed twice and compared. Results: Reconstruction time for one catheter was 10 s, leading to a total reconstruction time inferior to 3 min for a typical 17-catheter implant. When compared to the μCT, the mean EM tip identification error was 0.69 ± 0.29 mm while the CT error was 1.08 ± 0.67 mm. The mean 3D distance error was found to be 0.66 ± 0.33 mm and 1.08 ± 0.72 mm for the EM and CT, respectively. EM 3D catheter trajectories were found to be more accurate. A maximum difference of less than 0.6 mm was found between successive EM reconstructions. Conclusions: The EM reconstruction was found to be more accurate and precise than the conventional methods used for catheter reconstruction in HDR-B. This approach can be applied to any type of catheters and applicators

  4. Fast, automatic, and accurate catheter reconstruction in HDR brachytherapy using an electromagnetic 3D tracking system

    Poulin, Eric; Racine, Emmanuel; Beaulieu, Luc, E-mail: Luc.Beaulieu@phy.ulaval.ca [Département de physique, de génie physique et d’optique et Centre de recherche sur le cancer de l’Université Laval, Université Laval, Québec, Québec G1V 0A6, Canada and Département de radio-oncologie et Axe Oncologie du Centre de recherche du CHU de Québec, CHU de Québec, 11 Côte du Palais, Québec, Québec G1R 2J6 (Canada); Binnekamp, Dirk [Integrated Clinical Solutions and Marketing, Philips Healthcare, Veenpluis 4-6, Best 5680 DA (Netherlands)

    2015-03-15

    Purpose: In high dose rate brachytherapy (HDR-B), current catheter reconstruction protocols are relatively slow and error prone. The purpose of this technical note is to evaluate the accuracy and the robustness of an electromagnetic (EM) tracking system for automated and real-time catheter reconstruction. Methods: For this preclinical study, a total of ten catheters were inserted in gelatin phantoms with different trajectories. Catheters were reconstructed using a 18G biopsy needle, used as an EM stylet and equipped with a miniaturized sensor, and the second generation Aurora{sup ®} Planar Field Generator from Northern Digital Inc. The Aurora EM system provides position and orientation value with precisions of 0.7 mm and 0.2°, respectively. Phantoms were also scanned using a μCT (GE Healthcare) and Philips Big Bore clinical computed tomography (CT) system with a spatial resolution of 89 μm and 2 mm, respectively. Reconstructions using the EM stylet were compared to μCT and CT. To assess the robustness of the EM reconstruction, five catheters were reconstructed twice and compared. Results: Reconstruction time for one catheter was 10 s, leading to a total reconstruction time inferior to 3 min for a typical 17-catheter implant. When compared to the μCT, the mean EM tip identification error was 0.69 ± 0.29 mm while the CT error was 1.08 ± 0.67 mm. The mean 3D distance error was found to be 0.66 ± 0.33 mm and 1.08 ± 0.72 mm for the EM and CT, respectively. EM 3D catheter trajectories were found to be more accurate. A maximum difference of less than 0.6 mm was found between successive EM reconstructions. Conclusions: The EM reconstruction was found to be more accurate and precise than the conventional methods used for catheter reconstruction in HDR-B. This approach can be applied to any type of catheters and applicators.

  5. Using automatic speech processing to study French oral vowels Contributions du traitement automatique de la parole à l'étude des voyelles orales du français

    Martine Adda-Decker

    2009-10-01

    Full Text Available Automatic speech processing methods and tools can contribute to shedding light on many issues relating to phonemic variability in speech. The processing of huge amounts of speech thus allows to extract main tendencies, for which detailed interpretations then require both linguistic and methodological insights. The experimental study focuses on the variability of French oral vowels in the PFC and ESTER corpora, which are widely used both by linguists and researchers in automatic speech processing. Duration and formant measures allow to illustrate global variations depending on different parameters, which include speech style, syllable position and the speakers' regional origins. The last part addresses the phonetic realization of close-mid front vowels, using automatic classification in a Bayesian framework.

  6. Accurate and Fully Automatic Hippocampus Segmentation Using Subject-Specific 3D Optimal Local Maps Into a Hybrid Active Contour Model

    ZARPALAS, Dimitrios; Gkontra, Polyxeni; Daras, Petros; Maglaveras, Nicos

    2014-01-01

    Assessing the structural integrity of the hippocampus (HC) is an essential step toward prevention, diagnosis, and follow-up of various brain disorders due to the implication of the structural changes of the HC in those disorders. In this respect, the development of automatic segmentation methods that can accurately, reliably, and reproducibly segment the HC has attracted considerable attention over the past decades. This paper presents an innovative 3-D fully automatic method to be used on to...

  7. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  8. Automatic Understanding of Spontaneous Arabic Speech --- A Numerical Model Compréhension automatique de la parole arabe spontanée --- Une modélisation numérique

    Anis Zouaghi

    2009-01-01

    Full Text Available This work is part of a large research project entitled "Oreillodule" aimed at developing tools for automatic speech recognition, translation, and synthesis for Arabic language. Our attention has mainly been focused on an attempt to present the semantic analyzer developed for the automatic comprehension of the standard spontaneous arabic speech. The findings on the effectiveness of the semantic decoder are quite satisfactory.

  9. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types. PMID:26723348

  10. Ranking of predictor variables based on effect size criterion provides an accurate means of automatically classifying opinion column articles

    Legara, Erika Fille; Monterola, Christopher; Abundo, Cheryl

    2011-01-01

    We demonstrate an accurate procedure based on linear discriminant analysis that allows automatic authorship classification of opinion column articles. First, we extract the following stylometric features of 157 column articles from four authors: statistics on high frequency words, number of words per sentence, and number of sentences per paragraph. Then, by systematically ranking these features based on an effect size criterion, we show that we can achieve an average classification accuracy of 93% for the test set. In comparison, frequency size based ranking has an average accuracy of 80%. The highest possible average classification accuracy of our data merely relying on chance is ∼31%. By carrying out sensitivity analysis, we show that the effect size criterion is superior than frequency ranking because there exist low frequency words that significantly contribute to successful author discrimination. Consistent results are seen when the procedure is applied in classifying the undisputed Federalist papers of Alexander Hamilton and James Madison. To the best of our knowledge, the work is the first attempt in classifying opinion column articles, that by virtue of being shorter in length (as compared to novels or short stories), are more prone to over-fitting issues. The near perfect classification for the longer papers supports this claim. Our results provide an important insight on authorship attribution that has been overlooked in previous studies: that ranking discriminant variables based on word frequency counts is not necessarily an optimal procedure.

  11. Security and Hyper-accurate Positioning Monitoring with Automatic Dependent Surveillance-Broadcast (ADS-B) Project

    National Aeronautics and Space Administration — Lightning Ridge Technologies, working in collaboration with The Innovation Laboratory, Inc., extend Automatic Dependent Surveillance Broadcast (ADS-B) into a safe,...

  12. Security and Hyper-accurate Positioning Monitoring with Automatic Dependent Surveillance-Broadcast (ADS-B) Project

    National Aeronautics and Space Administration — Lightning Ridge Technologies, LLC, working in collaboration with The Innovation Laboratory, Inc., extend Automatic Dependent Surveillance ? Broadcast (ADS-B) into a...

  13. Automatic pose initialization for accurate 2D/3D registration applied to abdominal aortic aneurysm endovascular repair

    Miao, Shun; Lucas, Joseph; Liao, Rui

    2012-02-01

    Minimally invasive abdominal aortic aneurysm (AAA) stenting can be greatly facilitated by overlaying the preoperative 3-D model of the abdominal aorta onto the intra-operative 2-D X-ray images. Accurate 2-D/3-D registration in 3-D space makes the 2-D/3-D overlay robust to the change of C-Arm angulations. By far, the 2-D/3-D registration methods based on simulated X-ray projection images using multiple image planes have been shown to be able to provide satisfactory 3-D registration accuracy. However, one drawback of the intensity-based 2-D/3-D registration methods is that the similarity measure is usually highly non-convex and hence the optimizer can easily be trapped into local minima. User interaction therefore is often needed in the initialization of the position of the 3-D model in order to get a successful 2-D/3-D registration. In this paper, a novel 3-D pose initialization technique is proposed, as an extension of our previously proposed bi-plane 2-D/3-D registration method for AAA intervention [4]. The proposed method detects vessel bifurcation points and spine centerline in both 2-D and 3-D images, and utilizes landmark information to bring the 3-D volume into a 15mm capture range. The proposed landmark detection method was validated on real dataset, and is shown to be able to provide a good initialization for 2-D/3-D registration in [4], thus making the workflow fully automatic.

  14. A fully automatic tool to perform accurate flood mapping by merging remote sensing imagery and ancillary data

    D'Addabbo, Annarita; Refice, Alberto; Lovergine, Francesco; Pasquariello, Guido

    2016-04-01

    Flooding is one of the most frequent and expansive natural hazard. High-resolution flood mapping is an essential step in the monitoring and prevention of inundation hazard, both to gain insight into the processes involved in the generation of flooding events, and from the practical point of view of the precise assessment of inundated areas. Remote sensing data are recognized to be useful in this respect, thanks to the high resolution and regular revisit schedules of state-of-the-art satellites, moreover offering a synoptic overview of the extent of flooding. In particular, Synthetic Aperture Radar (SAR) data present several favorable characteristics for flood mapping, such as their relative insensitivity to the meteorological conditions during acquisitions, as well as the possibility of acquiring independently of solar illumination, thanks to the active nature of the radar sensors [1]. However, flood scenarios are typical examples of complex situations in which different factors have to be considered to provide accurate and robust interpretation of the situation on the ground: the presence of many land cover types, each one with a particular signature in presence of flood, requires modelling the behavior of different objects in the scene in order to associate them to flood or no flood conditions [2]. Generally, the fusion of multi-temporal, multi-sensor, multi-resolution and/or multi-platform Earth observation image data, together with other ancillary information, seems to have a key role in the pursuit of a consistent interpretation of complex scenes. In the case of flooding, distance from the river, terrain elevation, hydrologic information or some combination thereof can add useful information to remote sensing data. Suitable methods, able to manage and merge different kind of data, are so particularly needed. In this work, a fully automatic tool, based on Bayesian Networks (BNs) [3] and able to perform data fusion, is presented. It supplies flood maps

  15. SPEECH PROCESSING –AN OVERVIEW

    A.INDUMATHI

    2012-06-01

    Full Text Available One of the earliest goals of speech processing was coding speech for efficient transmission. Later, the research spread in various area like Automatic Speech Recognition (ASR, Speech Synthesis (TTS,Speech Enhancement, Automatic Language Translation (ALT.Initially, ASR is used to recognize single words in a small vocabulary, later many product was developed for continuous speech for large vocabulary.Speech Synthesis is used for synthesizing the speech corresponding to a given text Speech Synthesis provide a way to communicate for persons unable to speak. When Speech Synthesis used together withASR, it allows a complete two-way spoken interaction between humans and machines. Speech Enhancement technique is applied to improve the quality of speech signal. Automatic Language Translation helps toconvert one language into another language. Basic concept of speech processing is provided for beginners.

  16. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    Zapata, Julián; Kirkedal, Andreas Søeborg

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers of...... three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...... transcription vs. with transcription. We discuss the advantages and drawbacks of a particular ASR approach in different computational platforms when used by various speakers of a given language, who may have different accents and levels of proficiency in that language, and who may have different levels of...

  17. Full automatic fiducial marker detection on coil arrays for accurate instrumentation placement during MRI guided breast interventions

    Filippatos, Konstantinos; Boehler, Tobias; Geisler, Benjamin; Zachmann, Harald; Twellmann, Thorsten

    2010-02-01

    With its high sensitivity, dynamic contrast-enhanced MR imaging (DCE-MRI) of the breast is today one of the first-line tools for early detection and diagnosis of breast cancer, particularly in the dense breast of young women. However, many relevant findings are very small or occult on targeted ultrasound images or mammography, so that MRI guided biopsy is the only option for a precise histological work-up [1]. State-of-the-art software tools for computer-aided diagnosis of breast cancer in DCE-MRI data offer also means for image-based planning of biopsy interventions. One step in the MRI guided biopsy workflow is the alignment of the patient position with the preoperative MR images. In these images, the location and orientation of the coil localization unit can be inferred from a number of fiducial markers, which for this purpose have to be manually or semi-automatically detected by the user. In this study, we propose a method for precise, full-automatic localization of fiducial markers, on which basis a virtual localization unit can be subsequently placed in the image volume for the purpose of determining the parameters for needle navigation. The method is based on adaptive thresholding for separating breast tissue from background followed by rigid registration of marker templates. In an evaluation of 25 clinical cases comprising 4 different commercial coil array models and 3 different MR imaging protocols, the method yielded a sensitivity of 0.96 at a false positive rate of 0.44 markers per case. The mean distance deviation between detected fiducial centers and ground truth information that was appointed from a radiologist was 0.94mm.

  18. Machines a Comprendre la Parole: Methodologie et Bilan de Recherche (Automatic Speech Recognition: Methodology and the State of the Research)

    Haton, Jean-Pierre

    1974-01-01

    Still no decisive result has been achieved in the automatic machine recognition of sentences of a natural language. Current research concentrates on developing algorithms for syntactic and semantic analysis. It is obvious that clues from all levels of perception have to be taken into account if a long term solution is ever to be found. (Author/MSE)

  19. Accurate and Fully Automatic Hippocampus Segmentation Using Subject-Specific 3D Optimal Local Maps Into a Hybrid Active Contour Model.

    Zarpalas, Dimitrios; Gkontra, Polyxeni; Daras, Petros; Maglaveras, Nicos

    2014-01-01

    Assessing the structural integrity of the hippocampus (HC) is an essential step toward prevention, diagnosis, and follow-up of various brain disorders due to the implication of the structural changes of the HC in those disorders. In this respect, the development of automatic segmentation methods that can accurately, reliably, and reproducibly segment the HC has attracted considerable attention over the past decades. This paper presents an innovative 3-D fully automatic method to be used on top of the multiatlas concept for the HC segmentation. The method is based on a subject-specific set of 3-D optimal local maps (OLMs) that locally control the influence of each energy term of a hybrid active contour model (ACM). The complete set of the OLMs for a set of training images is defined simultaneously via an optimization scheme. At the same time, the optimal ACM parameters are also calculated. Therefore, heuristic parameter fine-tuning is not required. Training OLMs are subsequently combined, by applying an extended multiatlas concept, to produce the OLMs that are anatomically more suitable to the test image. The proposed algorithm was tested on three different and publicly available data sets. Its accuracy was compared with that of state-of-the-art methods demonstrating the efficacy and robustness of the proposed method. PMID:27170866

  20. Study on automatic prediction of sentential stress for Chinese Putonghua Text-to-Speech system with natural style

    SHAO Yanqiu; HAN Jiqing; ZHAO Yongzhen; LIU Ting

    2007-01-01

    Stress is an important parameter for prosody processing in speech synthesis. In this paper, we compare the acoustic features of neutral tone syllables and strong stress syllables with moderate stress syllables, including pitch, syllable duration, intensity and pause length after syllable. The relation between duration and pitch, as well as the Third Tone (T3) and pitch are also studied. Three stress prediction models based on ANN, i.e. the acoustic model,the linguistic model and the mixed model, are presented for predicting Chinese sentential stress.The results show that the mixed model performs better than the other two models. In order to solve the problem of the diversity of manual labeling, an evaluation index of support ratio is proposed.

  1. A fast, accurate, and automatic 2D-3D image registration for image-guided cranial radiosurgery

    The authors developed a fast and accurate two-dimensional (2D)-three-dimensional (3D) image registration method to perform precise initial patient setup and frequent detection and correction for patient movement during image-guided cranial radiosurgery treatment. In this method, an approximate geometric relationship is first established to decompose a 3D rigid transformation in the 3D patient coordinate into in-plane transformations and out-of-plane rotations in two orthogonal 2D projections. Digitally reconstructed radiographs are generated offline from a preoperative computed tomography volume prior to treatment and used as the reference for patient position. A multiphase framework is designed to register the digitally reconstructed radiographs with the x-ray images periodically acquired during patient setup and treatment. The registration in each projection is performed independently; the results in the two projections are then combined and converted to a 3D rigid transformation by 2D-3D geometric backprojection. The in-plane transformation and the out-of-plane rotation are estimated using different search methods, including multiresolution matching, steepest descent minimization, and one-dimensional search. Two similarity measures, optimized pattern intensity and sum of squared difference, are applied at different registration phases to optimize accuracy and computation speed. Various experiments on an anthropomorphic head-and-neck phantom showed that, using fiducial registration as a gold standard, the registration errors were 0.33±0.16 mm (s.d.) in overall translation and 0.29 deg. ±0.11 deg. (s.d.) in overall rotation. The total targeting errors were 0.34±0.16 mm (s.d.), 0.40±0.2 mm (s.d.), and 0.51±0.26 mm (s.d.) for the targets at the distances of 2, 6, and 10 cm from the rotation center, respectively. The computation time was less than 3 s on a computer with an Intel Pentium 3.0 GHz dual processor

  2. Integranting prosodic information into a speech recogniser

    López Soto, María Teresa

    2001-01-01

    In the last decade there has been an increasing tendency to incorporate language engineering strategies into speech technology. This technique combines linguistic and mathematical information in different applications: machine translation, natural language processing, speech synthesis and automatic speech recognition (ASR). In the field of speech synthesis, this hybrid approach (linguistic and mathematical/statistical) has led to the design of efficient models for reproducin...

  3. Application of Perceptual Filtering Models to Noisy Speech Signals Enhancement

    Novlene Zoghlami

    2012-01-01

    Full Text Available This paper describes a new speech enhancement approach using perceptually based noise reduction. The proposed approach is based on the application of two perceptual filtering models to noisy speech signals: the gammatone and the gammachirp filter banks with nonlinear resolution according to the equivalent rectangular bandwidth (ERB scale. The perceptual filtering gives a number of subbands that are individually spectral weighted and modified according to two different noise suppression rules. The importance of an accurate noise estimate is related to the reduction of the musical noise artifacts in the processed speech that appears after classic subtractive process. In this context, we use continuous noise estimation algorithms. The performance of the proposed approach is evaluated on speech signals corrupted by real-world noises. Using objective tests based on the perceptual quality PESQ score and the quality rating of signal distortion (SIG, noise distortion (BAK and overall quality (OVRL, and subjective test based on the quality rating of automatic speech recognition (ASR, we demonstrate that our speech enhancement approach using filter banks modeling the human auditory system outperforms the conventional spectral modification algorithms to improve quality and intelligibility of the enhanced speech signal.

  4. Pattern recognition in speech and language processing

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  5. Speech Problems

    ... your treatment plan may include seeing a speech therapist , a person who is trained to treat speech disorders. How often you have to see the speech therapist will vary — you'll probably start out seeing ...

  6. Speech Segmentation Algorithm Based On Fuzzy Memberships

    Luis D. Huerta; Jose Antonio Huesca; Julio C. Contreras

    2010-01-01

    In this work, an automatic speech segmentation algorithm with text independency was implemented. In the algorithm, the use of fuzzy memberships on each characteristic in different speech sub-bands is proposed. Thus, the segmentation is performed a greater detail. Additionally, we tested with various speech signal frequencies and labeling, and we could observe how they affect the performance of the segmentation process in phonemes. The speech segmentation algorithm used is described. During th...

  7. Sparse representation in speech signal processing

    Lee, Te-Won; Jang, Gil-Jin; Kwon, Oh-Wook

    2003-11-01

    We review the sparse representation principle for processing speech signals. A transformation for encoding the speech signals is learned such that the resulting coefficients are as independent as possible. We use independent component analysis with an exponential prior to learn a statistical representation for speech signals. This representation leads to extremely sparse priors that can be used for encoding speech signals for a variety of purposes. We review applications of this method for speech feature extraction, automatic speech recognition and speaker identification. Furthermore, this method is also suited for tackling the difficult problem of separating two sounds given only a single microphone.

  8. Speech Development

    ... Spotlight Fundraising Ideas Vehicle Donation Volunteer Efforts Speech Development skip to submenu Parents & Individuals Information for Parents & Individuals Speech Development To download the PDF version of this factsheet, ...

  9. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  10. GesRec3D: a real-time coded gesture-to-speech system with automatic segmentation and recognition thresholding using dissimilarity measures

    Craven, Michael P; Curtis, K. Mervyn

    2004-01-01

    A complete microcomputer system is described, GesRec3D, which facilitates the data acquisition, segmentation, learning, and recognition of 3-Dimensional arm gestures, with application as a Augmentative and Alternative Communication (AAC) aid for people with motor and speech disability. The gesture data is acquired from a Polhemus electro-magnetic tracker system, with sensors attached to the finger, wrist and elbow of one arm. Coded gestures are linked to user-defined text, to be spoken by a t...

  11. Feasibility of Technology Enabled Speech Disorder Screening.

    Duenser, Andreas; Ward, Lauren; Stefani, Alessandro; Smith, Daniel; Freyne, Jill; Morgan, Angela; Dodd, Barbara

    2016-01-01

    One in twenty Australian children suffers from a speech disorder. Early detection of such problems can significantly improve literacy and academic outcomes for these children, reduce health and educational burden and ongoing social costs. Here we present the development of a prototype and feasibility tests of a screening and decision support tool to assess speech disorders in young children. The prototype incorporates speech signal processing, machine learning and expert knowledge to automatically classify phonemes of normal and disordered speech. We discuss these results and our future work towards the development of a mobile tool to facilitate broad, early speech disorder screening by non-experts. PMID:27440284

  12. Survey On Speech Synthesis

    A. Indumathi

    2012-12-01

    Full Text Available The primary goal of this paper is to provide an overview of existing Text-To-Speech (TTS Techniques by highlighting its usage and advantage. First Generation Techniques includes Formant Synthesis and Articulatory Synthesis. Formant Synthesis works by using individually controllable formant filters, which can be set to produce accurate estimations of the vocal-track transfer function. Articulatory Synthesis produces speech by direct modeling of Human articulator behavior. Second Generation Techniques incorporates Concatenative synthesis and Sinusoidal synthesis. Concatenative synthesis generates speech output by concatenating the segments of recorded speech. Generally, Concatenative synthesis generates the natural sounding synthesized speech. Sinusoidal Synthesis use a harmonic model and decompose each frame into a set of harmonics of an estimated fundamental frequency. The model parameters are the amplitudes and periods of the harmonics. With these, the value of the fundamental can be changed while keeping the same basic spectral..In adding, Third Generation includes Hidden Markov Model (HMM and Unit Selection Synthesis.HMM trains the parameter module and produce high quality Speech. Finally, Unit Selection operates by selecting the best sequence of units from a large speech database which matches the specification.

  13. Automatic Speaker Recognition System

    Parul,R. B. Dubey

    2012-12-01

    Full Text Available Spoken language is used by human to convey many types of information. Primarily, speech convey message via words. Owing to advanced speech technologies, people's interactions with remote machines, such as phone banking, internet browsing, and secured information retrieval by voice, is becoming popular today. Speaker verification and speaker identification are important for authentication and verification in security purpose. Speaker identification methods can be divided into text independent and text-dependent. Speaker recognition is the process of automatically recognizing speaker voice on the basis of individual information included in the input speech waves. It consists of comparing a speech signal from an unknown speaker to a set of stored data of known speakers. This process recognizes who has spoken by matching input signal with pre- stored samples. The work is focussed to improve the performance of the speaker verification under noisy conditions.

  14. WE-A-17A-10: Fast, Automatic and Accurate Catheter Reconstruction in HDR Brachytherapy Using An Electromagnetic 3D Tracking System

    Poulin, E; Racine, E; Beaulieu, L [CHU de Quebec - Universite Laval, Quebec, Quebec (Canada); Binnekamp, D [Integrated Clinical Solutions and Marketing, Philips Healthcare, Best, DA (Netherlands)

    2014-06-15

    Purpose: In high dose rate brachytherapy (HDR-B), actual catheter reconstruction protocols are slow and errors prompt. The purpose of this study was to evaluate the accuracy and robustness of an electromagnetic (EM) tracking system for improved catheter reconstruction in HDR-B protocols. Methods: For this proof-of-principle, a total of 10 catheters were inserted in gelatin phantoms with different trajectories. Catheters were reconstructed using a Philips-design 18G biopsy needle (used as an EM stylet) and the second generation Aurora Planar Field Generator from Northern Digital Inc. The Aurora EM system exploits alternating current technology and generates 3D points at 40 Hz. Phantoms were also scanned using a μCT (GE Healthcare) and Philips Big Bore clinical CT system with a resolution of 0.089 mm and 2 mm, respectively. Reconstructions using the EM stylet were compared to μCT and CT. To assess the robustness of the EM reconstruction, 5 catheters were reconstructed twice and compared. Results: Reconstruction time for one catheter was 10 seconds or less. This would imply that for a typical clinical implant of 17 catheters, the total reconstruction time would be less than 3 minutes. When compared to the μCT, the mean EM tip identification error was 0.69 ± 0.29 mm while the CT error was 1.08 ± 0.67 mm. The mean 3D distance error was found to be 0.92 ± 0.37 mm and 1.74 ± 1.39 mm for the EM and CT, respectively. EM 3D catheter trajectories were found to be significantly more accurate (unpaired t-test, p < 0.05). A mean difference of less than 0.5 mm was found between successive EM reconstructions. Conclusion: The EM reconstruction was found to be faster, more accurate and more robust than the conventional methods used for catheter reconstruction in HDR-B. This approach can be applied to any type of catheters and applicators. We would like to disclose that the equipments, used in this study, is coming from a collaboration with Philips Medical.

  15. Robust speech recognition using articulatory information

    Kirchhoff, Katrin

    1999-01-01

    Current automatic speech recognition systems make use of a single source of information about their input, viz. a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy. The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information. Articulatory informa...

  16. Speech production in amplitude-modulated noise

    Macdonald, Ewen N; Raufer, Stefan

    2013-01-01

    The Lombard effect refers to the phenomenon where talkers automatically increase their level of speech in a noisy environment. While many studies have characterized how the Lombard effect influences different measures of speech production (e.g., F0, spectral tilt, etc.), few have investigated the...

  17. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  18. The role of speech in the user interface : perspective and application

    Abewusi, A.B.

    1994-01-01

    Consideration must be given to the implication of speech as a communication medium before deciding to use speech input or output in an interactive environment. There are several effective control strategies for improving the quality of speech. The utility of the speech has been demonstrated by application to several illustrative problems where their application has proved effective despite all the limitation of synthetic speech output and automatic speech recognition systems. (Résumé d'auteur)

  19. Current trends in multilingual speech processing

    Hervé Bourlard; John Dines; Mathew Magimai-Doss; Philip N Garner; David Imseng; Petr Motlicek; Hui Liang; Lakshmi Saheer; Fabio Valente

    2011-10-01

    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.

  20. 汉语词性自动标注系统的设计与实现%The Design and Implementation of the Chinese Part-of-speech Automatic Tagging System

    王素格; 张水奎

    2001-01-01

    介绍了汉语词性自动标注系统的设计与实现.谊系统实现了统计与规则相结合的方法进行汉语词性自动标注.描述了该系统的总体结构,以及所使用的非兼类词表、兼类词表、标记集和词性标注规则的组织,特别时稀疏矩阵及其存储方法进行了详细的介绍.%In this paper, the Chinese part-of-speech automatic tagging system is presented, which has implemented statistics-based and rulebased tagging methods, introduced its whole structure and organized a series of word tables such as the ambiguous word table,nonambiguous word table,tag-set and POS tagging rules. Especially, the processing and storing methods of sparse matrix are described in more detail.

  1. Fully Automated Assessment of the Severity of Parkinson's Disease from Speech.

    Bayestehtashk, Alireza; Asgari, Meysam; Shafran, Izhak; McNames, James

    2015-01-01

    For several decades now, there has been sporadic interest in automatically characterizing the speech impairment due to Parkinson's disease (PD). Most early studies were confined to quantifying a few speech features that were easy to compute. More recent studies have adopted a machine learning approach where a large number of potential features are extracted and the models are learned automatically from the data. In the same vein, here we characterize the disease using a relatively large cohort of 168 subjects, collected from multiple (three) clinics. We elicited speech using three tasks - the sustained phonation task, the diadochokinetic task and a reading task, all within a time budget of 4 minutes, prompted by a portable device. From these recordings, we extracted 1582 features for each subject using openSMILE, a standard feature extraction tool. We compared the effectiveness of three strategies for learning a regularized regression and find that ridge regression performs better than lasso and support vector regression for our task. We refine the feature extraction to capture pitch-related cues, including jitter and shimmer, more accurately using a time-varying harmonic model of speech. Our results show that the severity of the disease can be inferred from speech with a mean absolute error of about 5.5, explaining 61% of the variance and consistently well-above chance across all clinics. Of the three speech elicitation tasks, we find that the reading task is significantly better at capturing cues than diadochokinetic or sustained phonation task. In all, we have demonstrated that the data collection and inference can be fully automated, and the results show that speech-based assessment has promising practical application in PD. The techniques reported here are more widely applicable to other paralinguistic tasks in clinical domain. PMID:25382935

  2. Speech Recognition on Mobile Devices

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  3. Epoch-based analysis of speech signals

    B Yegnanarayana; Suryakanth V Gangashetty

    2011-10-01

    Speech analysis is traditionally performed using short-time analysis to extract features in time and frequency domains. The window size for the analysis is fixed somewhat arbitrarily, mainly to account for the time varying vocal tract system during production. However, speech in its primary mode of excitation is produced due to impulse-like excitation in each glottal cycle. Anchoring the speech analysis around the glottal closure instants (epochs) yields significant benefits for speech analysis. Epoch-based analysis of speech helps not only to segment the speech signals based on speech production characteristics, but also helps in accurate analysis of speech. It enables extraction of important acoustic-phonetic features such as glottal vibrations, formants, instantaneous fundamental frequency, etc. Epoch sequence is useful to manipulate prosody in speech synthesis applications. Accurate estimation of epochs helps in characterizing voice quality features. Epoch extraction also helps in speech enhancement and multispeaker separation. In this tutorial article, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed. Applications of epoch extraction for some speech applications are demonstrated.

  4. Automatic translation among spoken languages

    Walter, Sharon M.; Costigan, Kelly

    1994-02-01

    The Machine Aided Voice Translation (MAVT) system was developed in response to the shortage of experienced military field interrogators with both foreign language proficiency and interrogation skills. Combining speech recognition, machine translation, and speech generation technologies, the MAVT accepts an interrogator's spoken English question and translates it into spoken Spanish. The spoken Spanish response of the potential informant can then be translated into spoken English. Potential military and civilian applications for automatic spoken language translation technology are discussed in this paper.

  5. Cued Speech: A visual communication mode for the Deaf society

    Heracleous, Panikos; Beautemps, Denis

    2010-01-01

    Cued Speech is a visual mode of communication that uses handshapes and placements in combination with the mouth movements of speech to make the phonemes of a spoken language look different from each other and clearly understandable to deaf individuals. The aim of Cued Speech is to overcome the problems of lip reading and thus enable deaf persons to wholly understand spoken language. In this study, automatic phoneme recognition in Cued Speech for French based on hidden Markov model (HMMs) is i...

  6. Perception of Speech Sounds in School-Aged Children with Speech Sound Disorders.

    Preston, Jonathan L; Irwin, Julia R; Turcios, Jacqueline

    2015-11-01

    Children with speech sound disorders may perceive speech differently than children with typical speech development. The nature of these speech differences is reviewed with an emphasis on assessing phoneme-specific perception for speech sounds that are produced in error. Category goodness judgment, or the ability to judge accurate and inaccurate tokens of speech sounds, plays an important role in phonological development. The software Speech Assessment and Interactive Learning System, which has been effectively used to assess preschoolers' ability to perform goodness judgments, is explored for school-aged children with residual speech errors (RSEs). However, data suggest that this particular task may not be sensitive to perceptual differences in school-aged children. The need for the development of clinical tools for assessment of speech perception in school-aged children with RSE is highlighted, and clinical suggestions are provided. PMID:26458198

  7. Speech characteristics in depression.

    Stassen, H H; Bomben, G; Günther, E

    1991-01-01

    This study examined the relationship between speech characteristics and psychopathology throughout the course of affective disturbances. Our sample comprised 20 depressive, hospitalized patients who had been selected according to the following criteria: (1) first admission; (2) long-term patient; (3) early entry into study; (4) late entry into study; (5) low scorer; (6) high scorer, and (7) distinct retarded-depressive symptomatology. Since our principal goal was to model the course of affective disturbances in terms of speech parameters, a total of 6 repeated measurements had been carried out over a 2-week period, including 3 different psychopathological instruments and speech recordings from automatic speech as well as from reading out loud. It turned out that neither applicability nor efficiency of single-parameter models depend in any way on the given, clinically defined subgroups. On the other hand, however, no significant differences between the clinically defined subgroups showed up with regard to basic speech parameters, except for the fact that low scorers seemed to take their time when producing utterances (this in contrast to all other patients who, on the average, had a considerably shorter recording time). As to the relationship between psychopathology and speech parameters over time, we found significant correlations: (1) in 60% of cases between the apathic syndrome and energy/dynamics; (2) in 50% of cases between the retarded-depressive syndrome and energy/dynamics; (3) in 45% of cases between the apathic syndrome and mean vocal pitch, and (4) in 71% of low scores between the somatic-depressive syndrome and time duration of pauses. All in all, single parameter models turned out to cover only specific aspects of the individual courses of affective disturbances, thus speaking against a simple approach which applies in general. PMID:1886971

  8. The Phase Spectra Based Feature for Robust Speech Recognition

    Abbasian ALI

    2009-07-01

    Full Text Available Speech recognition in adverse environment is one of the major issue in automatic speech recognition nowadays. While most current speech recognition system show to be highly efficient for ideal environment but their performance go down extremely when they are applied in real environment because of noise effected speech. In this paper a new feature representation based on phase spectra and Perceptual Linear Prediction (PLP has been suggested which can be used for robust speech recognition. It is shown that this new features can improve the performance of speech recognition not only in clean condition but also in various levels of noise condition when it is compared to PLP features.

  9. Speech coding

    Ravishankar, C., Hughes Network Systems, Germantown, MD

    1998-05-08

    Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the

  10. HUMAN SPEECH EMOTION RECOGNITION

    Maheshwari Selvaraj

    2016-02-01

    Full Text Available Emotions play an extremely important role in human mental life. It is a medium of expression of one’s perspective or one’s mental state to others. Speech Emotion Recognition (SER can be defined as extraction of the emotional state of the speaker from his or her speech signal. There are few universal emotions- including Neutral, Anger, Happiness, Sadness in which any intelligent system with finite computational resources can be trained to identify or synthesize as required. In this work spectral and prosodic features are used for speech emotion recognition because both of these features contain the emotional information. Mel-frequency cepstral coefficients (MFCC is one of the spectral features. Fundamental frequency, loudness, pitch and speech intensity and glottal parameters are the prosodic features which are used to model different emotions. The potential features are extracted from each utterance for the computational mapping between emotions and speech patterns. Pitch can be detected from the selected features, using which gender can be classified. Support Vector Machine (SVM, is used to classify the gender in this work. Radial Basis Function and Back Propagation Network is used to recognize the emotions based on the selected features, and proved that radial basis function produce more accurate results for emotion recognition than the back propagation network.

  11. Speech recognition from spectral dynamics

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  12. Automatic differentiation bibliography

    Corliss, G.F. (comp.)

    1992-07-01

    This is a bibliography of work related to automatic differentiation. Automatic differentiation is a technique for the fast, accurate propagation of derivative values using the chain rule. It is neither symbolic nor numeric. Automatic differentiation is a fundamental tool for scientific computation, with applications in optimization, nonlinear equations, nonlinear least squares approximation, stiff ordinary differential equation, partial differential equations, continuation methods, and sensitivity analysis. This report is an updated version of the bibliography which originally appeared in Automatic Differentiation of Algorithms: Theory, Implementation, and Application.

  13. Predicting Speech Intelligibility

    HINES, ANDREW

    2012-01-01

    Hearing impairment, and specifically sensorineural hearing loss, is an increasingly prevalent condition, especially amongst the ageing population. It occurs primarily as a result of damage to hair cells that act as sound receptors in the inner ear and causes a variety of hearing perception problems, most notably a reduction in speech intelligibility. Accurate diagnosis of hearing impairments is a time consuming process and is complicated by the reliance on indirect measurements based on patie...

  14. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR

    Weninger, Felix; Erdogan, Hakan; Watanabe, Shinji; Vincent, Emmanuel; Le Roux, Jonathan; Hershey, John R.; Schuller, Björn

    2015-01-01

    We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ' na¨vely ' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furtherm...

  15. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  16. Phonetic Alphabet for Speech Recognition of Czech

    J. Uhlir; Psutka, J.; J. Nouza

    1997-01-01

    In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech) consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Cze...

  17. Unsupervised Topic Adaptation for Lecture Speech Retrieval

    Fujii, Atsushi; Itou, Katunobu; Akiba, Tomoyosi; Ishikawa, Tetsuya

    2004-01-01

    We are developing a cross-media information retrieval system, in which users can view specific segments of lecture videos by submitting text queries. To produce a text index, the audio track is extracted from a lecture video and a transcription is generated by automatic speech recognition. In this paper, to improve the quality of our retrieval system, we extensively investigate the effects of adapting acoustic and language models on speech recognition. We perform an MLLR-based method to adapt...

  18. Hate speech

    Anne Birgitta Nilsen

    2014-03-01

    Full Text Available The manifesto of the Norwegian terrorist Anders Behring Breivik is based on the “Eurabia” conspiracy theory. This theory is a key starting point for hate speech amongst many right-wing extremists in Europe, but also has ramifications beyond these environments. In brief, proponents of the Eurabia theory claim that Muslims are occupying Europe and destroying Western culture, with the assistance of the EU and European governments. By contrast, members of Al-Qaeda and other extreme Islamists promote the conspiracy theory “the Crusade” in their hate speech directed against the West. Proponents of the latter theory argue that the West is leading a crusade to eradicate Islam and Muslims, a crusade that is similarly facilitated by their governments. This article presents analyses of texts written by right-wing extremists and Muslim extremists in an effort to shed light on how hate speech promulgates conspiracy theories in order to spread hatred and intolerance.The aim of the article is to contribute to a more thorough understanding of hate speech’s nature by applying rhetorical analysis. Rhetorical analysis is chosen because it offers a means of understanding the persuasive power of speech. It is thus a suitable tool to describe how hate speech works to convince and persuade. The concepts from rhetorical theory used in this article are ethos, logos and pathos. The concept of ethos is used to pinpoint factors that contributed to Osama bin Laden's impact, namely factors that lent credibility to his promotion of the conspiracy theory of the Crusade. In particular, Bin Laden projected common sense, good morals and good will towards his audience. He seemed to have coherent and relevant arguments; he appeared to possess moral credibility; and his use of language demonstrated that he wanted the best for his audience.The concept of pathos is used to define hate speech, since hate speech targets its audience's emotions. In hate speech it is the

  19. Recognizing intentions in infant-directed speech: evidence for universals.

    Bryant, Gregory A; Barrett, H Clark

    2007-08-01

    In all languages studied to date, distinct prosodic contours characterize different intention categories of infant-directed (ID) speech. This vocal behavior likely exists universally as a species-typical trait, but little research has examined whether listeners can accurately recognize intentions in ID speech using only vocal cues, without access to semantic information. We recorded native-English-speaking mothers producing four intention categories of utterances (prohibition, approval, comfort, and attention) as both ID and adult-directed (AD) speech, and we then presented the utterances to Shuar adults (South American hunter-horticulturalists). Shuar subjects were able to reliably distinguish ID from AD speech and were able to reliably recognize the intention categories in both types of speech, although performance was significantly better with ID speech. This is the first demonstration that adult listeners in an indigenous, nonindustrialized, and nonliterate culture can accurately infer intentions from both ID speech and AD speech in a language they do not speak. PMID:17680948

  20. Strategies for distant speech recognitionin reverberant environments

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  1. Speech Enhancement

    Benesty, Jacob; Jensen, Jesper Rindom; Christensen, Mads Græsbøll;

    and their performance bounded and assessed in terms of noise reduction and speech distortion. The book shows how various filter designs can be obtained in this framework, including the maximum SNR, Wiener, LCMV, and MVDR filters, and how these can be applied in various contexts, like in single......Speech enhancement is a classical problem in signal processing, yet still largely unsolved. Two of the conventional approaches for solving this problem are linear filtering, like the classical Wiener filter, and subspace methods. These approaches have traditionally been treated as different classes...

  2. Speech enhancement

    Benesty, Jacob; Chen, Jingdong

    2006-01-01

    We live in a noisy world! In all applications (telecommunications, hands-free communications, recording, human-machine interfaces, etc.) that require at least one microphone, the signal of interest is usually contaminated by noise and reverberation. As a result, the microphone signal has to be ""cleaned"" with digital signal processing tools before it is played out, transmitted, or stored.This book is about speech enhancement. Different well-known and state-of-the-art methods for noise reduction, with one or multiple microphones, are discussed. By speech enhancement, we mean not only noise red

  3. Speech-enabled Computer-aided Translation

    Mesa-Lao, Bartolomé

    2014-01-01

    The present study has surveyed post-editor trainees’ views and attitudes before and after the introduction of speech technology as a front end to a computer-aided translation workbench. The aim of the survey was (i) to identify attitudes and perceptions among post-editor trainees before performing...... a post-editing task using automatic speech recognition (ASR); and (ii) to assess the degree to which post-editors’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires: the first one administered before the...

  4. Towards robust speech acquisition using sensor arrays

    Maganti, Hari Krishna

    2007-01-01

    An integrated system approach was developed to address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array processing techniques have presented a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. These techniques relied on accurate speaker locations for optimal performance. Tracking accurate speaker locations solely based on audio w...

  5. Toward Speech and Nonverbal Behaviors Integration for Humanoid Robot

    Wei Wang; Xiaodan Huang

    2012-01-01

    It is essential to integrate speeches and nonverbal behaviors for a humanoid robot in human‐robot interaction. This paper presents an approach using multi‐object genetic algorithm to match the speeches and behaviors automatically. Firstly, with humanoid robot’s emotion status, we construct a hierarchical structure to link voice characteristics and nonverbal behaviors. Secondly, these behaviors corresponding to speeches are matched and integrated into an action sequence based on genetic algori...

  6. The Role of Visual Spatial Attention in Audiovisual Speech Perception

    Andersen, Tobias; Tiippana, K.; Laarni, J.; Kojo, I.; Sams, M.

    2008-01-01

    Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive but recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a cen...

  7. The Use of Speech Recognition Technology in Automotive Applications

    Gellatly, Andrew William

    1997-01-01

    The research objectives were (1) to perform a detailed review of the literature on speech recognition technology and the attentional demands of driving; (2) to develop decision tools that assist designers of in-vehicle systems; (3) to experimentally examine automatic speech recognition (ASR) design parameters, input modalities, and driver ages; and (4) to provide human factors recommendations for the use of speech recognition technology in automotive applicatio...

  8. Language and Speech Processing

    Mariani, Joseph

    2008-01-01

    Speech processing addresses various scientific and technological areas. It includes speech analysis and variable rate coding, in order to store or transmit speech. It also covers speech synthesis, especially from text, speech recognition, including speaker and language identification, and spoken language understanding. This book covers the following topics: how to realize speech production and perception systems, how to synthesize and understand speech using state-of-the-art methods in signal processing, pattern recognition, stochastic modelling computational linguistics and human factor studi

  9. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  10. Experimental comparison between speech transmission index, rapid speech transmission index, and speech intelligibility index.

    Larm, Petra; Hongisto, Valtteri

    2006-02-01

    During the acoustical design of, e.g., auditoria or open-plan offices, it is important to know how speech can be perceived in various parts of the room. Different objective methods have been developed to measure and predict speech intelligibility, and these have been extensively used in various spaces. In this study, two such methods were compared, the speech transmission index (STI) and the speech intelligibility index (SII). Also the simplification of the STI, the room acoustics speech transmission index (RASTI), was considered. These quantities are all based on determining an apparent speech-to-noise ratio on selected frequency bands and summing them using a specific weighting. For comparison, some data were needed on the possible differences of these methods resulting from the calculation scheme and also measuring equipment. Their prediction accuracy was also of interest. Measurements were made in a laboratory having adjustable noise level and absorption, and in a real auditorium. It was found that the measurement equipment, especially the selection of the loudspeaker, can greatly affect the accuracy of the results. The prediction accuracy of the RASTI was found acceptable, if the input values for the prediction are accurately known, even though the studied space was not ideally diffuse. PMID:16521772

  11. Speech coding

    Gersho, Allen

    1990-05-01

    Recent advances in algorithms and techniques for speech coding now permit high quality voice reproduction at remarkably low bit rates. The advent of powerful single-ship signal processors has made it cost effective to implement these new and sophisticated speech coding algorithms for many important applications in voice communication and storage. Some of the main ideas underlying the algorithms of major interest today are reviewed. The concept of removing redundancy by linear prediction is reviewed, first in the context of predictive quantization or DPCM. Then linear predictive coding, adaptive predictive coding, and vector quantization are discussed. The concepts of excitation coding via analysis-by-synthesis, vector sum excitation codebooks, and adaptive postfiltering are explained. The main idea of vector excitation coding (VXC) or code excited linear prediction (CELP) are presented. Finally low-delay VXC coding and phonetic segmentation for VXC are described.

  12. Hate speech

    Anne Birgitta Nilsen

    2014-01-01

    The manifesto of the Norwegian terrorist Anders Behring Breivik is based on the “Eurabia” conspiracy theory. This theory is a key starting point for hate speech amongst many right-wing extremists in Europe, but also has ramifications beyond these environments. In brief, proponents of the Eurabia theory claim that Muslims are occupying Europe and destroying Western culture, with the assistance of the EU and European governments. By contrast, members of Al-Qaeda and other extreme Islamists prom...

  13. Emotion Recognition from Persian Speech with Neural Network

    Mina Hamidi

    2012-09-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuous Persian speech. Due to the unavailability of appropriate database in the Persian language for emotion recognition, at first, we built a database of emotional speech in Persian. This database consists of 2400 wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extract prosodic features, including features related to the pitch, intensity and global characteristics of the speech signal. Finally, we applied neural networks for automatic recognition of emotion. The resulting average accuracy was about 78%.

  14. Speech and Communication Disorders

    ... or understand speech. Causes include Hearing disorders and deafness Voice problems, such as dysphonia or those caused by cleft lip or palate Speech problems like stuttering Developmental disabilities Learning disorders Autism spectrum disorder Brain injury Stroke Some speech and ...

  15. Speech disorders - children

    ... of speech disorders may disappear on their own. Speech therapy may help with more severe symptoms or speech problems that do not improve. In therapy, the child will learn how to create certain sounds.

  16. Speech Enhancement based on Compressive Sensing Algorithm

    Sulong, Amart; Gunawan, Teddy S.; Khalifa, Othman O.; Chebil, Jalel

    2013-12-01

    There are various methods, in performance of speech enhancement, have been proposed over the years. The accurate method for the speech enhancement design mainly focuses on quality and intelligibility. The method proposed with high performance level. A novel speech enhancement by using compressive sensing (CS) is a new paradigm of acquiring signals, fundamentally different from uniform rate digitization followed by compression, often used for transmission or storage. Using CS can reduce the number of degrees of freedom of a sparse/compressible signal by permitting only certain configurations of the large and zero/small coefficients, and structured sparsity models. Therefore, CS is significantly provides a way of reconstructing a compressed version of the speech in the original signal by taking only a small amount of linear and non-adaptive measurement. The performance of overall algorithms will be evaluated based on the speech quality by optimise using informal listening test and Perceptual Evaluation of Speech Quality (PESQ). Experimental results show that the CS algorithm perform very well in a wide range of speech test and being significantly given good performance for speech enhancement method with better noise suppression ability over conventional approaches without obvious degradation of speech quality.

  17. Automatic sequences

    Haeseler, Friedrich

    2003-01-01

    Automatic sequences are sequences which are produced by a finite automaton. Although they are not random they may look as being random. They are complicated, in the sense of not being not ultimately periodic, they may look rather complicated, in the sense that it may not be easy to name the rule by which the sequence is generated, however there exists a rule which generates the sequence. The concept automatic sequences has special applications in algebra, number theory, finite automata and formal languages, combinatorics on words. The text deals with different aspects of automatic sequences, in particular:· a general introduction to automatic sequences· the basic (combinatorial) properties of automatic sequences· the algebraic approach to automatic sequences· geometric objects related to automatic sequences.

  18. Segmentation of the speech signal based on changes in energy distribution in the spectrum

    Jassem, W.; Kudzdela, H.; Domagala, P.

    1983-08-01

    A simple algorithm is proposed for automatic phonetic segmentation of the acoustic speech signal on the MERA 303 desk-top minicomputer. The algorithm is verified with Polish linguistic material spoken by two subjects. The proposed algorithm detects approximately 80 percent of the boundaries between enunciated segments correctly, a result no worse than that obtained using more complex methods. Speech recognition programs are discussed as speech perception models, and the nature of categorical perception of human speech sounds is examined.

  19. Commercial applications of speech interface technology: an industry at the threshold.

    Oberteuffer, J A

    1995-01-01

    Speech interface technology, which includes automatic speech recognition, synthetic speech, and natural language processing, is beginning to have a significant impact on business and personal computer use. Today, powerful and inexpensive microprocessors and improved algorithms are driving commercial applications in computer command, consumer, data entry, speech-to-text, telephone, and voice verification. Robust speaker-independent recognition systems for command and navigation in personal com...

  20. Modeling speech imitation and ecological learning of auditory-motor maps

    Claudia eCanevari; Leonardo eBadino; Alessandro eD'Ausilio; Luciano eFadiga; Giorgio eMetta

    2013-01-01

    Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR ...

  1. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  2. A Comprehensive Noise Robust Speech Parameterization Algorithm Using Wavelet Packet Decomposition-Based Denoising and Speech Feature Representation Techniques

    Kotnik Bojan

    2007-01-01

    Full Text Available This paper concerns the problem of automatic speech recognition in noise-intense and adverse environments. The main goal of the proposed work is the definition, implementation, and evaluation of a novel noise robust speech signal parameterization algorithm. The proposed procedure is based on time-frequency speech signal representation using wavelet packet decomposition. A new modified soft thresholding algorithm based on time-frequency adaptive threshold determination was developed to efficiently reduce the level of additive noise in the input noisy speech signal. A two-stage Gaussian mixture model (GMM-based classifier was developed to perform speech/nonspeech as well as voiced/unvoiced classification. The adaptive topology of the wavelet packet decomposition tree based on voiced/unvoiced detection was introduced to separately analyze voiced and unvoiced segments of the speech signal. The main feature vector consists of a combination of log-root compressed wavelet packet parameters, and autoregressive parameters. The final output feature vector is produced using a two-staged feature vector postprocessing procedure. In the experimental framework, the noisy speech databases Aurora 2 and Aurora 3 were applied together with corresponding standardized acoustical model training/testing procedures. The automatic speech recognition performance achieved using the proposed noise robust speech parameterization procedure was compared to the standardized mel-frequency cepstral coefficient (MFCC feature extraction procedures ETSI ES 201 108 and ETSI ES 202 050.

  3. Prediction Method of Speech Recognition Performance Based on HMM-based Speech Synthesis Technique

    Terashima, Ryuta; Yoshimura, Takayoshi; Wakita, Toshihiro; Tokuda, Keiichi; Kitamura, Tadashi

    We describe an efficient method that uses a HMM-based speech synthesis technique as a test pattern generator for evaluating the word recognition rate. The recognition rates of each word and speaker can be evaluated by the synthesized speech by using this method. The parameter generation technique can be formulated as an algorithm that can determine the speech parameter vector sequence O by maximizing P(O¦Q,λ) given the model parameter λ and the state sequence Q, under a dynamic acoustic feature constraint. We conducted recognition experiments to illustrate the validity of the method. Approximately 100 speakers were used to train the speaker dependent models for the speech synthesis used in these experiments, and the synthetic speech was generated as the test patterns for the target speech recognizer. As a result, the recognition rate of the HMM-based synthesized speech shows a good correlation with the recognition rate of the actual speech. Furthermore, we find that our method can predict the speaker recognition rate with approximately 2% error on average. Therefore the evaluation of the speaker recognition rate will be performed automatically by using the proposed method.

  4. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  5. Multi-thread Parallel Speech Recognition for Mobile Applications

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  6. Phonetic Alphabet for Speech Recognition of Czech

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  7. Sentence Clustering Using Parts-of-Speech

    Richard Khoury

    2012-02-01

    Full Text Available Clustering algorithms are used in many Natural Language Processing (NLP tasks. They have proven to be popular and effective tools to use to discover groups of similar linguistic items. In this exploratory paper, we propose a new clustering algorithm to automatically cluster together similar sentences based on the sentences’ part-of-speech syntax. The algorithm generates and merges together the clusters using a syntactic similarity metric based on a hierarchical organization of the parts-of-speech. We demonstrate the features of this algorithm by implementing it in a question type classification system, in order to determine the positive or negative impact of different changes to the algorithm.

  8. Post-editing through Speech Recognition

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text...... the most popular computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways...

  9. Robust coarticulatory modeling for continuous speech recognition

    Schwartz, R.; Chow, Y. L.; Dunham, M. O.; Kimball, O.; Krasner, M.; Kubala, F.; Makhoul, J.; Price, P.; Roucos, S.

    1986-10-01

    The purpose of this project is to perform research into algorithms for the automatic recognition of individual sounds or phonemes in continuous speech. The algorithms developed should be appropriate for understanding large-vocabulary continuous speech input and are to be made available to the Strategic Computing Program for incorporation in a complete word recognition system. This report describes process to date in developing phonetic models that are appropriate for continuous speech recognition. In continuous speech, the acoustic realization of each phoneme depends heavily on the preceding and following phonemes: a process known as coarticulation. Thus, while there are relatively few phonemes in English (on the order of fifty or so), the number of possible different accoustic realizations is in the thousands. Therefore, to develop high-accuracy recognition algorithms, one may need to develop literally thousands of relatively distance phonetic models to represent the various phonetic context adequately. Developing a large number of models usually necessitates having a large amount of speech to provide reliable estimates of the model parameters. The major contributions of this work are the development of: (1) A simple but powerful formalism for modeling phonemes in context; (2) Robust training methods for the reliable estimation of model parameters by utilizing the available speech training data in a maximally effective way; and (3) Efficient search strategies for phonetic recognition while maintaining high recognition accuracy.

  10. Genetic Advances in the Study of Speech and Language Disorders

    Newbury, D.F.; Monaco, A P

    2010-01-01

    Summary Developmental speech and language disorders cover a wide range of childhood conditions with overlapping but heterogeneous phenotypes and underlying etiologies. This characteristic heterogeneity hinders accurate diagnosis, can complicate treatment strategies, and causes difficulties in the identification of causal factors. Nonetheless, over the last decade, genetic variants have been identified that may predispose certain individuals to different aspects of speech and language difficul...

  11. Handling of errors for increasing automatic feedback reliability in foreign language prosody learning Gestion d’erreurs pour la fiabilisation des retours automatiques en apprentissage de la prosodie d’une langue seconde

    Denis Jouvet

    2013-06-01

    Full Text Available The success of future systems for computer assisted foreign language learning relies on providing the learner personalized diagnosis and relevant corrections of its pronunciations. After a presentation of the problem of reliable automatic prosodic feedbacks in language learning, we present our work related to the processing of some errors stemming from the learner and from the system itself. The first part deals with the relevant rejection of incorrect entries (for example due to learner’s errors while being tolerant to non-native speech deviations. The second part focuses on the automatic phonetic segmentation of nonnative speech. A detailed analysis has showed the benefit of taking into account non-native variants, and lead to determining the classes of phonemes whose temporal boundaries are the most accurate and which should be favored in the design of exercises for language learning.

  12. Detection and Separation of Speech Events in Meeting Recordings Using a Microphone Array

    Yamada Miichi

    2007-01-01

    Full Text Available When applying automatic speech recognition (ASR to meeting recordings including spontaneous speech, the performance of ASR is greatly reduced by the overlap of speech events. In this paper, a method of separating the overlapping speech events by using an adaptive beamforming (ABF framework is proposed. The main feature of this method is that all the information necessary for the adaptation of ABF, including microphone calibration, is obtained from meeting recordings based on the results of speech-event detection. The performance of the separation is evaluated via ASR using real meeting recordings.

  13. Detection and Separation of Speech Events in Meeting Recordings Using a Microphone Array

    Futoshi Asano

    2007-07-01

    Full Text Available When applying automatic speech recognition (ASR to meeting recordings including spontaneous speech, the performance of ASR is greatly reduced by the overlap of speech events. In this paper, a method of separating the overlapping speech events by using an adaptive beamforming (ABF framework is proposed. The main feature of this method is that all the information necessary for the adaptation of ABF, including microphone calibration, is obtained from meeting recordings based on the results of speech-event detection. The performance of the separation is evaluated via ASR using real meeting recordings.

  14. Deep Denoising Auto-encoder for Statistical Speech Synthesis

    Wu, Zhenzhou; Takaki, Shinji; Yamagishi, Junichi

    2015-01-01

    This paper proposes a deep denoising auto-encoder technique to extract better acoustic features for speech synthesis. The technique allows us to automatically extract low-dimensional features from high dimensional spectral features in a non-linear, data-driven, unsupervised way. We compared the new stochastic feature extractor with conventional mel-cepstral analysis in analysis-by-synthesis and text-to-speech experiments. Our results confirm that the proposed method increases the quality of s...

  15. An Agent-based Framework for Speech Investigation

    Walsh, Michael; O'Hare, G.M.P.; Carson-Berndsen, Julie

    2005-01-01

    This paper presents a novel agent-based framework for investigating speech recognition which combines statistical data and explicit phonological knowledge in order to explore strategies aimed at augmenting the performance of automatic speech recognition (ASR) systems. This line of research is motivated by a desire to provide solutions to some of the more notable problems encountered, including in particular the problematic phenomena of coarticulation, underspecified input...

  16. Minimal Pair Distinctions and Intelligibility in Preschool Children with and without Speech Sound Disorders

    Hodge, Megan M.; Gotzke, Carrie L.

    2011-01-01

    Listeners' identification of young children's productions of minimally contrastive words and predictive relationships between accurately identified words and intelligibility scores obtained from a 100-word spontaneous speech sample were determined for 36 children with typically developing speech (TDS) and 36 children with speech sound disorders…

  17. Speech and Language Impairments

    ... easily be mistaken for other disabilities such as autism or learning disabilities, so it’s very important to ensure that the child receives a thorough evaluation by a certified speech-language pathologist. Back to top What Causes Speech ...

  18. Speech impairment (adult)

    ... impairment; Impairment of speech; Inability to speak; Aphasia; Dysarthria; Slurred speech; Dysphonia voice disorders ... in others the condition does not get better. DYSARTHRIA With dysarthria, the person has ongoing difficulty expressing ...

  19. Speech perception as categorization

    Holt, Lori L.; Lotto, Andrew J.

    2010-01-01

    Speech perception (SP) most commonly refers to the perceptual mapping from the highly variable acoustic speech signal to a linguistic representation, whether it be phonemes, diphones, syllables, or words. This is an example of categorization, in that potentially discriminable speech sounds are assigned to functionally equivalent classes. In this tutorial, we present some of the main challenges to our understanding of the categorization of speech sounds and the conceptualization of SP that has...

  20. 无汞重铬酸钾-自动电位滴定法准确测定矿石中的全铁含量%Accurate Determination of Total Iron in Ores by Automatic Potentiometric Titration without Potassium Dichromate

    赵怀颖; 温宏利; 夏月莲; 巩爱华; 马生凤

    2012-01-01

    铁矿石样品采用Na2O2碱熔进行前处理,自动电位滴定法准确测定矿石中全铁的含量.对于样品溶液Fe3+的还原方式,考察了SnCl2-HgC12、SnC12、TiC13、SnCl2-TiCl3四种方式,确定选用SnCl2-TiCl3联合还原,不仅避免了有毒试剂的使用,而且滴定终点电位突跃明显.自动电位滴定法的相对误差(RE)为0.13%,精密度(RSD)为0.22%,优于手动滴定法,避免了手动滴定受终点颜色判断误差、分析者水平等因素影响的不足.将建立的SnCl2-TiCl3-K2Cr2O7自动电位滴定法应用于6个铁含量大于30%的矿石标准物质分析,RE<0.2%,RSD<0.3%(n=10).该方法对于钒钛磁铁矿样品GBW07226a、GBW07224无需分离,可直接测定,样品分解方法简单快捷,适用性强,样品不会飞溅且分解完全,适用于需要较高准确度的铁矿石尤其是高含量铁矿石样品的分析工作.%For iron ores, this paper discusses an alkali fusion method of sodium peroxide to resolve the ores without splash and complete decomposition. Four methods of reducing Fe3+ to Fe2+ are also discussed, using stannous chloride and titanium trichloride as this not only avoids the use of toxic reagents, but also has a clearly potential jump at the end of the titration. Finally, the manual titration method is replaced by an automatic potentiometric titration to avoid manual errors, such as the judgement of the end of titration by colour and the level of experience of the analyst. The relative error can be reduced to 0. 13% and the relative standard deviation is 0. 22%. A new SnCl2 - TiCl3 - K2 Cr2 O7 automatic potentiometric titration method has been developed. It has been applied to detect six National Standard Reference iron ore samples where the content of iron is higher than 30% and the relative error is lower than 0. 2% , the relative standard deviation being lower than 0. 3% (n = l0). The magnetite GBW 07226a and GBW 07224 with high vanadium and titanium can be determined

  1. Speech Recognition for Dental Electronic Health Record

    Nagy, Miroslav; Hanzlíček, Petr; Zvárová, Jana; Dostálová, T.; Seydlová, M.; Hippmann, R.; Smidl, L.; Trmal, J.; Psutka, J.

    Brno: VUTIUM Press, 2008 - (Jan, J.; Kozumplík, J.; Provazník, I.). s. 47-47 ISBN 978-80-214-3612-1. [Biosignal 2008. International EURASIP Conference /19./. 29.06.2008-01.07.2008, Brno] Institutional research plan: CEZ:AV0Z10300504 Keywords : automatic speech recognition * electronic health record * dental medicine Subject RIV: IN - Informatics, Computer Science

  2. Auto Spell Suggestion for High Quality Speech Synthesis in Hindi

    Kabra, Shikha; Agarwal, Ritika

    2014-02-01

    The goal of Text-to-Speech (TTS) synthesis in a particular language is to convert arbitrary input text to intelligible and natural sounding speech. However, for a particular language like Hindi, which is a highly confusing language (due to very close spellings), it is not an easy task to identify errors/mistakes in input text and an incorrect text degrade the quality of output speech hence this paper is a contribution to the development of high quality speech synthesis with the involvement of Spellchecker which generates spell suggestions for misspelled words automatically. Involvement of spellchecker would increase the efficiency of speech synthesis by providing spell suggestions for incorrect input text. Furthermore, we have provided the comparative study for evaluating the resultant effect on to phonetic text by adding spellchecker on to input text.

  3. The role of visual spatial attention in audiovisual speech perception

    Andersen, Tobias; Tiippana, K.; Laarni, J.;

    2009-01-01

    recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a central fixation point and dubbed with a single auditory speech track, we were able to discern the influences......Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive but...... from each of the faces and from the voice on the auditory speech percept. We found that directing visual spatial attention towards a face increased the influence of that face on auditory perception. However, the influence of the voice on auditory perception did not change suggesting that audiovisual...

  4. Speech-Language Pathologists

    ... INDEX | OOH SITE MAP | EN ESPAÑOL Healthcare > Speech-Language Pathologists PRINTER-FRIENDLY EN ESPAÑOL Summary What They ... workers and occupations. What They Do -> What Speech-Language Pathologists Do About this section Speech-language pathologists ...

  5. Talking Speech Input.

    Berliss-Vincent, Jane; Whitford, Gigi

    2002-01-01

    This article presents both the factors involved in successful speech input use and the potential barriers that may suggest that other access technologies could be more appropriate for a given individual. Speech input options that are available are reviewed and strategies for optimizing use of speech recognition technology are discussed. (Contains…

  6. THE BASIS FOR SPEECH PREVENTION

    Jordan JORDANOVSKI

    1997-06-01

    Full Text Available The speech is a tool for accurate communication of ideas. When we talk about speech prevention as a practical realization of the language, we are referring to the fact that it should be comprised of the elements of the criteria as viewed from the perspective of the standards. This criteria, in the broad sense of the word, presupposes an exact realization of the thought expressed between the speaker and the recipient.The absence of this criterion catches the eye through the practical realization of the language and brings forth consequences, often hidden very deeply in the human psyche. Their outer manifestation already represents a delayed reaction of the social environment. The foundation for overcoming and standardization of this phenomenon must be the anatomy-physiological patterns of the body, accomplished through methods in concordance with the nature of the body.

  7. Automatic readout micrometer

    A measuring system is disclosed for surveying and very accurately positioning objects with respect to a reference line. A principal use of this surveying system is for accurately aligning the electromagnets which direct a particle beam emitted from a particle accelerator. Prior art surveying systems require highly skilled surveyors. Prior art systems include, for example, optical surveying systems which are susceptible to operator reading errors, and celestial navigation-type surveying systems, with their inherent complexities. The present invention provides an automatic readout micrometer which can very accurately measure distances. The invention has a simplicity of operation which practically eliminates the possibilities of operator optical reading error, owning to the elimination of traditional optical alignments for making measurements. The invention has an extendable arm which carries a laser surveying target. The extendable arm can be continuously positioned over its entire length of travel by either a coarse or fine adjustment without having the fine adjustment outrun the coarse adjustment until a reference laser beam is centered on the target as indicated by a digital readout. The length of the micrometer can then be accurately and automatically read by a computer and compared with a standardized set of alignment measurements. Due to its construction, the micrometer eliminates any errors due to temperature changes when the system is operated within a standard operating temperature range

  8. Automated Gesturing for Virtual Characters: Speech-driven and Text-driven Approaches

    Goranka Zoric

    2006-04-01

    Full Text Available We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker’s facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

  9. Emotion Recognition from Persian Speech with Neural Network

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  10. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  11. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Hiroshi Saruwatari

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a 93.9% word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  12. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  13. A new method for extraction of speech features using spectral delta characteristics and invariant integration

    FARSI, Hassan; KUHIMOGHADAM, Samana

    2014-01-01

    We propose a new feature extraction algorithm that is robust against noise. Nonlinear filtering and temporal masking are used for the proposed algorithm. Since the current automatic speech recognition systems use invariant-integration and delta-delta techniques for speech feature extraction, the proposed algorithm improves speech recognition accuracy appropriately using a delta-spectral feature instead of invariant integration. One of the nonenvironmental factors that reduce recognitio...

  14. Employment of Spectral Voicing Information for Speech and Speaker Recognition in Noisy Conditions

    Jan&#;ovič, Peter; Köküer, M&#;nevver

    2008-01-01

    This chapter described our recent research on representation and modelling of speech signals for automatic speech and speaker recognition in noisy conditions. The chapter consisted of three parts. In the first part, we presented a novel method for estimation of the voicing information of speech spectra in the presence of noise. The presented method is based on calculating a similarity between the shape of signal short-term spectrum and the spectrum of the frame-analysis window. It does not re...

  15. A Computer-Aided Evaluation of Error Patterns in Aphasic Speech

    Chan, Sharon; Tsigka, Styliani; Boschetti, Federico; Capasso, Rita

    2010-01-01

    The objective of this research is to provide an improved automated computational tool to study aphasic production. Using the speech production of Italian aphasic patients, the present study demonstrates the possibility of applying an integrated algorithm to automatically assess and generate error patterns typical of aphasic speech. Philological…

  16. Unobtrusive multimodal emotion detection in adaptive interfaces: speech and facial expressions

    Truong, K.P.; Leeuwen, D.A. van; Neerincx, M.A.

    2007-01-01

    Two unobtrusive modalities for automatic emotion recognition are discussed: speech and facial expressions. First, an overview is given of emotion recognition studies based on a combination of speech and facial expressions. We will identify difficulties concerning data collection, data fusion, system

  17. Helium Speech: An Application of Standing Waves

    Wentworth, Christopher D.

    2011-01-01

    Taking a breath of helium gas and then speaking or singing to the class is a favorite demonstration for an introductory physics course, as it usually elicits appreciative laughter, which serves to energize the class session. Students will usually report that the helium speech "raises the frequency" of the voice. A more accurate description of the…

  18. Automatic analysis of multiparty meetings

    Steve Renals

    2011-10-01

    This paper is about the recognition and interpretation of multiparty meetings captured as audio, video and other signals. This is a challenging task since the meetings consist of spontaneous and conversational interactions between a number of participants: it is a multimodal, multiparty, multistream problem. We discuss the capture and annotation of the Augmented Multiparty Interaction (AMI) meeting corpus, the development of a meeting speech recognition system, and systems for the automatic segmentation, summarization and social processing of meetings, together with some example applications based on these systems.

  19. Fifty years of progress in speech waveform coding

    Atal, Bishnu S.

    2004-10-01

    Over the past 50 years, sustained research in speech coding has made it possible to encode speech with high speech quality at rates as low as 4 kb/s. The technology is now used in many applications, such as digital cellular phones, personal computers, and packet telephony. The early research in speech coding was aimed at reproducing speech spectra using a small number of slowly varying parameters. The focus of research shifted later to accurate reproduction of speech waveforms at low bit rates. The introduction of linear predictive coding (LPC) led to the development of new algorithms, such as adaptive predictive coding, multipulse and code-excited LPC. Code-excited LPC has become the method of choice for low bit rate speech coding and is used in most voice transmission standards. Digital speech communication is rapidly moving away from traditional circuit-switched to packet-switched networks based on IP protocols (VoIP). The focus of speech coding research is now on providing to low cost, reliable, and secure transmission of high-quality speech on IP networks.

  20. Digital speech processing using Matlab

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  1. Towards A Clinical Tool For Automatic Intelligibility Assessment.

    Berisha, Visar; Utianski, Rene; Liss, Julie

    2013-01-01

    An important, yet under-explored, problem in speech processing is the automatic assessment of intelligibility for pathological speech. In practice, intelligibility assessment is often done through subjective tests administered by speech pathologists; however research has shown that these tests are inconsistent, costly, and exhibit poor reliability. Although some automatic methods for intelligibility assessment for telecommunications exist, research specific to pathological speech has been limited. Here, we propose an algorithm that captures important multi-scale perceptual cues shown to correlate well with intelligibility. Nonlinear classifiers are trained at each time scale and a final intelligibility decision is made using ensemble learning methods from machine learning. Preliminary results indicate a marked improvement in intelligibility assessment over published baseline results. PMID:25004985

  2. Automatic Differentiation of Algorithms for Machine Learning

    Baydin, Atilim Gunes; Pearlmutter, Barak A.

    2014-01-01

    Automatic differentiation --- the mechanical transformation of numeric computer programs to calculate derivatives efficiently and accurately --- dates to the origin of the computer age. Reverse mode automatic differentiation both antedates and generalizes the method of backwards propagation of errors used in machine learning. Despite this, practitioners in a variety of fields, including machine learning, have been little influenced by automatic differentiation, and make scant use of available...

  3. Semantic and Phonetic Automatic Reconstruction of Medical Dictations

    Petrik, Stefan; Drexel, Christina; Fessler, Leo; Jancsary, Jeremy; Klein, Alexandra; Kubin, Gernot; Matiasek, Johannes; Pernkopf, Franz; Trost, Harald

    2010-01-01

    Abstract Automatic speech recognition (ASR) has become a valuable tool in large document production environments like medical dictation. While manual post-processing is still needed for correcting speech-recognition errors and for creating documents which adhere to various stylistic and formatting conventions, a large part of the document production process is carried out by the ASR system. For improving the quality of the system output, knowledge about the multi-layered relationsh...

  4. Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

    Bagher BabaAli

    2009-01-01

    Full Text Available Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios.

  5. Automatic Recognition of Element Classes and Boundaries in the Birdsong with Variable Sequences.

    Koumura, Takuya; Okanoya, Kazuo

    2016-01-01

    Researches on sequential vocalization often require analysis of vocalizations in long continuous sounds. In such studies as developmental ones or studies across generations in which days or months of vocalizations must be analyzed, methods for automatic recognition would be strongly desired. Although methods for automatic speech recognition for application purposes have been intensively studied, blindly applying them for biological purposes may not be an optimal solution. This is because, unlike human speech recognition, analysis of sequential vocalizations often requires accurate extraction of timing information. In the present study we propose automated systems suitable for recognizing birdsong, one of the most intensively investigated sequential vocalizations, focusing on the three properties of the birdsong. First, a song is a sequence of vocal elements, called notes, which can be grouped into categories. Second, temporal structure of birdsong is precisely controlled, meaning that temporal information is important in song analysis. Finally, notes are produced according to certain probabilistic rules, which may facilitate the accurate song recognition. We divided the procedure of song recognition into three sub-steps: local classification, boundary detection, and global sequencing, each of which corresponds to each of the three properties of birdsong. We compared the performances of several different ways to arrange these three steps. As results, we demonstrated a hybrid model of a deep convolutional neural network and a hidden Markov model was effective. We propose suitable arrangements of methods according to whether accurate boundary detection is needed. Also we designed the new measure to jointly evaluate the accuracy of note classification and boundary detection. Our methods should be applicable, with small modification and tuning, to the songs in other species that hold the three properties of the sequential vocalization. PMID:27442240

  6. Emotional speech acoustic model for Malay: iterative versus isolated unit training.

    Mustafa, Mumtaz Begum; Ainon, Raja Noor

    2013-10-01

    The ability of speech synthesis system to synthesize emotional speech enhances the user's experience when using this kind of system and its related applications. However, the development of an emotional speech synthesis system is a daunting task in view of the complexity of human emotional speech. The more recent state-of-the-art speech synthesis systems, such as the one based on hidden Markov models, can synthesize emotional speech with acceptable naturalness with the use of a good emotional speech acoustic model. However, building an emotional speech acoustic model requires adequate resources including segment-phonetic labels of emotional speech, which is a problem for many under-resourced languages, including Malay. This research shows how it is possible to build an emotional speech acoustic model for Malay with minimal resources. To achieve this objective, two forms of initialization methods were considered: iterative training using the deterministic annealing expectation maximization algorithm and the isolated unit training. The seed model for the automatic segmentation is a neutral speech acoustic model, which was transformed to target emotion using two transformation techniques: model adaptation and context-dependent boundary refinement. Two forms of evaluation have been performed: an objective evaluation measuring the prosody error and a listening evaluation to measure the naturalness of the synthesized emotional speech. PMID:24116440

  7. Indirect Speech Acts

    李威

    2001-01-01

    Indirect speech acts are frequently used in verbal communication, the interpretation of them is of great importance in order to meet the demands of the development of students' communicative competence. This paper, therefore, intends to present Searle' s indirect speech acts and explore the way how indirect speech acts are interpreted in accordance with two influential theories. It consists of four parts. Part one gives a general introduction to the notion of speech acts theory. Part two makes an elaboration upon the conception of indirect speech act theory proposed by Searle and his supplement and development of illocutionary acts. Part three deals with the interpretation of indirect speech acts. Part four draws implication from the previous study and also serves as the conclusion of the dissertation.

  8. Esophageal speeches modified by the Speech Enhancer Program®

    Manochiopinig, Sriwimon; Boonpramuk, Panuthat

    2014-01-01

    Esophageal speech appears to be the first choice of speech treatment for a laryngectomy. However, many laryngectomy people are unable to speak well. The aim of this study was to evaluate post-modified speech quality of Thai esophageal speakers using the Speech Enhancer Program®. The method adopted was to approach five speech–language pathologists to assess the speech accuracy and intelligibility of the words and continuing speech of the seven laryngectomy people. A comparison study was conduc...

  9. Speech input and output

    Class, F.; Mangold, H.; Stall, D.; Zelinski, R.

    1981-12-01

    Possibilities for acoustical dialogs with electronic data processing equipment were investigated. Speech recognition is posed as recognizing word groups. An economical, multistage classifier for word string segmentation is presented and its reliability in dealing with continuous speech (problems of temporal normalization and context) is discussed. Speech synthesis is considered in terms of German linguistics and phonetics. Preprocessing algorithms for total synthesis of written texts were developed. A macrolanguage, MUSTER, is used to implement this processing in an acoustic data information system (ADES).

  10. Speech Alarms Pilot Study

    Sandor, Aniko; Moses, Haifa

    2016-01-01

    Speech alarms have been used extensively in aviation and included in International Building Codes (IBC) and National Fire Protection Association's (NFPA) Life Safety Code. However, they have not been implemented on space vehicles. Previous studies conducted at NASA JSC showed that speech alarms lead to faster identification and higher accuracy. This research evaluated updated speech and tone alerts in a laboratory environment and in the Human Exploration Research Analog (HERA) in a realistic setup.

  11. Context dependent speech recognition

    Andersson, Sebastian

    2006-01-01

    Poor speech recognition is a problem when developing spoken dialogue systems, but several studies has showed that speech recognition can be improved by post-processing of recognition output that use the dialogue context, acoustic properties of a user utterance and other available resources to train a statistical model to use as a filter between the speech recogniser and dialogue manager. In this thesis a corpus of logged interactions between users and a dialogue system was used...

  12. Advances in speech processing

    Ince, A. Nejat

    1992-10-01

    The field of speech processing is undergoing a rapid growth in terms of both performance and applications and this is fueled by the advances being made in the areas of microelectronics, computation, and algorithm design. The use of voice for civil and military communications is discussed considering advantages and disadvantages including the effects of environmental factors such as acoustic and electrical noise and interference and propagation. The structure of the existing NATO communications network and the evolving Integrated Services Digital Network (ISDN) concept are briefly reviewed to show how they meet the present and future requirements. The paper then deals with the fundamental subject of speech coding and compression. Recent advances in techniques and algorithms for speech coding now permit high quality voice reproduction at remarkably low bit rates. The subject of speech synthesis is next treated where the principle objective is to produce natural quality synthetic speech from unrestricted text input. Speech recognition where the ultimate objective is to produce a machine which would understand conversational speech with unrestricted vocabulary, from essentially any talker, is discussed. Algorithms for speech recognition can be characterized broadly as pattern recognition approaches and acoustic phonetic approaches. To date, the greatest degree of success in speech recognition has been obtained using pattern recognition paradigms. It is for this reason that the paper is concerned primarily with this technique.

  13. Advances in Speech Recognition

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  14. Ear, Hearing and Speech

    Poulsen, Torben

    2000-01-01

    An introduction is given to the the anatomy and the function of the ear, basic psychoacoustic matters (hearing threshold, loudness, masking), the speech signal and speech intelligibility. The lecture note is written for the course: Fundamentals of Acoustics and Noise Control (51001)......An introduction is given to the the anatomy and the function of the ear, basic psychoacoustic matters (hearing threshold, loudness, masking), the speech signal and speech intelligibility. The lecture note is written for the course: Fundamentals of Acoustics and Noise Control (51001)...

  15. Principles of speech coding

    Ogunfunmi, Tokunbo

    2010-01-01

    It is becoming increasingly apparent that all forms of communication-including voice-will be transmitted through packet-switched networks based on the Internet Protocol (IP). Therefore, the design of modern devices that rely on speech interfaces, such as cell phones and PDAs, requires a complete and up-to-date understanding of the basics of speech coding. Outlines key signal processing algorithms used to mitigate impairments to speech quality in VoIP networksOffering a detailed yet easily accessible introduction to the field, Principles of Speech Coding provides an in-depth examination of the

  16. [The voice and speech].

    Pesák, J; Honová, J; Majtner, J; Vojtĕchovský, K

    1998-01-01

    Biophysics is the science comprising the sum of biophysical disciplines describing living systems. It also includes the biophysics of voice and speech. The latter deals with physiological acoustics, phonetics, phoniatry as well as logopaedics. In connection with the problems of voice and speech, including also their teaching problems, a common language is often being sought for appropriate to all the interested scientific branches. As a result of our efforts aimed at removing the existing barriers we have tried to set up a University Society for the Study of Voice and Speech. One of its first activities was also, besides other events, the realization of a videofilm On voice and speech. PMID:10803289

  17. Suppression of the µ Rhythm during Speech and Non-Speech Discrimination Revealed by Independent Component Analysis: Implications for Sensorimotor Integration in Speech Processing

    Bowers, Andrew; Saltuklaroglu, Tim; Harkrider, Ashley; Cuellar, Megan

    2013-01-01

    Background Constructivist theories propose that articulatory hypotheses about incoming phonetic targets may function to enhance perception by limiting the possibilities for sensory analysis. To provide evidence for this proposal, it is necessary to map ongoing, high-temporal resolution changes in sensorimotor activity (i.e., the sensorimotor μ rhythm) to accurate speech and non-speech discrimination performance (i.e., correct trials.) Methods Sixteen participants (15 female and 1 male) were a...

  18. 'What is it?' A functional MRI and SPECT study of ictal speech in a second language

    Neuronal networks involved in second language (L2) processing vary between normal subjects. Patients with epilepsy may have ictal speech automatisms in their second language. To delineate the brain systems involved in L2 ictal speech, we combined functional MRI during bilingual tasks and ictal - inter-ictal single-photon emission computed tomography in a patient who presented L2 ictal speech productions. These analyses showed that the networks activated by the seizure and those activated by L2 processing intersected in the right hippocampus. These results may provide some insights both into the pathophysiology of ictal speech and into the brain organization for L2. (authors)

  19. Analysis of vocal signal in its amplitude - time representation. speech synthesis-by-rules

    In the first part of this dissertation, the natural speech production and the resulting acoustic waveform are examined under various aspects: communication, phonetics, frequency and temporal analysis. Our own study of direct signal is compared to other researches in these different fields, and fundamental features of vocal signals are described. The second part deals with the numerous methods already used for automatic text-to-speech synthesis. In the last part, we expose the new speech synthesis-by-rule methods that we have worked out, and we present in details the structure of the real-time speech synthesiser that we have implemented on a mini-computer. (author)

  20. Speech-Language Therapy (For Parents)

    ... 5 Things to Know About Zika & Pregnancy Speech-Language Therapy KidsHealth > For Parents > Speech-Language Therapy Print ... with speech and/or language disorders. Speech Disorders, Language Disorders, and Feeding Disorders A speech disorder refers ...

  1. Time-expanded speech and speech recognition in older adults.

    Vaughan, Nancy E; Furukawa, Izumi; Balasingam, Nirmala; Mortz, Margaret; Fausti, Stephen A

    2002-01-01

    Speech understanding deficits are common in older adults. In addition to hearing sensitivity, changes in certain cognitive functions may affect speech recognition. One such change that may impact the ability to follow a rapidly changing speech signal is processing speed. When speakers slow the rate of their speech naturally in order to speak clearly, speech recognition is improved. The acoustic characteristics of naturally slowed speech are of interest in developing time-expansion algorithms to improve speech recognition for older listeners. In this study, we tested younger normally hearing, older normally hearing, and older hearing-impaired listeners on time-expanded speech using increased duration and increased intensity of unvoiced consonants. Although all groups performed best on unprocessed speech, performance with processed speech was better with the consonant gain feature without time expansion in the noise condition and better at the slowest time-expanded rate in the quiet condition. The effects of signal processing on speech recognition are discussed. PMID:17642020

  2. Emotional State Categorization from Speech: Machine vs. Human

    Shaukat, Arslan

    2010-01-01

    This paper presents our investigations on emotional state categorization from speech signals with a psychologically inspired computational model against human performance under the same experimental setup. Based on psychological studies, we propose a multistage categorization strategy which allows establishing an automatic categorization model flexibly for a given emotional speech categorization task. We apply the strategy to the Serbian Emotional Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human performance was reported in previous psychological studies. Our work is the first attempt to apply machine learning to the GEES corpus where the human recognition rates were only available prior to our study. Unlike the previous work on the DES corpus, our work focuses on a comparison to human performance under the same experimental settings. Our studies suggest that psychology-inspired systems yield behaviours that, to a great extent, resemble what humans perceived and their performance ...

  3. Speech Compression for Noise-Corrupted Thai Expressive Speech

    Suphattharachai Chomphan

    2011-01-01

    Full Text Available Problem statement: In speech communication, speech coding aims at preserving the speech quality with lower coding bitrate. When considering the communication environment, various types of noises deteriorates the speech quality. The expressive speech with different speaking styles may cause different speech quality with the same coding method. Approach: This research proposed a study of speech compression for noise-corrupted Thai expressive speech by using two coding methods of CS-ACELP and MP-CELP. The speech material included a hundredmale speech utterances and a hundred female speech utterances. Four speaking styles included enjoyable, sad, angry and reading styles. Five sentences of Thai speech were chosen. Three types of noises were included (train, car and air conditioner. Five levels of each type of noise were varied from 0-20 dB. The subjective test of mean opinion score was exploited in the evaluation process. Results: The experimental results showed that CS-ACELP gave the better speech quality than that of MP-CELP at all three bitrates of 6000, 8600-12600 bps. When considering the levels of noise, the 20-dB noise gave the best speech quality, while 0-dB noise gave the worst speech quality. When considering the speech gender, female speech gave the better results than that of male speech. When considering the types of noise, the air-conditioner noise gave the best speech quality, while the train noise gave the worst speech quality. Conclusion: From the study, it can be seen that coding methods, types of noise, levels of noise, speech gender influence on the coding speech quality.

  4. Private Speech in Ballet

    Johnston, Dale

    2006-01-01

    Authoritarian teaching practices in ballet inhibit the use of private speech. This paper highlights the critical importance of private speech in the cognitive development of young ballet students, within what is largely a non-verbal art form. It draws upon research by Russian psychologist Lev Vygotsky and contemporary socioculturalists, to…

  5. Improving Alaryngeal Speech Intelligibility.

    Christensen, John M.; Dwyer, Patricia E.

    1990-01-01

    Laryngectomized patients using esophageal speech or an electronic artificial larynx have difficulty producing correct voicing contrasts between homorganic consonants. This paper describes a therapy technique that emphasizes "pushing harder" on voiceless consonants to improve alaryngeal speech intelligibility and proposes focusing on the production…

  6. Speech and Language Delay

    ... child depends on the cause of the speech delay. Your doctor will tell you the cause of your child's problem and explain any treatments that might fix the problem or make it better. A speech and language pathologist might be helpful in making treatment plans. This ...

  7. Speech Situations and TEFL

    吴树奇; 高建国

    2008-01-01

    This paper deals with how speech situations or ratherspeech implicatures affect TEFL.As far as the writer is concerned,they have much influence on many aspect of language teaching.To illustrate this point explicitly,the writer focuses on the influence of speech situations upon pronunciation,intonation,lexical meanings,sentence comprehension and the grammatical study of the English language.

  8. Speech processing standards

    Ince, A. Nejat

    1990-05-01

    Speech processing standards are given for 64, 32, 16 kb/s and lower rate speech and more generally, speech-band signals which are or will be promulgated by CCITT and NATO. The International Telegraph and Telephone Consultative Committee (CCITT) of the International body which deals, among other things, with speech processing within the context of ISDN. Within NATO there are also bodies promulgating standards which make interoperability, possible without complex and expensive interfaces. Some of the applications for low-bit rate voice and the related work undertaken by CCITT Study Groups which are responsible for developing standards in terms of encoding algorithms, codec design objectives as well as standards on the assessment of speech quality, are highlighted.

  9. Speech recognition systems on the Cell Broadband Engine

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  10. Recognizing intentions in infant-directed speech: Evidence for universals

    Bryant, GA; Barrett, HC

    2007-01-01

    In all languages studied to date, distinct prosodic contours characterize different intention categories of infant-directed (ID) speech. This vocal behavior likely exists universally as a species-typical trait, but little research has examined whether listeners can accurately recognize intentions in ID speech using only vocal cues, without access to semantic information. We recorded native-English-speaking mothers producing four intention categories of utterances (prohibition, approval, comfo...

  11. Speech Acts In President Barack Obama Victory Speech 2012

    Januarini, Erna

    2016-01-01

    In the thesis, entitled Speech Acts In President Barack Obama's Victory Speech 2012. The author analyzes the illocutionary acts and direct and indirect speech acts and by Barack Obama as a speaker based on representative, directive, expressive, commissive, and declaration. The purpose of this thesis is to find the types of illocutionary acts and direct and indirect speech acts and in Barack Obama's victory speech 2012. In writing this thesis, the author uses a qualitative method from Huberman...

  12. 78 FR 49693 - Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services...

    2013-08-15

    ... Rulemaking, published at 73 FR 47120, August 13, 2008 (2008 STS NPRM). The Commission sought comment on... Abbreviated Dialing Arrangements, CC Docket No. 92-105, Report and Order, published at 65 FR 54799, September... COMMISSION 47 CFR Part 64 Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech...

  13. 78 FR 49717 - Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech Telecommunications Relay Services...

    2013-08-15

    ..., Report and Order and Further Notice of Proposed Rulemaking, published at 77 FR 25609, May 1, 2012 (VRS... Nos. 03-123 and 08-15, Notice of Proposed Rulemaking, published at 73 FR 47120, August 13, 2008 (2008... COMMISSION 47 CFR Part 64 Speech-to-Speech and Internet Protocol (IP) Speech-to-Speech...

  14. Going to a Speech Therapist

    ... What's in this article? What Do Speech Therapists Help With? Who Needs Speech Therapy? What's It Like? How Long Will Treatment Last? Some kids have trouble saying certain sounds or words. This can be frustrating ... speech therapists (also called speech-language pathologists ). What ...

  15. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    Catherine Middag

    2009-01-01

    Full Text Available It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008 is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  16. Adverse Conditions and ASR Techniques for Robust Speech User Interface

    Urmila Shrawankar

    2011-09-01

    Full Text Available The main motivation for Automatic Speech Recognition (ASR is efficient interfaces to computers, and for the interfaces to be natural and truly useful, it should provide coverage for a large group of users. The purpose of these tasks is to further improve man-machine communication. ASR systems exhibit unacceptable degradations in performance when the acoustical environments used for training and testing the system are not the same. The goal of this research is to increase the robustness of the speech recognition systems with respect to changes in the environment. A system can be labeled as environment-independent if the recognition accuracy for a new environment is the same or higher than that obtained when the system is retrained for that environment. Attaining such performance is the dream of the researchers. This paper elaborates some of the difficulties with Automatic Speech Recognition (ASR. These difficulties are classified into Speakers characteristics and environmental conditions, and tried to suggest some techniques to compensate variations in speech signal. This paper focuses on the robustness with respect to speakers variations and changes in the acoustical environment. We discussed several different external factors that change the environment and physiological differences that affect the performance of a speech recognition system followed by techniques that are helpful to design a robust ASR system

  17. Fishing for meaningful units in connected speech

    Henrichsen, Peter Juel; Christiansen, Thomas Ulrich

    2009-01-01

    In many branches of spoken language analysis including ASR, the set of smallest meaningful units of speech is taken to coincide with the set of phones or phonemes. However, fishing for phones is difficult, error-prone, and computationally expensive. We present an experiment, based on machine...... far lower than for phonemic recognition. Our findings show that it is possible to automatically characterize a linguistic message, without detailed spectral information or presumptions about the target units. Further, fishing for simple meaningful cues and enhancing these selectively would potentially...

  18. Global Freedom of Speech

    Binderup, Lars Grassme

    2007-01-01

    opposed to a legal norm, that curbs exercises of the right to free speech that offend the feelings or beliefs of members from other cultural groups. The paper rejects the suggestion that acceptance of such a norm is in line with liberal egalitarian thinking. Following a review of the classical liberal...... egalitarian reasons for free speech - reasons from overall welfare, from autonomy and from respect for the equality of citizens - it is argued that these reasons outweigh the proposed reasons for curbing culturally offensive speech. Currently controversial cases such as that of the Danish Cartoon Controversy...

  19. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

    Aleksic Petar S

    2002-01-01

    Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

  20. Automatic assessment of vowel space area.

    Sandoval, Steven; Berisha, Visar; Utianski, Rene L; Liss, Julie M; Spanias, Andreas

    2013-11-01

    Vowel space area (VSA) is an attractive metric for the study of speech production deficits and reductions in intelligibility, in addition to the traditional study of vowel distinctiveness. Traditional VSA estimates are not currently sufficiently sensitive to map to production deficits. The present report describes an automated algorithm using healthy, connected speech rather than single syllables and estimates the entire vowel working space rather than corner vowels. Analyses reveal a strong correlation between the traditional VSA and automated estimates. When the two methods diverge, the automated method seems to provide a more accurate area since it accounts for all vowels. PMID:24181994

  1. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

    Xiao, Xiong; Zhao, Shengkui; Ha Nguyen, Duc Hoang; Zhong, Xionghu; Jones, Douglas L.; Chng, Eng Siong; Li, Haizhou

    2016-01-01

    This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

  2. Deep Multimodal Learning for Audio-Visual Speech Recognition

    Mroueh, Youssef; Marcheret, Etienne; Goel, Vaibhava

    2015-01-01

    In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\\%$ under clean condition on the IBM large vocabulary audio-visual studio datase...

  3. Speech and Swallowing

    ... Español In Your Area NPF Shop Speech and Swallowing Problems Make Text Smaller Make Text Larger You ... How do I know if I have a swallowing problem? I have recently lost weight without trying. ...

  4. Speech disorders - children

    ... this page: //medlineplus.gov/ency/article/001430.htm Speech disorders - children To use the sharing features on ... PA: Elsevier Saunders; 2011:chap 32. Read More Autism spectrum disorder Cerebral palsy Hearing loss Intellectual disability ...

  5. Speech impairment (adult)

    ... ALS or Lou Gehrig disease), cerebral palsy, myasthenia gravis, or multiple sclerosis (MS) Facial trauma Facial weakness, ... provider will likely ask about the speech impairment. Questions may include when the problem developed, whether there ...

  6. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  7. Computer-generated speech

    Aimthikul, Y.

    1981-12-01

    This thesis reviews the essential aspects of speech synthesis and distinguishes between the two prevailing techniques: compressed digital speech and phonemic synthesis. It then presents the hardware details of the five speech modules evaluated. FORTRAN programs were written to facilitate message creation and retrieval with four of the modules driven by a PDP-11 minicomputer. The fifth module was driven directly by a computer terminal. The compressed digital speech modules (T.I. 990/306, T.S.I. Series 3D and N.S. Digitalker) each contain a limited vocabulary produced by the manufacturers while both the phonemic synthesizers made by Votrax permit an almost unlimited set of sounds and words. A text-to-phoneme rules program was adapted for the PDP-11 (running under the RSX-11M operating system) to drive the Votrax Speech Pac module. However, the Votrax Type'N Talk unit has its own built-in translator. Comparison of these modules revealed that the compressed digital speech modules were superior in pronouncing words on an individual basis but lacked the inflection capability that permitted the phonemic synthesizers to generate more coherent phrases. These findings were necessarily highly subjective and dependent on the specific words and phrases studied. In addition, the rapid introduction of new modules by manufacturers will necessitate new comparisons. However, the results of this research verified that all of the modules studied do possess reasonable quality of speech that is suitable for man-machine applications. Furthermore, the development tools are now in place to permit the addition of computer speech output in such applications.

  8. SPEECH DISORDERS ENCOUNTERED DURING SPEECH THERAPY AND THERAPY TECHNIQUES

    İlhan ERDEM

    2013-06-01

    Full Text Available Speech which is a physical and mental process, agreed signs and sounds to create a sense of mind to the message that change . Process to identify the sounds of speech it is essential to know the structure and function of various organs which allows to happen the conversation. Speech is a physical and mental process so many factors can lead to speech disorders. Speech disorder can be about language acquisitions as well as it can be caused medical and psychological many factors. Disordered speech, language, medical and psychological conditions as well as acquisitions also be caused by many factors. Speaking, is the collective work of many organs, such as an orchestra. Mental dimension of the speech disorder which is a very complex skill so it must be found which of these obstacles inhibit conversation. Speech disorder is a defect in speech flow, rhythm, tizliğinde, beats, the composition and vocalization. In this study, speech disorders such as articulation disorders, stuttering, aphasia, dysarthria, a local dialect speech, , language and lip-laziness, rapid speech peech defects in a term of language skills. This causes of speech disorders were investigated and presented suggestions for remedy was discussed.

  9. Practical speech user interface design

    Lewis, James R

    2010-01-01

    Although speech is the most natural form of communication between humans, most people find using speech to communicate with machines anything but natural. Drawing from psychology, human-computer interaction, linguistics, and communication theory, Practical Speech User Interface Design provides a comprehensive yet concise survey of practical speech user interface (SUI) design. It offers practice-based and research-based guidance on how to design effective, efficient, and pleasant speech applications that people can really use. Focusing on the design of speech user interfaces for IVR application

  10. Social Expectation Improves Speech Perception in Noise.

    McGowan, Kevin B

    2015-12-01

    Listeners' use of social information during speech perception was investigated by measuring transcription accuracy of Chinese-accented speech in noise while listeners were presented with a congruent Chinese face, an incongruent Caucasian face, or an uninformative silhouette. When listeners were presented with a Chinese face they transcribed more accurately than when presented with the Caucasian face. This difference existed both for listeners with a relatively high level of experience and for listeners with a relatively low level of experience with Chinese-accented English. Overall, these results are inconsistent with a model of social speech perception in which listener bias reduces attendance to the acoustic signal. These results are generally consistent with exemplar models of socially indexed speech perception predicting that activation of a social category will raise base activation levels of socially appropriate episodic traces, but the similar performance of more and less experienced listeners suggests the need for a more nuanced view with a role for both detailed experience and listener stereotypes. PMID:27483742