WorldWideScience

Sample records for automated speech recognition

  1. Speech recognition and understanding

    Energy Technology Data Exchange (ETDEWEB)

    Vintsyuk, T.K.

    1983-05-01

    This article discusses the automatic processing of speech signals with the aim of finding a sequence of works (speech recognition) or a concept (speech understanding) being transmitted by the speech signal. The goal of the research is to develop an automatic typewriter that will automatically edit and type text under voice control. A dynamic programming method is proposed in which all possible class signals are stored, after which the presented signal is compared to all the stored signals during the recognition phase. Topics considered include element-by-element recognition of words of speech, learning speech recognition, phoneme-by-phoneme speech recognition, the recognition of connected speech, understanding connected speech, and prospects for designing speech recognition and understanding systems. An application of the composition dynamic programming method for the solution of basic problems in the recognition and understanding of speech is presented.

  2. Context dependent speech recognition

    OpenAIRE

    Andersson, Sebastian

    2006-01-01

    Poor speech recognition is a problem when developing spoken dialogue systems, but several studies has showed that speech recognition can be improved by post-processing of recognition output that use the dialogue context, acoustic properties of a user utterance and other available resources to train a statistical model to use as a filter between the speech recogniser and dialogue manager. In this thesis a corpus of logged interactions between users and a dialogue system was used...

  3. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  4. Advances in Speech Recognition

    CERN Document Server

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  5. Speech recognition based on pattern recognition techniques

    Science.gov (United States)

    Rabiner, Lawrence R.

    1990-05-01

    Algorithms for speech recognition can be characterized broadly as pattern recognition approaches and acoustic phonetic approaches. To date, the greatest degree of success in speech recognition has been obtained using pattern recognition paradigms. The use of pattern recognition techniques were applied to the problems of isolated word (or discrete utterance) recognition, connected word recognition, and continuous speech recognition. It is shown that understanding (and consequently the resulting recognizer performance) is best to the simplest recognition tasks and is considerably less well developed for large scale recognition systems.

  6. HUMAN SPEECH EMOTION RECOGNITION

    Directory of Open Access Journals (Sweden)

    Maheshwari Selvaraj

    2016-02-01

    Full Text Available Emotions play an extremely important role in human mental life. It is a medium of expression of one’s perspective or one’s mental state to others. Speech Emotion Recognition (SER can be defined as extraction of the emotional state of the speaker from his or her speech signal. There are few universal emotions- including Neutral, Anger, Happiness, Sadness in which any intelligent system with finite computational resources can be trained to identify or synthesize as required. In this work spectral and prosodic features are used for speech emotion recognition because both of these features contain the emotional information. Mel-frequency cepstral coefficients (MFCC is one of the spectral features. Fundamental frequency, loudness, pitch and speech intensity and glottal parameters are the prosodic features which are used to model different emotions. The potential features are extracted from each utterance for the computational mapping between emotions and speech patterns. Pitch can be detected from the selected features, using which gender can be classified. Support Vector Machine (SVM, is used to classify the gender in this work. Radial Basis Function and Back Propagation Network is used to recognize the emotions based on the selected features, and proved that radial basis function produce more accurate results for emotion recognition than the back propagation network.

  7. Arabic Speech Recognition System using CMU-Sphinx4

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper we present the creation of an Arabic version of Automated Speech Recognition System (ASR). This system is based on the open source Sphinx-4, from the Carnegie Mellon University. Which is a speech recognition system based on discrete hidden Markov models (HMMs). We investigate the changes that must be made to the model to adapt Arabic voice recognition. Keywords: Speech recognition, Acoustic model, Arabic language, HMMs, CMUSphinx-4, Artificial intelligence.

  8. Speech recognition in university classrooms

    OpenAIRE

    Wald, Mike; Bain, Keith; Basson, Sara H

    2002-01-01

    The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions: 1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms? 2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities? This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition te...

  9. Techniques for automatic speech recognition

    Science.gov (United States)

    Moore, R. K.

    1983-05-01

    A brief insight into some of the algorithms that lie behind current automatic speech recognition system is provided. Early phonetically based approaches were not particularly successful, due mainly to a lack of appreciation of the problems involved. These problems are summarized, and various recognition techniques are reviewed in the contect of the solutions that they provide. It is pointed out that the majority of currently available speech recognition equipments employ a "whole-word' pattern matching approach which, although relatively simple, has proved particularly successful in its ability to recognize speech. The concepts of time-normalizing plays a central role in this type of recognition process and a family of such algorithms is described in detail. The technique of dynamic time warping is not only capable of providing good performance for isolated word recognition, but how it is also extended to the recognition of connected speech (thereby removing one of the most severe limitations of early speech recognition equipment).

  10. Speech Recognition Technology: Applications & Future

    OpenAIRE

    Pankaj Pathak

    2010-01-01

    Voice or speech recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned", .It is the technology needs a combination of improved artificial intelligence technology and a more sophisticated speech-recognition engine . Initially a primitive device is developed which could recognize speech, by AT & T Bell Laboratories in the 1940s. According to...

  11. Introduction to Arabic Speech Recognition Using CMUSphinx System

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper Arabic was investigated from the speech recognition problem point of view. We propose a novel approach to build an Arabic Automated Speech Recognition System (ASR). This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs). We build a model using utilities from the OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this system to Arabic voice recognition.

  12. Emotion Recognition using Speech Features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    “Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions.  The content of this book is important for designing and developing  natural and sophisticated speech systems. Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance;  and pro...

  13. Time-expanded speech and speech recognition in older adults.

    Science.gov (United States)

    Vaughan, Nancy E; Furukawa, Izumi; Balasingam, Nirmala; Mortz, Margaret; Fausti, Stephen A

    2002-01-01

    Speech understanding deficits are common in older adults. In addition to hearing sensitivity, changes in certain cognitive functions may affect speech recognition. One such change that may impact the ability to follow a rapidly changing speech signal is processing speed. When speakers slow the rate of their speech naturally in order to speak clearly, speech recognition is improved. The acoustic characteristics of naturally slowed speech are of interest in developing time-expansion algorithms to improve speech recognition for older listeners. In this study, we tested younger normally hearing, older normally hearing, and older hearing-impaired listeners on time-expanded speech using increased duration and increased intensity of unvoiced consonants. Although all groups performed best on unprocessed speech, performance with processed speech was better with the consonant gain feature without time expansion in the noise condition and better at the slowest time-expanded rate in the quiet condition. The effects of signal processing on speech recognition are discussed. PMID:17642020

  14. Lattice Parsing for Speech Recognition

    OpenAIRE

    Chappelier, Jean-Cédric; Rajman, Martin; Aragües, Ramon; Rozenknop, Antoine

    1999-01-01

    A lot of work remains to be done in the domain of a better integration of speech recognition and language processing systems. This paper gives an overview of several strategies for integrating linguistic models into speech understanding systems and investigates several ways of producing sets of hypotheses that include more "semantic" variability than usual language models. The main goal is to present and demonstrate by actual experiments that sequential couplingmay be efficiently achieved byw...

  15. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  16. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  17. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  18. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  19. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia ...... accuracy. Finally, the last part of the thesis looks at the acceptance and success of a speech recognition system introduced in a Danish hospital to produce patient records.......This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... inaccuracies in the anaesthesia record. Supplementing the electronic anaesthesia record interface with speech input facilities is proposed as one possible solution to a part of the problem. The testing of the various hypotheses has involved the development of a prototype of an electronic anaesthesia record...

  20. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  1. Automated Speech Rate Measurement in Dysarthria

    Science.gov (United States)

    Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc

    2015-01-01

    Purpose: In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. Method: The new algorithm was trained and tested using Dutch…

  2. Recent Advances in Robust Speech Recognition Technology

    CERN Document Server

    Ramírez, Javier

    2011-01-01

    This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminishe

  3. On speech recognition during anaesthesia

    OpenAIRE

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (HCI, informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia record, but also with newer electronic versions, in particular ergonomic issues and the fact that anaesthesiologists tend to postpone the registration of the medications and other events during b...

  4. Neural Network Based Hausa Language Speech Recognition

    Directory of Open Access Journals (Sweden)

    Matthew K Luka

    2012-05-01

    Full Text Available Speech recognition is a key element of diverse applications in communication systems, medical transcription systems, security systems etc. However, there has been very little research in the domain of speech processing for African languages, thus, the need to extend the frontier of research in order to port in, the diverse applications based on speech recognition. Hausa language is an important indigenous lingua franca in west and central Africa, spoken as a first or second language by about fifty million people. Speech recognition of Hausa Language is presented in this paper. A pattern recognition neural network was used for developing the system.

  5. Dynamic Automatic Noisy Speech Recognition System (DANSR)

    OpenAIRE

    Paul, Sheuli

    2014-01-01

    In this thesis we studied and investigated a very common but a long existing noise problem and we provided a solution to this problem. The task is to deal with different types of noise that occur simultaneously and which we call hybrid. Although there are individual solutions for specific types one cannot simply combine them because each solution affects the whole speech. We developed an automatic speech recognition system DANSR ( Dynamic Automatic Noisy Speech Recognition System) for hybri...

  6. Automatic speech recognition a deep learning approach

    CERN Document Server

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  7. Robust speech recognition using articulatory information

    OpenAIRE

    Kirchhoff, Katrin

    1999-01-01

    Current automatic speech recognition systems make use of a single source of information about their input, viz. a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy. The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information. Articulatory informa...

  8. Speech Recognition in Natural Background Noise

    OpenAIRE

    Julien Meyer; Laure Dentel; Fanny Meunier

    2013-01-01

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listene...

  9. Man machine interface based on speech recognition

    International Nuclear Information System (INIS)

    This work reports the development of a Man Machine Interface based on speech recognition. The system must recognize spoken commands, and execute the desired tasks, without manual interventions of operators. The range of applications goes from the execution of commands in an industrial plant's control room, to navigation and interaction in virtual environments. Results are reported for isolated word recognition, the isolated words corresponding to the spoken commands. For the pre-processing stage, relevant parameters are extracted from the speech signals, using the cepstral analysis technique, that are used for isolated word recognition, and corresponds to the inputs of an artificial neural network, that performs recognition tasks. (author)

  10. Robust coarticulatory modeling for continuous speech recognition

    Science.gov (United States)

    Schwartz, R.; Chow, Y. L.; Dunham, M. O.; Kimball, O.; Krasner, M.; Kubala, F.; Makhoul, J.; Price, P.; Roucos, S.

    1986-10-01

    The purpose of this project is to perform research into algorithms for the automatic recognition of individual sounds or phonemes in continuous speech. The algorithms developed should be appropriate for understanding large-vocabulary continuous speech input and are to be made available to the Strategic Computing Program for incorporation in a complete word recognition system. This report describes process to date in developing phonetic models that are appropriate for continuous speech recognition. In continuous speech, the acoustic realization of each phoneme depends heavily on the preceding and following phonemes: a process known as coarticulation. Thus, while there are relatively few phonemes in English (on the order of fifty or so), the number of possible different accoustic realizations is in the thousands. Therefore, to develop high-accuracy recognition algorithms, one may need to develop literally thousands of relatively distance phonetic models to represent the various phonetic context adequately. Developing a large number of models usually necessitates having a large amount of speech to provide reliable estimates of the model parameters. The major contributions of this work are the development of: (1) A simple but powerful formalism for modeling phonemes in context; (2) Robust training methods for the reliable estimation of model parameters by utilizing the available speech training data in a maximally effective way; and (3) Efficient search strategies for phonetic recognition while maintaining high recognition accuracy.

  11. PCA-Based Speech Enhancement for Distorted Speech Recognition

    Directory of Open Access Journals (Sweden)

    Tetsuya Takiguchi

    2007-09-01

    Full Text Available We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but it remains difficult to completely remove additive or convolution noise (distortion. The most commonly used noise-removal techniques are based on the spectraldomain operation, and then for speech recognition, the MFCC (Mel Frequency Cepstral Coefficient is computed, where DCT (Discrete Cosine Transform is applied to the mel-scale filter bank output. This paper describes a new PCA-based speech enhancement algorithm using kernel PCA instead of DCT, where the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features. Its effectiveness is confirmed by word recognition experiments on distorted speech.

  12. Low SNR Speech Recognition using SMKL

    Directory of Open Access Journals (Sweden)

    Qin Yuan

    2014-05-01

    Full Text Available While traditional speech recognition methods have achieved great success in a number of real word applications, their further applications to some difficult situations, such as Signal-to-Noise Ratio (SNR signal and local languages, are still limited by their shortcomings in adaption ability. In particular, their robustness to pronunciation level noise is not satisfied enough. To overcome these limitations, in this paper, we propose a novel speech recognition approach for low signal-to-noise ratio signal. The general steps for our speech recognition approach are composed of signal preprocessing, feature extraction and recognition with simple multiple kernel learning (SMKL method. Then the application of SMKL in speech recognition with low SNR is presented. We evaluate the proposed approach over a standard data set. The experimental results show that the performance of SMKL method for low SNR speech recognition is significantly higher than that of the method based on other popular approaches. Further, SMKL based method can be straightforwardly applied to recognition problem of large scale dataset, high dimension data, and a large amount of isomerism information.

  13. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Cini Kurian; Kannan Balakrishnan

    2013-12-01

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual Linear Predictive (PLP) cepstral coefficient for speech parameterization and continuous density Hidden Markov Model (HMM) in the recognition process. Viterbi algorithm is used for decoding. The training data base has the utterance of 21 speakers from the age group of 20 to 40 years and the sound is recorded in the normal office environment where each speaker is asked to read 20 set of continuous digits. The system obtained an accuracy of 99.5 % with the unseen data.

  14. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  15. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...... (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural...... networks...

  16. Dynamic Programming Algorithms in Speech Recognition

    Directory of Open Access Journals (Sweden)

    Titus Felix FURTUNA

    2008-01-01

    Full Text Available In a system of speech recognition containing words, the recognition requires the comparison between the entry signal of the word and the various words of the dictionary. The problem can be solved efficiently by a dynamic comparison algorithm whose goal is to put in optimal correspondence the temporal scales of the two words. An algorithm of this type is Dynamic Time Warping. This paper presents two alternatives for implementation of the algorithm designed for recognition of the isolated words.

  17. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  18. Fifty years of progress in speech recognition

    Science.gov (United States)

    Reddy, Raj

    2004-10-01

    Human level speech recognition has proved to be an elusive goal because of the many sources of variability that affect speech: from stationary and dynamic noise, microphone variability, and speaker variability to variability at phonetic, prosodic, and grammatical levels. Over the past 50 years, Jim Flanagan has been a continuous source of encouragement and inspiration to the speech recognition community. While early isolated word systems primarily used acoustic knowledge, systems in the 1970s found mechanisms to represent and utilize syntactic (e.g., information retrieval) and semantic knowledge (e.g., Chess) in speech recognition systems. As vocabularies became larger, leading to greater ambiguity and perplexity, we had to explore the use task specific and context specific knowledge to reduce the branching factors. As the need arose for systems that can be used by open populations using telephone quality speech, we developed learning techniques that use very large data sets and noise adaptation methods. We still have a long way to go before we can satisfactorily handle unrehearsed spontaneous speech, speech from non-native speakers, and dynamic learning of new words, phrases, and grammatical forms.

  19. Phonetic Alphabet for Speech Recognition of Czech

    OpenAIRE

    J. Uhlir; Psutka, J.; J. Nouza

    1997-01-01

    In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech) consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Cze...

  20. Testing for robust speech recognition performance

    Science.gov (United States)

    Simpson, C. A.; Moore, C. A.; Ruth, J. C.

    Results are reported from two studies which evaluated speaker-dependent connected-speech template-matching algorithms. One study examined the recognition performance for vocabularies spoken within a spacesuit. Two token vocabularies were used that were recorded in different noise levels. The second study evaluated the rejection accuracy for two commercial speech recognizers. The spoken test tokens were variations on a single word. The tests underscored the inferiority of speech recognizers relative to the human capability for discerning among phonetically different words. However, one commercial recognizer exhibited over 96-percent rejection accuracy in a noisy environment.

  1. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  2. The Phase Spectra Based Feature for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Abbasian ALI

    2009-07-01

    Full Text Available Speech recognition in adverse environment is one of the major issue in automatic speech recognition nowadays. While most current speech recognition system show to be highly efficient for ideal environment but their performance go down extremely when they are applied in real environment because of noise effected speech. In this paper a new feature representation based on phase spectra and Perceptual Linear Prediction (PLP has been suggested which can be used for robust speech recognition. It is shown that this new features can improve the performance of speech recognition not only in clean condition but also in various levels of noise condition when it is compared to PLP features.

  3. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    Directory of Open Access Journals (Sweden)

    Catherine Middag

    2009-01-01

    Full Text Available It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008 is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  4. Phonological modeling for continuous speech recognition in Korean

    CERN Document Server

    Lee, W I; Lee, J H; Lee, WonIl; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.

  5. Speech Recognition: A World of Opportunities

    Science.gov (United States)

    PACER Center, 2004

    2004-01-01

    Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. The technology is also useful to people with learning disabilities who experience difficulty with spelling…

  6. Bimodal Emotion Recognition from Speech and Text

    Directory of Open Access Journals (Sweden)

    Weilin Ye

    2014-01-01

    Full Text Available This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.

  7. Objects Control through Speech Recognition Using LabVIEW

    OpenAIRE

    Ankush Sharma; Srinivas Perala; Priya Darshni

    2013-01-01

    Speech is the natural form of human communication and the speech processing is the one of the most stimulating area of the signal processing. Speech recognition technology has made it possible for computer to follow the human voice command and understand the human languages. The objects (LED, Toggle switch etc.) control through human speech is designed in this paper. By combine the virtual instrumentation technology and speech recognition techniques. And also provided password authentication....

  8. Phonetic Alphabet for Speech Recognition of Czech

    Directory of Open Access Journals (Sweden)

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  9. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text...... the most popular computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways...

  10. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  11. Speech recognition: Acoustic, phonetic and lexical knowledge

    Science.gov (United States)

    Zue, V. W.

    1985-08-01

    During this reporting period we continued to make progress on the acquisition of acoustic-phonetic and lexical knowledge. We completed development of a continuous digit recognition system. The system was constructed to investigate the use of acoustic-phonetic knowledge in a speech recognition system. The significant achievements of this study include the development of a soft-failure procedure for lexical access and the discovery of a set of acoustic-phonetic features for verification. We completed a study of the constraints that lexical stress imposes on word recognition. We found that lexical stress information alone can, on the average, reduce the number of word candidates from a large dictionary by more than 80 percent. In conjunction with this study, we successfully developed a system that automatically determines the stress pattern of a word from the acoustic signal. We performed an acoustic study on the characteristics of nasal consonants and nasalized vowels. We have also developed recognition algorithms for nasal murmurs and nasalized vowels in continuous speech. We finished the preliminary development of a system that aligns a speech waveform with the corresponding phonetic transcription.

  12. Development of a System for Automatic Recognition of Speech Development of a System for Automatic Recognition of Speech

    Directory of Open Access Journals (Sweden)

    Michal Kuba

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of Zilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  13. A Dialectal Chinese Speech Recognition Framework

    Institute of Scientific and Technical Information of China (English)

    Jing Li; Thomas Fang Zheng; William Byrne; Dan Jurafsky

    2006-01-01

    A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level,and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use contextindependent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multipronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10-18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and

  14. Speech Recognition Technology for Hearing Disabled Community

    Directory of Open Access Journals (Sweden)

    Tanvi Dua

    2014-09-01

    Full Text Available As the number of people with hearing disabilities are increasing significantly in the world, it is always required to use technology for filling the gap of communication between Deaf and Hearing communities. To fill this gap and to allow people with hearing disabilities to communicate this paper suggests a framework that contributes to the efficient integration of people with hearing disabilities. This paper presents a robust speech recognition system, which converts the continuous speech into text and image. The results are obtained with an accuracy of 95% with the small size vocabulary of 20 greeting sentences of continuous speech form tested in a speaker independent mode. In this testing phase all these continuous sentences were given as live input to the proposed system.

  15. Speech emotion recognition with unsupervised feature learning

    Institute of Scientific and Technical Information of China (English)

    Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

    2015-01-01

    Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

  16. Speech Recognition for Dental Electronic Health Record

    Czech Academy of Sciences Publication Activity Database

    Nagy, Miroslav; Hanzlíček, Petr; Zvárová, Jana; Dostálová, T.; Seydlová, M.; Hippmann, R.; Smidl, L.; Trmal, J.; Psutka, J.

    Brno: VUTIUM Press, 2008 - (Jan, J.; Kozumplík, J.; Provazník, I.). s. 47-47 ISBN 978-80-214-3612-1. [Biosignal 2008. International EURASIP Conference /19./. 29.06.2008-01.07.2008, Brno] Institutional research plan: CEZ:AV0Z10300504 Keywords : automatic speech recognition * electronic health record * dental medicine Subject RIV: IN - Informatics, Computer Science

  17. Self Organizing Markov Map for Speech and Gesture Recognition

    Directory of Open Access Journals (Sweden)

    Ms. Nutan D. Sonwane, Prof. S.A. Chhabria, Dr. R.V. Dharaskar

    2012-04-01

    Full Text Available Gesture and Speech based human Computer interaction is attractive attention across various areas such as pattern recognition, computer vision. Thus kind of research areas find many kind of application in Multimodal HCI, Robotics control, Sign language recognition. This paper presents head and hand Gesture as well as Speech recognition system for human computer interaction (HCI.This kind of vision based system can show the capability of computer, which understand and responding to the hand and head gesture also for Speech in form of sentence. This recognition system consists of two main modules namely 1.Gesture recognition 2.Speech recognition, Gesture recognition consists of various phases.i. image capturing, ii. Feature extraction of gesture iii.Gesture modeling (Direction, Position, generalized, 2.Speech recognition consists of various phases i. taking voice signals ii. Spectral coding iii. Unit matching (BMU iv. Lexical decoding v.syntactic, semantic analysis. Compared with many existing algorithms for gesture and speech recognition, SOM provides flexibility, robustness against noisy environment. The detection of gestures is based on discrete predestinated symbol sets, which are manually labeled during the training phase. The gesture-speech correlation is modelled by examining the co-occurring speech and gesture patterns. This correlation can be used to fuse gesture and speech modalities for edutainment applications (i.e. video games, 3-D animations where natural gestures of talking avatars are animated from speech. A speech driven gesture animation example has been implemented for demonstration.

  18. a New Structure for Automatic Speech Recognition

    Science.gov (United States)

    Duchnowski, Paul

    Speech is a wideband signal with cues identifying a particular element distributed across frequency. To capture these cues, most ASR systems analyze the speech signal into spectral (or spectrally-derived) components prior to recognition. Traditionally, these components are integrated across frequency to form a vector of "acoustic evidence" on which a decision by the ASR system is based. This thesis develops an alternate approach, post-labeling integration. In this scheme, tentative decisions or labels, of the identity of a given speech element are assigned in parallel by sub -recognizers, each operating on a band-limited portion of the speech waveform. Outputs of these independent channels are subsequently combined (integrated) to render the final decision. Remarkably good recognition of bandlimited nonsense syllables by humans leads to the consideration of this method. It also allows potentially more accurate parameterization of the speech waveform and simultaneously robust estimation of parameter probabilities. The algorithm also represents an attempt to make explicit use of redundancies in speech. Three basic methods of parameterizing the bandlimited input of the sub-recognizers were considered, focusing respectively on LPC and cepstrum coefficients, and parameters based on the autocorrelation function. Four sub-recognizers were implemented as discrete Hidden Markov Model (HMM) systems. Maximum A Posteriori (MAP) hypothesis testing approach was applied to the problem of integrating the individual sub-recognizer decisions on a frame by frame basis. Final segmentation was achieved by a secondary HMM. Five methods of estimating the probabilities necessary for MAP integration were tested. The proposed structure was applied to the task of phonetic, speaker-independent, continuous speech recognition. Performance for several combinations of parameterization schemes and integration methods was measured. The best score of 58.5% on a 39 phone alphabet is roughly

  19. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Christophe Lévy

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  20. Cross-Word Modeling for Arabic Speech Recognition

    CERN Document Server

    AbuZeina, Dia

    2012-01-01

    "Cross-Word Modeling for Arabic Speech Recognition" utilizes phonological rules in order to model the cross-word problem, a merging of adjacent words in speech caused by continuous speech, to enhance the performance of continuous speech recognition systems. The author aims to provide an understanding of the cross-word problem and how it can be avoided, specifically focusing on Arabic phonology using an HHM-based classifier.

  1. Joint speech and spearker recognition using neural networks

    OpenAIRE

    Xue, Xiaoguo

    2013-01-01

    Speech is the main communication method between human beings. Since the time of the invention of the computer people have been trying to let the computer understand natural speech. Speech recognition is a technology which has close connections with computer science, signal processing, voice linguistics and intelligent systems. It has been a ”hot” subject not only in the field of research but also as a practical application. Especially in real life, speaker and speech recognition have been use...

  2. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  3. Performance of current models of speech recognition and resulting challenges

    OpenAIRE

    Schubotz, Wiebke

    2015-01-01

    Speech is usually perceived in background noise (masker) that can severely hamper its recognition. Nevertheless, there are mechanisms that enable speech recognition even in difficult listening conditions. Some of them, such as e.g., the combination of across-frequency information or binaural cues, are studied in this dissertation. Moreover, masking aspects such as energetic, amplitude modulation or informational masking are considered. Speech recognition in complex maskers is investigated tha...

  4. The Use of Speech Recognition Technology in Automotive Applications

    OpenAIRE

    Gellatly, Andrew William

    1997-01-01

    The research objectives were (1) to perform a detailed review of the literature on speech recognition technology and the attentional demands of driving; (2) to develop decision tools that assist designers of in-vehicle systems; (3) to experimentally examine automatic speech recognition (ASR) design parameters, input modalities, and driver ages; and (4) to provide human factors recommendations for the use of speech recognition technology in automotive applicatio...

  5. Better speech recognition with cochlear implants

    Science.gov (United States)

    Wilson, Blake S.; Finley, Charles C.; Lawson, Dewey T.; Wolford, Robert D.; Eddington, Donald K.; Rabinowitz, William M.

    1991-07-01

    HIGH levels of speech recognition have been achieved with a new sound processing strategy for multielectrode cochlear implants. A cochlear implant system consists of one or more implanted elec-trodes for direct electrical activation of the auditory nerve, an external speech processor that transforms a microphone input into stimuli for each electrode, and a transcutaneous (rf-link) or per-cutaneous (direct) connection between the processor and the elec-trodes. We report here the comparison of the new strategy and a standard clinical processor. The standard compressed analogue (CA) processor1,2 presented analogue waveforms simultaneously to all electrodes, whereas the new continuous interleaved sampling (CIS) strategy presented brief pulses to each electrode in a nonover-lapping sequence. Seven experienced implant users, selected for their excellent performance with the CA processor, participated as subjects. The new strategy produced large improvements in the scores of speech reception tests for all subjects. These results have important implications for the treatment of deafness and for minimal representations of speech at the auditory periphery.

  6. An Improved Hindi Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Agnes Jacob

    2013-11-01

    Full Text Available This paper presents the results of investigations in speech emotion recognition in Hindi, using only the first four formants and their bandwidths. This research work was done on female speech data base of nearly 1600 utterances comprising neutral, happiness, surprise, anger, sadness, fear and disgust as the elicited emotions. The best of the statistically preprocessed formant and bandwidth features were first identified by the KMeans, K nearest Neighbour and Naive Bayes classification of individual features. This was followed by artificial neural network classification based on the combination of the best formants and bandwidths. The highest overall emotion recognition accuracy obtained by the ANN method was 97.14%, based on the first four values of formants and bandwidths. A striking increase in the recognition accuracy was observed when the number of emotion classes was reduced from seven. The obtained results presented in this paper, have not been reported so far for Hindi, using the proposed spectral features as well as with the adopted preprocessing and classification methods.

  7. Speech Recognition Technology Applied to Intelligent Mobile Navigation System

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    The capability of human-computer interaction reflects the intelligent degree of mobile navigation system.The navigation data and functions of mobile navigation system are divided into system commands and non-system commands in this paper.And then a group of speech commands are Abstracted.This paper applies speech recognition technology to intelligent mobile navigation system to process speech commands and does some deep research on the integration of speech recognition technology with mobile navigation system.The navigation operation can be performed by speech commands,which makes human-computer interaction easy during navigation.Speech command interface of navigation system is implemented by Dutty ++ Software,which is based on speech recognition system -Via Voice of IBM.Through navigation experiments,navigation can be done almost without keyboard,which proved that human-computer interaction is very convenient by speech commands and the reliability is also higher.

  8. A pattern recognition based esophageal speech enhancement system

    Directory of Open Access Journals (Sweden)

    A.Mantilla‐Caeiros

    2010-04-01

    Full Text Available A system for improving the intelligibility and quality of alaryngeal speech based on the replacement of voiced segments ofalaryngeal speech with the equivalent segments of normal speech is proposed. To this end, the system proposed identifies thevoiced segments of the alaryngeal speech signal by using isolate speech recognition methods, and replaces them by theirequivalent voiced segments of normal speech, keeping the silence and unvoiced segments without change. Evaluation resultsusing objective and subjective evaluation methods show that the proposed system proposed provides a fairly goodimprovement of the quality and intelligibility of alaryngeal speech signals.

  9. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  10. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The aim of the research described in this thesis was to develop speech-based affect recognition systems that can deal with spontaneous (‘real’) affect instead of acted affect. Several affect recognition experiments with spontaneous affective speech data were carried out to investigate what combinati

  11. Building DNN Acoustic Models for Large Vocabulary Speech Recognition

    OpenAIRE

    Maas, Andrew L.; Qi, Peng; Xie, Ziang; Hannun, Awni Y.; Lengerich, Christopher T.; Jurafsky, Daniel; Ng, Andrew Y.

    2014-01-01

    Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and co...

  12. Punjabi Automatic Speech Recognition Using HTK

    Directory of Open Access Journals (Sweden)

    Mohit Dua

    2012-07-01

    Full Text Available This paper aims to discuss the implementation of an isolated word Automatic Speech Recognition system (ASR for an Indian regional language Punjabi. The HTK toolkit based on Hidden Markov Model (HMM, a statistical approach, is used to develop the system. Initially the system is trained for 115 distinct Punjabi words by collecting data from eight speakers and then is tested by using samples from six speakers in real time environments. To make the system more interactive and fast a GUI has been developed using JAVA platform for implementing the testing module. The paper also describes the role of each HTK tool, used in various phases of system development, by presenting a detailed architecture of an ASR system developed using HTK library modules and tools. The experimental results show that the overall system performance is 95.63% and 94.08%.

  13. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Hiroshi Saruwatari

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a 93.9% word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  14. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  15. Speech recognition algorithms based on weighted finite-state transducers

    CERN Document Server

    Hori, Takaaki

    2013-01-01

    This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recogni

  16. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  17. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  18. Wavelet Cesptral Coefficients for Isolated Speech Recognition

    Directory of Open Access Journals (Sweden)

    T. B. Adam

    2013-05-01

    Full Text Available The study proposes an improved feature extraction method that is called Wavelet Cepstral Coefficients (WCC. In traditional cepstral analysis, the cepstrums are calculated with the use of the Discrete Fourier Transform (DFT. Owing to the fact that the DFT calculation assumes signal stationary between frames which in practice is not quite true, the WCC replaces the DFT block in the traditional cepstrum calculation with the Discrete Wavelet Transform (DWT hence producing the WCC. To evaluate the proposed WCC, speech recognition task of recognizing the 26 English alphabets were conducted. Comparisons with the traditional Mel-Frequency Cepstral Coefficients (MFCC are done to further analyze the effectiveness of the WCCs. It is found that the WCCs showed some comparable results when compared to the MFCCs considering the WCCs small vector dimension when compared to the MFCCs. The best recognition was found from WCCs at level 5 of the DWT decomposition with a small difference of 1.19% and 3.21% when compared to the MFCCs for speaker independent and speaker dependent tasks respectively.

  19. Prediction Method of Speech Recognition Performance Based on HMM-based Speech Synthesis Technique

    Science.gov (United States)

    Terashima, Ryuta; Yoshimura, Takayoshi; Wakita, Toshihiro; Tokuda, Keiichi; Kitamura, Tadashi

    We describe an efficient method that uses a HMM-based speech synthesis technique as a test pattern generator for evaluating the word recognition rate. The recognition rates of each word and speaker can be evaluated by the synthesized speech by using this method. The parameter generation technique can be formulated as an algorithm that can determine the speech parameter vector sequence O by maximizing P(O¦Q,λ) given the model parameter λ and the state sequence Q, under a dynamic acoustic feature constraint. We conducted recognition experiments to illustrate the validity of the method. Approximately 100 speakers were used to train the speaker dependent models for the speech synthesis used in these experiments, and the synthetic speech was generated as the test patterns for the target speech recognizer. As a result, the recognition rate of the HMM-based synthesized speech shows a good correlation with the recognition rate of the actual speech. Furthermore, we find that our method can predict the speaker recognition rate with approximately 2% error on average. Therefore the evaluation of the speaker recognition rate will be performed automatically by using the proposed method.

  20. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-09-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuous Persian speech. Due to the unavailability of appropriate database in the Persian language for emotion recognition, at first, we built a database of emotional speech in Persian. This database consists of 2400 wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extract prosodic features, including features related to the pitch, intensity and global characteristics of the speech signal. Finally, we applied neural networks for automatic recognition of emotion. The resulting average accuracy was about 78%.

  1. Emotion Recognition from Speech Signals and Perception of Music

    OpenAIRE

    Fernandez Pradier, Melanie

    2011-01-01

    This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segment...

  2. Effects of Speech Clarity on Recognition Memory for Spoken Sentences

    OpenAIRE

    Van Engen, Kristin J.; Bharath Chandrasekaran; Rajka Smiljanic

    2012-01-01

    Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communi...

  3. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  4. Automatic speech recognition for radiological reporting

    International Nuclear Information System (INIS)

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  5. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  6. Speech recognition for 40 patients receiving multichannel cochlear implants.

    Science.gov (United States)

    Dowell, R C; Mecklenburg, D J; Clark, G M

    1986-10-01

    We collected data on 40 patients who received the Nucleus multichannel cochlear implant. Results were reviewed to determine if the coding strategy is effective in transmitting the intended speech features and to assess patient benefit in terms of communication skills. All patients demonstrated significant improvement over preoperative results with a hearing aid for both lipreading enhancement and speech recognition without lipreading. Of the patients, 50% demonstrated ability to understand connected discourse with auditory input only. For the 23 patients who were tested 12 months postoperatively, there was substantial improvement in open-set speech recognition. PMID:3755975

  7. Speech Recognition Oriented Vowel Classification Using Temporal Radial Basis Functions

    CERN Document Server

    Guezouri, Mustapha; Benyettou, Abdelkader

    2009-01-01

    The recent resurgence of interest in spatio-temporal neural network as speech recognition tool motivates the present investigation. In this paper an approach was developed based on temporal radial basis function "TRBF" looking to many advantages: few parameters, speed convergence and time invariance. This application aims to identify vowels taken from natural speech samples from the Timit corpus of American speech. We report a recognition accuracy of 98.06 percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the possibility to expend the vowel sets in future.

  8. Effects of speech clarity on recognition memory for spoken sentences.

    Directory of Open Access Journals (Sweden)

    Kristin J Van Engen

    Full Text Available Extensive research shows that inter-talker variability (i.e., changing the talker affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language. Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise. Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

  9. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Science.gov (United States)

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  10. Mandarin Digits Speech Recognition Using Support Vector Machines

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; KUANG Jing-ming

    2005-01-01

    A method of applying support vector machine (SVM) in speech recognition was proposed, and a speech recognition system for mandarin digits was built up by SVMs. In the system, vectors were linearly extracted from speech feature sequence to make up time-aligned input patterns for SVM, and the decisions of several 2-class SVM classifiers were employed for constructing an N-class classifier. Four kinds of SVM kernel functions were compared in the experiments of speaker-independent speech recognition of mandarin digits. And the kernel of radial basis function has the highest accurate rate of 99.33%, which is better than that of the baseline system based on hidden Markov models (HMM) (97.08%). And the experiments also show that SVM can outperform HMM especially when the samples for learning were very limited.

  11. Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database

    Science.gov (United States)

    Mishra, A. N.; Shrotriya, M. C.; Sharan, S. N.

    2010-02-01

    In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. In this paper wavelet based features using different wavelets are used for Hindi digits recognition. The recognition performance of these features has been compared with Linear Prediction Coefficients (LPC) and Perceptual Linear Prediction (PLP) features. All features have been tested using Hidden Markov Model (HMM) based classifier for speaker independent Hindi digits recognition. The recognition performance of PLP features is11.3% better than LPC features. The recognition performance with db10 features has shown a further improvement of 12.55% over PLP features. The recognition performance with db10 is best among all wavelet based features.

  12. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    Science.gov (United States)

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  13. Microphone Array Speech Recognition : Experiments on Overlapping Speech in Meetings

    OpenAIRE

    Moore, Darren; McCowan, Iain A.

    2002-01-01

    This paper investigates the use of microphone arrays to acquire and recognise speech in meetings. Meetings pose several interesting problems for speech processing, as they consist of multiple competing speakers within a small space, typically around a table. Due to their ability to provide hands-free acquisition and directional discrimination, microphone arrays present a potential alternative to close-talking microphones in such an application. We first propose an appropriate microphone array...

  14. A Multi-Modal Recognition System Using Face and Speech

    OpenAIRE

    Samir Akrouf; Belayadi Yahia; Mostefai Messaoud; Youssef Chahir

    2011-01-01

    Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of reco...

  15. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  16. A Multi-Modal Recognition System Using Face and Speech

    Directory of Open Access Journals (Sweden)

    Samir Akrouf

    2011-05-01

    Full Text Available Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of recognition systems. In this paper, we study the fusion of speech and face in a recognition system for taking a final decision (i.e., accept or reject identity claim. We evaluate the performance of each system differently then we fuse the results and compare the performances.

  17. Speech recognition: Acoustic phonetic and lexical knowledge representation

    Science.gov (United States)

    Zue, V. W.

    1984-02-01

    The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words and determine to what extent the phonotactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.

  18. Confidence Measures for Automatic and Interactive Speech Recognition

    OpenAIRE

    Sánchez Cortina, Isaías

    2016-01-01

    [EN] This thesis work contributes to the field of the {Automatic Speech Recognition} (ASR). And particularly to the {Interactive Speech Transcription} and {Confidence Measures} (CM) for ASR. The main goals of this thesis work can be summarised as follows: 1. To design IST methods and tools to tackle the problem of improving automatically generated transcripts. 2. To assess the designed IST methods and tools on real-life tasks of transcription in large educational repositories of vide...

  19. Studies in automatic speech recognition and its application in aerospace

    Science.gov (United States)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  20. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  1. Speech Recognition Method Based on Multilayer Chaotic Neural Network

    Institute of Scientific and Technical Information of China (English)

    REN Xiaolin; HU Guangrui

    2001-01-01

    In this paper,speech recognitionusing neural networks is investigated.Especially,chaotic dynamics is introduced to neurons,and a mul-tilayer chaotic neural network (MLCNN) architectureis built.A learning algorithm is also derived to trainthe weights of the network.We apply the MLCNNto speech recognition and compare the performanceof the network with those of recurrent neural net-work (RNN) and time-delay neural network (TDNN).Experimental results show that the MLCNN methodoutperforms the other neural networks methods withrespect to average recognition rate.

  2. Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hironori Matsumasa

    2009-08-01

    Full Text Available We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of the first words spoken tends to be unstable due to the strain placed on the speech-related muscles, and this causes degradation of speech recognition. Therefore, we proposed a robust feature extraction method based on PCA (Principal Component Analysis instead of MFCC, where the main stable utterance element is projected onto low-order features and fluctuation elements of speech style are projected onto high-order features. Therefore, the PCA-based filter will be able to extract stable utterance features only. The fluctuation of speaking style may invoke phone fluctuations, such as substitutions, deletions and insertions. In this paper, we discuss our effort to integrate a Metamodel and an Acoustic model approach. Metamodels have a technique for incorporating a model of a speaker’s confusion matrix into the ASR process in such a way as to increase recognition accuracy. The integration of metamodels and acoustic models enables fluctuation suppression not only in feature extraction but also in recognition. The proposed method resulted in an improvement of 9.9% (from 79.1% to 89% in the recognition rate compared to the conventional method.

  3. Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

    CERN Document Server

    Lee, G; Kim, S; Lee, Geunbae; Lee, Jong-Hyeok; Kim, Sangeok

    1996-01-01

    This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

  4. Robust Automatic Speech Recognition in Impulsive Noise Environment

    Institute of Scientific and Technical Information of China (English)

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  5. Mixed Bayesian Networks with Auxiliary Variables for Automatic Speech Recognition

    OpenAIRE

    Stephenson, Todd Andrew; Magimai.-Doss, Mathew; Bourlard, Hervé

    2001-01-01

    Standard hidden Markov models (HMMs), as used in automatic speech recognition (ASR), calculate their emission probabilities by an artificial neural network (ANN) or a Gaussian distribution conditioned on the hidden state variable, considering the emissions independent of any other variable in the model. Recent work showed the benefit of conditioning the emission distributions on a discrete auxiliary variable, which is observed in training and hidden in recognition. Related work has shown the ...

  6. Objective Gender and Age Recognition from Speech Sentences

    Directory of Open Access Journals (Sweden)

    Fatima K. Faek

    2015-10-01

    Full Text Available In this work, an automatic gender and age recognizer from speech is investigated. The relevant features to gender recognition are selected from the first four formant frequencies and twelve MFCCs and feed the SVM classifier. While the relevant features to age has been used with k-NN classifier for the age recognizer model, using MATLAB as a simulation tool. A special selection of robust features is used in this work to improve the results of the gender and age classifiers based on the frequency range that the feature represents. The gender and age classification algorithms are evaluated using 114 (clean and noisy speech samples uttered in Kurdish language. The model of two classes (adult males and adult females gender recognition, reached 96% recognition accuracy. While for three categories classification (adult males, adult females, and children, the model achieved 94% recognition accuracy. For the age recognition model, seven groups according to their ages are categorized. The model performance after selecting the relevant features to age achieved 75.3%. For further improvement a de-noising technique is used with the noisy speech signals, followed by selecting the proper features that are affected by the de-noising process and result in 81.44% recognition accuracy.

  7. Speech recognition: Acoustic-phonetic knowledge acquisition and representation

    Science.gov (United States)

    Zue, Victor W.

    1988-09-01

    The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.

  8. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  9. On Automatic Voice Casting for Expressive Speech: Speaker Recognition vs. Speech Classification

    OpenAIRE

    Obin, Nicolas; Roebel, Axel; Bachman, Grégoire

    2014-01-01

    This paper presents the first large-scale automatic voice casting system, and explores the adaptation of speaker recognition techniques to measure voice similarities. The proposed system is based on the representation of a voice by classes (e.g., age/gender, voice quality, emotion). First, a multi-label system is used to classify speech into classes. Then, the output probabilities for each class are concatenated to form a vector that represents the vocal signature of a speech recording. Final...

  10. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  11. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  12. EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR

    Institute of Scientific and Technical Information of China (English)

    Chen Yanxiang; Xie Jian

    2012-01-01

    Emotion recognition from speech is an important field of research in human computer interaction.In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition.Because of the importance of variance in reflecting the distribution of speech,the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector.Comparative experiments from five aspects are conducted to study their corresponding effect to system performance.The experiment results,which indicate that the influence of number of mixtures is strong as well as influence of duration is weak,provide basis for the train set selection of Universal Background Model (UBM).

  13. POLISH EMOTIONAL SPEECH RECOGNITION USING ARTIFICAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2014-11-01

    Full Text Available The article presents the issue of emotion recognition based on polish emotional speech analysis. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. The following parameters extracted from sampled and normalised speech signal has been used for the analysis: energy of signal, speaker’s sex, average value of speech signal and both the minimum and maximum sample value for a given signal. As an emotional state a classifier fof our layers of artificial neural network has been used. The achieved results reach 50% of accuracy. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom.

  14. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue;

    2006-01-01

    temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features of......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize that it is the...... audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  15. Biologically inspired emotion recognition from speech

    OpenAIRE

    Buscicchio Cosimo; Caponetti Laura; Castellano Giovanna

    2011-01-01

    Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency ceps...

  16. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  17. Computer recognition of phonets in speech

    Science.gov (United States)

    Martin, D. D.

    1982-12-01

    This project generated phonetic units, termed 'phonets,' from digitized speech files. The time file was converted to feature space using the Fourier Transform, and phonet occurrences were detected using Minkowski One and Two distance measures. Phonet matches were detected and ranked for each phonet compared against a template file. Phonet short-time energy was included in the output files. An algorithm was developed to partition feature space and its performance was evaluated.

  18. Speech recognition for embedded automatic positioner for laparoscope

    Science.gov (United States)

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  19. Phonetic recognition of natural speech by nonstationary Markov models

    Science.gov (United States)

    Falaschi, Alessandro

    1988-04-01

    A speech recognition system based on statistical decision theory, viewing the problem as the classical design of a decoder in a communication system framework is outlined. Statistical properties of the language are used to characterize the allowable phonetic sequence inside the words, while trying to capture allophonic phoneme features into functional-dependent acoustical models with the aim of utilizing them as word segmentation cues. Experiments prove the utility of an explicit modeling of the intrinsic speech nonstationarity in a statistically based speech recognition system. The nonstationarity of phonetic chain statistics and acoustical transition probabilities can be easily taken into account, yielding recognition improvements. The use of inside syllable position dependent phonetic models does not improve recognition performance, and the iterative Viterbi training algorithm seems unable to adequately valorize this kind of acoustical modeling. As a direct consequence of the system design, the recognized phonetic sequence exhibits word boundary marks even in absence of pauses between words, thus giving anchor points to the higher level parsing algorithms needed in a complete recognition system.

  20. Success potential of automated star pattern recognition

    Science.gov (United States)

    Van Bezooijen, R. W. H.

    1986-01-01

    A quasi-analytical model is presented for calculating the success probability of automated star pattern recognition systems for attitude control of spacecraft. The star data is gathered by an imaging star tracker (STR) with a circular FOV capable of detecting 20 stars. The success potential is evaluated in terms of the equivalent diameters of the FOV and the target star area ('uniqueness area'). Recognition is carried out as a function of the position and brightness of selected stars in an area around each guide star. The success of the system is dependent on the resultant pointing error, and is calculated by generating a probability distribution of reaching a threshold probability of an unacceptable pointing error. The method yields data which are equivalent to data available with Monte Carlo simulatins. When applied to the recognition system intended for use on the Space IR Telescope Facility it is shown that acceptable pointing, to a level of nearly 100 percent certainty, can be obtained using a single star tracker and about 4000 guide stars.

  1. New Ideas for Speech Recognition and Related Technologies

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J F

    2002-06-17

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstrated by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun

  2. Automatic speech recognition for report generation in computed tomography

    International Nuclear Information System (INIS)

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.)

  3. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  4. Biologically inspired emotion recognition from speech

    Science.gov (United States)

    Caponetti, Laura; Buscicchio, Cosimo Alessandro; Castellano, Giovanna

    2011-12-01

    Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC) and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  5. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Directory of Open Access Journals (Sweden)

    Santiago Omar Caballero Morales

    2009-01-01

    Full Text Available Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1 a set of “metamodels” that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2 a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  6. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    OpenAIRE

    Veton Z. Këpuska; Mohamed M. Eljhani; Brian H. Hight

    2014-01-01

    A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR) that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc.) with only speech commands instead of manual activation. T...

  7. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    OpenAIRE

    Kang, S.; Kim, J.-H.; Seo, J.

    2014-01-01

    Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech reco...

  8. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  9. Deep Multimodal Learning for Audio-Visual Speech Recognition

    OpenAIRE

    Mroueh, Youssef; Marcheret, Etienne; Goel, Vaibhava

    2015-01-01

    In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\\%$ under clean condition on the IBM large vocabulary audio-visual studio datase...

  10. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality...... of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...

  11. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  12. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  13. Speech signal recognition with the homotopic representation method

    Science.gov (United States)

    Bianchi, F.; Pocci, P.; Prina-Ricotti, L.

    1981-02-01

    Speech recognition by a computer, using homotopic representation, is introduced, including the algorithm and the processing mode for the speech signal, the result of a vowel recognition experiment, and the result of a phonetic transcription experiment with simple words composed of four phonemes. The signal is stored in a delay line of M elements and N = M + 1 outputs. Homotopic defines a pair of outputs symmetrical to the exit located in the central element of the delay line. When the products of sample homotopic outputs of the first sequence of pair sampling are found, they are separately summed to the products of the following processing. This procedure is repeated continuously so that at every instant the transform function is the result of the last processing and the weighted sum of the previous result. In tests a female /o/ is recognized as /a/. Of 320 test phonemes, 15 are mistaken and 7 are dubious.

  14. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part...... probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  15. Initial evaluation of a continuous speech recognition program for radiology

    OpenAIRE

    Kanal, KM; Hangiandreou, NJ; Sykes, AM; Eklund, HE; Araoz, PA; Leon, JA; Erickson, BJ

    2001-01-01

    The aims of this work were to measure the accuracy of one continuous speech recognition product and dependence on the speaker's gender and status as a native or nonnative English speaker, and evaluate the product's potential for routine use in transcribing radiology reports. IBM MedSpeak/Radiology software, version 1.1 was evaluated by 6 speakers. Two were nonnative English speakers, and 3 were men. Each speaker dictated a set of 12 reports. The reports included neurologic and body imaging ex...

  16. On the Generalization of Shannon Entropy for Speech Recognition

    OpenAIRE

    Obin, Nicolas; Liuni, Marco

    2012-01-01

    This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distributi...

  17. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  18. Statistical modeling of speech Poincaré sections in combination of frequency analysis to improve speech recognition performance.

    Science.gov (United States)

    Jafari, Ayyoob; Almasganj, Farshad; Bidhendi, Maryam Nabi

    2010-09-01

    This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from Poincaré section applied to speech reconstructed phase space (RPS) and typical Mel frequency cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. Moreover, complicated systems such as speech production system can present cyclic and oscillatory patterns and Poincaré sections could be used as an effective tool in analysis of such trajectories. In this research, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to Poincaré sections of speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. By the proposed feature set, 5.7% absolute isolated phoneme recognition improvement is obtained against only MFCC-based features. PMID:20887046

  19. Merge-Weighted Dynamic Time Warping for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张湘莉兰; 骆志刚; 李明

    2014-01-01

    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

  20. Emerging technologies with potential for objectively evaluating speech recognition skills.

    Science.gov (United States)

    Rawool, Vishakha Waman

    2016-02-01

    Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed. PMID:26807789

  1. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise.

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18-35 years) and 22 older (60-78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an

  2. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  3. Event-Synchronous Analysis for Connected-Speech Recognition.

    Science.gov (United States)

    Morgan, David Peter

    The motivation for event-synchronous speech analysis originates from linear system theory where the speech-source transfer function is excited by an impulse-like driving function. In speech processing, the impulse response obtained from this linear system contains both semantic information and the vocal tract transfer function. Typically, an estimate of the transfer function is obtained via the spectrum by assuming a short-time stationary signal within some analysis window. However, this spectrum is often distorted by the periodic effects which occur when multiple (pitch) impulses are included in the analysis window. One method to remove these effects would be to deconvolve the excitation function from the speech signal to obtain the transfer function. The more attractive approach is to locate and identify the excitation function and synchronize the analysis frame with it. Event-synchronous analysis differs from pitch -synchronous analysis in that there are many events useful for speech recognition which are not pitch excited. In addition, event-synchronous analysis locates the important boundaries between speech events, such as voiced to unvoiced and silence to burst transitions. In asynchronous processing, an analysis frame which contains portions of two adjacent but dissimilar speech events is often so ambiguous as to distort or mask the important "phonetic" features of both events. Thus event-syncronous processing is employed to obtain an accurate spectral estimate and in turn enhance the estimate of the vocal-tract transfer function. Among the issues which have been addressed in implementing an event-synchronous recognition system are those of developing robust event (pitch, burst, etc.) detectors, synchronous-analysis methodologies, more meaningful feature sets, and dynamic programming algorithms for nonlinear time alignment. An advantage of event-synchronous processing is that the improved representation of the transfer function creates an opportunity for

  4. Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition

    Science.gov (United States)

    Fujii, Yasuhisa; Yamamoto, Kazumasa; Nakagawa, Seiichi

    In this paper, we propose Hidden Conditional Neural Fields (HCNF) for continuous phoneme speech recognition, which are a combination of Hidden Conditional Random Fields (HCRF) and a Multi-Layer Perceptron (MLP), and inherit their merits, namely, the discriminative property for sequences from HCRF and the ability to extract non-linear features from an MLP. HCNF can incorporate many types of features from which non-linear features can be extracted, and is trained by sequential criteria. We first present the formulation of HCNF and then examine three methods to further improve automatic speech recognition using HCNF, which is an objective function that explicitly considers training errors, provides a hierarchical tandem-style feature and includes a deep non-linear feature extractor for the observation function. We show that HCNF can be trained realistically without any initial model and outperforms HCRF and the triphone hidden Markov model trained by the minimum phone error (MPE) manner using experimental results for continuous English phoneme recognition on the TIMIT core test set and Japanese phoneme recognition on the IPA 100 test set.

  5. Error analysis to improve the speech recognition accuracy on Telugu language

    Indian Academy of Sciences (India)

    N Usha Rani; P N Girija

    2012-12-01

    Speech is one of the most important communication channels among the people. Speech Recognition occupies a prominent place in communication between the humans and machine. Several factors affect the accuracy of the speech recognition system. Much effort was involved to increase the accuracy of the speech recognition system, still erroneous output is generating in current speech recognition systems. Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy. Modification should be performed in the dictionary, which is used in the decoder of the speech recognition system. This modification reduces the number of the confusion pairs which improves the performance of the speech recognition system. Language model scores are also varied with this modification. Hit rate is considerably increased during this modification and false alarms have been changing during the modification of the pronunciation dictionary. Variations are observed in different error measures such as F-measures, error-rate and Word Error Rate (WER) by application of the proposed method.

  6. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    Indian Academy of Sciences (India)

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  7. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  8. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Directory of Open Access Journals (Sweden)

    Sadaoki Furui

    2007-01-01

    Full Text Available We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM architecture developed from a clean-speech hidden Markov model (HMM and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs reduced the word error rate by 20.5% compared to that of the clean-speech word HMMs.

  9. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Directory of Open Access Journals (Sweden)

    Betkowska Agnieszka

    2007-01-01

    Full Text Available We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM architecture developed from a clean-speech hidden Markov model (HMM and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs reduced the word error rate by 20.5 compared to that of the clean-speech word HMMs.

  10. Automated recognition system for power quality disturbances

    Science.gov (United States)

    Abdelgalil, Tarek

    amplitudes and frequencies, an Artificial Neural Network is employed to identify the switched capacitor by using amplitudes and frequencies extracted from the transient signal. The new algorithms for detecting, tracking, and classifying power quality disturbances demonstrate the potential for further development of a fully automated recognition system for the assessment of power quality. This is possible because the implementation of the proposed algorithms for the power quality monitoring device becomes a straight forward process by modifying the device software.

  11. Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bagher BabaAli

    2009-01-01

    Full Text Available Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios.

  12. Speech recognition in individuals having a clinical complaint about understanding speech during noise or not

    Directory of Open Access Journals (Sweden)

    Becker, Karine Thaís

    2011-07-01

    Full Text Available Introduction: Clinical and experimental study. Individuals with a normal hearing can be jeopardized in adverse communication situations, what negatively interferes with speech clearness. Objective: check and compare the performance of normal hearing young adults who have a difficulty in understanding speech during noise or not, by making use of sentences as stimuli. Method: 50 normal hearing individuals, 21 of whom were male and 29 were female, aged between 19 and 32, were evaluated and divided into two groups: with and without a clinical complaint about understanding speech during noise. By using Portuguese Sentence Lists test, the Recognition Threshold of Sentences during Noise research was performed, through which the signal-to-noise (SN ratios were obtained. The contrasting noise was introduced at 65 dB NA. Results: the average values achieved for SN ratios in the left ear, for the group without a complaint and the group with a complaint, were respectively 6.26 dB and 3.62 dB. For the left ear, the values were -7.12 dB and -4.12 dB. A statistically significant difference was noticed in both right and left ears of the two groups. Conclusion: normal hearing individuals showing a clinical complaint about understanding speech at noisy places have more difficulty in the task to recognize sentences during noise, in comparison with the people who do not face such a difficulty. Accordingly, the customary audiologic evaluation must include tests using sentences during a contrasting noise, with a view to evaluating the speech recognition performance more reliably and efficiently. ACTRN12610000822088

  13. Recognition of Emotions in German Speech Using Gaussian Mixture Models

    Czech Academy of Sciences Publication Activity Database

    Vondra, Martin; Vích, Robert

    Vol. 5398. Berlin: SPRINGER-VERLAG, 2009 - (Esposito, A.; Hussain, A.; Marinaro, M.; Martone, R.), s. 256-263. (Lecture Notes in Artificial Intelligence . 5398). ISBN 978-3-642-00524-4. ISSN 0302-9743. [euCognition International Training School on Multimodal Signals - Cognitive and Algorithmic Issues (European COST A2102). Vietri sul Mare (IT), 21.04.2008-26.04.2008] R&D Projects: GA MŠk OC08010 Institutional research plan: CEZ:AV0Z20670512 Keywords : emotion recognition * speech emotions Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering

  14. Class of data-flow architectures for speech recognition

    Energy Technology Data Exchange (ETDEWEB)

    Bisiani, R.

    1983-01-01

    The behaviour of many speech recognition systems, and of some artificial intelligence programs, can be modeled as a data driven computation in which the instructions are complex but inexpensive operations (e.g. the evaluation of the likelihood of a partial sentence hypothesis), and the data-flow graph is derived directly from the knowledge representation (e.G. The phoneme-level network in a harpy system). The architecture of a machine that exploits this characteristic is presented, together with the results of the simulation of one possible implementation. 15 references.

  15. Real-world speech recognition with neural networks

    Science.gov (United States)

    Barnard, Etienne; Cole, Ronald; Fanty, Mark; Vermeulen, Pieter J. E.

    1995-04-01

    We describe a system based on neural networks that is designed to recognize speech transmitted through the telephone network. Context-dependent phonetic modeling is studied as a method of improving recognition accuracy, and a special training algorithm is introduced to make the training of these nets more manageable. Our system is designed for real-world applications, and we have therefore specialized our implementation for this goal; a pipelined DSP structure and a compact search algorithm are described as examples of this specialization. Preliminary results from a realistic test of the system (a field trial for the U.S. Census Bureau) are reported.

  16. Efficient Speech Recognition by Using Modular Neural Network

    Directory of Open Access Journals (Sweden)

    Dr.R.L.K.Venkateswarlu

    2011-05-01

    Full Text Available The Modular approach and Neural Network approach are well known concepts in the research and engineering community. By combining these two together, the Modular Neural Network approach is very effective in searching for solutions to complex problems of various fields. The aim of this study is the distribution of the complexity for the ambiguous words classification task on a set of modules. Each of these modules is a single Neural Network which is characterized by its high degree of specialization. The number of interfaces, and there with possibilities for filtering external acoustic – phonetic knowledge, increases a modular architecture. Modular Neural Network (MNN for speech recognition is presented with speaker dependent single word recognition in this paper. Using this approach by taking computational effort into account, the system performance can be accessed. The active performance is found maximum for MFCC while training with Modular Neural Network classifiers as 99.88%. The active performance is found maximum for LPCC while training with Modular Neural Network classifier as 99.77%. It is found that MFCC performance is superior to LPCC performance while training the speech data with Modular Neural Network classifier.

  17. Modeling words with subword units in an articulatorily constrained speech recognition algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1997-11-20

    The goal of speech recognition is to find the most probable word given the acoustic evidence, i.e. a string of VQ codes or acoustic features. Speech recognition algorithms typically take advantage of the fact that the probability of a word, given a sequence of VQ codes, can be calculated.

  18. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  19. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...

  20. Automatic Speech Recognition Using Template Model for Man-Machine Interface

    OpenAIRE

    Mishra, Neema; Shrawankar, Urmila; Thakare, V. M

    2013-01-01

    Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech is being an important mode of interaction with computers. In this paper Feature extraction is implemented using well-known Mel-Frequenc...

  1. Segment-based acoustic models for continuous speech recognition

    Science.gov (United States)

    Ostendorf, Mari; Rohlicek, J. R.

    1993-07-01

    This research aims to develop new and more accurate stochastic models for speaker-independent continuous speech recognition, by extending previous work in segment-based modeling and by introducing a new hierarchical approach to representing intra-utterance statistical dependencies. These techniques, which are more costly than traditional approaches because of the large search space associated with higher order models, are made feasible through rescoring a set of HMM-generated N-best sentence hypotheses. We expect these different modeling techniques to result in improved recognition performance over that achieved by current systems, which handle only frame-based observations and assume that these observations are independent given an underlying state sequence. In the fourth quarter of the project, we have completed the following: (1) ported our recognition system to the Wall Street Journal task, a standard task in the ARPA community; (2) developed an initial dependency-tree model of intra-utterance observation correlation; and (3) implemented baseline language model estimation software. Our initial results on the Wall Street Journal task are quite good and represent significantly improved performance over most HMM systems reporting on the Nov. 1992 5k vocabulary test set.

  2. Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Martin Heckmann

    2002-11-01

    Full Text Available It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.

  3. Automated License Plate Recognition for Toll Booth Application

    OpenAIRE

    Ketan S. Shevale

    2014-01-01

    This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as wel...

  4. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  5. Lexical decoder for continuous speech recognition: sequential neural network approach

    International Nuclear Information System (INIS)

    The work presented in this dissertation concerns the study of a connectionist architecture to treat sequential inputs. In this context, the model proposed by J.L. Elman, a recurrent multilayers network, is used. Its abilities and its limits are evaluated. Modifications are done in order to treat erroneous or noisy sequential inputs and to classify patterns. The application context of this study concerns the realisation of a lexical decoder for analytical multi-speakers continuous speech recognition. Lexical decoding is completed from lattices of phonemes which are obtained after an acoustic-phonetic decoding stage relying on a K Nearest Neighbors search technique. Test are done on sentences formed from a lexicon of 20 words. The results are obtained show the ability of the proposed connectionist model to take into account the sequentiality at the input level, to memorize the context and to treat noisy or erroneous inputs. (author)

  6. Robust Digital Speech Watermarking For Online Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Ali Nematollahi

    2015-01-01

    Full Text Available A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz. Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT, Massachusetts Institute of Technology (MIT, and Mobile Biometry (MOBIO show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

  7. Analyzing and Improving Statistical Language Models for Speech Recognition

    CERN Document Server

    Ueberla, J P

    1994-01-01

    In many current speech recognizers, a statistical language model is used to indicate how likely it is that a certain word will be spoken next, given the words recognized so far. How can statistical language models be improved so that more complex speech recognition tasks can be tackled? Since the knowledge of the weaknesses of any theory often makes improving the theory easier, the central idea of this thesis is to analyze the weaknesses of existing statistical language models in order to subsequently improve them. To that end, we formally define a weakness of a statistical language model in terms of the logarithm of the total probability, LTP, a term closely related to the standard perplexity measure used to evaluate statistical language models. We apply our definition of a weakness to a frequently used statistical language model, called a bi-pos model. This results, for example, in a new modeling of unknown words which improves the performance of the model by 14% to 21%. Moreover, one of the identified weak...

  8. Benefits of spatial hearing to speech recognition in young people with normal hearing

    Institute of Scientific and Technical Information of China (English)

    SONG Peng-long; LI Hui-jun; WANG Ning-yu

    2011-01-01

    Background Many factors interfering with a listener attempting to grasp speech in noisy environments.The spatial hearing by which speech and noise can be spatially separated may play a crucial role in speech recognition in the presence of competing noise.This study aimed to assess whether,and to what degree,spatial hearing benefit speech recognition in young normal-hearing participants in both quiet and noisy environments.Methods Twenty-eight young participants were tested by Mandarin Hearing In Noise Test (MHINT) in quiet and noisy environments.The assessment method used was characterized by modifications of speech and noise configurations,as well as by changes of speech presentation mode.The benefit of spatial hearing was measured by speech recognition threshold (SRT) variation between speech condition 1 (SC1) and speech condition 2 (SC2).Results There was no significant difference found in the SRT between SC1 and SC2 in quiet.SRT in SC1 was about 4.2 dB lower than that in SC2,both in speech-shaped and four-babble noise conditions.SRTs measured in both SC1 and SC2 were lower in the speech-shaped noise condition than in the four-babble noise condition.Conclusion Spatial hearing in young normal-hearing participants contribute to speech recognition in noisy environments,but provide no benefit to speech recognition in quiet environments,which may be due to the offset of auditory extrinsic redundancy against the lack of spatial hearing.

  9. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

    Directory of Open Access Journals (Sweden)

    Gordan Mihaela

    2002-01-01

    Full Text Available Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

  10. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: Speech recognition is currently being deployed in medical and anaesthesia applications. This article is part of a project to investigate and further develop a prototype of a speech-input interface in Danish for an electronic anaesthesia patient record, to be used in real time during...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type of...... predominant effect, recognition rates for common noises (e.g. ventilation, alarms) are only slightly below rates obtained in a quiet environment. Finally, a redundant architecture succeeds in improving the reliability of the recognitions. Conclusion: This study removes some uncertainties regarding the...

  11. Recognition of speaker-dependent continuous speech with KEAL

    Science.gov (United States)

    Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.

    1989-04-01

    A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.

  12. Composite Wavelet Filters for Enhanced Automated Target Recognition

    Science.gov (United States)

    Chiang, Jeffrey N.; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2012-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low-resolution sonar and camera videos taken from unmanned vehicles. These sonar images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both sonar and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this paper.

  13. Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling

    OpenAIRE

    Wand, Michael

    2014-01-01

    Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this thesis is substantially improved in terms of accuracy, flexibility, and robustness.

  14. Speech recognition interference by the temporal and spectral properties of a single competing talker.

    Science.gov (United States)

    Fogerty, Daniel; Xu, Jiaqian

    2016-08-01

    This study investigated how speech recognition during speech-on-speech masking may be impaired due to the interaction between amplitude modulations of the target and competing talker. Young normal-hearing adults were tested in a competing talker paradigm where the target and/or competing talker was processed to primarily preserve amplitude modulation cues. Effects of talker sex and linguistic interference were also examined. Results suggest that performance patterns for natural speech-on-speech conditions are largely consistent with the same masking patterns observed for signals primarily limited to temporal amplitude modulations. However, results also suggest a role for spectral cues in talker segregation and linguistic competition. PMID:27586780

  15. A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

    OpenAIRE

    Pao, Tsang-Long; Liao, Wen-Yuan; Chen, Yu-Te

    2008-01-01

    In this chapter, we present a speech emotion recognition system to compare several classifiers on the clean speech and noisy speech. Our proposed WD-KNN classifier outperforms the other three KNN-based classifiers at every SNR level and achieves highest accuracy from clean speech to 20dB noisy speech when compared with all other classifiers. Similar to (Neiberg et al, 2006), GMM is a feasible technique for emotion classification on the frame level and the results of GMM are better than perfor...

  16. Significance of parametric spectral ratio methods in detection and recognition of whispered speech

    Science.gov (United States)

    Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.

    2012-12-01

    In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

  17. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

    Directory of Open Access Journals (Sweden)

    Aleksic Petar S

    2002-01-01

    Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

  18. Semi-automated contour recognition using DICOMautomaton

    Science.gov (United States)

    Clark, H.; Wu, J.; Moiseenko, V.; Lee, R.; Gill, B.; Duzenli, C.; Thomas, S.

    2014-03-01

    Purpose: A system has been developed which recognizes and classifies Digital Imaging and Communication in Medicine contour data with minimal human intervention. It allows researchers to overcome obstacles which tax analysis and mining systems, including inconsistent naming conventions and differences in data age or resolution. Methods: Lexicographic and geometric analysis is used for recognition. Well-known lexicographic methods implemented include Levenshtein-Damerau, bag-of-characters, Double Metaphone, Soundex, and (word and character)-N-grams. Geometrical implementations include 3D Fourier Descriptors, probability spheres, boolean overlap, simple feature comparison (e.g. eccentricity, volume) and rule-based techniques. Both analyses implement custom, domain-specific modules (e.g. emphasis differentiating left/right organ variants). Contour labels from 60 head and neck patients are used for cross-validation. Results: Mixed-lexicographical methods show an effective improvement in more than 10% of recognition attempts compared with a pure Levenshtein-Damerau approach when withholding 70% of the lexicon. Domain-specific and geometrical techniques further boost performance. Conclusions: DICOMautomaton allows users to recognize contours semi-automatically. As usage increases and the lexicon is filled with additional structures, performance improves, increasing the overall utility of the system.

  19. Semi-automated contour recognition using DICOMautomaton

    International Nuclear Information System (INIS)

    Purpose: A system has been developed which recognizes and classifies Digital Imaging and Communication in Medicine contour data with minimal human intervention. It allows researchers to overcome obstacles which tax analysis and mining systems, including inconsistent naming conventions and differences in data age or resolution. Methods: Lexicographic and geometric analysis is used for recognition. Well-known lexicographic methods implemented include Levenshtein-Damerau, bag-of-characters, Double Metaphone, Soundex, and (word and character)-N-grams. Geometrical implementations include 3D Fourier Descriptors, probability spheres, boolean overlap, simple feature comparison (e.g. eccentricity, volume) and rule-based techniques. Both analyses implement custom, domain-specific modules (e.g. emphasis differentiating left/right organ variants). Contour labels from 60 head and neck patients are used for cross-validation. Results: Mixed-lexicographical methods show an effective improvement in more than 10% of recognition attempts compared with a pure Levenshtein-Damerau approach when withholding 70% of the lexicon. Domain-specific and geometrical techniques further boost performance. Conclusions: DICOMautomaton allows users to recognize contours semi-automatically. As usage increases and the lexicon is filled with additional structures, performance improves, increasing the overall utility of the system.

  20. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    Science.gov (United States)

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory

  1. A speech recognition system for data collection in precision agriculture

    Science.gov (United States)

    Dux, David Lee

    Agricultural producers have shown interest in collecting detailed, accurate, and meaningful field data through field scouting, but scouting is labor intensive. They use yield monitor attachments to collect weed and other field data while driving equipment. However, distractions from using a keyboard or buttons while driving can lead to driving errors or missed data points. At Purdue University, researchers have developed an ASR system to allow equipment operators to collect georeferenced data while keeping hands and eyes on the machine during harvesting and to ease georeferencing of data collected during scouting. A notebook computer retrieved locations from a GPS unit and displayed and stored data in Excel. A headset microphone with a single earphone collected spoken input while allowing the operator to hear outside sounds. One-, two-, or three-word commands activated appropriate VBA macros. Four speech recognition products were chosen based on hardware requirements and ability to add new terms. After training, speech recognition accuracy was 100% for Kurzweil VoicePlus and Verbex Listen for the 132 vocabulary words tested, during tests walking outdoors or driving an ATV. Scouting tests were performed by carrying the system in a backpack while walking in soybean fields. The system recorded a point or a series of points with each utterance. Boundaries of points showed problem areas in the field and single points marked rocks and field corners. Data were displayed as an Excel chart to show a real-time map as data were collected. The information was later displayed in a GIS over remote sensed field images. Field corners and areas of poor stand matched, with voice data explaining anomalies in the image. The system was tested during soybean harvest by using voice to locate weed patches. A harvester operator with little computer experience marked points by voice when the harvester entered and exited weed patches or areas with poor crop stand. The operator found the

  2. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    Directory of Open Access Journals (Sweden)

    KANG, S.

    2014-05-01

    Full Text Available Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech recognition systems focus only on acoustic and language model scores for the current sentence. The proposed model successfully reduces the word error rate and semantic error rate by 3.65% and 8.61%, respectively.

  3. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    International Nuclear Information System (INIS)

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  4. Employment of Spectral Voicing Information for Speech and Speaker Recognition in Noisy Conditions

    OpenAIRE

    Jan&#;ovič, Peter; Köküer, M&#;nevver

    2008-01-01

    This chapter described our recent research on representation and modelling of speech signals for automatic speech and speaker recognition in noisy conditions. The chapter consisted of three parts. In the first part, we presented a novel method for estimation of the voicing information of speech spectra in the presence of noise. The presented method is based on calculating a similarity between the shape of signal short-term spectrum and the spectrum of the frame-analysis window. It does not re...

  5. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  6. An RBFN-based system for speaker-independent speech recognition

    OpenAIRE

    Huliehel, Fakhralden A.

    1995-01-01

    A speaker-independent isolated-word small vocabulary system is developed for applications such as voice-driven menu systems. The design of a cascade of recognition layers is presented. Several feature sets are compared. Phone recognition is performed using a radial basis function network (RBFN). Dynamic time warping (DTW) is used for word recognition. The TIMIT database is used to design and test the automatic speech recognition (ASR) system. Several feature sets using mel-s...

  7. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  8. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    Science.gov (United States)

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  9. Immediate and sustained benefits of a “total” implementation of speech recognition reporting

    OpenAIRE

    Hart, J. L.; McBride, A; Blunt, D.; Gishen, P; STRICKLAND, N.

    2010-01-01

    Speech recognition reporting was introduced in our institution to address the significant delay between report dictation and the appearance of a typed report on the Picture Archiving and Communication System (PACS). We report our experience of a “total” implementation of a speech recognition reporting (SRR) system, which became the sole means of radiology reporting from day 1 of introduction. Prospectively gathered Radiology Information System (RIS) data were examined to determine the monthly...

  10. Influence of GSM speech coding on the performance of text-independent speaker recognition

    OpenAIRE

    Grassi, Sara; Besacier, Laurent; DUFAUX, Alain; Ansorge, Michael; Pellandini, Fausto

    2006-01-01

    We have investigated the influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM). The performance degradation due to the utilization of the three GSM speech coders was assessed, using three transcoded databases, obtained by passing the TIMIT through each GSM coder/decoder. The recognition performance was also assessed using the original TIMIT and its 8 kHz downsampled version. Then, different experiments were ca...

  11. A Cognitive Science Reasoning in Recognition of Emotions in Audio-Visual Speech

    OpenAIRE

    Slavova, Velina; Verhelst, Werner; Sahli, Hichem

    2008-01-01

    In this report we summarize the state-of-the-art of speech emotion recognition from the signal processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the observation is made that existing approaches for supervised machine learning lead to database dependent classifiers which can not be applied for multi-language speech emotion recognition without additional training because they discriminate the emotion classes following the use...

  12. Automated Robot with Object Recognition and Handling Features

    Directory of Open Access Journals (Sweden)

    Amiraj Dhawan

    2013-06-01

    Full Text Available With the advent of new technologies, every industry is moving towards automation. A large number of jobs in industries, such as Manufacturing, are performed repeatedly. These jobs require a lot of human effort. In such cases, there is a need of an automated robot which can perform the repetitive task more efficiently. This paper is about a robot which has object recognition and handling features. The robot will optically recognize the objects and pick and place them as per the hand gestures given by the user. It will have a camera to capture image of the objects and one arm to perform the pick and place function.

  13. Automated License Plate Recognition for Toll Booth Application

    Directory of Open Access Journals (Sweden)

    Ketan S. Shevale

    2014-10-01

    Full Text Available This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as well as some experimental results to recognize the license plate successfully.

  14. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  15. Is Listening in Noise Worth It? The Neurobiology of Speech Recognition in Challenging Listening Conditions.

    Science.gov (United States)

    Eckert, Mark A; Teubner-Rhodes, Susan; Vaden, Kenneth I

    2016-01-01

    This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. The authors propose that the behavioral economics or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance. PMID:27355759

  16. Influences of Infant-Directed Speech on Early Word Recognition

    Science.gov (United States)

    Singh, Leher; Nestor, Sarah; Parikh, Chandni; Yull, Ashley

    2009-01-01

    When addressing infants, many adults adopt a particular type of speech, known as infant-directed speech (IDS). IDS is characterized by exaggerated intonation, as well as reduced speech rate, shorter utterance duration, and grammatical simplification. It is commonly asserted that IDS serves in part to facilitate language learning. Although…

  17. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments. PMID:27332294

  18. An analytical approach to photonic reservoir computing - a network of SOA's - for noisy speech recognition

    Science.gov (United States)

    Salehi, Mohammad Reza; Abiri, Ebrahim; Dehyadegari, Louiza

    2013-10-01

    This paper seeks to investigate an approach of photonic reservoir computing for optical speech recognition on an examination isolated digit recognition task. An analytical approach in photonic reservoir computing is further drawn on to decrease time consumption, compared to numerical methods; which is very important in processing large signals such as speech recognition. It is also observed that adjusting reservoir parameters along with a good nonlinear mapping of the input signal into the reservoir, analytical approach, would boost recognition accuracy performance. Perfect recognition accuracy (i.e. 100%) can be achieved for noiseless speech signals. For noisy signals with 0-10 db of signal to noise ratios, however, the accuracy ranges observed varied between 92% and 98%. In fact, photonic reservoir application demonstrated 9-18% improvement compared to classical reservoir networks with hyperbolic tangent nodes.

  19. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which......An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior to the...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  20. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    secretarial transcription for all physicians in clinical departments. The aim of the survey was (i) to identify how attitudes and perceptions among physicians affected the acceptance and success of the speech-recognition system and the new work procedures associated with it; and (ii) to assess the degree to......The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent...... which physicians’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires—one administered when the physicians were about to begin training with the speech-recognition system and another, asking similar questions, when they...

  1. Use of intonation contours for speech recognition in noise by cochlear implant recipients.

    Science.gov (United States)

    Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin

    2011-05-01

    The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise. PMID:21568376

  2. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...

  3. The Effect of Asymmetrical Signal Degradation on Binaural Speech Recognition in Children and Adults.

    Science.gov (United States)

    Rothpletz, Ann M.; Tharpe, Anne Marie; Grantham, D. Wesley

    2004-01-01

    To determine the effect of asymmetrical signal degradation on binaural speech recognition, 28 children and 14 adults were administered a sentence recognition task amidst multitalker babble. There were 3 listening conditions: (a) monaural, with mild degradation in 1 ear; (b) binaural, with mild degradation in both ears (symmetric degradation); and…

  4. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions. PMID:27378997

  5. Speech recognition materials and ceiling effects: considerations for cochlear implant programs.

    Science.gov (United States)

    Gifford, René H; Shallop, Jon K; Peterson, Anna Mary

    2008-01-01

    Cochlear implant recipients have demonstrated remarkable increases in speech perception since US FDA approval was granted in 1984. Improved performance is due to a number of factors including improved cochlear implant technology, evolving speech coding strategies, and individuals with increasingly more residual hearing receiving implants. Despite this evolution, the same recommendations for pre- and postimplant speech recognition testing have been in place for over 10 years in the United States. To determine whether new recommendations are warranted, speech perception performance was assessed for 156 adult, postlingually deafened implant recipients as well as 50 hearing aid users on monosyllabic word recognition (CNC) and sentence recognition in quiet (HINT and AzBio sentences) and in noise (BKB-SIN). Results demonstrated that for HINT sentences in quiet, 28% of the subjects tested achieved maximum performance of 100% correct and that scores did not agree well with monosyllables (CNC) or sentence recognition in noise (BKB-SIN). For a more difficult sentence recognition material (AzBio), only 0.7% of the subjects achieved 100% performance and scores were in much better agreement with monosyllables and sentence recognition in noise. These results suggest that more difficult materials are needed to assess speech perception performance of postimplant patients - and perhaps also for determining implant candidacy. PMID:18212519

  6. Noisy Speech Recognition Based on Integration/Selection of Multiple Noise Suppression Methods Using Noise GMMs

    Science.gov (United States)

    Kitaoka, Norihide; Hamaguchi, Souta; Nakagawa, Seiichi

    To achieve high recognition performance for a wide variety of noise and for a wide range of signal-to-noise ratio, this paper presents methods for integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVD-based speech enhancement, GMM-based speech estimation and KLT-based comb-filtering. In this paper, we proposed two types of combination methods of noise suppression algorithms: selection of front-end processor and combination of results from multiple recognition processes. Recognition results on the CENSREC-1 task showed the effectiveness of our proposed methods.kn-abstract=

  7. Report generation using digital speech recognition in radiology.

    Science.gov (United States)

    Vorbeck, F; Ba-Ssalamah, A; Kettenbach, J; Huebsch, P

    2000-01-01

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5%, and the mean error rate for conventional typing was 0.4%. Reports edited with the CSR, on average, were generated 19% faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28% for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. PMID:11305581

  8. Report generation using digital speech recognition in radiology

    International Nuclear Information System (INIS)

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5 %, and the mean error rate for conventional typing was 0.4 %. Reports edited with the CSR, on average, were generated 19 % faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28 % for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. (orig.)

  9. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  10. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

    Science.gov (United States)

    Xiao, Xiong; Zhao, Shengkui; Ha Nguyen, Duc Hoang; Zhong, Xionghu; Jones, Douglas L.; Chng, Eng Siong; Li, Haizhou

    2016-01-01

    This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

  11. Using the FASST source separation toolbox for noise robust speech recognition

    OpenAIRE

    Ozerov, Alexey; Vincent, Emmanuel

    2011-01-01

    We describe our submission to the 2011 CHiME Speech Separation and Recognition Challenge. Our speech separation algorithm was built using the Flexible Audio Source Separation Toolbox (FASST) we developed recently. This toolbox is an implementation of a general flexible framework based on a library of structured source models that enable the incorporation of prior knowledge about a source separation problem via user-specifiable constraints. We show how to use FASST to develop an efficient spee...

  12. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    Directory of Open Access Journals (Sweden)

    Veton Z. Këpuska

    2014-12-01

    Full Text Available A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc. with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD, and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., serverfor remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV. This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.

  13. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  14. [Research on Barrier-free Home Environment System Based on Speech Recognition].

    Science.gov (United States)

    Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo

    2015-10-01

    The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment. PMID:26964305

  15. Hybrid Approach for Language Identification Oriented to Multilingual Speech Recognition in the Basque Context

    Science.gov (United States)

    Barroso, N.; de Ipiña, K. López; Ezeiza, A.; Barroso, O.; Susperregi, U.

    The development of Multilingual Large Vocabulary Continuous Speech Recognition systems involves issues as: Language Identification, Acoustic-Phonetic Decoding, Language Modelling or the development of appropriated Language Resources. The interest on Multilingual Systems arouses because there are three official languages in the Basque Country (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition for the Basque context. The work presents hybrid strategies for LID, based on the selection of system elements by Support Vector Machines and Multilayer Perceptron classifiers and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams).

  16. Frequency band-importance functions for auditory and auditory-visual speech recognition

    Science.gov (United States)

    Grant, Ken W.

    2005-04-01

    In many everyday listening environments, speech communication involves the integration of both acoustic and visual speech cues. This is especially true in noisy and reverberant environments where the speech signal is highly degraded, or when the listener has a hearing impairment. Understanding the mechanisms involved in auditory-visual integration is a primary interest of this work. Of particular interest is whether listeners are able to allocate their attention to various frequency regions of the speech signal differently under auditory-visual conditions and auditory-alone conditions. For auditory speech recognition, the most important frequency regions tend to be around 1500-3000 Hz, corresponding roughly to important acoustic cues for place of articulation. The purpose of this study is to determine the most important frequency region under auditory-visual speech conditions. Frequency band-importance functions for auditory and auditory-visual conditions were obtained by having subjects identify speech tokens under conditions where the speech-to-noise ratio of different parts of the speech spectrum is independently and randomly varied on every trial. Point biserial correlations were computed for each separate spectral region and the normalized correlations are interpreted as weights indicating the importance of each region. Relations among frequency-importance functions for auditory and auditory-visual conditions will be discussed.

  17. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    OpenAIRE

    Ibrahim Missaoui; Zied Lachiri

    2016-01-01

    In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that ...

  18. Correcting Automatic Speech Recognition Errors in Real Time

    OpenAIRE

    Wald, M; Boulain, P; Bell, J.; Doody, K; Gerrard, J

    2007-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search fo...

  19. Automated target recognition and tracking using an optical pattern recognition neural network

    Science.gov (United States)

    Chao, Tien-Hsin

    1991-01-01

    The on-going development of an automatic target recognition and tracking system at the Jet Propulsion Laboratory is presented. This system is an optical pattern recognition neural network (OPRNN) that is an integration of an innovative optical parallel processor and a feature extraction based neural net training algorithm. The parallel optical processor provides high speed and vast parallelism as well as full shift invariance. The neural network algorithm enables simultaneous discrimination of multiple noisy targets in spite of their scales, rotations, perspectives, and various deformations. This fully developed OPRNN system can be effectively utilized for the automated spacecraft recognition and tracking that will lead to success in the Automated Rendezvous and Capture (AR&C) of the unmanned Cargo Transfer Vehicle (CTV). One of the most powerful optical parallel processors for automatic target recognition is the multichannel correlator. With the inherent advantages of parallel processing capability and shift invariance, multiple objects can be simultaneously recognized and tracked using this multichannel correlator. This target tracking capability can be greatly enhanced by utilizing a powerful feature extraction based neural network training algorithm such as the neocognitron. The OPRNN, currently under investigation at JPL, is constructed with an optical multichannel correlator where holographic filters have been prepared using the neocognitron training algorithm. The computation speed of the neocognitron-type OPRNN is up to 10(exp 14) analog connections/sec that enabling the OPRNN to outperform its state-of-the-art electronics counterpart by at least two orders of magnitude.

  20. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  1. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  2. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    OpenAIRE

    Heungkyu Lee

    2013-01-01

    This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA) on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at...

  3. Researches of the Electrotechnical Laboratory. No. 955: Speech recognition by description of acoustic characteristic variations

    Science.gov (United States)

    Hayamizu, Satoru

    1993-09-01

    A new speech recognition technique is proposed. This technique systematically describes acoustic characteristic variations using a large scale speech database, thereby, obtaining high recognition accuracy. Rules are extracted to represent knowledge concerning acoustic characteristic variations by observing the actual speech database. A general framework based on maps of the sets of variation factors to the acoustic feature spaces is proposed. A single recognition model is not used for each element of descriptive units regardless of the states of the variation factors. Large-scaled and systematic different recognition models are used for different states. A technique to structurize the representation of acoustic characteristic variations by clustering recognition models depending on variation factors is proposed. To investigate acoustic characteristic variations for phonetic contexts efficiently, word sets for reading texts of speech database are selected so that the maximum number of three phoneme sequences are covered in small number of words as possible. A selection algorithm, in which the first criterion is to maximize the number of different three phoneme sequences in the word set and the second criterion is to maximize the entropy of the three phonemes, is proposed. Read speed data of the word sets are collected and labelled as acoustic-phonetic segments. Experiments of speaker-independent word recognition using this speech database were conducted to show the description effectiveness of the acoustic characteristic variations using networks of acoustic-phonetic segments. The experiment shows the recognition errors are reduced. Basic framework for estimating the acoustic characteristics in unknown phonetic contexts using decision trees is proposed.

  4. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Directory of Open Access Journals (Sweden)

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  5. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ibrahim Missaoui

    2016-04-01

    Full Text Available In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that the proposed feature extraction method outperforms the classic methods such as Perceptual Linear Prediction, Linear Predictive Coding, Linear Prediction Cepstral coefficients and Mel Frequency Cepstral Coefficients.

  6. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... by a keyword; combination of command and free text modes); finally, to quantify some of the gains that could be provided by the speech input modality. Methods: Six anaesthesia teams composed of one doctor and one nurse were each confronted with two crisis scenarios in a full-scale anaesthesia...

  7. Automated recognition system for ELM classification in JET

    International Nuclear Information System (INIS)

    Edge localized modes (ELMs) are instabilities occurring in the edge of H-mode plasmas. Considerable efforts are being devoted to understanding the physics behind this non-linear phenomenon. A first characterization of ELMs is usually their identification as type I or type III. An automated pattern recognition system has been developed in JET for off-line ELM recognition and classification. The empirical method presented in this paper analyzes each individual ELM instead of starting from a temporal segment containing many ELM bursts. The ELM recognition and isolation is carried out using three signals: Dα, line integrated electron density and stored diamagnetic energy. A reduced set of characteristics (such as diamagnetic energy drop, ELM period or Dα shape) has been extracted to build supervised and unsupervised learning systems for classification purposes. The former are based on support vector machines (SVM). The latter have been developed with hierarchical and K-means clustering methods. The success rate of the classification systems is about 98% for a database of almost 300 ELMs.

  8. Predictors of aided speech recognition, with and without frequency compression, in older adults.

    OpenAIRE

    Ellis, Rachel J.; Munro, Kevin J.

    2015-01-01

    OBJECTIVE: The aim was to investigate whether cognitive and/or audiological measures predict aided speech recognition, both with and without frequency compression (FC). DESIGN: Participants wore hearing aids, with and without FC for a total of 12 weeks (six weeks in each signal processing condition, ABA design). Performance on a sentence-in-noise recognition test was assessed at the end of each six-week period. Audiological (severity of high frequency hearing loss, presence of dead regions) a...

  9. Implementation of Speech Recognition in Web Application for Sub Continental Language

    OpenAIRE

    Dilip Kumar; Abhishek Sachan; Malay Kumar

    2014-01-01

    Speech recognition is the aptitude of a mechanism in the web application to identify voice instructions agree with the pattern stored in the glossary. Mainly two concepts are summarized in this paper: Hindi voice conversion into text and searching that text in web application like Google. Currently, Hidden Markov Model (HMMs) is used for Hindi voice recognition and its Toolkit is accustomed to identify Hindi language. It recognizes the isolated words using acoustic word model. The system is t...

  10. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  11. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  12. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  13. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  14. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the...

  15. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  16. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  17. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet P.; Leeuwen, van David A.; Jong, de Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two t

  18. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  19. Comparative Evaluation of Three Continuous Speech Recognition Software Packages in the Generation of Medical Reports

    OpenAIRE

    Devine, Eric G.; Gaehde, Stephan A.; Curtis, Arthur C.

    2000-01-01

    Objective: To compare out-of-box performance of three commercially available continuous speech recognition software packages: IBM ViaVoice 98 with General Medicine Vocabulary; Dragon Systems NaturallySpeaking Medical Suite, version 3.0; and L&H Voice Xpress for Medicine, General Medicine Edition, version 1.2.

  20. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Science.gov (United States)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  1. Automated Recognition of 3D Features in GPIR Images

    Science.gov (United States)

    Park, Han; Stough, Timothy; Fijany, Amir

    2007-01-01

    A method of automated recognition of three-dimensional (3D) features in images generated by ground-penetrating imaging radar (GPIR) is undergoing development. GPIR 3D images can be analyzed to detect and identify such subsurface features as pipes and other utility conduits. Until now, much of the analysis of GPIR images has been performed manually by expert operators who must visually identify and track each feature. The present method is intended to satisfy a need for more efficient and accurate analysis by means of algorithms that can automatically identify and track subsurface features, with minimal supervision by human operators. In this method, data from multiple sources (for example, data on different features extracted by different algorithms) are fused together for identifying subsurface objects. The algorithms of this method can be classified in several different ways. In one classification, the algorithms fall into three classes: (1) image-processing algorithms, (2) feature- extraction algorithms, and (3) a multiaxis data-fusion/pattern-recognition algorithm that includes a combination of machine-learning, pattern-recognition, and object-linking algorithms. The image-processing class includes preprocessing algorithms for reducing noise and enhancing target features for pattern recognition. The feature-extraction algorithms operate on preprocessed data to extract such specific features in images as two-dimensional (2D) slices of a pipe. Then the multiaxis data-fusion/ pattern-recognition algorithm identifies, classifies, and reconstructs 3D objects from the extracted features. In this process, multiple 2D features extracted by use of different algorithms and representing views along different directions are used to identify and reconstruct 3D objects. In object linking, which is an essential part of this process, features identified in successive 2D slices and located within a threshold radius of identical features in adjacent slices are linked in a

  2. Post-Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion

    CERN Document Server

    Bassil, Youssef

    2012-01-01

    ASR short for Automatic Speech Recognition is the process of converting a spoken speech into text that can be manipulated by a computer. Although ASR has several applications, it is still erroneous and imprecise especially if used in a harsh surrounding wherein the input speech is of low quality. This paper proposes a post-editing ASR error correction method and algorithm based on Bing's online spelling suggestion. In this approach, the ASR recognized output text is spell-checked using Bing's spelling suggestion technology to detect and correct misrecognized words. More specifically, the proposed algorithm breaks down the ASR output text into several word-tokens that are submitted as search queries to Bing search engine. A returned spelling suggestion implies that a query is misspelled; and thus it is replaced by the suggested correction; otherwise, no correction is performed and the algorithm continues with the next token until all tokens get validated. Experiments carried out on various speeches in differen...

  3. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.

  4. Integrating Stress Information in Large Vocabulary Continuous Speech Recognition

    OpenAIRE

    Ludusan, Bogdan; Ziegler, Stefan; Gravier, Guillaume

    2012-01-01

    In this paper we propose a novel method for integrating stress information in the decoding step of a speech recognizer. A multiscale rhythm model was used to determine the stress scores for each syllable, which are further used to reinforce paths during search. Two strategies for integrating the stress were employed: the first one reinforces paths through all the syllables with a value proportional to the their stress score, while the second one enhances paths passing only through stressed sy...

  5. High-level Approaches to Confidence Estimation in Speech Recognition

    OpenAIRE

    S. J. Cox; Dasmahapatra, S.

    2002-01-01

    We describe some high-level approaches to es-timating confidence scores for the words output by a speech recognizer. By "high-level" we mean that the proposed measuresdo not rely on decoder specific "side information" and so should find more general applicability than measures that have beendeveloped for specific recognizers. Our main approach is to attempt to decouple the language modeling and acoustic modelingin the recognizer in order to generate independent information from these two sour...

  6. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  7. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  8. Application of neural networks to speech recognition. Onsei ninshiki eno neural net oyo

    Energy Technology Data Exchange (ETDEWEB)

    Nitta, T.; Masai, Y. (Toshiba Corp., Tokyo (Japan))

    1991-12-01

    The application of neural networks to speech recognition was reviewed which has produced good results, in particular, in the fields of phoneme recognition and small vocabulary spoken word recognition. Because the implementation of a training process is unnecessary in speaker-independent speech recognizers, a neural-based speech recognition board can be easily realized with a digital signal processor (DSP). While since a standard multilayer neural network cannot incorporate large pattern variabilities on a time axis, the hybrid algorithm composed of the neural network and a hidden Markov model (HMM) is expected as the most hopeful solution. The speaker-independent large vocabulary spoken word recognition is given as an example in which phoneme is recognized by the neural network, and such phoneme series is identified with that of each word by HMM. In addition, since a recurrent neural network can represent the transition between states by itself, the same process as HMM is expected to be realized through such network. 16 refs., 3 figs.

  9. Speech variability effects on recognition accuracy associated with concurrent task performance by pilots

    Science.gov (United States)

    Simpson, C. A.

    1985-01-01

    In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.

  10. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Science.gov (United States)

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  11. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition‐ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model‐based real‐time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360‐degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  12. Suprasegmental Duration Modelling with Elastic Constraints in Automatic Speech Recognition

    OpenAIRE

    Molloy, Laurence; Isard, Stephen

    1998-01-01

    In this paper a method of integrating a model of suprasegmental duration with a HMM-based recogniser at the post-processing level is presented. The N-Best utterance output is rescored using a suitable linear combination of acoustic log-likelihood (provided by a set of tied-state triphone HMMs) and duration log-likelihood (provided by a set of durational models). The durational model used in the post-processing imposes syllable-level elastic constraints on the durational behaviour of speech se...

  13. Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

    CERN Document Server

    Terbeh, Naim

    2012-01-01

    This work consists of creating a system of the Computer Assisted Language Learning (CALL) based on a system of Automatic Speech Recognition (ASR) for the Arabic language using the tool CMU Sphinx3 [1], based on the approach of HMM. To this work, we have constructed a corpus of six hours of speech recordings with a number of nine speakers. we find in the robustness to noise a grounds for the choice of the HMM approach [2]. the results achieved are encouraging since our corpus is made by only nine speakers, but they are always reasons that open the door for other improvement works.

  14. Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

    OpenAIRE

    Anguita Ortega, Jan; Hernando Pericás, Francisco Javier

    2005-01-01

    In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have te...

  15. Assessing the efficacy of benchmarks for automatic speech accent recognition

    Directory of Open Access Journals (Sweden)

    Benjamin Bock

    2015-08-01

    Full Text Available Speech accents can possess valuable information about the speaker, and can be used in intelligent multimedia-based human-computer interfaces. The performance of algorithms for automatic classification of accents is often evaluated using audio datasets that include recording samples of different people, representing different accents. Here we describe a method that can detect bias in accent datasets, and apply the method to two accent identification datasets to reveal the existence of dataset bias, meaning that the datasets can be classified with accuracy higher than random even if the tested algorithm has no ability to analyze speech accent. We used the datasets by separating one second of silence from the beginning of each audio sample, such that the one-second sample did not contain voice, and therefore no information about the accent. An audio classification method was then applied to the datasets of silent audio samples, and provided classification accuracy significantly higher than random. These results indicate that the performance of accent classification algorithms measured using some accent classification benchmarks can be biased, and can be driven by differences in the background noise rather than the auditory features of the accents.

  16. A Factored Language Model for Prosody Dependent Speech Recognition

    OpenAIRE

    Chen, Ken; Hasegawa-Johnson, Mark A.; Cole, Jennifer S.

    2007-01-01

    In this chapter, we proposed a novel approach that improves the robustness of prosody dependent language modeling by leveraging the dependence between prosody and syntax. In our experiments on Radio News Corpus, a factorial prosody dependent language model estimated using our proposed approach has achieved as much as 31% reduction of the joint perplexity over a prosody dependent language model estimated using the standard Maximum Likelihood approach. In recognition experiments, our approach r...

  17. Real-time contrast enhancement to improve speech recognition.

    Science.gov (United States)

    Alexander, Joshua M; Jenison, Rick L; Kluender, Keith R

    2011-01-01

    An algorithm that operates in real-time to enhance the salient features of speech is described and its efficacy is evaluated. The Contrast Enhancement (CE) algorithm implements dynamic compressive gain and lateral inhibitory sidebands across channels in a modified winner-take-all circuit, which together produce a form of suppression that sharpens the dynamic spectrum. Normal-hearing listeners identified spectrally smeared consonants (VCVs) and vowels (hVds) in quiet and in noise. Consonant and vowel identification, especially in noise, were improved by the processing. The amount of improvement did not depend on the degree of spectral smearing or talker characteristics. For consonants, when results were analyzed according to phonetic feature, the most consistent improvement was for place of articulation. This is encouraging for hearing aid applications because confusions between consonants differing in place are a persistent problem for listeners with sensorineural hearing loss. PMID:21949736

  18. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search......, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even...... for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used...

  19. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  20. A Log—Index Weighted Cepstral Distance Measure for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 吴文虎; 等

    1997-01-01

    A log-index weighted cepstral distance measure is proposed and tested in speacker-independent and speaker-dependent isolated word recognition systems using statistic techniques.The weights for the cepstral coefficients of this measure equal the logarithm of the corresponding indices.The experimental results show that this kind of measure works better than any other weighted Euclidean cepstral distance measures on three speech databases.The error rate obtained using this measure is about 1.8 percent for three databases on average,which is a 25% reduction from that obtained using other measures,and a 40% reduction from that obtained using Log Likelihood Ratio(LLR)measure.The experimental results also show that this kind of distance measure woks well in both speaker-dependent and speaker-independent speech recognition systems.

  1. Robust Features for Speech Recognition using Temporal Filtering Technique in the Presence of Impulsive Noise

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-10-01

    Full Text Available In this paper we introduce a robust feature extractor, dubbed as Modified Function Cepstral Coefficients (MODFCC, based on gammachirp filterbank, Relative Spectral (RASTA and Autoregressive Moving-Average (ARMA filter. The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. In speech recognition systems Mel-Frequency Cepstral Coefficients (MFCC, RASTA and ARMA Frequency Cepstral Coefficients (RASTA-MFCC and ARMA-MFCC are the three main techniques used. It will be shown in this paper that it presents some modifications to the original MFCC method. In our work the effectiveness of proposed changes to MFCC were tested and compared against the original RASTA-MFCC and ARMA-MFCC features. The prosodic features such as jitter and shimmer are added to baseline spectral features. The above-mentioned techniques were tested with impulsive signals under various noisy conditions within AURORA databases.

  2. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  3. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  4. Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

    CERN Document Server

    Bouselmi, Ghazi; Illina, Irina

    2007-01-01

    In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The ``phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.

  5. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition

    Institute of Scientific and Technical Information of China (English)

    Hong-tao WANG; Yuan-qing LI; Tian-you YU

    2014-01-01

    An intelligent wheelchair is devised, which is controlled by a coordinated mechanism based on a brain-computer interface (BCI) and speech recognition. By performing appropriate activities, users can navigate the wheelchair with four steering behaviors (start, stop, turn left, and turn right). Five healthy subjects participated in an indoor experiment. The results demonstrate the efficiency of the coordinated control mechanism with satisfactory path and time optimality ratios, and show that speech recognition is a fast and accurate supplement for BCI-based control systems. The proposed intelligent wheelchair is especially suitable for patients suffering from paralysis (especially those with aphasia) who can learn to pronounce only a single sound (e.g.,‘ah’).

  6. Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition

    Directory of Open Access Journals (Sweden)

    SimonRigoulot

    2013-06-01

    Full Text Available Recent studies suggest that the time course for recognizing vocal expressions of basic emotion in speech varies significantly by emotion type, implying that listeners uncover acoustic evidence about emotions at different rates in speech (e.g., fear is recognized most quickly whereas happiness and disgust are recognized relatively slowly, Pell and Kotz, 2011. To investigate whether vocal emotion recognition is largely dictated by the amount of time listeners are exposed to speech or the position of critical emotional cues in the utterance, 40 English participants judged the meaning of emotionally-inflected pseudo-utterances presented in a gating paradigm, where utterances were gated as a function of their syllable structure in segments of increasing duration from the end of the utterance (i.e., gated ‘backwards’. Accuracy for detecting six target emotions in each gate condition and the mean identification point for each emotion in milliseconds were analyzed and compared to results from Pell & Kotz (2011. We again found significant emotion-specific differences in the time needed to accurately recognize emotions from speech prosody, and new evidence that utterance-final syllables tended to facilitate listeners’ accuracy in many conditions when compared to utterance-initial syllables. The time needed to recognize fear, anger, sadness, and neutral from speech cues was not influenced by how utterances were gated, although happiness and disgust were recognized significantly faster when listeners heard the end of utterances first. Our data provide new clues about the relative time course for recognizing vocally-expressed emotions within the 400-1200 millisecond time window, while highlighting that emotion recognition from prosody can be shaped by the temporal properties of speech.

  7. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    recognition system depends heavily on the dictionary and the transcriptions therein. This paper presents an analysis of phonetic/phonemic features that are salient for current Danish ASR systems. This preliminary study consists of a series of experiments using an ASR system trained on the DK-PAROLE corpus....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  8. Improving speech recognition on a mobile robot platform through the use of top-down visual queues

    OpenAIRE

    Ross, Robert; O'Donoghue, R. P. S.; O'Hare, G. M. P.

    2003-01-01

    In many real-world environments, Automatic Speech Recognition (ASR) technologies fail to provide adequate performance for applications such as human robot dialog. Despite substantial evidence that speech recognition in humans is performed in a top-down as well as bottom-up manner, ASR systems typically fail to capitalize on this, instead relying on a purely statistical, bottom up methodology. In this paper we advocate the use of a knowledge based approach to improving ASR in domains such as m...

  9. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Science.gov (United States)

    Vlaj, Damjan; Kotnik, Bojan; Horvat, Bogomir; Kačič, Zdravko

    2005-12-01

    This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by [InlineEquation not available: see fulltext.] relative (G.723.1 VAD), by [InlineEquation not available: see fulltext.] relative (G.729 VAD), and by [InlineEquation not available: see fulltext.] relative (DSR VAD) in all SNRs.

  10. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Vlaj Damjan

    2005-01-01

    Full Text Available This paper presents a novel computationally efficient voice activity detection (VAD algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by relative (G.723.1 VAD, by relative (G.729 VAD, and by relative (DSR VAD in all SNRs.

  11. Searching for sources of variance in speech recognition: Young adults with normal hearing

    Science.gov (United States)

    Watson, Charles S.; Kidd, Gary R.

    2005-04-01

    In the present investigation, sensory-perceptual abilities of one thousand young adults with normal hearing are being evaluated with a range of auditory, visual, and cognitive measures. Four auditory measures were derived from factor-analytic analyses of previous studies with 18-20 speech and non-speech variables [G. R. Kidd et al., J. Acoust. Soc. Am. 108, 2641 (2000)]. Two measures of visual acuity are obtained to determine whether variation in sensory skills tends to exist primarily within or across sensory modalities. A working memory test, grade point average, and Scholastic Aptitude Test scores (Verbal and Quantitative) are also included. Preliminary multivariate analyses support previous studies of individual differences in auditory abilities (e.g., A. M. Surprenant and C. S. Watson, J. Acoust. Soc. Am. 110, 2085-2095 (2001)] which found that spectral and temporal resolving power obtained with pure tones and more complex unfamiliar stimuli have little or no correlation with measures of speech recognition under difficult listening conditions. The current findings show that visual acuity, working memory, and intellectual measures are also very poor predictors of speech recognition ability, supporting the independence of this processing skill. Remarkable performance by some exceptional listeners will be described. [Work supported by the Office of Naval Research, Award No. N000140310644.

  12. Managing predefined templates and macros for a departmental speech recognition system using common software

    OpenAIRE

    Sistrom, Chris L.; Honeyman, Janice C.; Mancuso, Anthony; Quisling, Ronald G.

    2001-01-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design consider...

  13. Authenticity affects the recognition of emotions in speech: behavioral and fMRI evidence

    OpenAIRE

    Drolet, Matthis; Schubotz, Ricarda I.; Fischer, Julia

    2011-01-01

    The aim of the present study was to determine how authenticity of emotion expression in speech modulates activity in the neuronal substrates involved in emotion recognition. Within an fMRI paradigm, participants judged either the authenticity (authentic or play acted) or emotional content (anger, fear, joy, or sadness) of recordings of spontaneous emotions and reenactments by professional actors. When contrasting between task types, active judgment of authenticity, more than active judgment o...

  14. Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

    OpenAIRE

    SAK, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise

    2015-01-01

    We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques tha...

  15. Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

    OpenAIRE

    Li, Xiangang; Wu, Xihong

    2014-01-01

    Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed an...

  16. A SPEECH RECOGNITION METHOD USING COMPETITIVE AND SELECTIVE LEARNING NEURAL NETWORKS

    Institute of Scientific and Technical Information of China (English)

    2000-01-01

    On the basis of asymptotic theory of Gersho, the isodistortion principle of vector clustering was discussed and a kind of competitive and selective learning method (CSL) which may avoid local optimization and have excellent result in application to clusters of HMM model was also proposed. In combining the parallel, self-organizational hierarchical neural networks (PSHNN) to reclassify the scores of every form output by HMM, the CSL speech recognition rate is obviously elevated.

  17. Audiovisual benefit for recognition of speech presented with single-talker noise in older listeners

    OpenAIRE

    Jesse, A.; Janse, E.

    2012-01-01

    Older listeners are more affected than younger listeners in their recognition of speech in adverse conditions, such as when they also hear a single-competing speaker. In the present study, we investigated with a speeded response task whether older listeners with various degrees of hearing loss benefit under such conditions from also seeing the speaker they intend to listen to. We also tested, at the same time, whether older adults need postperceptual processing to obtain an audiovisual benefi...

  18. Visual Word Recognition is Accompanied by Covert Articulation: Evidence for a Speech-like Phonological Representation

    OpenAIRE

    Eiter, Brianna M.; INHOFF, ALBRECHT W.

    2008-01-01

    Two lexical decision task (LDT) experiments examined whether visual word recognition involves the use of a speech-like phonological code that may be generated via covert articulation. In Experiment 1, each visual item was presented with an irrelevant spoken word (ISW) that was either phonologically identical, similar, or dissimilar to it. An ISW delayed classification of a visual word when the two were phonologically similar, and it delayed the classification of a pseudoword when it was ident...

  19. Hidden Markov Model for Speech Recognition Using Modified Forward-Backward Re-estimation Algorithm

    OpenAIRE

    Balwant A. Sonkamble

    2012-01-01

    There are various kinds of practical implementation issues for the HMM. The use of scaling factor is the main issue in HMM implementation. The scaling factor is used for obtaining smoothened probabilities. The proposed technique called Modified Forward-Backward Re-estimation algorithm used to recognize speech patterns. The proposed algorithm has shown very good recognition accuracy as compared to the conventional Forward-Backward Re-estimation algorithm.

  20. Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic

    OpenAIRE

    Melnikoff, Stephen Jonathan; Quigley, Steven Francis; Russell, Martin

    2001-01-01

    Performing Viterbi decoding for continuous real-time speech recognition is a highly computationally-demanding task, but is one which can take good advantage of a parallel processing architecture. To this end, we describe a system which uses an FPGA for the decoding and a PC for pre- and post-processing, taking advantage of the properties of this kind of programmable logic device, specifically its ability to perform in parallel the large number of additions and comparisons required. We compare...

  1. Automated target recognition technique for image segmentation and scene analysis

    Science.gov (United States)

    Baumgart, Chris W.; Ciarcia, Christopher A.

    1994-03-01

    Automated target recognition (ATR) software has been designed to perform image segmentation and scene analysis. Specifically, this software was developed as a package for the Army's Minefield and Reconnaissance and Detector (MIRADOR) program. MIRADOR is an on/off road, remote control, multisensor system designed to detect buried and surface- emplaced metallic and nonmetallic antitank mines. The basic requirements for this ATR software were the following: (1) an ability to separate target objects from the background in low signal-noise conditions; (2) an ability to handle a relatively high dynamic range in imaging light levels; (3) the ability to compensate for or remove light source effects such as shadows; and (4) the ability to identify target objects as mines. The image segmentation and target evaluation was performed using an integrated and parallel processing approach. Three basic techniques (texture analysis, edge enhancement, and contrast enhancement) were used collectively to extract all potential mine target shapes from the basic image. Target evaluation was then performed using a combination of size, geometrical, and fractal characteristics, which resulted in a calculated probability for each target shape. Overall results with this algorithm were quite good, though there is a tradeoff between detection confidence and the number of false alarms. This technology also has applications in the areas of hazardous waste site remediation, archaeology, and law enforcement.

  2. Fully Automated Assessment of the Severity of Parkinson's Disease from Speech.

    Science.gov (United States)

    Bayestehtashk, Alireza; Asgari, Meysam; Shafran, Izhak; McNames, James

    2015-01-01

    For several decades now, there has been sporadic interest in automatically characterizing the speech impairment due to Parkinson's disease (PD). Most early studies were confined to quantifying a few speech features that were easy to compute. More recent studies have adopted a machine learning approach where a large number of potential features are extracted and the models are learned automatically from the data. In the same vein, here we characterize the disease using a relatively large cohort of 168 subjects, collected from multiple (three) clinics. We elicited speech using three tasks - the sustained phonation task, the diadochokinetic task and a reading task, all within a time budget of 4 minutes, prompted by a portable device. From these recordings, we extracted 1582 features for each subject using openSMILE, a standard feature extraction tool. We compared the effectiveness of three strategies for learning a regularized regression and find that ridge regression performs better than lasso and support vector regression for our task. We refine the feature extraction to capture pitch-related cues, including jitter and shimmer, more accurately using a time-varying harmonic model of speech. Our results show that the severity of the disease can be inferred from speech with a mean absolute error of about 5.5, explaining 61% of the variance and consistently well-above chance across all clinics. Of the three speech elicitation tasks, we find that the reading task is significantly better at capturing cues than diadochokinetic or sustained phonation task. In all, we have demonstrated that the data collection and inference can be fully automated, and the results show that speech-based assessment has promising practical application in PD. The techniques reported here are more widely applicable to other paralinguistic tasks in clinical domain. PMID:25382935

  3. Time-Varying Noise Estimation for Speech Enhancement and Recognition Using Sequential Monte Carlo Method

    Directory of Open Access Journals (Sweden)

    Kaisheng Yao

    2004-11-01

    Full Text Available We present a method for sequentially estimating time-varying noise parameters. Noise parameters are sequences of time-varying mean vectors representing the noise power in the log-spectral domain. The proposed sequential Monte Carlo method generates a set of particles in compliance with the prior distribution given by clean speech models. The noise parameters in this model evolve according to random walk functions and the model uses extended Kalman filters to update the weight of each particle as a function of observed noisy speech signals, speech model parameters, and the evolved noise parameters in each particle. Finally, the updated noise parameter is obtained by means of minimum mean square error (MMSE estimation on these particles. For efficient computations, the residual resampling and Metropolis-Hastings smoothing are used. The proposed sequential estimation method is applied to noisy speech recognition and speech enhancement under strongly time-varying noise conditions. In both scenarios, this method outperforms some alternative methods.

  4. Estimation of phoneme-specific HMM topologies for the automatic recognition of dysarthric speech.

    Science.gov (United States)

    Caballero-Morales, Santiago-Omar

    2013-01-01

    Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic) for each phoneme in the speaker's phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components) is performed with a Genetic Algorithm (GA). Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria. PMID:24222784

  5. Environment-dependent denoising autoencoder for distant-talking speech recognition

    Science.gov (United States)

    Ueda, Yuma; Wang, Longbiao; Kai, Atsuhiko; Ren, Bo

    2015-12-01

    In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two

  6. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    Science.gov (United States)

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types. PMID:26723348

  7. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition.

    Science.gov (United States)

    Jesse, Alexandra; McQueen, James M

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker say fragments of word pairs that were segmentally identical but differed in their stress realization (e.g., 'ca-vi from cavia "guinea pig" vs. 'ka-vi from kaviaar "caviar"). Participants were able to distinguish between these pairs from seeing a speaker alone. Only the presence of primary stress in the fragment, not its absence, was informative. Participants were able to distinguish visually primary from secondary stress on first syllables, but only when the fragment-bearing target word carried phrase-level emphasis. Furthermore, participants distinguished fragments with primary stress on their second syllable from those with secondary stress on their first syllable (e.g., pro-'jec from projector "projector" vs. 'pro-jec from projectiel "projectile"), independently of phrase-level emphasis. Seeing a speaker thus contributes to spoken-word recognition by providing suprasegmental information about the presence of primary lexical stress. PMID:24134065

  8. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  9. Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition

    Science.gov (United States)

    Zhang, Jin-Song; Hu, Xin-Hui; Nakamura, Satoshi

    Chinese is a representative tonal language, and it has been an attractive topic of how to process tone information in the state-of-the-art large vocabulary speech recognition system. This paper presents a novel way to derive an efficient phoneme set of tone-dependent units to build a recognition system, by iteratively merging a pair of tone-dependent units according to the principle of minimal loss of the Mutual Information (MI). The mutual information is measured between the word tokens and their phoneme transcriptions in a training text corpus, based on the system lexical and language model. The approach has a capability to keep discriminative tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enables a flexible selection of phoneme set according to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of Initial/Finals, and derived several phoneme sets with different number of units. Speech recognition experiments using the derived sets showed its effectiveness.

  10. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author)

  11. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  12. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  13. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    -pass-filtered speech items. Data were collected from 22 hearing-impaired subjects with moderate-to-profound sensorineural hearing losses. The results showed that 11 subjects exhibited abnormal psychoacoustic behaviour in the TEN test, indicative of a possible dead region. Estimates of audibility were used to assess......Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN......) has been reported to enable diagnosis of dead regions. The purpose of the present study was first to assess the feasibility of the TEN test protocol, and second, to assess the ability of the procedure to reveal related functional impairment. The latter was done by a test for the recognition of low...

  14. Performance of Czech Speech Recognition with Language Models Created from Public Resources

    Directory of Open Access Journals (Sweden)

    V. Prochazka

    2011-12-01

    Full Text Available In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

  15. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels

    International Nuclear Information System (INIS)

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author)

  16. A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

    Science.gov (United States)

    Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

    2015-11-01

    This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks. PMID:25643415

  17. Using vector Taylor series with noise clustering for speech recognition in non-stationary noisy environments

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    The performance of automatic speech recognizer degrades seriously when there are mismatches between the training and testing conditions. Vector Taylor Series (VTS) approach has been used to compensate mismatches caused by additive noise and convolutive channel distortion in the cepstral domain. In this paper, the conventional VTS is extended by incorporating noise clustering into its EM iteration procedure, improving its compensation effectiveness under non-stationary noisy environments. Recognition experiments under babble and exhibition noisy environments demonstrate that the new algorithm achieves35 % average error rate reduction compared with the conventional VTS.

  18. Non-linear Spectral Contrast Stretching for In-car Speech Recognition

    OpenAIRE

    Li, Weifeng; Bourlard, Hervé

    2007-01-01

    In this paper, we present a novel feature normalization method in the log-scaled spectral domain for improving the noise robustness of speech recognition front-ends. In the proposed scheme, a non-linear contrast stretching is added to the outputs of log mel-filterbanks (MFB) to imitate the adaptation of the auditory system under adverse conditions. This is followed by a two-dimensional filter to smooth out the processing artifacts. The proposed MFCC front-ends perform remarkably well on CENSR...

  19. CAR2 - Czech Database of Car Speech

    Directory of Open Access Journals (Sweden)

    P. Sovka

    1999-12-01

    Full Text Available This paper presents new Czech language two-channel (stereo speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done.

  20. High-order hidden Markov model for piecewise linear processes and applications to speech recognition.

    Science.gov (United States)

    Lee, Lee-Min; Jean, Fu-Rong

    2016-08-01

    The hidden Markov models have been widely applied to systems with sequential data. However, the conditional independence of the state outputs will limit the output of a hidden Markov model to be a piecewise constant random sequence, which is not a good approximation for many real processes. In this paper, a high-order hidden Markov model for piecewise linear processes is proposed to better approximate the behavior of a real process. A parameter estimation method based on the expectation-maximization algorithm was derived for the proposed model. Experiments on speech recognition of noisy Mandarin digits were conducted to examine the effectiveness of the proposed method. Experimental results show that the proposed method can reduce the recognition error rate compared to a baseline hidden Markov model. PMID:27586781

  1. Text Detection and Recognition with Speech Output for Visually Challenged Person: A Review

    Directory of Open Access Journals (Sweden)

    Ms.Rupali D. Dharmale

    2015-03-01

    Full Text Available Reading text from scene, images and text boards is an exigent task for visually challenged persons. This task has been proposed to be carried out with the help of image processing. Since a long period of time, image processing has helped a lot in the field of object recognition and still an emerging area of research. The proposed system reads the text encountered in images and text boards with the aim to provide support to the visually challenged persons. Text detection and recognition in natural scene can give valuable information for many applications. In this work, an approach has been attempted to extract and recognize text from scene images and convert that recognized text into speech. This task can definitely be an empowering force in a visually challenged person's life and can be supportive in relieving them of their frustration of not being able to read whatever they want, thus enhancing the quality of their lives.

  2. Analysis of speech under stress using Linear techniques and Non-Linear techniques for emotion recognition system

    OpenAIRE

    A. A. Khulageand; B. V. Pathak

    2012-01-01

    Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German database. Here e...

  3. Automated Fourier space region-recognition filtering for off-axis digital holographic microscopy

    CERN Document Server

    He, Xuefei; Pratap, Mrinalini; Zheng, Yujie; Wang, Yi; Nisbet, David R; Williams, Richard J; Rug, Melanie; Maier, Alexander G; Lee, Woei Ming

    2016-01-01

    Automated label-free quantitative imaging of biological samples can greatly benefit high throughput diseases diagnosis. Digital holographic microscopy (DHM) is a powerful quantitative label-free imaging tool that retrieves structural details of cellular samples non-invasively. In off-axis DHM, a proper spatial filtering window in Fourier space is crucial to the quality of reconstructed phase image. Here we describe a region-recognition approach that combines shape recognition with an iterative thresholding to extracts the optimal shape of frequency components. The region recognition technique offers fully automated adaptive filtering that can operate with a variety of samples and imaging conditions. When imaging through optically scattering biological hydrogel matrix, the technique surpasses previous histogram thresholding techniques without requiring any manual intervention. Finally, we automate the extraction of the statistical difference of optical height between malaria parasite infected and uninfected re...

  4. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  5. Compensating Acoustic Mismatch Using Class-Based Histogram Equalization for Robust Speech Recognition

    Science.gov (United States)

    Suh, Youngjoo; Kim, Sungtak; Kim, Hoirin

    2007-12-01

    A new class-based histogram equalization method is proposed for robust speech recognition. The proposed method aims at not only compensating for an acoustic mismatch between training and test environments but also reducing the two fundamental limitations of the conventional histogram equalization method, the discrepancy between the phonetic distributions of training and test speech data, and the nonmonotonic transformation caused by the acoustic mismatch. The algorithm employs multiple class-specific reference and test cumulative distribution functions, classifies noisy test features into their corresponding classes, and equalizes the features by using their corresponding class reference and test distributions. The minimum mean-square error log-spectral amplitude (MMSE-LSA)-based speech enhancement is added just prior to the baseline feature extraction to reduce the corruption by additive noise. The experiments on the Aurora2 database proved the effectiveness of the proposed method by reducing relative errors by[InlineEquation not available: see fulltext.] over the mel-cepstral-based features and by[InlineEquation not available: see fulltext.] over the conventional histogram equalization method, respectively.

  6. A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM

    Science.gov (United States)

    Ijima, Yusuke; Nose, Takashi; Tachibana, Makoto; Kobayashi, Takao

    In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the output probability density function are controlled by a low-dimensional parameter vector, called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.

  7. Understanding speech in interactive narratives with crowd sourced data

    OpenAIRE

    Orkin, Jeff; Roy, Deb K.

    2012-01-01

    Speech recognition failures and limited vocabulary coverage pose challenges for speech interactions with characters in games. We describe an end-to-end system for automating characters from a large corpus of recorded human game logs, and demonstrate that inferring utterance meaning through a combination of plan recognition and surface texts similarity compensates for recognition and understanding failures significantly better than relying on surface similarity alone.

  8. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  9. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  10. Hierarchical polynomial network approach to automated target recognition

    Science.gov (United States)

    Kim, Richard Y.; Drake, Keith C.; Kim, Tony Y.

    1994-02-01

    A hierarchical recognition methodology using abductive networks at several levels of object recognition is presented. Abductive networks--an innovative numeric modeling technology using networks of polynomial nodes--results from nearly three decades of application research and development in areas including statistical modeling, uncertainty management, genetic algorithms, and traditional neural networks. The systems uses pixel-registered multisensor target imagery provided by the Tri-Service Laser Radar sensor. Several levels of recognition are performed using detection, classification, and identification, each providing more detailed object information. Advanced feature extraction algorithms are applied at each recognition level for target characterization. Abductive polynomial networks process feature information and situational data at each recognition level, providing input for the next level of processing. An expert system coordinates the activities of individual recognition modules and enables employment of heuristic knowledge to overcome the limitations provided by a purely numeric processing approach. The approach can potentially overcome limitations of current systems such as catastrophic degradation during unanticipated operating conditions while meeting strict processing requirements. These benefits result from implementation of robust feature extraction algorithms that do not take explicit advantage of peculiar characteristics of the sensor imagery, and the compact, real-time processing capability provided by abductive polynomial networks.

  11. Design of an Automated Secure Garage System Using License Plate Recognition Technique

    OpenAIRE

    Afaz Uddin Ahmed; Taufiq Mahmud Masum; Mohammad Mahbubur Rahman

    2014-01-01

    Modern technologies have reached our garage to secure the cars and entrance to the residences for the demand of high security and automated infrastructure. The concept of intelligent secure garage systems in modern transport management system is a remarkable example of the computer interfaced controlling devices. License Plate Recognition (LPR) process is one of the key elements of modern intelligent garage security setups. This paper presents a design of an automated secure garage system fea...

  12. Extraction of Prostatic Lumina and Automated Recognition for Prostatic Calculus Image Using PCA-SVM

    OpenAIRE

    D. Joshua Liao; Yusheng Huang; Xiaofen Xing; Hua Wang; Jian Liu; Hui Xiao; Zhuocai Wang; Xiaojun Ding; Xiangmin Xu

    2011-01-01

    Identification of prostatic calculi is an important basis for determining the tissue origin. Computation-assistant diagnosis of prostatic calculi may have promising potential but is currently still less studied. We studied the extraction of prostatic lumina and automated recognition for calculus images. Extraction of lumina from prostate histology images was based on local entropy and Otsu threshold recognition using PCA-SVM and based on the texture features of prostatic calculus. The SVM cla...

  13. ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNIQUES FOR EMOTION RECOGNITION SYSTEM

    Directory of Open Access Journals (Sweden)

    A. A. Khulageand

    2012-07-01

    Full Text Available Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin German database. Here emotion recognition is done for different emotions like neutral, happy, disgust, sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.

  14. Psychometric Functions for Shortened Administrations of a Speech Recognition Approach Using Tri-Word Presentations and Phonemic Scoring

    Science.gov (United States)

    Gelfand, Stanley A.; Gelfand, Jessica T.

    2012-01-01

    Method: Complete psychometric functions for phoneme and word recognition scores at 8 signal-to-noise ratios from -15 dB to 20 dB were generated for the first 10, 20, and 25, as well as all 50, three-word presentations of the Tri-Word or Computer Assisted Speech Recognition Assessment (CASRA) Test (Gelfand, 1998) based on the results of 12…

  15. ZIGBEE BASED VOICE RECOGNITION WIRELESS HOME AUTOMATION SYSTEM

    Directory of Open Access Journals (Sweden)

    Akshay Sharma

    2014-07-01

    Full Text Available In the past years, the Home Automation systems has seen revolutionary changes due to introduction of various wireless technologies. These revolutions in the wireless technology has seen the unveiling of many standards, especially in the industrial, scientific and medical (ISM radio band. ZigBee is an IEEE 802.15.4 standard for data communications with business and consumer devices. The key to have effective energy efficiency in our house is the employment of home automation. Home automation may include centralized control of lighting, HVAC (heating, ventilation and air conditioning, appliances, security locks of gates and other systems, to provide improved convenience, comfort, energy efficiency and security. Now a days, it has to be updated with the rapid changes in technology to ensure vast coverage, remote control, reliability, and real time operation. Unfolding wireless technologies for security and control in home automation systems offers attractive benefits along with user friendly interface. The proposed system consists of a control console interfaced with different relays using ZigBee model Tarang F4 with the AVR microcontroller which is a low-power, high- performance CMOS 8-bit microcontroller with 16K bytes of in- system programmable Flash memory. Then we have analyzed regarding the data rate and coverage area of zigBee technology in indoor and outdoor applications.

  16. A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System

    Directory of Open Access Journals (Sweden)

    P.S. Sathidevi

    2010-03-01

    Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.

  17. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  18. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  19. Object Type Recognition for Automated Analysis of Protein Subcellular Location

    OpenAIRE

    Zhao, Ting; Velliste, Meel; Boland, Michael V.; Murphy, Robert F.

    2005-01-01

    The new field of location proteomics seeks to provide a comprehensive, objective characterization of the subcellular locations of all proteins expressed in a given cell type. Previous work has demonstrated that automated classifiers can recognize the patterns of all major subcellular organelles and structures in fluorescence microscope images with high accuracy. However, since some proteins may be present in more than one organelle, this paper addresses a more difficult task: recognizing a pa...

  20. 声纹识别中合成语音的鲁棒性%Robust Speaker Recognition against Synthetic Speech

    Institute of Scientific and Technical Information of China (English)

    陈联武; 郭武; 戴礼荣

    2011-01-01

    随着以隐马尔科夫模型为基础的语音合成技术的发展,冒认者很容易利用该技术生成具有目标说话人特性的合成语音,这对现有的声纹识别系统构成巨大威胁.针对此问题,文中从统计学的角度分析自然语音与合成语音在实倒谱上的区别,并提出对合成语音具有鲁棒性的声纹识别系统.实验结果初步表明,相比于传统的声纹识别系统,在对自然语音的等错误率不变的情况下,该系统对合成语音的错误接受率由99.2%降为0.%With the development of the hidden markov model ( HMM) based speech synthesis technology, it is easy for impostors to produce synthetic speech with the specific speakers characteristics, which becomes an enormous threat to the existing speaker recognition system. In this paper, the difference between natural speech and synthetic speech is investigated on the real part of cepstrum. And a speaker recognition system is proposed which is robust against synthetic speech. Experimental results demonstrate that the false accept rate (FAR) for synthetic speech is zero in the proposed system, while that of the existing speaker recognition system is 99. 2% with the equal error rate (EER) for natural speech unchanged.

  1. Testing of a Composite Wavelet Filter to Enhance Automated Target Recognition in SONAR

    Science.gov (United States)

    Chiang, Jeffrey N.

    2011-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low resolution SONAR and camera videos taken from Unmanned Underwater Vehicles (UUVs). These SONAR images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both SONAR and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this report.

  2. Wifi-based Indoor Navigation with Mobile GIS and Speech Recognition

    Directory of Open Access Journals (Sweden)

    Jiangfan Feng

    2012-11-01

    Full Text Available As the development of mobile communications and wireless networking technology and people's increasing demand for wireless location services, positioning services are more and more important and become the focus of research. However, the most common technology which supports outdoor positioning services is global positioning system (GPS, but GPS has been subjected to some restrictions in the complex environment. WLAN (Wireless Local Area Networks has matured gradually in recent years, so positioning technique based on Wi-Fi can assist GPS. In this paper, we describe two positioning techniques based on WLAN: Triangulation and fingerprinting. Additionally, the method which associated with speech recognition and spatial data model is presented. Ultimately, the experimental result is encouraging and indicates that the proposed approach is effective.

  3. Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous Chinese Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 武健; 宋战江

    2000-01-01

    The previously proposed syllable-synchronous network search (SSNS) algorithm plays a very important role in the word decoding of the continuous Chinese speech recognition and achieves satisfying performance. Several related key factors that may affect the overall word decoding effect are carefully studied in this paper, including the perfecting of the vocabulary, the big-discount Turing re-estimating of the N-Gram probabilities, and the managing of the searching path buffers. Based on these discussions, corresponding approaches to improving the SSNS algorithm are proposed. Compared with the previous version of SSNS algorithm, the new version decreases the Chinese character error rate (CCER) in the word decoding by 42.1% across a database consisting of a large number of testing sentences (syllable strings).

  4. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  5. Hindi Digits Recognition System on Speech Data Collected in Different Natural Noise Environments

    Directory of Open Access Journals (Sweden)

    Babita Saxena

    2015-02-01

    Full Text Available This paper presents a baseline digits speech recogn izer for Hindi language. The recording environment is different for all speakers, since th e data is collected in their respective homes. The different environment refers to vehicle horn no ises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. Th e Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer i s 10 words. HTK toolkit is used for building acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.

  6. Classifier Subset Selection for the Stacked Generalization Method Applied to Emotion Recognition in Speech

    Directory of Open Access Journals (Sweden)

    Aitor Álvarez

    2015-12-01

    Full Text Available In this paper, a new supervised classification paradigm, called classifier subset selection for stacked generalization (CSS stacking, is presented to deal with speech emotion recognition. The new approach consists of an improvement of a bi-level multi-classifier system known as stacking generalization by means of an integration of an estimation of distribution algorithm (EDA in the first layer to select the optimal subset from the standard base classifiers. The good performance of the proposed new paradigm was demonstrated over different configurations and datasets. First, several CSS stacking classifiers were constructed on the RekEmozio dataset, using some specific standard base classifiers and a total of 123 spectral, quality and prosodic features computed using in-house feature extraction algorithms. These initial CSS stacking classifiers were compared to other multi-classifier systems and the employed standard classifiers built on the same set of speech features. Then, new CSS stacking classifiers were built on RekEmozio using a different set of both acoustic parameters (extended version of the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS and standard classifiers and employing the best meta-classifier of the initial experiments. The performance of these two CSS stacking classifiers was evaluated and compared. Finally, the new paradigm was tested on the well-known Berlin Emotional Speech database. We compared the performance of single, standard stacking and CSS stacking systems using the same parametrization of the second phase. All of the classifications were performed at the categorical level, including the six primary emotions plus the neutral one.

  7. Speech Based Question Recognition of Interactive Ubiquitous Teaching Robot using Supervised Classifier

    Directory of Open Access Journals (Sweden)

    Umarani S D#

    2011-06-01

    Full Text Available This paper presents the research work that aims at designing a speech based interactive ubiquitous teaching robot for quality enhancement of Higher Education ensuring better performance by making possibility of interaction between the Robot Teacher and Student/ Learner. Students/ Learner can ask technical questions for which the robot replies. Here the reply by the Robot is limited by the answers available in the database. The proposed speech based interactive teaching system has two phases; they are training and testing. A feed forward neural network is used as a supervised classifier after it istrained by means of a back propagation (BP algorithm using the features of sound signal from the Student/ Learner. The number of queries present in the database and non database queries recognized ispresented by Receiver Operating Characteristics curve (ROC. The ROC curve lies above the diagonal of the ROC plot. This indicates that the proposed method of interaction of Robot with the learners isacceptable. The percentage of question recognition resulted in 74%.

  8. Influence of tinnitus percentage index of speech recognition in patients with normal hearing

    Directory of Open Access Journals (Sweden)

    Urnau, Daila

    2010-12-01

    Full Text Available Introduction: The understanding of speech is one of the most important measurable aspects of human auditory function. Tinnitus affects the quality of life, impairing communication. Objective: To investigate possible changes in the Percentage Index of Speech Recognition (SDT in individuals with tinnitus have normal hearing and examining the relationship between tinnitus, gender and age. Methods:A retrospective study by analyzing the records of 82 individuals of both genders, aged 21-70 years, totaling 128 ears with normal hearing. The ears were analyzed separately, and divided into control group, no complaints of tinnitus and group study, with complaints of tinnitus. The variables gender and age groups and examined the influence of tinnitus in the SDT. It was considered normal, the percentage of 100% correct and changed, and the value between 88-96%. These criteria were adopted, since the percentage below 88% correct is found in individuals with sensorineural hearing loss. Results:There was no statistically significant difference between the variables age and tinnitus, and tinnitus SDT, only gender and tinnitus. The prevalence of tinnitus in females (56%, higher incidence of tinnitus in the age group 31-40 years (41.67% and fewer from 41 to 50 years (18.75% and on the SDT there was a greater percentage change in individuals with tinnitus (61.11%. Conclusion: The buzz does not interfere with SDT and there is no relationship between tinnitus and age, only between tinnitus and gender.

  9. Silent Speech Recognition with Arabic and English Words for Vocally Disabled Persons

    Directory of Open Access Journals (Sweden)

    Sami Nassimi

    2014-05-01

    Full Text Available This paper presents the results of our research in silent speech recognition (SSR using Surface Electromyography (sEMG; which is the technology of recording the electric activation potentials of the human articulatory muscles by surface electrodes in order to recognize speech. Though SSR is still in the experimental stage, a number of potential applications seem evident. Persons who have undergone a laryngectomy, or older people for whom speaking requires a substantial effort, would be able to mouth (vocalize words rather than actually pronouncing them. Our system has been trained with 30 utterances from each of the three subjects we had on a testing vocabulary of 4 phrases, and then tested for 15 new utterances that were not part of the training list. The system achieved an average of 91.11% word accuracy when using Support Vector Machine (SVM classifier while the base language is English, and an average of 89.44% word accuracy using the Standard Arabic language.

  10. A comparison of speech recognition training programs for cochlear implant users: A simulation study

    Science.gov (United States)

    McCabe, Marie E.; Chiu, C.-Y. Peter

    2003-10-01

    The present simulaton study compared two training programs with very different design features to explore how each might improve the ability of listeners with normal hearing to recognize speech generated by a cochlear implant simulator. The first program, which focused training on specific areas of difficulty for individual patients across multiple levels of linguistic content (e.g., vowels, consonants, words, and sentences), was modeled after a standard program prescribed by one of the US manufacturers of cochlear implants. The second program consisted of exposure to multiple sentences with feedback regardless of subjects' performance level, and had been used in previous studies from this laboratory. All speech materials were reduced spectrally to simulate an 8-channel CIS cochlear implant processor with a ``6mm frequency upshift'' [Fu and Shannon, J. Acoust. Soc. Am. 105, 1889 (1999)]. Test sessions were administered to all subjects to assess recognition of sentences, consonants (/aCa/), and vowels (in /hVd/ and /bVt/ contexts) pre- and post-training. In a subset of subjects, a crossover design, in which subjects were trained first with one program and then with the other, was employed. Results will be discussed both in terms of theory and practice of therapeutic programs for cochlear implant users.

  11. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  12. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Science.gov (United States)

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141

  13. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Directory of Open Access Journals (Sweden)

    Hariharan Muthusamy

    Full Text Available In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs, linear predictive cepstral coefficients (LPCCs, perceptual linear predictive (PLP features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES signals and its glottal waveforms(GW. Particle swarm optimization based clustering (PSOC and wrapper based particle swarm optimization (WPSO were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.

  14. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  15. Automated track recognition and event reconstruction in nuclear emulsion

    International Nuclear Information System (INIS)

    The major advantages of nuclear emulsion for detecting charged particles are its submicron position resolution and sensitivity to minimum ionizing particles. These must be balanced, however, against the difficult manual microscope measurement by skilled observers required for the analysis. We have developed an automated system to acquire and analyze the microscope images from emulsion chambers. Each emulsion plate is analyzed independently, allowing coincidence techniques to be used in order to reject background and estimate error rates. The system has been used to analyze a sample of high-multiplicity Pb-Pb interactions (charged particle multiplicities ∝ 1100) produced by the 158 GeV/c per nucleon 208Pb beam at CERN. Automatically measured events agree with our best manual measurements on 97% of all the tracks. We describe the image analysis and track reconstruction techniques, and discuss the measurement and reconstruction uncertainties. (orig.)

  16. Long-term outcomes on spatial hearing, speech recognition and receptive vocabulary after sequential bilateral cochlear implantation in children.

    Science.gov (United States)

    Sparreboom, Marloes; Langereis, Margreet C; Snik, Ad F M; Mylanus, Emmanuel A M

    2014-11-01

    Sequential bilateral cochlear implantation in profoundly deaf children often leads to primary advantages in spatial hearing and speech recognition. It is not yet known how these children develop in the long-term and if these primary advantages will also lead to secondary advantages, e.g. in better language skills. The aim of the present longitudinal cohort study was to assess the long-term effects of sequential bilateral cochlear implantation in children on spatial hearing, speech recognition in quiet and in noise and receptive vocabulary. Twenty-four children with bilateral cochlear implants (BiCIs) were tested 5-6 years after sequential bilateral cochlear implantation. These children received their second implant between 2.4 and 8.5 years of age. Speech and language data were also gathered in a matched reference group of 26 children with a unilateral cochlear implant (UCI). Spatial hearing was assessed with a minimum audible angle (MAA) task with different stimulus types to gain global insight into the effective use of interaural level difference (ILD) and interaural timing difference (ITD) cues. In the long-term, children still showed improvements in spatial acuity. Spatial acuity was highest for ILD cues compared to ITD cues. For speech recognition in quiet and noise, and receptive vocabulary, children with BiCIs had significant higher scores than children with a UCI. Results also indicate that attending a mainstream school has a significant positive effect on speech recognition and receptive vocabulary compared to attending a school for the deaf. Despite of a period of unilateral deafness, children with BiCIs, participating in mainstream education obtained age-appropriate language scores. PMID:25462493

  17. Feature Extraction and Selection Strategies for Automated Target Recognition

    Science.gov (United States)

    Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2010-01-01

    Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory region of-interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.

  18. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener. PMID:24253361

  19. Development and Evaluation of a Speech Recognition Test for Persian Speaking Adults

    Directory of Open Access Journals (Sweden)

    Mohammad Mosleh

    2001-05-01

    Full Text Available Method and Materials: This research is carried out for development and evaluation of 25 phonemically balanced word lists for Persian speaking adults in two separate stages: development and evaluation. In the first stage, in order to balance the lists phonemically, frequency -of- occurrences of each 29phonems (6 vowels and 23 Consonants of the Persian language in adults speech are determined. This section showed some significant differences between some phonemes' frequencies. Then, all Persian monosyllabic words extracted from the Mo ‘in Persian dictionary. The semantically difficult words were refused and the appropriate words choosed according to judgment of 5 adult native speakers of Persian with high school diploma. 12 openset 25 word lists are prepared. The lists were recorded on magnetic tapes in an audio studio by a professional speaker of IRIB. "nIn the second stage, in order to evaluate the test's validity and reliability, 60 normal hearing adults (30 male, 30 female, were randomly selected and evaluated as test and retest. Findings: 1- Normal hearing adults obtained 92-1 0O scores for each list at their MCL through test-retest. 2- No significant difference was observed a/ in test-retest scores in each list (‘P>O.05 b/ between the lists at test or retest scores (P>0.05, c/between sex (P>0.05. Conclusion: This research is reliable and valid, the lists are phonemically balanced and equal in difficulty and valuable for evaluation of Persian speaking adults speech recognition.

  20. Automated defect recognition method based on neighbor layer slice images of ICT

    International Nuclear Information System (INIS)

    The current automated defect recognition of industrial computerized tomography(ICT) slice images is mostly carried out in individual image. Certain false detections would exist for some isolated noises can not be wiped off without considering the information of neighbor layer images. To solve this problem,a new automated defect recognition method is presented based on a two-step analysis of consecutive slice images. First, all potential defects are segmented using a classic method in each image. Second, real defects and false defects are recognized by all potential defect matching of neighbor layer images in two steps based on the continuity of real defects characteristic and the non-continuity of false defects between the neighbor images. The method is verified by experiments and results prove that the real defects can be detected with high probability and false detections can be reduced effectively. (authors)

  1. Automated recognition of forest patterns using aerial photographs

    Science.gov (United States)

    Barbezat, Vincent; Kreiss, Philippe; Sulzmann, Armin; Jacot, Jacques

    1996-12-01

    In Switzerland, aerial photos are indispensable tools for research into ecosystems and their management. Every six years since 1950, the whole of Switzerland has been systematically surveyed by aerial photos. In the forestry field, these documents not only provide invaluable information but also give support to field activities such as the drawing up of tree population maps, intervention planning, precise positioning of the upper forest limit, evaluation of forest damage and rates of tree growth. Up to now, the analysis of aerial photos has been carried out by specialists who painstakingly examine every photograph, which makes it a very long, exacting and expensive job. The IMT-DMT of the EPFL and Antenne romande of FNP, aware of the special interest involved and the necessity of automated classification of aerial photos, have pooled their resources to develop a software program capable of differentiating between single trees, copses and dense forests. The developed algorithms detect the crowns of the trees and the surface of the orthogonal projection. Form the shadow of each tree they calculate its height. They also determine the position of the tree in the Swiss national coordinate thanks to the implementation of a numeric altitude model. For the future, we have the prospect of many new and better uses of aerial photos being available to us, particularly where isolated stands are concerned and also when evolutions based on a diachronic series of photos have to be assessed: from timberline monitoring in the research on global change to the exploitation of wooded pastures on small surface areas.

  2. Language and Speech Processing

    CERN Document Server

    Mariani, Joseph

    2008-01-01

    Speech processing addresses various scientific and technological areas. It includes speech analysis and variable rate coding, in order to store or transmit speech. It also covers speech synthesis, especially from text, speech recognition, including speaker and language identification, and spoken language understanding. This book covers the following topics: how to realize speech production and perception systems, how to synthesize and understand speech using state-of-the-art methods in signal processing, pattern recognition, stochastic modelling computational linguistics and human factor studi

  3. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    Science.gov (United States)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  4. Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

    OpenAIRE

    Huet, Stéphane; Gravier, Guillaume; Sébillot, Pascale

    2010-01-01

    Abstract Many automatic speech recognition (ASR) systems rely on the sole pronunciation dictionaries and language models to take into account information about language. Implicitly, morphology and syntax are to a certain extent embedded in the language models but the richness of such linguistic knowledge is not exploited. This paper studies the use of morpho-syntactic (MS) information in a post-processing stage of an ASR system, by reordering N-best lists. Each sentence hypothesis ...

  5. Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

    OpenAIRE

    Shrawankar, Urmila; Thakare, Vilas

    2010-01-01

    Noise is ubiquitous in almost all acoustic environments. The speech signal, that is recorded by a microphone is generally infected by noise originating from various sources. Such contamination can change the characteristics of the speech signals and degrade the speech quality and intelligibility, thereby causing significant harm to human-to-machine communication systems. Noise detection and reduction for speech applications is often formulated as a digital filtering problem, where the clean s...

  6. Managing predefined templates and macros for a departmental speech recognition system using common software.

    Science.gov (United States)

    Sistrom, C L; Honeyman, J C; Mancuso, A; Quisling, R G

    2001-09-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design considerations, and implementation details of the structured report management software. The database and its interface are designed to allow all radiologists and division managers to define and update template structures relevant to their practice areas. Two key conceptual extensions supported by the template management system are the addition of a template type construct and allowing individual radiologists to dynamically share common organ system or modality-specific templates. In addition, the template manager software enables specifying predefined report structures that can be triggered at the time of dictation from printed lists of barcodes. Initial experience using the program in a regional, multisite, academic radiology practice has been positive. PMID:11720335

  7. Development of a two wheeled self balancing robot with speech recognition and navigation algorithm

    Science.gov (United States)

    Rahman, Md. Muhaimin; Ashik-E-Rasul, Haq, Nowab. Md. Aminul; Hassan, Mehedi; Hasib, Irfan Mohammad Al; Hassan, K. M. Rafidh

    2016-07-01

    This paper is aimed to discuss modeling, construction and development of navigation algorithm of a two wheeled self balancing mobile robot in an enclosure. In this paper, we have discussed the design of two of the main controller algorithms, namely PID algorithms, on the robot model. Simulation is performed in the SIMULINK environment. The controller is developed primarily for self-balancing of the robot and also it's positioning. As for the navigation in an enclosure, template matching algorithm is proposed for precise measurement of the robot position. The navigation system needs to be calibrated before navigation process starts. Almost all of the earlier template matching algorithms that can be found in the open literature can only trace the robot. But the proposed algorithm here can also locate the position of other objects in an enclosure, like furniture, tables etc. This will enable the robot to know the exact location of every stationary object in the enclosure. Moreover, some additional features, such as Speech Recognition and Object Detection, are added. For Object Detection, the single board Computer Raspberry Pi is used. The system is programmed to analyze images captured via the camera, which are then processed through background subtraction, followed by active noise reduction.

  8. A Comparative Study: Gammachirp Wavelets and Auditory Filter Using Prosodic Features of Speech Recognition In Noisy Environment

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-04-01

    Full Text Available Modern automatic speech recognition (ASR systems typically use a bank of linear filters as the first step in performing frequency analysis of speech. On the other hand, the cochlea, which is responsible for frequency analysis in the human auditory system, is known to have a compressive non-linear frequency response which depends on input stimulus level. It will be shown in this paper that it presents a new method on the use of the gammachirp auditory filter based on a continuous wavelet analysis. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear that differs from the existing model based on an analysis by a short term Fourier transformation (STFT. The prosodic features such as pitch, formant frequency, jitter and shimmer are extracted from the fundamental frequency contour and added to baseline spectral features, specifically, Mel Frequency Cepstral Coefficients (MFCC for human speech, Gammachirp Filterbank Cepstral Coefficient (GFCC and Gammachirp Wavelet Frequency Cepstral Coefficient (GWFCC. The results show that the gammachirp wavelet gives results that are comparable to ones obtained by MFCC and GFCC. Experimental results show the best performance of this architecture. This paper implements the GW and examines its application to a specific example of speech. Implications for noise robust speech analysis are also discussed within AURORA databases.

  9. TreeRipper web application: towards a fully automated optical tree recognition software

    Science.gov (United States)

    2011-01-01

    Background Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century. Results TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, JPG/JPEG or GIF). The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves. Conclusions Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3. PMID:21599881

  10. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

    Science.gov (United States)

    Alam, Md Jahangir; Gupta, Vishwa; Kenny, Patrick; Dumouchel, Pierre

    2015-12-01

    The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8

  11. Automated Gesturing for Virtual Characters: Speech-driven and Text-driven Approaches

    Directory of Open Access Journals (Sweden)

    Goranka Zoric

    2006-04-01

    Full Text Available We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker’s facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

  12. Speech Recognition Performance in Children with Cochlear Implants Using Bimodal Stimulation

    OpenAIRE

    Rathna Kumar, S. B.; Mohanty, P.; Prakash, S. G. R.

    2010-01-01

    Cochlear implantees have considerably good speech understanding abilities in quiet surroundings. But, ambient noise poses significant difficulties in understanding speech for these individuals. Bimodal stimulation is still not used by many Indian implantees in spite of reports that bimodal stimulation is beneficial for speech understanding in noise as compared to cochlear implant alone and also prevents auditory deprivation in the un-implanted ear. The aim of the study is to evaluate the bene...

  13. Comparison of the South African Spondaic and CID W-1 wordlists for measuring speech recognition threshold

    Directory of Open Access Journals (Sweden)

    Tanya Hanekom

    2015-02-01

    Full Text Available Background: The home language of most audiologists in South Africa is either English or Afrikaans, whereas most South Africans speak an African language as their home language. The use of an English wordlist, the South African Spondaic (SAS wordlist, which is familiar to the English Second Language (ESL population, was developed by the author for testing the speech recognition threshold (SRT of ESL speakers. Objectives: The aim of this study was to compare the pure-tone average (PTA/SRT correlation results of ESL participants when using the SAS wordlist (list A and the CID W-1 spondaic wordlist (list B – less familiar; list C – more familiar CID W-1 words. Method: A mixed-group correlational, quantitative design was adopted. PTA and SRT measurements were compared for lists A, B and C for 101 (197 ears ESL participants with normal hearing or a minimal hearing loss (<26 dBHL; mean age 33.3. Results: The Pearson correlation analysis revealed a strong PTA/SRT correlation when using list A (right 0.65; left 0.58 and list C (right 0.63; left 0.56. The use of list B revealed weak correlations (right 0.30; left 0.32. Paired sample t-tests indicated a statistically significantly stronger PTA/SRT correlation when list A was used, rather than list B or list C, at a 95% level of confidence. Conclusions: The use of the SAS wordlist yielded a stronger PTA/SRT correlation than the use of the CID W-1 wordlist, when performing SRT testing on South African ESL speakers with normal hearing, or minimal hearing loss (<26 dBHL.

  14. Speech recognition software and electronic psychiatric progress notes: physicians' ratings and preferences

    Directory of Open Access Journals (Sweden)

    Derman Yaron D

    2010-08-01

    Full Text Available Abstract Background The context of the current study was mandatory adoption of electronic clinical documentation within a large mental health care organization. Psychiatric electronic documentation has unique needs by the nature of dense narrative content. Our goal was to determine if speech recognition (SR would ease the creation of electronic progress note (ePN documents by physicians at our institution. Methods Subjects: Twelve physicians had access to SR software on their computers for a period of four weeks to create ePN. Measurements: We examined SR software in relation to its perceived usability, data entry time savings, impact on the quality of care and quality of documentation, and the impact on clinical and administrative workflow, as compared to existing methods for data entry. Data analysis: A series of Wilcoxon signed rank tests were used to compare pre- and post-SR measures. A qualitative study design was used. Results Six of twelve participants completing the study favoured the use of SR (five with SR alone plus one with SR via hand-held digital recorder for creating electronic progress notes over their existing mode of data entry. There was no clear perceived benefit from SR in terms of data entry time savings, quality of care, quality of documentation, or impact on clinical and administrative workflow. Conclusions Although our findings are mixed, SR may be a technology with some promise for mental health documentation. Future investigations of this nature should use more participants, a broader range of document types, and compare front- and back-end SR methods.

  15. Feature Selection Software to Improve Accuracy and Reduce Cost in Automated Recognition Systems

    Czech Academy of Sciences Publication Activity Database

    Somol, Petr

    2011-01-01

    Roč. 2011, č. 84 (2011), s. 54-54. ISSN 0926-4981 R&D Projects: GA MŠk 1M0572 Grant ostatní: GA MŠk(CZ) 2C06019 Institutional research plan: CEZ:AV0Z10750506 Keywords : feature selection * software library * machine learning Subject RIV: BD - Theory of Information http://library.utia.cas.cz/separaty/2011/RO/somol-feature selection software to improve accuracy and reduce cost in automated recognition systems.pdf

  16. [Automated recognition of quasars based on adaptive radial basis function neural networks].

    Science.gov (United States)

    Zhao, Mei-Fang; Luo, A-Li; Wu, Fu-Chao; Hu, Zhan-Yi

    2006-02-01

    Recognizing and certifying quasars through the research on spectra is an important method in the field of astronomy. This paper presents a novel adaptive method for the automated recognition of quasars based on the radial basis function neural networks (RBFN). The proposed method is composed of the following three parts: (1) The feature space is reduced by the PCA (the principal component analysis) on the normalized input spectra; (2) An adaptive RBFN is constructed and trained in this reduced space. At first, the K-means clustering is used for the initialization, then based on the sum of squares errors and a gradient descent optimization technique, the number of neurons in the hidden layer is adaptively increased to improve the recognition performance; (3) The quasar spectra recognition is effectively carried out by the above trained RBFN. The author's proposed adaptive RBFN is shown to be able to not only overcome the difficulty of selecting the number of neurons in hidden layer of the traditional RBFN algorithm, but also increase the stability and accuracy of recognition of quasars. Besides, the proposed method is particularly useful for automatic voluminous spectra processing produced from a large-scale sky survey project, such as our LAMOST, due to its efficiency. PMID:16826929

  17. Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

    Science.gov (United States)

    Mimura, Masato; Sakai, Shinsuke; Kawahara, Tatsuya

    2015-12-01

    We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7-8 % relative) from the standard DAE.

  18. Design of an Automated Secure Garage System Using License Plate Recognition Technique

    Directory of Open Access Journals (Sweden)

    Afaz Uddin Ahmed

    2014-01-01

    Full Text Available Modern technologies have reached our garage to secure the cars and entrance to the residences for the demand of high security and automated infrastructure. The concept of intelligent secure garage systems in modern transport management system is a remarkable example of the computer interfaced controlling devices. License Plate Recognition (LPR process is one of the key elements of modern intelligent garage security setups. This paper presents a design of an automated secure garage system featuring LPR process. A study of templates matching approach by using Optical Character Recognition (OCR is implemented to carry out the LPR method. We also developed a prototype design of the secured garage system to verify the application for local use. The system allows only a predefined enlisted cars or vehicles to enter the garage while blocking the others along with a central-alarm feature. Moreover, the system maintains an update database of the cars that has left and entered into the garage within a particular duration. The vehicle is distinguished by the system mainly based on their registration number in the license plates. The tactics are tried on several samples of license plate’s image in both indoor and outdoor setting.

  19. The accuracy of radiology speech recognition reports in a multilingual South African teaching hospital

    International Nuclear Information System (INIS)

    Speech recognition (SR) technology, the process whereby spoken words are converted to digital text, has been used in radiology reporting since 1981. It was initially anticipated that SR would dominate radiology reporting, with claims of up to 99% accuracy, reduced turnaround times and significant cost savings. However, expectations have not yet been realised. The limited data available suggest SR reports have significantly higher levels of inaccuracy than traditional dictation transcription (DT) reports, as well as incurring greater aggregate costs. There has been little work on the clinical significance of such errors, however, and little is known of the impact of reporter seniority on the generation of errors, or the influence of system familiarity on reducing error rates. Furthermore, there have been conflicting findings on the accuracy of SR amongst users with English as first- and second-language respectively. The aim of the study was to compare the accuracy of SR and DT reports in a resource-limited setting. The first 300 SR and the first 300 DT reports generated during March 2010 were retrieved from the hospital’s PACS, and reviewed by a single observer. Text errors were identified, and then classified as either clinically significant or insignificant based on their potential impact on patient management. In addition, a follow-up analysis was conducted exactly 4 years later. Of the original 300 SR reports analysed, 25.6% contained errors, with 9.6% being clinically significant. Only 9.3% of the DT reports contained errors, 2.3% having potential clinical impact. Both the overall difference in SR and DT error rates, and the difference in ‘clinically significant’ error rates (9.6% vs. 2.3%) were statistically significant. In the follow-up study, the overall SR error rate was strikingly similar at 24.3%, 6% being clinically significant. Radiologists with second-language English were more likely to generate reports containing errors, but level of seniority

  20. Impact of a PACS/RIS-integrated speech recognition system on radiology reporting time and report availability

    International Nuclear Information System (INIS)

    Purpose: Quantification of the impact of a PACS/RIS-integrated speech recognition system (SRS) on the time expenditure for radiology reporting and on hospital-wide report availability (RA) in a university institution. Material and Methods: In a prospective pilot study, the following parameters were assessed for 669 radiographic examinations (CR): 1. time requirement per report dictation (TED: dictation time (s)/number of images [examination] x number of words [report]) with either a combination of PACS/tape-based dictation (TD: analog dictation device/minicassette/transcription) or PACS/RIS/speech recognition system (RR: remote recognition/transcription and OR: online recognition/self-correction by radiologist), respectively, and 2. the Report Turnaround Time (RTT) as the time interval from the entry of the first image into the PACS to the available RIS/HIS report. Two equal time periods were chosen retrospectively from the RIS database: 11/2002-2/2003 (only TD) and 11/2003-2/2004 (only RR or OR with speech recognition system [SRS]). The midterm (≥24 h, 24 h intervals) and short-term (< 24 h, 1 h intervals), RA after examination completion were calculated for all modalities and for Cr, CT, MR and XA/DS separately. The relative increase in the mid-term RA (RIMRA: related to total number of examinations in each time period) and increase in the short-term RA (ISRA: ratio of available reports during the 1st to 24th hour) were calculated. Results: Prospectively, there was a significant difference between TD/RR/OR (n=151/257/261) regarding mean TED (0.44/0.54/0.62 s [per word and image]) and mean RTT (10.47/6.65/1.27 h), respectively. Retrospectively, 37 898/39 680 reports were computed from the RIS database for the time periods of 11/2002-2/2003 and 11/2003-2/2004. For CR/CT there was a shift of the short-term RA to the first 6 hours after examination completion (mean cumulative RA 20% higher) with a more than three-fold increase in the total number of available

  1. Vision-based obstacle recognition system for automated lawn mower robot development

    Science.gov (United States)

    Mohd Zin, Zalhan; Ibrahim, Ratnawati

    2011-06-01

    Digital image processing techniques (DIP) have been widely used in various types of application recently. Classification and recognition of a specific object using vision system require some challenging tasks in the field of image processing and artificial intelligence. The ability and efficiency of vision system to capture and process the images is very important for any intelligent system such as autonomous robot. This paper gives attention to the development of a vision system that could contribute to the development of an automated vision based lawn mower robot. The works involve on the implementation of DIP techniques to detect and recognize three different types of obstacles that usually exist on a football field. The focus was given on the study on different types and sizes of obstacles, the development of vision based obstacle recognition system and the evaluation of the system's performance. Image processing techniques such as image filtering, segmentation, enhancement and edge detection have been applied in the system. The results have shown that the developed system is able to detect and recognize various types of obstacles on a football field with recognition rate of more 80%.

  2. APPLYING RECOGNITION OF EMOTIONS IN SPEECH TO EXTEND THE IMPACT OF BRAND SLOGAN RESEARCH

    OpenAIRE

    Chien, Charles S.; Wan-Chen, Wang; Moutinho, Luiz; Cheng, Yun-Maw; Pao, Tsang-Long; Yu-Te, Chen; Jun-Heng, Yeh

    2007-01-01

    How brand slogans can influence and change the consumers' perception of image of products has been a topic of great interest to marketers. However, it is a non-trivial task to evaluate how brand slogans affect their customers' emotions and how the emotions influence the customers' perceptions of brand images. In this paper we demonstrate the Slogan Validator to evaluate brand slogans by analyzing the speech signals from customers' voiced slogans. It is arguably the first speech signal based a...

  3. Speech Enhancement and Recognition in Meetings with an Audio-Visual Sensor Array

    OpenAIRE

    Maganti, Hari Krishna; Gatica-Perez, Daniel; McCowan, Iain A.

    2006-01-01

    We address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. Beamforming techniques rely on the knowledge of a speaker location. In this paper, we present an integrated approach, in which an audio-visual multi-person tracker is used to track active ...

  4. A Methodology for Recognition of Emotions Based on Speech Analysis, for Applications to Human-Robot Interaction. An Exploratory Study

    Directory of Open Access Journals (Sweden)

    Rabiei Mohammad

    2014-01-01

    Full Text Available A system for recognition of emotions based on speech analysis can have interesting applications in human-robot interaction. In this paper, we carry out an exploratory study on the possibility to use a proposed methodology to recognize basic emotions (sadness, surprise, happiness, anger, fear and disgust based on phonetic and acoustic properties of emotive speech with the minimal use of signal processing algorithms. We set up an experimental test, consisting of choosing three types of speakers, namely: (i five adult European speakers, (ii five Asian (Middle East adult speakers and (iii five adult American speakers. The speakers had to repeat 6 sentences in English (with durations typically between 1 s and 3 s in order to emphasize rising-falling intonation and pitch movement. Intensity, peak and range of pitch and speech rate have been evaluated. The proposed methodology consists of generating and analyzing a graph of formant, pitch and intensity, using the open-source PRAAT program. From the experimental results, it was possible to recognize the basic emotions in most of the cases

  5. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise......Digital noise reduction strategies are important in technical devices such as hearing aids and mobile phones. One well-described noise reduction scheme is the spectral subtraction algorithm. Many versions of the spectral subtraction scheme have been presented in the literature, but the methods have...

  6. Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition

    Directory of Open Access Journals (Sweden)

    Yin Hui

    2009-01-01

    Full Text Available Fractional Fourier transform (FrFT has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.

  7. Digital speech processing using Matlab

    CERN Document Server

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  8. Automated, high accuracy classification of Parkinsonian disorders: a pattern recognition approach.

    Directory of Open Access Journals (Sweden)

    Andre F Marquand

    Full Text Available Progressive supranuclear palsy (PSP, multiple system atrophy (MSA and idiopathic Parkinson's disease (IPD can be clinically indistinguishable, especially in the early stages, despite distinct patterns of molecular pathology. Structural neuroimaging holds promise for providing objective biomarkers for discriminating these diseases at the single subject level but all studies to date have reported incomplete separation of disease groups. In this study, we employed multi-class pattern recognition to assess the value of anatomical patterns derived from a widely available structural neuroimaging sequence for automated classification of these disorders. To achieve this, 17 patients with PSP, 14 with IPD and 19 with MSA were scanned using structural MRI along with 19 healthy controls (HCs. An advanced probabilistic pattern recognition approach was employed to evaluate the diagnostic value of several pre-defined anatomical patterns for discriminating the disorders, including: (i a subcortical motor network; (ii each of its component regions and (iii the whole brain. All disease groups could be discriminated simultaneously with high accuracy using the subcortical motor network. The region providing the most accurate predictions overall was the midbrain/brainstem, which discriminated all disease groups from one another and from HCs. The subcortical network also produced more accurate predictions than the whole brain and all of its constituent regions. PSP was accurately predicted from the midbrain/brainstem, cerebellum and all basal ganglia compartments; MSA from the midbrain/brainstem and cerebellum and IPD from the midbrain/brainstem only. This study demonstrates that automated analysis of structural MRI can accurately predict diagnosis in individual patients with Parkinsonian disorders, and identifies distinct patterns of regional atrophy particularly useful for this process.

  9. A novel approach to object recognition and localization in automation and handling engineering

    Science.gov (United States)

    Stotz, M.; Kuehnle, J. U.; Verl, A.

    2008-11-01

    The industry is in need of reliable, computer aided object recognition and localization systems in automation and handling engineering. One possible application is bin picking, i.e. the task of grasping work pieces out of a storage container with a robot. Therefore, the parts do not have to be ordered or semi-ordered but can be totally unordered. 2D image processing techniques often can not perform such sophisticated tasks since the gray scale or color information provided is just not enough. An alternative is the examination of the other dimension. In this paper we discuss a novel approach to a 3D object recognizer that localizes objects by looking at the primitive features within the objects. The basic idea of the system is that the geometric primitives usually carry enough information to make possible proper object recognition and localization. The algorithms use 3D best-fitting combined with clever 2.5D preprocessing. The feasibility of the approach is demonstrated and tested by means of a prototypical bin picking system. The time taken to recognize and localize an object is < 0.5 sec., and the accuracy of the result is in the order of magnitude of the measurements inaccuracy, < 0.5 mm.

  10. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN...

  11. Investigating Applications of Speech-to-Text Recognition Technology for a Face-to-Face Seminar to Assist Learning of Non-Native English-Speaking Participants

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Huang, Yueh-Min; Liu, Chia-Ju

    2016-01-01

    This study applied speech-to-text recognition (STR) technology to assist non-native English-speaking participants to learn at a seminar given in English. How participants used transcripts generated by the STR technology for learning and their perceptions toward the STR were explored. Three main findings are presented in this study. Most…

  12. A Pilot Investigation regarding Speech-Recognition Performance in Noise for Adults with Hearing Loss in the FM+HA Listening Condition

    Science.gov (United States)

    Lewis, M. Samantha; Gallun, Frederick J.; Gordon, Jane; Lilly, David J.; Crandell, Carl

    2010-01-01

    While the concurrent use of the hearing aid (HA) microphone with frequency modulation (FM) technology can decrease speech-recognition performance, the FM+HA condition is still an important setting for users of both HA and FM technology. The primary goal of this investigation was to evaluate the effect of attenuating HA gain in the FM+HA listening…

  13. SOFTWARE EFFORT ESTIMATION FRAMEWORK TO IMPROVE ORGANIZATION PRODUCTIVITY USING EMOTION RECOGNITION OF SOFTWARE ENGINEERS IN SPONTANEOUS SPEECH

    Directory of Open Access Journals (Sweden)

    B.V.A.N.S.S. Prabhakar Rao

    2015-10-01

    Full Text Available Productivity is a very important part of any organisation in general and software industry in particular. Now a day’s Software Effort estimation is a challenging task. Both Effort and Productivity are inter-related to each other. This can be achieved from the employee’s of the organization. Every organisation requires emotionally stable employees in their firm for seamless and progressive working. Of course, in other industries this may be achieved without man power. But, software project development is labour intensive activity. Each line of code should be delivered from software engineer. Tools and techniques may helpful and act as aid or supplementary. Whatever be the reason software industry has been suffering with success rate. Software industry is facing lot of problems in delivering the project on time and within the estimated budget limit. If we want to estimate the required effort of the project it is significant to know the emotional state of the team member. The responsibility of ensuring emotional contentment falls on the human resource department and the department can deploy a series of systems to carry out its survey. This analysis can be done using a variety of tools, one such, is through study of emotion recognition. The data needed for this is readily available and collectable and can be an excellent source for the feedback systems. The challenge of recognition of emotion in speech is convoluted primarily due to the noisy recording condition, the variations in sentiment in sample space and exhibition of multiple emotions in a single sentence. The ambiguity in the labels of training set also increases the complexity of problem addressed. The existing models using probabilistic models have dominated the study but present a flaw in scalability due to statistical inefficiency. The problem of sentiment prediction in spontaneous speech can thus be addressed using a hybrid system comprising of a Convolution Neural Network and

  14. Relating hearing loss and executive functions to hearing aid users’ preference for, and speech recognition with, different combinations of binaural noise reduction and microphone directionality

    Directory of Open Access Journals (Sweden)

    Tobias eNeher

    2014-12-01

    Full Text Available Knowledge of how executive functions relate to preferred hearing aid (HA processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1 performance on a measure of reading span (RS is related to preferred binaural noise reduction (NR strength, (2 similar relations exist for two different, nonverbal measures of executive function, (3 pure-tone average hearing loss (PTA, signal-to-noise ratio (SNR, and microphone directionality (DIR also influence preferred NR strength, and (4 preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker relations between worse performance on one nonverbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings.

  15. Emotional intelligence, not music training, predicts recognition of emotional speech prosody.

    Science.gov (United States)

    Trimmer, Christopher G; Cuddy, Lola L

    2008-12-01

    Is music training associated with greater sensitivity to emotional prosody in speech? University undergraduates (n = 100) were asked to identify the emotion conveyed in both semantically neutral utterances and melodic analogues that preserved the fundamental frequency contour and intensity pattern of the utterances. Utterances were expressed in four basic emotional tones (anger, fear, joy, sadness) and in a neutral condition. Participants also completed an extended questionnaire about music education and activities, and a battery of tests to assess emotional intelligence, musical perception and memory, and fluid intelligence. Emotional intelligence, not music training or music perception abilities, successfully predicted identification of intended emotion in speech and melodic analogues. The ability to recognize cues of emotion accurately and efficiently across domains may reflect the operation of a cross-modal processor that does not rely on gains of perceptual sensitivity such as those related to music training. PMID:19102595

  16. Speech recognition by bilateral cochlear implant users in a cocktail-party setting

    OpenAIRE

    Loizou, Philipos C.; Hu, Yi; Litovsky, Ruth; Yu, Gongqiang; PETERS, Robert; Lake, Jennifer; Roland, Peter

    2009-01-01

    Unlike prior studies with bilateral cochlear implant users which considered only one interferer, the present study considered realistic listening situations wherein multiple interferers were present and in some cases originating from both hemifields. Speech reception thresholds were measured in bilateral users unilaterally and bilaterally in four different spatial configurations, with one and three interferers consisting of modulated noise or competing talkers. The data were analyzed in terms...

  17. Creating Accessible Educational Multimedia through Editing Automatic Speech Recognition Captioning in Real Time

    OpenAIRE

    Wald, M

    2006-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Notetakers can only summarise what is being said while qualified sign language interpreters with a good understanding of the re...

  18. Captioning for Deaf and Hard of Hearing People by Editing Automatic Speech Recognition in Real Time

    OpenAIRE

    Wald, M

    2006-01-01

    Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes when lip-reading or watching a sign-language interpreter. Notetakers summarise what is being said while qualified sign language interpreters with a good understanding of the relevant higher education subject content are in very scarce supply. Real time captioning/transcription is not normally available in UK higher education because of the shortage of real time stenographers. Lectures...

  19. Speech recall and word recognition depending on prosodic and musical cues as well as voice pitch

    OpenAIRE

    Rozanovskaya, Anna; Sokolova, Taisia

    2011-01-01

    Within this study, speech perception in different conditions was examined. The aim of the research was to compare perception results based on stimuli mode (plain spoken, rhythmic spoken or rhythmic sung stimuli) and pitch (normal, lower and higher). In the study, an experiment was conducted on 44 participants who had been asked to listen to 9 recorded sentences in Russian language (unknown to them) and write them down using Latin letters. These 9 sentences were specially prepared using differ...

  20. Speech emotion recognition in emotional feedback for Human-Robot Interaction

    Directory of Open Access Journals (Sweden)

    Javier G. R´azuri

    2015-02-01

    Full Text Available For robots to plan their actions autonomously and interact with people, recognizing human emotions is crucial. For most humans nonverbal cues such as pitch, loudness, spectrum, speech rate are efficient carriers of emotions. The features of the sound of a spoken voice probably contains crucial information on the emotional state of the speaker, within this framework, a machine might use such properties of sound to recognize emotions. This work evaluated six different kinds of classifiers to predict six basic universal emotions from non-verbal features of human speech. The classification techniques used information from six audio files extracted from the eNTERFACE05 audio-visual emotion database. The information gain from a decision tree was also used in order to choose the most significant speech features, from a set of acoustic features commonly extracted in emotion analysis. The classifiers were evaluated with the proposed features and the features selected by the decision tree. With this feature selection could be observed that each one of compared classifiers increased the global accuracy and the recall. The best performance was obtained with Support Vector Machine and bayesNet.

  1. Automated urban features classification and recognition from combined RGB/lidar data

    Science.gov (United States)

    Elhifnawy Eid, Hassan Elsaid

    2011-12-01

    Although a Red, Green and Blue (RGB) image provides rich semantic information for different features, it is difficult to extract and separate features which share similar texture properties. The data provided by a LIght Detection And Ranging (LIDAR) system contain dense spatial information for terrain and non-terrain objects, but feature extraction poses difficulties in separating different features sharing the same height information. The thesis objective is to introduce an automated urban classification technique using combined semantic and spatial information leading to the ability to extract different features efficiently. RGB color channels are used to produce two color invariant images for vegetation and shadowy areas identification. Otsu segmentation is applied on these color invariant images to identify shadows and vegetation candidates from each other. An RGB image is transformed into two other color spaces, YCbCr and HSV. Luminance color channel is extracted from YCbCr color space, while hue and saturation color channels are extracted from HSV color space. Global thresholding is applied on these color channels individually and collectively for detecting sandy areas. Wavelet transform is used for detecting building boundaries from LIDAR height data. Final building candidates are identified after removing vegetation areas from the resulting image of extracted buildings from LIDAR data. After successful building extraction using wavelets and vegetation, sandy and shadowy areas from an RGB, remaining features will be the roads. This new filter combination introduces a highly efficient automatic urban classification approach from combined LIDAR/RGB data. The proposed urban classification algorithm will introduce classified libraries for several features and in order to use this output efficiently an independent search algorithm is required. An efficient texture and boundary search algorithm is introduced for automatic object recognition of buildings using both

  2. Automated recognition of body cavities in torso X-ray CT images

    International Nuclear Information System (INIS)

    Recently, it has become even more important to understand the normal anatomical structures of the human body in the field of radiological image anatomy. The recognition of body cavities is useful when organ extraction is performed. However, it is difficult to perform organ extraction and to obtain a clear understanding of anatomical structures based on CT values. For example, the distributions of the CT numbers of muscle and organs overlap each other. Therefore, the recognition of the body cavity domain can reduce the range of internal organ extraction and simplify organ segmentation. The purpose of the present study was to develop an automated method for recognizing the body cavities in torso X-ray CT images. Our body cavity recognition method is based on skeletal positions and is performed by extracting the borders of the thoracic cavity, abdominal cavity, and pelvic cavity. This method detects the initial and final points of such borders and searches for the shortest path on the surfaces of the skeletal structures. It then generates the boundary surfaces of the thoracic, abdominal, and pelvic cavities by drawing straight lines between the shortest paths and the centers of the paths. The body cavities located between the superior thoracic aperture and the diaphragm, the diaphragm and the pelvic inlet, and the pelvic inlet and the pelvic outlet are regarded as the thoracic cavity, the abdominal cavity, and the pelvic cavity, respectively. The method was applied to 20 cases in which torso X-ray CT images were obtained. The results showed that the body cavities were extracted correctly in most cases: the thoracic cavity in 17 cases, the abdominal cavity in 19 cases, and the pelvic cavity in 18 cases. We confirmed that our proposed method is effective for recognizing these body cavities. As one of the applications of our study, segmentation of muscle, fat, and the rectum in CT images was performed using the information obtained for body cavity structures. The results

  3. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  4. Laser Opto-Electronic Correlator for Robotic Vision Automated Pattern Recognition

    Science.gov (United States)

    Marzwell, Neville

    1995-01-01

    A compact laser opto-electronic correlator for pattern recognition has been designed, fabricated, and tested. Specifically it is a translation sensitivity adjustable compact optical correlator (TSACOC) utilizing convergent laser beams for the holographic filter. Its properties and performance, including the location of the correlation peak and the effects of lateral and longitudinal displacements for both filters and input images, are systematically analyzed based on the nonparaxial approximation for the reference beam. The theoretical analyses have been verified in experiments. In applying the TSACOC to important practical problems including fingerprint identification, we have found that the tolerance of the system to the input lateral displacement can be conveniently increased by changing a geometric factor of the system. The system can be compactly packaged using the miniature laser diode sources and can be used in space by the National Aeronautics and Space Administration (NASA) and ground commercial applications which include robotic vision, and industrial inspection of automated quality control operations. The personnel of Standard International will work closely with the Jet Propulsion Laboratory (JPL) to transfer the technology to the commercial market. Prototype systems will be fabricated to test the market and perfect the product. Large production will follow after successful results are achieved.

  5. An integrated and interactive decision support system for automated melanoma recognition of dermoscopic images.

    Science.gov (United States)

    Rahman, M M; Bhattacharya, P

    2010-09-01

    This paper presents an integrated and interactive decision support system for the automated melanoma recognition of the dermoscopic images based on image retrieval by content and multiple expert fusion. In this context, the ultimate aim is to support the decision making by retrieving and displaying the relevant past cases as well as predicting the image categories (e.g., melanoma, benign and dysplastic nevi) by combining outputs from different classifiers. However, the most challenging aspect in this domain is to detect the lesion from the healthy background skin and extract the lesion-specific local image features. A thresholding-based segmentation method is applied on the intensity images generated from two different schemes to detect the lesion. For the fusion-based image retrieval and classification, the lesion-specific local color and texture features are extracted and represented in the form of the mean and variance-covariance of color channels and in a combined feature space. The performance is evaluated by using both the precision-recall and classification accuracies. Experimental results on a dermoscopic image collection demonstrate the effectiveness of the proposed system and show the viability of a real-time clinical application. PMID:19942406

  6. A large-scale dataset of solar event reports from automated feature recognition modules

    Science.gov (United States)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  7. An automated target recognition technique for image segmentation and scene analysis

    Energy Technology Data Exchange (ETDEWEB)

    Baumgart, C.W.; Ciarcia, C.A.

    1994-02-01

    Automated target recognition software has been designed to perform image segmentation and scene analysis. Specifically, this software was developed as a package for the Army`s Minefield and Reconnaissance and Detector (MIRADOR) program. MIRADOR is an on/off road, remote control, multi-sensor system designed to detect buried and surface-emplaced metallic and non-metallic anti-tank mines. The basic requirements for this ATR software were: (1) an ability to separate target objects from the background in low S/N conditions; (2) an ability to handle a relatively high dynamic range in imaging light levels; (3) the ability to compensate for or remove light source effects such as shadows; and (4) the ability to identify target objects as mines. The image segmentation and target evaluation was performed utilizing an integrated and parallel processing approach. Three basic techniques (texture analysis, edge enhancement, and contrast enhancement) were used collectively to extract all potential mine target shapes from the basic image. Target evaluation was then performed using a combination of size, geometrical, and fractal characteristics which resulted in a calculated probability for each target shape. Overall results with this algorithm were quite good, though there is a trade-off between detection confidence and the number of false alarms. This technology also has applications in the areas of hazardous waste site remediation, archaeology, and law enforcement.

  8. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  9. 人机交互中的语音情感识别研究进展%A survey of speech emotion recognition in human computer interaction

    Institute of Scientific and Technical Information of China (English)

    张石清; 李乐民; 赵知劲

    2013-01-01

    Speech emotion recognition is a current active research topic in the fields of signal processing,pattern recognition,artificial intelligence,human computer interaction,etc.The ultimate purpose of such research is to endow computers with emotion ability and make human computer interaction be genuinely harmonic and natural.This paper reviews the recent advance of several key problems involved in speech emotion recognition,including emotional description theory,emotional speech databases,emotional acoustic analysis as well as emotion recognition methods.In addition,the existing research problems and the future direction are presented.%语音情感识别是当前信号处理、模式识别、人工智能、人机交互等领域的热点研究课题,其研究的最终目的是赋予计算机情感能力,使得人机交互做到真正的和谐和自然.本文综述了语音情感识别所涉及到的几个关键问题,包括情感表示理论、情感语音数据库、情感声学特征分析以及情感识别方法四个方面的最新进展,并指出了研究中存在的问题及下一步发展的方向.

  10. 俄语语音识别技术的研究现状和发展趋势%Research Status and Development Trend of Russian Speech Recognition Technology

    Institute of Scientific and Technical Information of China (English)

    马延周

    2015-01-01

    Abstract:Technological advance of speech recognition facilitates intelligent human-computer interactions. And applica-tions of speech recognition technology have made human communications easier and more instantaneous. Starting with a look at the past and the present of Russian speech recognition, this paper attempts to conduct a detailed analysis on fundamental princi-ples of speech recognition, speech recognition technology based on Hammond theoretical groundwork for consecutive vast-vo-cabulary speech recognition. The paper also demonstrates steps for establishing models in Russian acoustics and speeches. As to technological barriers in speech recognition, it probes into possible way out strategies. Finally, it predicts future development di-rection and application prospects for Russian speech recognition technology.%语音识别技术的发展,推动了人机交互的智能化,语音识别实用化技术使得人们之间的交流更加方便顺畅.本文从语音识别的发展历程及俄语语音识别的现状入手,对语音识别的基本原理、基于HMM模型的语音识别技术和大词汇量连续语音识别的理论基础进行了详细分析,并介绍了俄语语音声学模型和语言模型的创建办法.针对语音识别技术面临的难点问题,探讨了应对的策略,最后对俄语语音识别技术的发展方向和应用前景作了展望.

  11. Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

    OpenAIRE

    SAK, Haşim; Senior, Andrew; Beaufays, Françoise

    2014-01-01

    Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have been successfully used for sequence labeling and sequence prediction tasks, such as handwriting recognition, language modeling, phonetic labeling of acoustic frames. However, in contrast to the de...

  12. GesRec3D: a real-time coded gesture-to-speech system with automatic segmentation and recognition thresholding using dissimilarity measures

    OpenAIRE

    Craven, Michael P; Curtis, K. Mervyn

    2004-01-01

    A complete microcomputer system is described, GesRec3D, which facilitates the data acquisition, segmentation, learning, and recognition of 3-Dimensional arm gestures, with application as a Augmentative and Alternative Communication (AAC) aid for people with motor and speech disability. The gesture data is acquired from a Polhemus electro-magnetic tracker system, with sensors attached to the finger, wrist and elbow of one arm. Coded gestures are linked to user-defined text, to be spoken by a t...

  13. Logistic Kernel Function and its Application to Speech Recognition%Logistic 核函数及其在语音识别中的应用

    Institute of Scientific and Technical Information of China (English)

    刘晓峰; 张雪英; Zizhong John Wang

    2015-01-01

    核函数是支持向量机( SVM)的核心,直接决定着SVM的性能。为提高SVM在语音识别问题中的学习能力和泛化能力,文中提出了一种 Logistic 核函数,并给出了该Logistic核函数是Mercer核的理论证明。在双螺旋、语音识别问题上的实验结果表明,该Logistic核函数是有效的,其性能优于线性、多项式、径向基、指数径向基的核函数,尤其是在语音识别中,该Logistic核函数具有更好的识别性能。%Kernel function is the core of support vector machine ( SVM) and directly affects the performance of SVM.In order to improve the learning ability and generalization ability of SVM for speech recognition, a Logistic kernel function, which is proved to be a Mercer kernel function, is presented.Experimental results on bi-spiral and speech recognition problems show that the presented Logistic kernel function is effective and performs better than linear, polynomial, radial basis and exponential radial basis kernel functions, especially in the case of speech rec-ognition.

  14. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  15. Robust Speech Recognition Based on Vector Taylor Series%基于矢量泰勒级数的鲁棒语音识别

    Institute of Scientific and Technical Information of China (English)

    吕勇; 吴镇扬

    2011-01-01

    The vector Taylor series (VTS) expansion is an effective approach to noise robust speech recognition.However, in the log-spectral domain, there exist the strong correlations among the different channels of Mel filter bank and thus it is difficult to estimate the noise variance from noisy speech proposes.A feature compensation algorithm in the cepstral domain based on vector Taylor series was proposed.In this algorithm, the distribution of speech cepstral features was represented by a Gaussian mixture model (GMM), and the mean and variance of noise were estimated from noisy speech by the VTS approximation.The experimental results show that the proposed algorithm can significantly improve the performance of speech recognition system, and outperforms the VTS-based feature compensation method in the log-spectral domain.%矢量泰勒级数是一种有效的抗噪声鲁棒语音识别算法.然而在对数谱域,美尔滤波器组的不同通道之间有较强的相关性,因而难以从含噪语音中准确估计噪声的方差.提出了一种基于矢量泰勒级数的倒谱域特征补偿算法.该算法在倒谱域,用一个高斯混合模型描述语音倒谱特征的分布,通过矢量泰勒级数从含噪语音中估计噪声的均值和方差.实验结果表明,此算法能明显提高语音识别系统的性能,优于基于矢量泰勒级数的对数谱域特征补偿算法.

  16. Study of the vocal signal in the amplitude-time representation. Speech segmentation and recognition algorithms

    International Nuclear Information System (INIS)

    This dissertation exposes an acoustical and phonetical study of vocal signal. The complex pattern of the signal is segmented into simple sub-patterns and each one of these sub-patterns may be segmented again into another more simplest patterns with lower level. Application of pattern recognition techniques facilitates on one hand this segmentation and on the other hand the definition of the structural relations between the sub-patterns. Particularly, we have developed syntactic techniques in which the rewriting rules, context-sensitive, are controlled by predicates using parameters evaluated on the sub-patterns themselves. This allow to generalize a pure syntactic analysis by adding a semantic information. The system we expose, realizes pre-classification and a partial identification of the phonemes as also the accurate detection of each pitch period. The voice signal is analysed directly using the amplitude-time representation. This system has been implemented on a mini-computer and it works in the real time. (author)

  17. Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech.

    Science.gov (United States)

    Campo, D.; Quintero, O. L.; Bastidas, M.

    2016-04-01

    We propose a study of the mathematical properties of voice as an audio signal. This work includes signals in which the channel conditions are not ideal for emotion recognition. Multiresolution analysis- discrete wavelet transform – was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states. ANNs proved to be a system that allows an appropriate classification of such states. This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features. Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify.

  18. Talking Speech Input.

    Science.gov (United States)

    Berliss-Vincent, Jane; Whitford, Gigi

    2002-01-01

    This article presents both the factors involved in successful speech input use and the potential barriers that may suggest that other access technologies could be more appropriate for a given individual. Speech input options that are available are reviewed and strategies for optimizing use of speech recognition technology are discussed. (Contains…

  19. Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Kun-Ching Wang

    2015-01-01

    Full Text Available The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI. In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII. The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

  20. 人工蜂群算法改进DHMM的语音识别方法%Speech recognition algorithm based on artificial bee colony modified DHMM

    Institute of Scientific and Technical Information of China (English)

    宁爱平; 张雪英

    2012-01-01

    针对离散隐马尔可夫(Discrete Hidden Markov Model,DHMM)语音识别系统中LBG算法对初始码书的依赖性和易陷入局部最优解的问题,采用人工蜂群(Artificial Bee Colony,ABC)算法对语音特征参数进行矢量量化,从而得到最优码书,提出了ABC改进DHMM的孤立词语音识别方法.先提取语音信号的特征参数,然后用ABC算法中每个食物源表示一个码书,以人工蜂群进化的方式对初始码书进行迭代而获得最优码书,最后把最优码书的码矢标号代入DHMM模型进行训练和识别.实验结果表明,ABC改进的DHMM语音识别方法与传统的LBG及粒子群优化初始码书的LBG的DHMM语音识别方法相比具有较高的识别率和较好的鲁棒性.%The paper proposes the modified DHMM speech recognition algorithm which uses Artificial Bee Colony algorithm (ABC) to cluster speech feature vector and generate the optimal codebook in the Discrete Hidden Markov Model (DHMM) speech recognition system. In the experiments, extract the feature vector of speech. In ABC algorithm, each food source indicates a codebook. The optimal codebook is obtained by using bee evolution ways to iterative initial codebook. The optimal codebook enters the DHMM to be trained and recognized. The experimental results show that the modified DHMM speech recognition algorithm has higher recognition ratio and better robustness than DHMM algorithm which uses the traditional LBG algorithm and the LBG algorithm of particle swarm optimization initial codebook.

  1. Is talking to an automated teller machine natural and fun?

    Science.gov (United States)

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use. PMID:14612327

  2. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    Directory of Open Access Journals (Sweden)

    Zucker Gerhard

    2011-01-01

    Full Text Available The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour that resolves conflicting demands—a decision-making process. In the last years, research has started to focus on bionic principles for designing new concepts in this area. The information processing principles of the human mind have turned out to be of particular interest as the mind is capable of processing huge amounts of sensory data and taking adequate decisions for (re-actions based on these analysed data. In this paper, we discuss how a bionic approach can solve the upcoming problems of energy optimal systems. A recently developed model for environment recognition and decision-making processes, which is based on research findings from different disciplines of brain research is introduced. This model is the foundation for applications in intelligent building automation that have to deal with information from home and office environments. All of these applications have in common that they consist of a combination of communicating nodes and have many, partly contradicting goals.

  3. ACQUA: Automated Cyanobacterial Quantification Algorithm for toxic filamentous genera using spline curves, pattern recognition and machine learning.

    Science.gov (United States)

    Gandola, Emanuele; Antonioli, Manuela; Traficante, Alessio; Franceschini, Simone; Scardi, Michele; Congestri, Roberta

    2016-05-01

    Toxigenic cyanobacteria are one of the main health risks associated with water resources worldwide, as their toxins can affect humans and fauna exposed via drinking water, aquaculture and recreation. Microscopy monitoring of cyanobacteria in water bodies and massive growth systems is a routine operation for cell abundance and growth estimation. Here we present ACQUA (Automated Cyanobacterial Quantification Algorithm), a new fully automated image analysis method designed for filamentous genera in Bright field microscopy. A pre-processing algorithm has been developed to highlight filaments of interest from background signals due to other phytoplankton and dust. A spline-fitting algorithm has been designed to recombine interrupted and crossing filaments in order to perform accurate morphometric analysis and to extract the surface pattern information of highlighted objects. In addition, 17 specific pattern indicators have been developed and used as input data for a machine-learning algorithm dedicated to the recognition between five widespread toxic or potentially toxic filamentous genera in freshwater: Aphanizomenon, Cylindrospermopsis, Dolichospermum, Limnothrix and Planktothrix. The method was validated using freshwater samples from three Italian volcanic lakes comparing automated vs. manual results. ACQUA proved to be a fast and accurate tool to rapidly assess freshwater quality and to characterize cyanobacterial assemblages in aquatic environments. PMID:27012737

  4. Study of the Ability of Articulation Index (Al for Predicting the Unaided and Aided Speech Recognition Performance of 25 to 65 Years Old Hearing-Impaired Adults

    Directory of Open Access Journals (Sweden)

    Ghasem Mohammad Khani

    2001-05-01

    Full Text Available Background: In recent years there has been increased interest in the use of Al for assessing hearing handicap and for measuring the potential effectiveness of amplification system. AI is an expression of proportion of average speech signal that is audible to a given patient, and it can vary between 0.0 to 1.0. Method and Materials: This cross-sectional analytical study was carried out in department of audiology, rehabilitation, faculty, IUMS form 31 Oct 98 to 7 March 1999, on 40 normal hearing persons (80 ears; 19 males and 21 females and 40 hearing impaired persons (61 ears; 36 males and 25 females, 25-65 years old with moderate to moderately severe SNI-IL The pavlovic procedure (1988 for calculating Al, open set taped standard mono syllabic word lists, and the real -ear probe- tube microphone system to measure insertion gain were used, through test-retest. Results: 1/A significant correlation was shown between the Al scores and the speech recognition scores of normal hearing and hearing-impaired group with and without the hearing aid (P<0.05 2/ There was no significant differences in age group & sex: also 3 In test-retest measures of the insertion gain in each test and 4/No significant in test-retest of speech recognition test score. Conclusion: According to these results the Al can predict the unaided and aided monosyllabic recognition test scores very well, and age and sex variables have no effect on its ability. Therefore with respect to high reliability of the Al results and its simplicity, easy -to- use, cost effective, and little time consuming for calculation, its recommended the wide use of the Al, especially in clinical situation.

  5. Automated Facial Expression Recognition Using Gradient-Based Ternary Texture Patterns

    OpenAIRE

    Faisal Ahmed; Emam Hossain

    2013-01-01

    Recognition of human expression from facial image is an interesting research area, which has received increasing attention in the recent years. A robust and effective facial feature descriptor is the key to designing a successful expression recognition system. Although much progress has been made, deriving a face feature descriptor that can perform consistently under changing environment is still a difficult and challenging task. In this paper, we present the gradient local ternary pattern (G...

  6. Advances in speech processing

    Science.gov (United States)

    Ince, A. Nejat

    1992-10-01

    The field of speech processing is undergoing a rapid growth in terms of both performance and applications and this is fueled by the advances being made in the areas of microelectronics, computation, and algorithm design. The use of voice for civil and military communications is discussed considering advantages and disadvantages including the effects of environmental factors such as acoustic and electrical noise and interference and propagation. The structure of the existing NATO communications network and the evolving Integrated Services Digital Network (ISDN) concept are briefly reviewed to show how they meet the present and future requirements. The paper then deals with the fundamental subject of speech coding and compression. Recent advances in techniques and algorithms for speech coding now permit high quality voice reproduction at remarkably low bit rates. The subject of speech synthesis is next treated where the principle objective is to produce natural quality synthetic speech from unrestricted text input. Speech recognition where the ultimate objective is to produce a machine which would understand conversational speech with unrestricted vocabulary, from essentially any talker, is discussed. Algorithms for speech recognition can be characterized broadly as pattern recognition approaches and acoustic phonetic approaches. To date, the greatest degree of success in speech recognition has been obtained using pattern recognition paradigms. It is for this reason that the paper is concerned primarily with this technique.

  7. Comparing the effects of reverberation and of noise on speech recognition in simulated electric-acoustic listening

    OpenAIRE

    Helms Tillery, Kate; Brown, Christopher A; Bacon, Sid P.

    2012-01-01

    Cochlear implant users report difficulty understanding speech in both noisy and reverberant environments. Electric-acoustic stimulation (EAS) is known to improve speech intelligibility in noise. However, little is known about the potential benefits of EAS in reverberation, or about how such benefits relate to those observed in noise. The present study used EAS simulations to examine these questions. Sentences were convolved with impulse responses from a model of a room whose estimated reverbe...

  8. Forensic speaker recognition

    NARCIS (Netherlands)

    Meuwly, Didier

    2009-01-01

    The aim of forensic speaker recognition is to establish links between individuals and criminal activities, through audio speech recordings. This field is multidisciplinary, combining predominantly phonetics, linguistics, speech signal processing, and forensic statistics. On these bases, expert-based

  9. Automated Facial Expression Recognition Using Gradient-Based Ternary Texture Patterns

    Directory of Open Access Journals (Sweden)

    Faisal Ahmed

    2013-01-01

    Full Text Available Recognition of human expression from facial image is an interesting research area, which has received increasing attention in the recent years. A robust and effective facial feature descriptor is the key to designing a successful expression recognition system. Although much progress has been made, deriving a face feature descriptor that can perform consistently under changing environment is still a difficult and challenging task. In this paper, we present the gradient local ternary pattern (GLTP—a discriminative local texture feature for representing facial expression. The proposed GLTP operator encodes the local texture of an image by computing the gradient magnitudes of the local neighborhood and quantizing those values in three discrimination levels. The location and occurrence information of the resulting micropatterns is then used as the face feature descriptor. The performance of the proposed method has been evaluated for the person-independent face expression recognition task. Experiments with prototypic expression images from the Cohn-Kanade (CK face expression database validate that the GLTP feature descriptor can effectively encode the facial texture and thus achieves improved recognition performance than some well-known appearance-based facial features.

  10. Automated inspection of micro-defect recognition system for color filter

    Science.gov (United States)

    Jeffrey Kuo, Chung-Feng; Peng, Kai-Ching; Wu, Han-Cheng; Wang, Ching-Chin

    2015-07-01

    This study focused on micro-defect recognition and classification in color filters. First, six types of defects were examined, namely grain, black matrix hole (BMH), indium tin oxide (ITO) defect, missing edge and shape (MES), highlights, and particle. Orthogonal projection was applied to locate each pixel in a test image. Then, an image comparison was performed to mark similar blocks on the test image. The block that best resembled the template was chosen as the new template (or matching adaptive template). Afterwards, image subtraction was applied to subtract the pixels at the same location in each block of the test image from the matching adaptive template. The control limit law employed logic operation to separate the defect from the background region. The complete defect structure was obtained by the morphology method. Next, feature values, including defect gray value, red, green, and blue (RGB) color components, and aspect ratio were obtained as the classifier input. The experimental results showed that defect recognition could be completed as fast as 0.154 s using the proposed recognition system and software. In micro-defect classification, back-propagation neural network (BPNN) and minimum distance classifier (MDC) served as the defect classification decision theories for the five acquired feature values. To validate the proposed system, this study used 41 defects as training samples, and treated the feature values of 307 test samples as the BPNN classifier inputs. The total recognition rate was 93.7%. When an MDC was used, the total recognition rate was 96.8%, indicating that the MDC method is feasible in applying automatic optical inspection technology to classify micro-defects of color filters. The proposed system is proven to successfully improve the production yield and lower costs.

  11. A preliminary study on automated freshwater algae recognition and classification system

    OpenAIRE

    Mosleh Mogeeb AA; Manssor Hayat; Malek Sorayya; Milow Pozi; Salleh Aishah

    2012-01-01

    Abstract Background Freshwater algae can be used as indicators to monitor freshwater ecosystem condition. Algae react quickly and predictably to a broad range of pollutants. Thus they provide early signals of worsening environment. This study was carried out to develop a computer-based image processing technique to automatically detect, recognize, and identify algae genera from the divisions Bacillariophyta, Chlorophyta and Cyanobacteria in Putrajaya Lake. Literature shows that most automated...

  12. Automated recognition of cell phenotypes in histology images based on membrane- and nuclei-targeting biomarkers

    International Nuclear Information System (INIS)

    Three-dimensional in vitro culture of cancer cells are used to predict the effects of prospective anti-cancer drugs in vivo. In this study, we present an automated image analysis protocol for detailed morphological protein marker profiling of tumoroid cross section images. Histologic cross sections of breast tumoroids developed in co-culture suspensions of breast cancer cell lines, stained for E-cadherin and progesterone receptor, were digitized and pixels in these images were classified into five categories using k-means clustering. Automated segmentation was used to identify image regions composed of cells expressing a given biomarker. Synthesized images were created to check the accuracy of the image processing system. Accuracy of automated segmentation was over 95% in identifying regions of interest in synthesized images. Image analysis of adjacent histology slides stained, respectively, for Ecad and PR, accurately predicted regions of different cell phenotypes. Image analysis of tumoroid cross sections from different tumoroids obtained under the same co-culture conditions indicated the variation of cellular composition from one tumoroid to another. Variations in the compositions of cross sections obtained from the same tumoroid were established by parallel analysis of Ecad and PR-stained cross section images. Proposed image analysis methods offer standardized high throughput profiling of molecular anatomy of tumoroids based on both membrane and nuclei markers that is suitable to rapid large scale investigations of anti-cancer compounds for drug development

  13. Automation of the novel object recognition task for use in adolescent rats

    OpenAIRE

    Silvers, Janelle M; Harrod, Steven B.; Mactutus, Charles F.; Booze, Rosemarie M.

    2007-01-01

    The novel object recognition task is gaining popularity for its ability to test a complex behavior which relies on the integrity of memory and attention systems without placing undue stress upon the animal. While the task places few requirements upon the animal, it traditionally requires the experimenter to observe the test phase directly and record behavior. This approach can severely limit the number of subjects which can be tested in a reasonable period of time, as training and testing occ...

  14. Automation of Inter-Networked Banking and Teller Machine Operations Using Face Recognition

    OpenAIRE

    Prof.Mazher Khan*; Dr.Sayyad Ajij,; , Majed Ahmed Khan

    2014-01-01

    In this article about biometric systems the general idea is to use of facial recognition to reinforce security on one of the oldest and most secure piece of technology that is still in use to date thus an Automatic Teller Machine. The main use for any biometric system is to authenticate an input by Identifying and verifying it in an existing database. Security in ATM’s has changed little since their introduction in the late 70’s. This puts them in a very vulnerable state as techno...

  15. Reliability of an Automated High-Resolution Manometry Analysis Program across Expert Users, Novice Users, and Speech-Language Pathologists

    Science.gov (United States)

    Jones, Corinne A.; Hoffman, Matthew R.; Geng, Zhixian; Abdelhalim, Suzan M.; Jiang, Jack J.; McCulloch, Timothy M.

    2014-01-01

    Purpose: The purpose of this study was to investigate inter- and intrarater reliability among expert users, novice users, and speech-language pathologists with a semiautomated high-resolution manometry analysis program. We hypothesized that all users would have high intrarater reliability and high interrater reliability. Method: Three expert…

  16. Individual Differences in Language Ability Are Related to Variation in Word Recognition, Not Speech Perception: Evidence from Eye Movements

    Science.gov (United States)

    McMurray, Bob; Munson, Cheyenne; Tomblin, J. Bruce

    2014-01-01

    Purpose: The authors examined speech perception deficits associated with individual differences in language ability, contrasting auditory, phonological, or lexical accounts by asking whether lexical competition is differentially sensitive to fine-grained acoustic variation. Method: Adolescents with a range of language abilities (N = 74, including…

  17. Automated recognition and tracking of aerosol threat plumes with an IR camera pod

    Science.gov (United States)

    Fauth, Ryan; Powell, Christopher; Gruber, Thomas; Clapp, Dan

    2012-06-01

    Protection of fixed sites from chemical, biological, or radiological aerosol plume attacks depends on early warning so that there is time to take mitigating actions. Early warning requires continuous, autonomous, and rapid coverage of large surrounding areas; however, this must be done at an affordable cost. Once a potential threat plume is detected though, a different type of sensor (e.g., a more expensive, slower sensor) may be cued for identification purposes, but the problem is to quickly identify all of the potential threats around the fixed site of interest. To address this problem of low cost, persistent, wide area surveillance, an IR camera pod and multi-image stitching and processing algorithms have been developed for automatic recognition and tracking of aerosol plumes. A rugged, modular, static pod design, which accommodates as many as four micro-bolometer IR cameras for 45deg to 180deg of azimuth coverage, is presented. Various OpenCV1 based image-processing algorithms, including stitching of multiple adjacent FOVs, recognition of aerosol plume objects, and the tracking of aerosol plumes, are presented using process block diagrams and sample field test results, including chemical and biological simulant plumes. Methods for dealing with the background removal, brightness equalization between images, and focus quality for optimal plume tracking are also discussed.

  18. Speech input and output

    Science.gov (United States)

    Class, F.; Mangold, H.; Stall, D.; Zelinski, R.

    1981-12-01

    Possibilities for acoustical dialogs with electronic data processing equipment were investigated. Speech recognition is posed as recognizing word groups. An economical, multistage classifier for word string segmentation is presented and its reliability in dealing with continuous speech (problems of temporal normalization and context) is discussed. Speech synthesis is considered in terms of German linguistics and phonetics. Preprocessing algorithms for total synthesis of written texts were developed. A macrolanguage, MUSTER, is used to implement this processing in an acoustic data information system (ADES).

  19. Automation of Inter-Networked Banking and Teller Machine Operations Using Face Recognition

    Directory of Open Access Journals (Sweden)

    Prof.Mazher Khan*

    2014-11-01

    Full Text Available In this article about biometric systems the general idea is to use of facial recognition to reinforce security on one of the oldest and most secure piece of technology that is still in use to date thus an Automatic Teller Machine. The main use for any biometric system is to authenticate an input by Identifying and verifying it in an existing database. Security in ATM’s has changed little since their introduction in the late 70’s. This puts them in a very vulnerable state as technology has brought in a new breed of thieves who use the advancement of technology to their advantage. With this in mind it is high time something should be done about the security of this technology beside there cannot be too much security when it comes to people’s money

  20. The effect of using integrated signal processing hearing aids on the speech recognition abilities of hearing impaired Arabic-speaking children

    Directory of Open Access Journals (Sweden)

    Somaia Tawfik

    2014-11-01

    Results and conclusions: Significant improvement in aided sound field threshold levels and speech recognition in noise tests was recorded using ISP HAs over time. As regards consonant manner, glides and stop consonants showed the highest improvement. Though voiced and voiceless consonants were equally transmitted through digital HAs, voiced consonants were easier to perceive using ISP HAs. Middle and back consonants were easier to perceive compared to front consonants using both HAs. Application of WILSI self assessment questionnaire revealed that parents reported better performance in different listening situations. In conclusion, results of the present study support the use of ISP HAs in children with moderate to severe hearing loss due to the significant improvement recorded in both subjective and objective measures.

  1. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers of...... three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...... transcription vs. with transcription. We discuss the advantages and drawbacks of a particular ASR approach in different computational platforms when used by various speakers of a given language, who may have different accents and levels of proficiency in that language, and who may have different levels of...

  2. Automated pattern recognition to support geological mapping and exploration target generation - A case study from southern Namibia

    Science.gov (United States)

    Eberle, Detlef; Hutchins, David; Das, Sonali; Majumdar, Anandamayee; Paasche, Hendrik

    2015-06-01

    This paper demonstrates a methodology for the automatic joint interpretation of high resolution airborne geophysical and space-borne remote sensing data to support geological mapping in a largely automated, fast and objective manner. At the request of the Geological Survey of Namibia (GSN), part of the Gordonia Subprovince of the Namaqua Metamorphic Belt situated in southern Namibia was selected for this study. All data - covering an area of 120 km by 100 km in size - were gridded, with a spacing of adjacent data points of only 200 m. The data points were coincident for all data sets. Published criteria were used to characterize the airborne magnetic data and to establish a set of attributes suitable for the recognition of linear features and their pattern within the study area. This multi-attribute analysis of the airborne magnetic data provided the magnetic lineament pattern of the study area. To obtain a (pseudo-) lithology map of the area, the high resolution airborne gamma-ray data were integrated with selected Landsat band data using unsupervised fuzzy partitioning clustering. The outcome of this unsupervised clustering is a classified (zonal) map which in terms of the power of spatial resolution is superior to any regional geological mapping. The classified zones are then assigned geological/geophysical parameters and attributes known from the study area, e.g. lithology, physical rock properties, age, chemical composition, geophysical field characteristics, etc. This information is obtained from the examination of archived geological reports, borehole logs, any kind of existing geological/geophysical data and maps as well as ground truth controls where deemed necessary. To obtain a confidence measure validating the unsupervised fuzzy clustering results and receive a quality criterion of the classified zones, stepwise linear discriminant analysis was chosen. Only a small percentage (8%) of the samples was misclassified by discriminant analysis when compared

  3. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction.

    Science.gov (United States)

    Nass, C; Lee, K M

    2001-09-01

    Would people exhibit similarity-attraction and consistency-attraction toward unambiguously computer-generated speech even when personality is clearly not relevant? In Experiment 1, participants (extrovert or introvert) heard a synthesized voice (extrovert or introvert) on a book-buying Web site. Participants accurately recognized personality cues in text to speech and showed similarity-attraction in their evaluation of the computer voice, the book reviews, and the reviewer. Experiment 2, in a Web auction context, added personality of the text to the previous design. The results replicated Experiment 1 and demonstrated consistency (voice and text personality)-attraction. To maximize liking and trust, designers should set parameters, for example, words per minute or frequency range, that create a personality that is consistent with the user and the content being presented. PMID:11676096

  4. Automated Recognition of Railroad Infrastructure in Rural Areas from LIDAR Data

    Directory of Open Access Journals (Sweden)

    Mostafa Arastounia

    2015-11-01

    Full Text Available This study is aimed at developing automated methods to recognize railroad infrastructure from 3D LIDAR data. Railroad infrastructure includes rail tracks, contact cables, catenary cables, return current cables, masts, and cantilevers. The LIDAR dataset used in this study is acquired by placing an Optech Lynx mobile mapping system on a railcar, operating at 125 km/h. The acquired dataset covers 550 meters of Austrian rural railroad corridor comprising 31 railroad key elements and containing only spatial information. The proposed methodology recognizes key components of the railroad corridor based on their physical shape, geometrical properties, and the topological relationships among them. The developed algorithms managed to recognize all key components of the railroad infrastructure, including two rail tracks, thirteen masts, thirteen cantilevers, one contact cable, one catenary cable, and one return current cable. The results are presented and discussed both at object level and at point cloud level. The results indicate that 100% accuracy and 100% precision at the object level and an average of 96.4% accuracy and an average of 97.1% precision at point cloud level are achieved.

  5. Investigation into diagnostic agreement using automated computer-assisted histopathology pattern recognition image analysis

    Directory of Open Access Journals (Sweden)

    Joshua D Webster

    2012-01-01

    Full Text Available The extent to which histopathology pattern recognition image analysis (PRIA agrees with microscopic assessment has not been established. Thus, a commercial PRIA platform was evaluated in two applications using whole-slide images. Substantial agreement, lacking significant constant or proportional errors, between PRIA and manual morphometric image segmentation was obtained for pulmonary metastatic cancer areas (Passing/Bablok regression. Bland-Altman analysis indicated heteroscedastic measurements and tendency toward increasing variance with increasing tumor burden, but no significant trend in mean bias. The average between-methods percent tumor content difference was -0.64. Analysis of between-methods measurement differences relative to the percent tumor magnitude revealed that method disagreement had an impact primarily in the smallest measurements (tumor burden 0.988, indicating high reproducibility for both methods, yet PRIA reproducibility was superior (C.V.: PRIA = 7.4, manual = 17.1. Evaluation of PRIA on morphologically complex teratomas led to diagnostic agreement with pathologist assessments of pluripotency on subsets of teratomas. Accommodation of the diversity of teratoma histologic features frequently resulted in detrimental trade-offs, increasing PRIA error elsewhere in images. PRIA error was nonrandom and influenced by variations in histomorphology. File-size limitations encountered while training algorithms and consequences of spectral image processing dominance contributed to diagnostic inaccuracies experienced for some teratomas. PRIA appeared better suited for tissues with limited phenotypic diversity. Technical improvements may enhance diagnostic agreement, and consistent pathologist input will benefit further development and application of PRIA.

  6. Integranting prosodic information into a speech recogniser

    OpenAIRE

    López Soto, María Teresa

    2001-01-01

    In the last decade there has been an increasing tendency to incorporate language engineering strategies into speech technology. This technique combines linguistic and mathematical information in different applications: machine translation, natural language processing, speech synthesis and automatic speech recognition (ASR). In the field of speech synthesis, this hybrid approach (linguistic and mathematical/statistical) has led to the design of efficient models for reproducin...

  7. 马克思主义大众化过程中言语认同的心理激励策略%Psychological Motivation Strategy of Speech Recognition in the Popularization of Marxism

    Institute of Scientific and Technical Information of China (English)

    邓瑞琴

    2015-01-01

    Speech recognition is a powerful speech support system and discourse security in the course of the popularization of Marx. Psychological motivation is effective means to promote speech recognition, correctly grasp and deal with the coup-ling relationship between speech recognition and mental stimulation, from building speech recognition mechanism, shaping people's psychological identity, respect the dominant position of the public as psychological motivation of basic strategy to explore the ways of the popularization of Marxism. In the process of specific implementation psychological incentive strat-egy, should pay attention to the correct understanding of verbal and psychological relationship and pay attention to grasp the language of artistic expression.%言语认同是马克思主义大众化过程中有力的言语支撑体系和话语保障.心理激励是促进言语认同的有效手段,正确把握和处理言语认同与心理激励之间的耦合关系,从构建言语认同机制、塑造群众心理认同、尊重大众主体地位三方面作为心理激励的基本策略来探寻马克思主义大众化的实现方式.在具体实施心理激励策略的过程中,要注意正确理解言语与心理的关系和注意把握言语的表达艺术.

  8. SPEECH PROCESSING –AN OVERVIEW

    Directory of Open Access Journals (Sweden)

    A.INDUMATHI

    2012-06-01

    Full Text Available One of the earliest goals of speech processing was coding speech for efficient transmission. Later, the research spread in various area like Automatic Speech Recognition (ASR, Speech Synthesis (TTS,Speech Enhancement, Automatic Language Translation (ALT.Initially, ASR is used to recognize single words in a small vocabulary, later many product was developed for continuous speech for large vocabulary.Speech Synthesis is used for synthesizing the speech corresponding to a given text Speech Synthesis provide a way to communicate for persons unable to speak. When Speech Synthesis used together withASR, it allows a complete two-way spoken interaction between humans and machines. Speech Enhancement technique is applied to improve the quality of speech signal. Automatic Language Translation helps toconvert one language into another language. Basic concept of speech processing is provided for beginners.

  9. Machines a Comprendre la Parole: Methodologie et Bilan de Recherche (Automatic Speech Recognition: Methodology and the State of the Research)

    Science.gov (United States)

    Haton, Jean-Pierre

    1974-01-01

    Still no decisive result has been achieved in the automatic machine recognition of sentences of a natural language. Current research concentrates on developing algorithms for syntactic and semantic analysis. It is obvious that clues from all levels of perception have to be taken into account if a long term solution is ever to be found. (Author/MSE)

  10. 语音识别方法在水下目标识别中的应用%Application of speech recognition methods to underwater target classification

    Institute of Scientific and Technical Information of China (English)

    曾渊; 李钢虎; 赵亚楠; 苗雨

    2012-01-01

    水下目标识别是潜艇在海战中,先敌发现并有效进行水声对抗的关键技术.然而,如何根据声纳接收到的舰船辐射噪声对三类目标进行分类识别是长期困扰人们的问题.研究了四种语音识别中常用的方法——线性预测系数(LPC),线性预测倒谱系数(LPCC),美尔倒谱系数(MFCC)和最小均方无失真响应(MVDR),在水下目标识别中的应用效果,并比较了这四种方法在无噪声情况下的识别概率,以及在不同信噪比下的识别概率,并通过比较找到在无噪声和有噪声情况下的最佳方法.实验表明,在无噪声的情况下,MFCC方法总体识别率最高,第一类目标MFCC方法的识别率最高,第二类目标MFCC和MVDR方法识别率相似,好于其他两者,第三类目标MVDR方法识别率最高.在加入噪声的情况下,MVDR方法对三类目标的识别和抗噪声性能明显好于其余三者.%In submarine battle, underwater target recognition is a key technology used for early finding enemy and then taking effective acoustic countermeasure to defeat the enemy. However, a problem that puzzles people for a long time is how to classify and identify three kinds of ship targets based on the received ship radiation noise by sonar system. This paper studies the effects of applying four frequently-used speech recognition methods on underwater target classification. The four methods are linear prediction coefficient (LPC), linear prediction cepstrum coefficient (LPCC), Mel frequency cepstrum coefficient (MFCC) and Minimum Variance Distortionless Response (MVDR).The paper verifies the results by comparing the rates of recognition for different target samples under different SNRs, and finds out the best method in the condition of having or no noise. Experiments show that, without noise, the overall recognition rate of MFCC is highest; MFCC method has the highest recognition rate to the first kind of targets; MFCC and MVDR methods have similar recognition

  11. Building Searchable Collections of Enterprise Speech Data.

    Science.gov (United States)

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  12. Clinical and audiological features of a syndrome with deterioration in speech recognition out of proportion to pure hearing loss

    Directory of Open Access Journals (Sweden)

    Abdi S

    2007-04-01

    Full Text Available Background: The objective of this study was to describe the audiologic and related characteristics of a group patient with speech perception affected out of proportion to pure tone hearing loss. A case series of patient were referred for evaluation and management to the Hearing Research Center.To describe the clinical picture of the patients with the key clinical feature of hearing loss for pure tones and reduction in speech discrimination out of proportion to the pure tone loss, having some of the criteria of auditory neuropathy (i.e. normal otoacoustic emissions, OAE, and abnormal auditory brainstem evoked potentials, ABR and lacking others (e.g. present auditory reflexes. Methods: Hearing abilities were measured by Pure Tone Audiometry (PTA and Speech Discrimination Scores (SDS, measured in all patients using a standardized list of 25 monosyllabic Farsi words at MCL in quiet. Auditory pathway integrity was measured by using Auditory Brainstem Response (ABR and Otoacoustic Emission (OAE and anatomical lesions Computed Tomography Scan (CT and Magnetic Resonance Image (MRI of brain and retrocochlea. Patient included in the series were 35 patients who have SDS disproportionably low with regard to PTA, absent ABR waves and normal OAE. Results: All patients reported the beginning of their problem around adolescence. Neither of them had anatomical lesion in imaging studies and neither of them had any finding suggestive of conductive hearing lesion. Although in most of the cases the hearing loss had been more apparent in the lower frequencies (i.e. 1000 Hz and less, a stronger correlation was found between SDS and hearing threshold at higher frequencies. These patients may not benefit from hearing aids, as the outer hair cells are functional and amplification doesn’t seem to help; though, it was tried for all. Conclusion: These patients share a pattern of sensory –neural loss with no detectable lesion. The age of onset and the gradual

  13. Commercial speech and off-label drug uses: what role for wide acceptance, general recognition and research incentives?

    Science.gov (United States)

    Gilhooley, Margaret

    2011-01-01

    This article provides an overview of how the constitutional protections for commercial speech affect the Food and Drug Administration's (FDA) regulation of drugs, and the emerging issues about the scope of these protections. A federal district court has already found that commercial speech allows manufacturers to distribute reprints of medical articles about a new off-label use of a drug as long as it contains disclosures to prevent deception and to inform readers about the lack of FDA review. This paper summarizes the current agency guidance that accepts the manufacturer's distribution of reprints with disclosures. Allergan, the maker of Botox, recently maintained in a lawsuit that the First Amendment permits drug companies to provide "truthful information" to doctors about "widely accepted" off-label uses of a drug. While the case was settled as part of a fraud and abuse case on other grounds, extending constitutional protections generally to "widely accepted" uses is not warranted, especially if it covers the use of a drug for a new purpose that needs more proof of efficacy, and that can involve substantial risks. A health law academic pointed out in an article examining a fraud and abuse case that off-label use of drugs is common, and that practitioners may lack adequate dosage information about the off-label uses. Drug companies may obtain approval of a drug for a narrow use, such as for a specific type of pain, but practitioners use the drug for similar uses based on their experience. The writer maintained that a controlled study may not be necessary to establish efficacy for an expanded use of a drug for pain. Even if this is the case, as discussed below in this paper, added safety risks may exist if the expansion covers a longer period of time and use by a wider number of patients. The protections for commercial speech should not be extended to allow manufacturers to distribute information about practitioner use with a disclosure about the lack of FDA

  14. Speech in Mobile and Pervasive Environments

    CERN Document Server

    Rajput, Nitendra

    2012-01-01

    This book brings together the latest research in one comprehensive volume that deals with issues related to speech processing on resource-constrained, wireless, and mobile devices, such as speech recognition in noisy environments, specialized hardware for speech recognition and synthesis, the use of context to enhance recognition, the emerging and new standards required for interoperability, speech applications on mobile devices, distributed processing between the client and the server, and the relevance of Speech in Mobile and Pervasive Environments for developing regions--an area of explosiv

  15. Emotion Recognition in Speech Based on HMM and PNN%基于HMM和PNN的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    叶斌

    2011-01-01

    语音情感识别是从语音的角度赋予计算机理解情感特征的能力,最终使计算机能像人一样进行自然、亲切和生动的交互.提出了一种融合隐马尔科夫模型(hidden markov model,HMM)和概率神经网络(probabilistic neural network,PNN)的语音情感识别方法.在所设计情感识别系统中,提取出基本的韵律参数和频谱参数,利用PNN处理声学参数的统计特征,利用HMM处理声学参数的时序特征,运用加法规则和乘法规则融合了统计特征和时序特征的识别结果.实验结果显示,所提出的算法在语音情感识别中具有有效的识别能力.%The aim of the emotion recognition is make the computer have the capacity of understand emotion by the way of voice characteristics studies and ultimately like people for natural, warm and lively interaction. A speech emotion recognition algorithm based on HMM (hidden Markov model) and PNN (probabilistic neural network) was developed, in the system, the basic prosody parameters and spectral parameters were extracted first, and then the PNN was used to model the statistic features and HMM to model the temporal features. The sum and product rules were used to combine the probabilities from each group of features for the final decision. Experimental results approved the capacity and the efficiency of the proposed method.

  16. A Kinect-Based Sign Language Hand Gesture Recognition System for Hearing- and Speech-Impaired: A Pilot Study of Pakistani Sign Language.

    Science.gov (United States)

    Halim, Zahid; Abbas, Ghulam

    2015-01-01

    Sign language provides hearing and speech impaired individuals with an interface to communicate with other members of the society. Unfortunately, sign language is not understood by most of the common people. For this, a gadget based on image processing and pattern recognition can provide with a vital aid for detecting and translating sign language into a vocal language. This work presents a system for detecting and understanding the sign language gestures by a custom built software tool and later translating the gesture into a vocal language. For the purpose of recognizing a particular gesture, the system employs a Dynamic Time Warping (DTW) algorithm and an off-the-shelf software tool is employed for vocal language generation. Microsoft(®) Kinect is the primary tool used to capture video stream of a user. The proposed method is capable of successfully detecting gestures stored in the dictionary with an accuracy of 91%. The proposed system has the ability to define and add custom made gestures. Based on an experiment in which 10 individuals with impairments used the system to communicate with 5 people with no disability, 87% agreed that the system was useful. PMID:26132224

  17. Direcionalidade e reconhecimento de fala no ruído: estudo de quatro casos Directionality and speech recognition in noise: study of four cases

    Directory of Open Access Journals (Sweden)

    Mariane Turella Mazzochi

    2012-01-01

    Full Text Available A dificuldade de compreensão de fala no ruído é apontada como uma das principais incapacidades pelo usuário de Aparelho de Amplificação Sonora Individual. O objetivo desse estudo foi comparar o desempenho auditivo de sujeitos portadores de perda auditiva neurossensorial, bilateral, de grau leve a moderado, com os microfones ominidirecional, direcional fixo e direcional adaptativo automaticamente ativado, por meio da relação sinal/ruído (S/R nas quais são obtidos os Limiares de Reconhecimento de Sentenças no Ruído (LRSR. Utilizaram-se os aparelhos Reach, modelo RCH62, da marca Beltone, nos modos microfone omnidirecional, direcional fixo e direcional adaptativo automaticamente ativado. Foram testadas as seguintes situações de apresentação dos estímulos acústicos: fala 0° azimute e ruído 180° azimute (0°/180°, fala 90° azimute e ruído 270° (90°/270° azimute e fala 270° azimute e ruído 90° azimute (270°/90°. A média das relações sinal/ruído variou de 6,6 dB a -6,9 dB. O microfone que apresentou melhor média para a relação sinal/ruído, considerando as três situações de apresentação dos estímulos, foi o direcional adaptativo automaticamente ativado. Entretanto, por se tratar de uma amostra pequena, houve grande variabilidade individual. Mais estudos devem ser realizados a fim de que se tenham subsídios científicos para a seleção dos microfones mais apropriados.The difficulty of understanding speech with background noise is perceived as one of the main disabilities of the hearing aids users. The purpose of this study was to compare the hearing performance of subjects with sensorioneural, bilateral, light to moderate degree hearing loss with the microphones omnidirectional, fixed directional mode and automatically activated adaptive directional mode, activated through the signal/noise ratio (S/R in which the Sentence Recognition Threshold in Noise are obtained. It was used the hearing aid Reach, RCH62

  18. SAM: speech-aware applications in medicine to support structured data entry.

    OpenAIRE

    Wormek, A. K.; Ingenerf, J; Orthner, H. F.

    1997-01-01

    In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech ...

  19. A Study to Increase the Quality of Financial and Operational Performances of Call Centers using Speech Technology

    Directory of Open Access Journals (Sweden)

    R. Manoharan

    2015-04-01

    Full Text Available Everyone knows technology and automation are not the solutions to every business problem. But when used for the right reasons-and deployed and maintained wisely-speech based contact center applications can be good for the customers as well as business, if the money and time spent to implement and maintain contact center business. Speech based application is an experimental conversational speech system. Experience with redesigning the system based on user feedback indicates the importance of adhering to conversational conventions when designing speech interfaces, particularly in the face of speech recognition errors. Study results also suggest that speech-only interfaces should be designed from scratch rather than directly translated from their graphical counterparts. This paper examines a set of challenging issues facing speech interface designers and describes approaches to address some of these challenges. Let us highlight some of the specific constrains involved in this process of using speech technology in the main stream of business in general and of a Call Center specific and being resolved through the Industrial process. So the real challenge is “Developing a new Business Process Model for an Industry application, for a Call Center specific. And that paves the way to design and analyze the “Financial and Operational performance of call centers through the business process model using speech technology”.

  20. Speaker Recognition Algorithm for Abnormal Speech Based on Abnormal Feature Weighting%变异特征加权的异常语音说话人识别算法

    Institute of Scientific and Technical Information of China (English)

    何俊; 李艳雄; 贺前华; 李威

    2012-01-01

    As the commonly-used weighting algorithm is inefficient in tracking the abnormal feature of abnormal speech, a speaker recognition algorithm for abnormal speech is proposed based on the abnormal feature weighting. In this algorithm, first, a feature template of normal speech is established by computing the probability distribution of MFCC features of each order in a large number of normal speech samples. Then, the K-L distance and the Euclidean distance are used to measure the differences between a given test speech and the normal speech templates and to further determine the K-L and the Euclidean weighting factors. Finally, the two weighting factors are used to weight the MFCC features of the test speech, and the weighted MFCC features are input in the Gaussian mixture model for the speaker recognition with abnormal speech. Experimental results show that the global recognition rates of the speaker recognition algorithms based on the K-L weighting and the Euclidean weighting are respectively 46.61% and 42.25% , while those of the algorithms with and without the weighting of speaker recognition contribution of each order feature are respectively only 39.68% and 36.36%.%常用的加权算法难以跟踪非常态语音特征的变异,为此,文中提出了一种变异特征加权的异常语音说话人识别算法.首先统计大量正常语音各阶MFCC特征的概率分布,建立正常语音特征模板;然后用测试语音特征与正常语音特征模板之间的K-L距离和欧氏距离来度量语音的变异程度,确定K-L加权因子和欧氏加权因子;最后利用加权因子对测试语音的MFCC特征进行加权,并将加权后的特征输入高斯混合模型进行异常语音说话人识别.实验结果表明,文中提出的K-L加权和欧氏加权的异常语音说话人识别算法的整体识别率分别为46.61%和42.25%,而基于各阶特征对说话人识别贡献的加权算法和不加权算法的整体识别率分别为39.68%和36

  1. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    OpenAIRE

    Zucker Gerhard; Dietrich Dietmar; Velik Rosemarie

    2011-01-01

    The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour ...

  2. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  3. Current trends in multilingual speech processing

    Indian Academy of Sciences (India)

    Hervé Bourlard; John Dines; Mathew Magimai-Doss; Philip N Garner; David Imseng; Petr Motlicek; Hui Liang; Lakshmi Saheer; Fabio Valente

    2011-10-01

    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.

  4. Speech Problems

    Science.gov (United States)

    ... your treatment plan may include seeing a speech therapist , a person who is trained to treat speech disorders. How often you have to see the speech therapist will vary — you'll probably start out seeing ...

  5. Speech Development

    Science.gov (United States)

    ... Spotlight Fundraising Ideas Vehicle Donation Volunteer Efforts Speech Development skip to submenu Parents & Individuals Information for Parents & Individuals Speech Development To download the PDF version of this factsheet, ...

  6. Learning Representations of Affect from Speech

    OpenAIRE

    Ghosh, Sayan; Laksana, Eugene; Morency, Louis-Philippe; Scherer, Stefan

    2015-01-01

    There has been a lot of prior work on representation learning for speech recognition applications, but not much emphasis has been given to an investigation of effective representations of affect from speech, where the paralinguistic elements of speech are separated out from the verbal content. In this paper, we explore denoising autoencoders for learning paralinguistic attributes i.e. categorical and dimensional affective traits from speech. We show that the representations learnt by the bott...

  7. Neural bases of accented speech perception

    OpenAIRE

    Adank, Patti; Nuttall, Helen E.; Banks, Briony; Kennedy-Higgins, Daniel

    2015-01-01

    The recognition of unfamiliar regional and foreign accents represents a challenging task for the speech perception system (Floccia et al., 2006; Adank et al., 2009). Despite the frequency with which we encounter such accents, the neural mechanisms supporting successful perception of accented speech are poorly understood. Nonetheless, candidate neural substrates involved in processing speech in challenging listening conditions, including accented speech, are beginning to be identified. This re...

  8. Sparse representation in speech signal processing

    Science.gov (United States)

    Lee, Te-Won; Jang, Gil-Jin; Kwon, Oh-Wook

    2003-11-01

    We review the sparse representation principle for processing speech signals. A transformation for encoding the speech signals is learned such that the resulting coefficients are as independent as possible. We use independent component analysis with an exponential prior to learn a statistical representation for speech signals. This representation leads to extremely sparse priors that can be used for encoding speech signals for a variety of purposes. We review applications of this method for speech feature extraction, automatic speech recognition and speaker identification. Furthermore, this method is also suited for tackling the difficult problem of separating two sounds given only a single microphone.

  9. Impact of a PACS/RIS-integrated speech recognition system on radiology reporting time and report availability; Einfluss eines PACS-/RIS-integrierten Spracherkennungssystems auf den Zeitaufwand der Erstellung und die Verfuegbarkeit radiologischer Berfunde

    Energy Technology Data Exchange (ETDEWEB)

    Trumm, C.G.; Glaser, C.; Paasche, V.; Kuettner, B.; Francke, M.; Nissen-Meyer, S.; Reiser, M. [Klinikum Grosshadern, Muenchen (Germany). Inst. fuer Klinische Radiologie; Crispin, A. [Klinikum Grosshadern, Muenchen (Germany). Inst. fuer Medizinische Informationsverarbeitung, Biometrie und Epidemiologie; Popp, P. [Siemens AG Erlangen (Germany). Medizinische Loesungen

    2006-04-15

    Purpose: Quantification of the impact of a PACS/RIS-integrated speech recognition system (SRS) on the time expenditure for radiology reporting and on hospital-wide report availability (RA) in a university institution. Material and Methods: In a prospective pilot study, the following parameters were assessed for 669 radiographic examinations (CR): 1. time requirement per report dictation (TED: dictation time (s)/number of images [examination] x number of words [report]) with either a combination of PACS/tape-based dictation (TD: analog dictation device/minicassette/transcription) or PACS/RIS/speech recognition system (RR: remote recognition/transcription and OR: online recognition/self-correction by radiologist), respectively, and 2. the Report Turnaround Time (RTT) as the time interval from the entry of the first image into the PACS to the available RIS/HIS report. Two equal time periods were chosen retrospectively from the RIS database: 11/2002-2/2003 (only TD) and 11/2003-2/2004 (only RR or OR with speech recognition system [SRS]). The midterm ({>=}24 h, 24 h intervals) and short-term (< 24 h, 1 h intervals), RA after examination completion were calculated for all modalities and for Cr, CT, MR and XA/DS separately. The relative increase in the mid-term RA (RIMRA: related to total number of examinations in each time period) and increase in the short-term RA (ISRA: ratio of available reports during the 1st to 24th hour) were calculated. Results: Prospectively, there was a significant difference between TD/RR/OR (n=151/257/261) regarding mean TED (0.44/0.54/0.62 s [per word and image]) and mean RTT (10.47/6.65/1.27 h), respectively. Retrospectively, 37 898/39 680 reports were computed from the RIS database for the time periods of 11/2002-2/2003 and 11/2003-2/2004. For CR/CT there was a shift of the short-term RA to the first 6 hours after examination completion (mean cumulative RA 20% higher) with a more than three-fold increase in the total number of available

  10. Integration of speech with natural language understanding.

    OpenAIRE

    Moore, R C

    1995-01-01

    The integration of speech recognition with natural language understanding raises issues of how to adapt natural language processing to the characteristics of spoken language; how to cope with errorful recognition output, including the use of natural language information to reduce recognition errors; and how to use information from the speech signal, beyond just the sequence of words, as an aid to understanding. This paper reviews current research addressing these questions in the Spoken Langu...

  11. 面向手语自动翻译的基于Kinect的手势识别%GESTURE RECOGNITION WITH KINECT FOR AUTOMATED SIGN LANGUAGE TRANSLATION

    Institute of Scientific and Technical Information of China (English)

    付倩; 沈俊辰; 张茜颖; 武仲科; 周明全

    2013-01-01

    For a long time,barriers exist in communication to people with hearing disability.To them, being able to effectively communicate with other people has been a dream that seems may never come true. With the invention of computer and technological advances,this dream would be possible in the future. Researchers have always been trying to achieve automated sign language translation by utilizing Human-Computer Interaction (HCI),computer graphics,computer vision and pattern recognition.Most of previous work is based on videos.This paper proposed a gesture recognition method for automated sign language translation based on depth images provided by Kinect,a three-dimensional (3D)motion sensing input device. The 3D coordinates of palm center was computed by using NITE middleware and an open source framework-OpenNI.Then fingers were recognized and hand gesture was detected through adopting contour tracing algorithm and three-point alignment algorithm.The names of fingers were recognized using vector fitting method.Three layers of classifiers were designed to achieve static hand gesture recognition successfully. Compared with traditional methods using data gloves and monocular cameras,the present method was more accurate.Our experiments showed that the method was concise and effective.%提出了一种基于3D体感摄像机Kinect设备实现面向手语自动翻译的手势识别方法,利用 Kinect 体感外设提供的深度图像实现手的检测,并通过 OpenNI+NITE 框架中函数获得手掌质心在三维空间中的位置,然后通过等值线追踪算法(又称围线追踪算法)和三点定位算法识别出手指实现了手势跟踪,最后通过矢量匹配方法识别手指名字,并设计了三层分类器来实现静态手势语的识别。相较于传统的基于数据手套和单目摄像机的方法,本方法识别的更准确。基于上述方法,实现了一个手势识别系统。实验结果显示,本文提出

  12. 言语感知中词汇识别的句子语境效应研究%Effect of Sentimental Contexts on Word Recognition in Speech Perception

    Institute of Scientific and Technical Information of China (English)

    柳鑫淼

    2014-01-01

    言语感知遵循音不离词,词不离句的原则。除了语音特征、音位和单词三个感知单元外,句子单元也参与了言语感知的过程。在这一感知过程中,句子语境分别从句法和语义两方面对词汇的识别发生影响。在句法方面,句子层依据句法规则对词汇层产生自上而下的反馈作用,通过词类限制和曲折形态特征核查等方式实现对词汇层上备选单词的筛选;在语义方面,句子层根据语义限制条件对备选单词产生激活或抑制作用。%Phonemes, words and sentences are interconnected in speech perception. Besides phonetic features, phonemes and words, sentences are also engaged in speech perception. In speech perception, sentimental contexts exert influence on word recog-nition both syntactically and semantically. Syntactically, sentence levels exert top-down feedback effect on world levels according to syntactic rules, screening the candidates on word levels by constraining their part of speech or checking their inflectional fea-tures. Semantically, sentence levels activate pr inhibit the candidates by exerting semantic constraints.

  13. Head movements encode emotions during speech and song.

    Science.gov (United States)

    Livingstone, Steven R; Palmer, Caroline

    2016-04-01

    When speaking or singing, vocalists often move their heads in an expressive fashion, yet the influence of emotion on vocalists' head motion is unknown. Using a comparative speech/song task, we examined whether vocalists' intended emotions influence head movements and whether those movements influence the perceived emotion. In Experiment 1, vocalists were recorded with motion capture while speaking and singing each statement with different emotional intentions (very happy, happy, neutral, sad, very sad). Functional data analyses showed that head movements differed in translational and rotational displacement across emotional intentions, yet were similar across speech and song, transcending differences in F0 (varied freely in speech, fixed in song) and lexical variability. Head motion specific to emotional state occurred before and after vocalizations, as well as during sound production, confirming that some aspects of movement were not simply a by-product of sound production. In Experiment 2, observers accurately identified vocalists' intended emotion on the basis of silent, face-occluded videos of head movements during speech and song. These results provide the first evidence that head movements encode a vocalist's emotional intent and that observers decode emotional information from these movements. We discuss implications for models of head motion during vocalizations and applied outcomes in social robotics and automated emotion recognition. PMID:26501928

  14. Audio-visual gender recognition

    Science.gov (United States)

    Liu, Ming; Xu, Xun; Huang, Thomas S.

    2007-11-01

    Combining different modalities for pattern recognition task is a very promising field. Basically, human always fuse information from different modalities to recognize object and perform inference, etc. Audio-Visual gender recognition is one of the most common task in human social communication. Human can identify the gender by facial appearance, by speech and also by body gait. Indeed, human gender recognition is a multi-modal data acquisition and processing procedure. However, computational multimodal gender recognition has not been extensively investigated in the literature. In this paper, speech and facial image are fused to perform a mutli-modal gender recognition for exploring the improvement of combining different modalities.

  15. Cortisol, Chromogranin A, and Pupillary Responses Evoked by Speech Recognition Tasks in Normally Hearing and Hard-of-Hearing Listeners: A Pilot Study.

    Science.gov (United States)

    Kramer, Sophia E; Teunissen, Charlotte E; Zekveld, Adriana A

    2016-01-01

    Pupillometry is one method that has been used to measure processing load expended during speech understanding. Notably, speech perception (in noise) tasks can evoke a pupil response. It is not known if there is concurrent activation of the sympathetic nervous system as indexed by salivary cortisol and chromogranin A (CgA) and whether such activation differs between normally hearing (NH) and hard-of-hearing (HH) adults. Ten NH and 10 adults with mild-to-moderate hearing loss (mean age 52 years) participated. Two speech perception tests were administered in random order: one in quiet targeting 100% correct performance and one in noise targeting 50% correct performance. Pupil responses and salivary samples for cortisol and CgA analyses were collected four times: before testing, after the two speech perception tests, and at the end of the session. Participants rated their perceived accuracy, effort, and motivation. Effects were examined using repeated-measures analyses of variance. Correlations between outcomes were calculated. HH listeners had smaller peak pupil dilations (PPDs) than NH listeners in the speech in noise condition only. No group or condition effects were observed for the cortisol data, but HH listeners tended to have higher cortisol levels across conditions. CgA levels were larger at the pretesting time than at the three other test times. Hearing impairment did not affect CgA. Self-rated motivation correlated most often with cortisol or PPD values. The three physiological indicators of cognitive load and stress (PPD, cortisol, and CgA) are not equally affected by speech testing or hearing impairment. Each of them seem to capture a different dimension of sympathetic nervous system activity. PMID:27355762

  16. Robust Recogmtion Method of Speech Under Stress%心理紧张情况下的Robust语音识别方法

    Institute of Scientific and Technical Information of China (English)

    韩纪庆; 张磊; 王承发

    2000-01-01

    Abstract There are many stressful envronments which deteriorate the performance of speech recognition systems. Techniques for compensating the influence of stress can help neutralize stressed speech and improve robustness of speech recognition systems. In this paper,we smmarize the aproaches for robust recognition of speech under stress and also give the advances in the area.

  17. Speech perception and production in severe environments

    Science.gov (United States)

    Pisoni, David B.

    1990-09-01

    The goal was to acquire new knowledge about speech perception and production in severe environments such as high masking noise, increased cognitive load or sustained attentional demands. Changes were examined in speech production under these adverse conditions through acoustic analysis techniques. One set of studies focused on the effects of noise on speech production. The experiments in this group were designed to generate a database of speech obtained in noise and in quiet. A second set of experiments was designed to examine the effects of cognitive load on the acoustic-phonetic properties of speech. Talkers were required to carry out a demanding perceptual motor task while they read lists of test words. A final set of experiments explored the effects of vocal fatigue on the acoustic-phonetic properties of speech. Both cognitive load and vocal fatigue are present in many applications where speech recognition technology is used, yet their influence on speech production is poorly understood.

  18. Speech perception of noise with binary gains

    DEFF Research Database (Denmark)

    Wang, DeLiang; Kjems, Ulrik; Pedersen, Michael Syskind;

    2008-01-01

    For a given mixture of speech and noise, an ideal binary time-frequency mask is constructed by comparing speech energy and noise energy within local time-frequency units. It is observed that listeners achieve nearly perfect speech recognition from gated noise with binary gains prescribed by the i...... by the ideal binary mask. Only 16 filter channels and a frame rate of 100 Hz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception....

  19. Automatic recognition of type III solar radio bursts. Automated radio burst identification system method and first observations

    International Nuclear Information System (INIS)

    Complete text of publication follows. Because of the rapidly increasing role of technology, including complicated electronic systems, spacecraft, etc., modern society has become more vulnerable to a set of extraterrestrial influences (space weather) and requires continuous observation and forecasts of space weather. The major space weather events like solar flares and coronal mass ejections are usually accompanied by solar radio bursts, which can be used for a real-time space weather forecast. Coronal type III radio bursts are produced near the local electron plasma frequency and near its harmonic by fast electrons ejected from the solar active regions and moving through the corona and solar wind. These bursts have dynamic spectra with frequency rapidly falling with time, the typical duration of the coronal burst being about 1-3 s. This paper presents a new method developed to detect coronal type III bursts automatically and its implementation in a new Automated Radio Burst Identification System (ARBIS), which is working in real-time. The central idea of the implementation is to use the Radon transform for more objective detection of the bursts as approximately straight lines in dynamic spectra. Preliminary tests of the method with the use of the spectra obtained during 13 days show that the performance of the current implementation is quite high, ∼84%, while no false positives are observed and 23 events not listed previously are found. The first automatically detected coronal type III radio bursts are presented.

  20. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR

    OpenAIRE

    Weninger, Felix; Erdogan, Hakan; Watanabe, Shinji; Vincent, Emmanuel; Le Roux, Jonathan; Hershey, John R.; Schuller, Björn

    2015-01-01

    We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ' na¨vely ' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furtherm...

  1. Strategies for distant speech recognitionin reverberant environments

    Science.gov (United States)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  2. Design of self-tracking smart car based on speech recognition and infrared photoelectric sensor%基于语音识别和红外光电传感器的自循迹智能小车设计

    Institute of Scientific and Technical Information of China (English)

    李新科; 高潮; 郭永彩; 何卫华

    2011-01-01

    A self-tracking smart car has been designed based on infrared photoelectric sensor and speech recognition technology. The car adopts 16 bit singlechip SPCE061A of Sunplus Inc to work as the core processor of the control circuit, and obtains the path information by the reflected infrared photoelectric sensor. The car can adjust the direction and speed by the location of the black part of the path information to implement self-tracking. Compiled speech program API function to achieve speech capability of human-machine intercourse based on SPCE061 A. Experiments indicates that the smart car can achieve the anticipated steady and credible purpose. The technic can use the fields such as physical disabilities smart wheelchair, service robot, intelligent toy, unmanned driving vehicles, storage, etc.%基于光电传感器和语音识别技术完成了一种自循迹智能小车的设计.该小车采用凌阳16位单片机SPCE061A作为系统控制处理器,以反射式红外光电传感器获取路径信息.根据路径信息中黑线的位置来调整小车的运动方向与速度,从而实现自循迹功能.结合SPCE061A片内资源,编写了语音处理API函数,实现语音人机交互的智能化导航控制.实验表明:智能小车功能达到要求,运行可靠稳定.该技术可以应用于残障人智能轮椅、服务机器人、智能玩具、无人驾驶机动车、仓库等领域.

  3. Testing of Haar-Like Feature in Region of Interest Detection for Automated Target Recognition (ATR) System

    Science.gov (United States)

    Zhang, Yuhan; Lu, Dr. Thomas

    2010-01-01

    The objectives of this project were to develop a ROI (Region of Interest) detector using Haar-like feature similar to the face detection in Intel's OpenCV library, implement it in Matlab code, and test the performance of the new ROI detector against the existing ROI detector that uses Optimal Trade-off Maximum Average Correlation Height filter (OTMACH). The ROI detector included 3 parts: 1, Automated Haar-like feature selection in finding a small set of the most relevant Haar-like features for detecting ROIs that contained a target. 2, Having the small set of Haar-like features from the last step, a neural network needed to be trained to recognize ROIs with targets by taking the Haar-like features as inputs. 3, using the trained neural network from the last step, a filtering method needed to be developed to process the neural network responses into a small set of regions of interests. This needed to be coded in Matlab. All the 3 parts needed to be coded in Matlab. The parameters in the detector needed to be trained by machine learning and tested with specific datasets. Since OpenCV library and Haar-like feature were not available in Matlab, the Haar-like feature calculation needed to be implemented in Matlab. The codes for Adaptive Boosting and max/min filters in Matlab could to be found from the Internet but needed to be integrated to serve the purpose of this project. The performance of the new detector was tested by comparing the accuracy and the speed of the new detector against the existing OTMACH detector. The speed was referred as the average speed to find the regions of interests in an image. The accuracy was measured by the number of false positives (false alarms) at the same detection rate between the two detectors.

  4. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Directory of Open Access Journals (Sweden)

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  5. Development of a Danish speech intelligibility test

    DEFF Research Database (Denmark)

    Nielsen, Jens Bo; Dau, Torsten

    2009-01-01

    Abstract A Danish speech intelligibility test for assessing the speech recognition threshold in noise (SRTN) has been developed. The test consists of 180 sentences distributed in 18 phonetically balanced lists. The sentences are based on an open word-set and represent everyday language. The sente...

  6. Unsupervised Topic Adaptation for Lecture Speech Retrieval

    OpenAIRE

    Fujii, Atsushi; Itou, Katunobu; Akiba, Tomoyosi; Ishikawa, Tetsuya

    2004-01-01

    We are developing a cross-media information retrieval system, in which users can view specific segments of lecture videos by submitting text queries. To produce a text index, the audio track is extracted from a lecture video and a transcription is generated by automatic speech recognition. In this paper, to improve the quality of our retrieval system, we extensively investigate the effects of adapting acoustic and language models on speech recognition. We perform an MLLR-based method to adapt...

  7. Emotion Classification from Noisy Speech - A Deep Learning Approach

    OpenAIRE

    Rana, Rajib

    2016-01-01

    This paper investigates the performance of Deep Learning for speech emotion classification when the speech is compounded with noise. It reports on the classification accuracy and concludes with the future directions for achieving greater robustness for emotion recognition from noisy speech.

  8. A New Database for Speaker Recognition

    OpenAIRE

    Feng, Ling; Hansen, Lars Kai

    2005-01-01

    In this paper we discuss properties of speech databases used for speaker recognition research and evaluation, and we characterize some popular standard databases. The paper presents a new database called ELSDSR dedicated to speaker recognition applications. The main characteristics of this database are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics.

  9. A New Database for Speaker Recognition

    DEFF Research Database (Denmark)

    Feng, Ling; Hansen, Lars Kai

    2005-01-01

    In this paper we discuss properties of speech databases used for speaker recognition research and evaluation, and we characterize some popular standard databases. The paper presents a new database called ELSDSR dedicated to speaker recognition applications. The main characteristics of this database...... are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics....

  10. The role of speech in the user interface : perspective and application

    OpenAIRE

    Abewusi, A.B.

    1994-01-01

    Consideration must be given to the implication of speech as a communication medium before deciding to use speech input or output in an interactive environment. There are several effective control strategies for improving the quality of speech. The utility of the speech has been demonstrated by application to several illustrative problems where their application has proved effective despite all the limitation of synthetic speech output and automatic speech recognition systems. (Résumé d'auteur)

  11. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT i

  12. Bimodal Hearing and Speech Perception with a Competing Talker

    Science.gov (United States)

    Pyschny, Verena; Landwehr, Markus; Hahn, Moritz; Walger, Martin; von Wedel, Hasso; Meister, Hartmut

    2011-01-01

    Purpose: The objective of the study was to investigate the influence of bimodal stimulation upon hearing ability for speech recognition in the presence of a single competing talker. Method: Speech recognition was measured in 3 listening conditions: hearing aid (HA) alone, cochlear implant (CI) alone, and both devices together (CI + HA). To examine…

  13. Comparison of speech recognition with different speech coding strategies (SPEAK, CIS, and ACE) and their relationship to telemetric measures of compound action potentials in the nucleus CI 24M cochlear implant system.

    Science.gov (United States)

    Kiefer, J; Hohl, S; Stürzebecher, E; Pfennigdorff, T; Gstöettner, W

    2001-01-01

    Speech understanding and subjective preference for three different speech coding strategies (spectral peak coding [SPEAK], continuous interleaved sampling [CIS], and advanced combination encoders [ACE]) were investigated in 11 post-lingually deaf adult subjects, using the Nucleus CI 24M cochlear implant system. Subjects were randomly assigned to two groups in a balanced crossover study design. The first group was initially fitted with SPEAK and the second group with CIS. The remaining strategies were tested sequentially over 8 to 10 weeks with systematic variations of number of channels and rate of stimulation. Following a further interval of 3 months, during which subjects were allowed to listen with their preferred strategy, they were tested again with all three strategies. Compound action potentials (CAPs) were recorded using neural response telemetry. Input/output functions in relation to increasing stimulus levels and inter-stimulus intervals between masker and probe were established to assess the physiological status of the cochlear nerve. Objective results and subjective rating showed significant differences in favour of the ACE strategy. Ten of the 11 subjects preferred the ACE strategy at the end of the study. The estimate of the refractory period based on the inter-stimulus interval correlated significantly with the overall performance with all three strategies, but CAP measures could not be related to individual preference of strategy or differences in performance between strategies. Based on these results, the ACE strategy can be recommended as an initial choice specifically for the Nucleus CI 24M cochlear implant system. Nevertheless, access to the other strategies may help to increase performance in individual patients. PMID:11296939

  14. Speech and gesture interfaces for squad-level human-robot teaming

    Science.gov (United States)

    Harris, Jonathan; Barber, Daniel

    2014-06-01

    As the military increasingly adopts semi-autonomous unmanned systems for military operations, utilizing redundant and intuitive interfaces for communication between Soldiers and robots is vital to mission success. Currently, Soldiers use a common lexicon to verbally and visually communicate maneuvers between teammates. In order for robots to be seamlessly integrated within mixed-initiative teams, they must be able to understand this lexicon. Recent innovations in gaming platforms have led to advancements in speech and gesture recognition technologies, but the reliability of these technologies for enabling communication in human robot teaming is unclear. The purpose for the present study is to investigate the performance of Commercial-Off-The-Shelf (COTS) speech and gesture recognition tools in classifying a Squad Level Vocabulary (SLV) for a spatial navigation reconnaissance and surveillance task. The SLV for this study was based on findings from a survey conducted with Soldiers at Fort Benning, GA. The items of the survey focused on the communication between the Soldier and the robot, specifically in regards to verbally instructing them to execute reconnaissance and surveillance tasks. Resulting commands, identified from the survey, were then converted to equivalent arm and hand gestures, leveraging existing visual signals (e.g. U.S. Army Field Manual for Visual Signaling). A study was then run to test the ability of commercially available automated speech recognition technologies and a gesture recognition glove to classify these commands in a simulated intelligence, surveillance, and reconnaissance task. This paper presents classification accuracy of these devices for both speech and gesture modalities independently.

  15. Speech Processing Application Based on Phonetics and Phonology of the Polish Language

    Science.gov (United States)

    Kłosowski, Piotr

    The article presents methods of improving speech processing based on phonetics and phonology of Polish language. The new presented method for speech recognition was based on detection of distinctive acoustic parameters of phonemes in Polish language. Distinctivity has been assumed as the most important selection of parameters, which have represented objects from recognized classes. Speech recognition is widely used in telecommunications applications.

  16. 基于语音识别的农产品价格信息采集方法%The Agricultural Price Information Acquisition Method Based on Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    许金普; 诸叶平

    2015-01-01

    分别采用相同的测试集进行试验,得出不同方法下的句子识别率、词识别率以及精准度。【结果】三音子声学模型的识别性能明显优于单音素声学模型,女声模型和男声模型的性能均优于男女混合声学模型,决策树聚类方法对识别率的提高不明显但可以明显减少三音子模型的数量,混合高斯分量的增加对识别率具有一定提高但同时带来计算量的增加,CMN和CVN方法可以明显提高系统的识别性能。通过对不同地点和不同说话人进行测试,最终识别率男性为95.04%,女性为97.62%。【结论】语音识别技术应用到农产品价格信息采集过程中是可行的。本文提出了一种农产品价格采集环境下提高语音识别率的方法,试验证明通过该方法训练出的模型具有较好的识别性能,本研究方法为日后应用系统的开发奠定了基础。%Objective]In this research, speech recognition technology was applied to collect agricultural price information. The aim of the research is to recognize the continuous speech which is limited in vocabulary and uttered by independent Chinese mandarin speakers, and to propose a robust speech recognition method suitable for the environment where agricultural product prices are collected. On the basis of Hidden Markov Model (HMM), we train the acoustic models for this environment, so as to relieve the decrease of recognition rate caused by the mismatching between the test environment and the training environment, and to make further improvement of the recognition rate.[Method] In the stage of acquiring and processing data, we first built the transformation grammar according to certain rules to recognize the limited vocabulary, and this grammar will be used to guide the recording of both train data and test data. Then we select different environments to collect agricultural product prices by different speakers. On this basis, we built a speech

  17. Real-time speech gisting for ATC applications

    Science.gov (United States)

    Dunkelberger, Kirk A.

    1995-06-01

    Command and control within the ATC environment remains primarily voice-based. Hence, automatic real time, speaker independent, continuous speech recognition (CSR) has many obvious applications and implied benefits to the ATC community: automated target tagging, aircraft compliance monitoring, controller training, automatic alarm disabling, display management, and many others. However, while current state-of-the-art CSR systems provide upwards of 98% word accuracy in laboratory environments, recent low-intrusion experiments in the ATCT environments demonstrated less than 70% word accuracy in spite of significant investments in recognizer tuning. Acoustic channel irregularities and controller/pilot grammar verities impact current CSR algorithms at their weakest points. It will be shown herein, however, that real time context- and environment-sensitive gisting can provide key command phrase recognition rates of greater than 95% using the same low-intrusion approach. The combination of real time inexact syntactic pattern recognition techniques and a tight integration of CSR, gisting, and ATC database accessor system components is the key to these high phase recognition rates. A system concept for real time gisting in the ATC context is presented herein. After establishing an application context, discussion presents a minimal CSR technology context then focuses on the gisting mechanism, desirable interfaces into the ATCT database environment, and data and control flow within the prototype system. Results of recent tests for a subset of the functionality are presented together with suggestions for further research.

  18. Integrated speech and morphological processing in a connectionist continuous speech understanding for Korean

    CERN Document Server

    Lee, G; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new tightly coupled speech and natural language integration model is presented for a TDNN-based continuous possibly large vocabulary speech recognition system for Korean. Unlike popular n-best techniques developed for integrating mainly HMM-based speech recognition and natural language processing in a {\\em word level}, which is obviously inadequate for morphologically complex agglutinative languages, our model constructs a spoken language system based on a {\\em morpheme-level} speech and language integration. With this integration scheme, the spoken Korean processing engine (SKOPE) is designed and implemented using a TDNN-based diphone recognition module integrated with a Viterbi-based lexical decoding and symbolic phonological/morphological co-analysis. Our experiment results show that the speaker-dependent continuous can be achieved with over 80.6\\% success rate directly from speech inputs for the middle-level vocabularies.

  19. Embedding speech into virtual realities

    Science.gov (United States)

    Bohn, Christian-Arved; Krueger, Wolfgang

    1993-05-01

    In this work a speaker-independent speech recognition system is presented, which is suitable for implementation in Virtual Reality applications. The use of an artificial neural network in connection with a special compression of the acoustic input leads to a system, which is robust, fast, easy to use and needs no additional hardware, beside a common VR-equipment.

  20. The Kiel Corpora of "Speech & Emotion" - A Summary

    DEFF Research Database (Denmark)

    Niebuhr, Oliver; Peters, Benno; Landgraf, Rabea; Schmidt, Gerhard

    technology applications that sneak in every corner of our life. Apart from the fact that speech corpora seem to become constantly larger (for example, in order to properly train self-learning speech synthesis/recognition algorithms), the content of speech corpora also changes. In particular, recordings of...... isolated logatomes, words or sentences are successively supplanted by more realistic, interactive, and informal speech-production tasks....

  1. Cued Speech: A visual communication mode for the Deaf society

    OpenAIRE

    Heracleous, Panikos; Beautemps, Denis

    2010-01-01

    Cued Speech is a visual mode of communication that uses handshapes and placements in combination with the mouth movements of speech to make the phonemes of a spoken language look different from each other and clearly understandable to deaf individuals. The aim of Cued Speech is to overcome the problems of lip reading and thus enable deaf persons to wholly understand spoken language. In this study, automatic phoneme recognition in Cued Speech for French based on hidden Markov model (HMMs) is i...

  2. Human Lips-Contour Recognition and Tracing

    Directory of Open Access Journals (Sweden)

    Md. Hasan Tareque

    2014-01-01

    Full Text Available Human-lip detection is an important criterion for many automated modern system in present day. Like computerized speech reading, face recognition etc. system can work more precisely if human-lip can detect accurately. There are many processes for detecting human-lip. In this paper an approach is developed so that the region of a human-lip can be detected, we called it lip contour. For this a region-based Active Contour Model (ACM is introduced with watershed segmentation. In this model we used global energy terms instead of local energy terms because, global energy gives better convergence rate for malicious environment. At the time of ACM initialization by using H8 based on Lyapunov stability theory, the system gives more accurate and stable result.

  3. The intelligibility of interrupted speech depends upon its uninterrupted intelligibility.

    Science.gov (United States)

    Ardoint, Marine; Green, Tim; Rosen, Stuart

    2014-10-01

    Recognition of sentences containing periodic, 5-Hz, silent interruptions of differing duty cycles was assessed for three types of processed speech. Processing conditions employed different combinations of spectral resolution and the availability of fundamental frequency (F0) information, chosen to yield similar, below-ceiling performance for uninterrupted speech. Performance declined with decreasing duty cycle similarly for each processing condition, suggesting that, at least for certain forms of speech processing and interruption rates, performance with interrupted speech may reflect that obtained with uninterrupted speech. This highlights the difficulty in interpreting differences in interrupted speech performance across conditions for which uninterrupted performance is at ceiling. PMID:25324110

  4. Speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Ravishankar, C., Hughes Network Systems, Germantown, MD

    1998-05-08

    Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the

  5. Multimedia content with a speech track: ACM multimedia 2010 workshop on searching spontaneous conversational speech

    NARCIS (Netherlands)

    Larson, M.; Ordelman, R.; Metze, F.; Kraaij, W.; Jong, F. de

    2010-01-01

    When multimedia content has a speech track, a whole array of techniques involving speech recognition and analysis becomes available for indexing and structuring and can provide users with improved access and search. The set of new domains standing to benefit from these techniques encompasses talksho

  6. Automatic Speaker Recognition System

    Directory of Open Access Journals (Sweden)

    Parul,R. B. Dubey

    2012-12-01

    Full Text Available Spoken language is used by human to convey many types of information. Primarily, speech convey message via words. Owing to advanced speech technologies, people's interactions with remote machines, such as phone banking, internet browsing, and secured information retrieval by voice, is becoming popular today. Speaker verification and speaker identification are important for authentication and verification in security purpose. Speaker identification methods can be divided into text independent and text-dependent. Speaker recognition is the process of automatically recognizing speaker voice on the basis of individual information included in the input speech waves. It consists of comparing a speech signal from an unknown speaker to a set of stored data of known speakers. This process recognizes who has spoken by matching input signal with pre- stored samples. The work is focussed to improve the performance of the speaker verification under noisy conditions.

  7. Modeling speech imitation and ecological learning of auditory-motor maps

    OpenAIRE

    Claudia eCanevari; Leonardo eBadino; Alessandro eD'Ausilio; Luciano eFadiga; Giorgio eMetta

    2013-01-01

    Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR ...

  8. The relative phonetic contributions of a cochlear implant and residual acoustic hearing to bimodal speech perceptiona

    OpenAIRE

    Sheffield, Benjamin M.; Zeng, Fan-Gang

    2012-01-01

    The addition of low-passed (LP) speech or even a tone following the fundamental frequency (F0) of speech has been shown to benefit speech recognition for cochlear implant (CI) users with residual acoustic hearing. The mechanisms underlying this benefit are still unclear. In this study, eight bimodal subjects (CI users with acoustic hearing in the non-implanted ear) and eight simulated bimodal subjects (using vocoded and LP speech) were tested on vowel and consonant recognition to determine th...

  9. Commercial applications of speech interface technology: an industry at the threshold.

    OpenAIRE

    Oberteuffer, J A

    1995-01-01

    Speech interface technology, which includes automatic speech recognition, synthetic speech, and natural language processing, is beginning to have a significant impact on business and personal computer use. Today, powerful and inexpensive microprocessors and improved algorithms are driving commercial applications in computer command, consumer, data entry, speech-to-text, telephone, and voice verification. Robust speaker-independent recognition systems for command and navigation in personal com...

  10. 基于改进谱减法的语音识别系统去噪%Denoising in the Speech Recognition System Based on the Improved Spectral Subtraction Method

    Institute of Scientific and Technical Information of China (English)

    田莎莎; 田艳

    2012-01-01

      The paper studied genetic spectral subtraction method and improved spectral subtraction method for eliminating the noise jamming in the speech recognition system. The two algorithm method are programmed and implemented.The simulation show that the improved algorithm can not only eliminate noise, but also suppress the“music noise”. The improved algorithm has an advantage over the genetic one.%  为消除语音识别系统中的噪声干扰,研究了传统谱减法和改进的谱减法,描述了两种算法的原理和特点,并对两种算法进行了编程实现.matlab仿真结果表明,改进的谱减法不仅可以消除噪声,还可以抑制“音乐噪声”的产生,比传统谱减法更具优势.

  11. The Cortical Organization of Speech Processing: Feedback Control and Predictive Coding the Context of a Dual-Stream Model

    Science.gov (United States)

    Hickok, Gregory

    2012-01-01

    Speech recognition is an active process that involves some form of predictive coding. This statement is relatively uncontroversial. What is less clear is the source of the prediction. The dual-stream model of speech processing suggests that there are two possible sources of predictive coding in speech perception: the motor speech system and the…

  12. An Abnormal Speech Detection Algorithm Based on GMM-UBM

    Directory of Open Access Journals (Sweden)

    Jun He

    2014-05-01

    Full Text Available To overcome the defects of common used algorithms based on model for abnormal speech recognition, which existed insufficient training data and difficult to fit each type of abnormal characters, an abnormal speech detection method based on GMM-UBM was proposed in this paper. For compensating the defects of methods based on model which difficult to deal with the diversification speech. Firstly, many normal utterances and unknowing type abnormal utterances came from different speaker, were used to train the GMM-UBM for normal speech and abnormal speech, respectively; secondly, the GMM-UBM obtained by training normal speech and abnormal speech were used to s core for these testing utterances. From the results show that compared with GMM and GMM-SVM methods under 24 Gaussians and the ratio of training speech and testing is 6:4, the correct classification ratio of this proposed have 6.1% and 4.4% improvement, respectively

  13. Hate speech

    Directory of Open Access Journals (Sweden)

    Anne Birgitta Nilsen

    2014-03-01

    Full Text Available The manifesto of the Norwegian terrorist Anders Behring Breivik is based on the “Eurabia” conspiracy theory. This theory is a key starting point for hate speech amongst many right-wing extremists in Europe, but also has ramifications beyond these environments. In brief, proponents of the Eurabia theory claim that Muslims are occupying Europe and destroying Western culture, with the assistance of the EU and European governments. By contrast, members of Al-Qaeda and other extreme Islamists promote the conspiracy theory “the Crusade” in their hate speech directed against the West. Proponents of the latter theory argue that the West is leading a crusade to eradicate Islam and Muslims, a crusade that is similarly facilitated by their governments. This article presents analyses of texts written by right-wing extremists and Muslim extremists in an effort to shed light on how hate speech promulgates conspiracy theories in order to spread hatred and intolerance.The aim of the article is to contribute to a more thorough understanding of hate speech’s nature by applying rhetorical analysis. Rhetorical analysis is chosen because it offers a means of understanding the persuasive power of speech. It is thus a suitable tool to describe how hate speech works to convince and persuade. The concepts from rhetorical theory used in this article are ethos, logos and pathos. The concept of ethos is used to pinpoint factors that contributed to Osama bin Laden's impact, namely factors that lent credibility to his promotion of the conspiracy theory of the Crusade. In particular, Bin Laden projected common sense, good morals and good will towards his audience. He seemed to have coherent and relevant arguments; he appeared to possess moral credibility; and his use of language demonstrated that he wanted the best for his audience.The concept of pathos is used to define hate speech, since hate speech targets its audience's emotions. In hate speech it is the

  14. Personality in speech assessment and automatic classification

    CERN Document Server

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  15. Towards Quranic reader controlled by speech

    OpenAIRE

    Yacine Yekache; Yekhlef Mekelleche; Belkacem Kouninef

    2012-01-01

    In this paper we describe the process of designing a task-oriented continuous speech recognition system for Arabic, based on CMU Sphinx4, to be used in the voice interface of Quranic reader. The concept of the Quranic reader controlled by speech is presented, the collection of the corpus and creation of acoustic model are described in detail taking into account a specificities of Arabic language and the desired application.

  16. Towards Quranic reader controlled by speech

    Directory of Open Access Journals (Sweden)

    Yacine Yekache

    2011-11-01

    Full Text Available In this paper we describe the process of designing a task-oriented continuous speech recognition system for Arabic, based on CMU Sphinx4, to be used in the voice interface of Quranic reader. The concept of the Quranic reader controlled by speech is presented, the collection of the corpus and creation of acoustic model are described in detail taking into account a specificities of Arabic language and the desired application.

  17. Speech Enhancement

    DEFF Research Database (Denmark)

    Benesty, Jacob; Jensen, Jesper Rindom; Christensen, Mads Græsbøll;

    and their performance bounded and assessed in terms of noise reduction and speech distortion. The book shows how various filter designs can be obtained in this framework, including the maximum SNR, Wiener, LCMV, and MVDR filters, and how these can be applied in various contexts, like in single......Speech enhancement is a classical problem in signal processing, yet still largely unsolved. Two of the conventional approaches for solving this problem are linear filtering, like the classical Wiener filter, and subspace methods. These approaches have traditionally been treated as different classes...

  18. Speech enhancement

    CERN Document Server

    Benesty, Jacob; Chen, Jingdong

    2006-01-01

    We live in a noisy world! In all applications (telecommunications, hands-free communications, recording, human-machine interfaces, etc.) that require at least one microphone, the signal of interest is usually contaminated by noise and reverberation. As a result, the microphone signal has to be ""cleaned"" with digital signal processing tools before it is played out, transmitted, or stored.This book is about speech enhancement. Different well-known and state-of-the-art methods for noise reduction, with one or multiple microphones, are discussed. By speech enhancement, we mean not only noise red

  19. A Danish open-set speech corpus for competing-speech studies

    DEFF Research Database (Denmark)

    Nielsen, Jens Bo; Dau, Torsten; Neher, Tobias

    2014-01-01

    ) when the competing speech signals are spatially separated. To achieve higher SRTs that correspond more closely to natural communication situations, an open-set, low-context, multi-talker speech corpus was developed. Three sets of 268 unique Danish sentences were created, and each set was recorded......Studies investigating speech-on-speech masking effects commonly use closed-set speech materials such as the coordinate response measure [Bolia et al. (2000). J. Acoust. Soc. Am. 107, 1065-1066]. However, these studies typically result in very low (i.e., negative) speech recognition thresholds (SRTs...... in a setup with a frontal target sentence and two concurrent masker sentences at ±50 degrees azimuth. For a group of 16 normal-hearing listeners and a group of 15 elderly (linearly aided) hearing-impaired listeners, overall SRTs of, respectively, +1.3 dB and +6.3 dB target-to-masker ratio were obtained...

  20. Reconhecimento de fala no nível de máximo conforto em pacientes adultos com perda auditiva neurossensorial Speech recognition in the maximum comfort level in adults with sensorineural hearing loss

    Directory of Open Access Journals (Sweden)

    Zuleica Costa Zaboni

    2009-01-01

    Full Text Available OBJETIVO: Pesquisar o Índice Percentual de Reconhecimento de Fala (IPRF no nível de máximo de conforto em adultos com perda auditiva neurossensorial de grau leve a moderadamente severo até 60 dB NA. MÉTODOS: Os indivíduos avaliados foram agrupados de acordo com o grau de perda auditiva (Grupos I, II e III. Feito isso, os grupos foram subdivididos (IA, IB, IIA, IIB, IIIA, IIIB conforme explicado a seguir. No subgrupo A de cada um dos grupos, foi determinado o IPRF a 40 dB NS, obtendo-se as respostas iniciando-se pela orelha direita. O paciente foi solicitado a informar quão confortável estava o som neste nível de apresentação dos estímulos, utilizando-se para obter a informação, uma escala com quatro descritores: baixo, confortável, alto e alto demais. Depois disso, foi pesquisado o nível de máximo conforto (NMC e nessa intensidade foi obtido o IPRF. No subgrupo B de cada um dos grupos, foi realizado o mesmo procedimento, porém obtendo-se primeiramente o IPRF no NMC, e a seguir a 40 dB NS. Nesse subgrupo, o teste foi iniciado na orelha esquerda. RESULTADOS: Após a avaliação dos indivíduos dos três grupos foi calculado o nível de audição médio em que os indivíduos referiram maior conforto que variou de 25 a 32,95 dB NS. No NMC de apresentação dos estímulos, houve maior percentagem de acertos de palavras. CONCLUSÃO: A avaliação do IPRF no nível de máximo conforto para indivíduos com perda auditiva neurossensorial de grau leve a moderadamente severo, proporciona melhores resultados de Reconhecimento de Fala.PURPOSE: To obtain the Percentage Index of Speech Recognition (PISR at maximum comfortable level (MCL in adults with mild to moderately severe (up to 60 dB sensorineural hearing loss. METHODS: The subjects evaluated were grouped according to the degree of hearing loss (Groups I, II and III. The groups were further divided into subgroups (IA,IB, IIA, IIB, IIIA, IIIB as it follows. In the subgroup A of each