WorldWideScience

Sample records for automated speech recognition

  1. Health Care in Home Automation Systems with Speech Recognition and Mobile Technology

    Directory of Open Access Journals (Sweden)

    Jasmin Kurti

    2016-08-01

    Full Text Available - Home automation systems use technology to facilitate the lives of people using it, and it is especially useful for assisting the elderly and persons with special needs. These kind of systems have been a popular research subject in last few years. In this work, I present the design and development of a system that provides a life assistant service in a home environment, a smart home-based healthcare system controlled with speech recognition and mobile technology. This includes developing software with speech recognition, speech synthesis, face recognition, controls for Arduino hardware, and a smartphone application for remote controlling the system. With the developed system, elderly and persons with special needs can stay independently in their own home secure and with care facilities. This system is tailored towards the elderly and disabled, but it can also be embedded in any home and used by anybody. It provides healthcare, security, entertainment, and total local and remote control of home.

  2. Acoustic diagnosis of pulmonary hypertension: automated speech- recognition-inspired classification algorithm outperforms physicians

    Science.gov (United States)

    Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian

    2016-09-01

    We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp < 25 mmHg). Physicians blinded to patient data listened to the same heart sound recordings and attempted a diagnosis. We studied 164 subjects: 86 with mPAp ≥ 25 mmHg (mPAp 41 ± 12 mmHg) and 78 with mPAp < 25 mmHg (mPAp 17 ± 5 mmHg) (p  < 0.005). The correct diagnostic rate of the automated speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.

  3. Fully Automated Non-Native Speech Recognition Using Confusion-Based Acoustic Model Integration

    OpenAIRE

    Bouselmi, Ghazi; Fohr, Dominique; Illina, Irina; Haton, Jean-Paul

    2005-01-01

    This paper presents a fully automated approach for the recognition of non-native speech based on acoustic model modification. For a native language (L1) and a spoken language (L2), pronunciation variants of the phones of L2 are automatically extracted from an existing non-native database as a confusion matrix with sequences of phones of L1. This is done using L1's and L2's ASR systems. This confusion concept deals with the problem of non existence of match between some L2 and L1 phones. The c...

  4. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  5. DEVELOPMENT OF AUTOMATED SPEECH RECOGNITION SYSTEM FOR EGYPTIAN ARABIC PHONE CONVERSATIONS

    Directory of Open Access Journals (Sweden)

    A. N. Romanenko

    2016-07-01

    Full Text Available The paper deals with description of several speech recognition systems for the Egyptian Colloquial Arabic. The research is based on the CALLHOME Egyptian corpus. The description of both systems, classic: based on Hidden Markov and Gaussian Mixture Models, and state-of-the-art: deep neural network acoustic models is given. We have demonstrated the contribution from the usage of speaker-dependent bottleneck features; for their extraction three extractors based on neural networks were trained. For their training three datasets in several languageswere used:Russian, English and differentArabic dialects.We have studied the possibility of application of a small Modern Standard Arabic (MSA corpus to derive phonetic transcriptions. The experiments have shown that application of the extractor obtained on the basis of the Russian dataset enables to increase significantly the quality of the Arabic speech recognition. We have also stated that the usage of phonetic transcriptions based on modern standard Arabic decreases recognition quality. Nevertheless, system operation results remain applicable in practice. In addition, we have carried out the study of obtained models application for the keywords searching problem solution. The systems obtained demonstrate good results as compared to those published before. Some ways to improve speech recognition are offered.

  6. Isolated Speech Recognition Using Artificial Neural Networks

    Science.gov (United States)

    2007-11-02

    In this project Artificial Neural Networks are used as research tool to accomplish Automated Speech Recognition of normal speech. A small size...the first stage of this work are satisfactory and thus the application of artificial neural networks in conjunction with cepstral analysis in isolated word recognition holds promise.

  7. Amharic Speech Recognition for Speech Translation

    OpenAIRE

    Melese, Michael; Besacier, Laurent; Meshesha, Million

    2016-01-01

    International audience; The state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Bas...

  8. Advances in Speech Recognition

    CERN Document Server

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  9. Arabic Speech Recognition System using CMU-Sphinx4

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper we present the creation of an Arabic version of Automated Speech Recognition System (ASR). This system is based on the open source Sphinx-4, from the Carnegie Mellon University. Which is a speech recognition system based on discrete hidden Markov models (HMMs). We investigate the changes that must be made to the model to adapt Arabic voice recognition. Keywords: Speech recognition, Acoustic model, Arabic language, HMMs, CMUSphinx-4, Artificial intelligence.

  10. Speech recognition in university classrooms

    OpenAIRE

    Wald, Mike; Bain, Keith; Basson, Sara H

    2002-01-01

    The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions: 1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms? 2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities? This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition te...

  11. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...... in the mobile context covering motivations, challenges, fundamental techniques and applications. Three ASR architectures are introduced: embedded speech recognition, distributed speech recognition and network speech recognition. Their pros and cons and implementation issues are discussed. Applications within...... command and control, text entry and search are presented with an emphasis on mobile text entry....

  12. Introduction to Arabic Speech Recognition Using CMUSphinx System

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper Arabic was investigated from the speech recognition problem point of view. We propose a novel approach to build an Arabic Automated Speech Recognition System (ASR). This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs). We build a model using utilities from the OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this system to Arabic voice recognition.

  13. Emotion Recognition using Speech Features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    “Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions.  The content of this book is important for designing and developing  natural and sophisticated speech systems. Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance;  and pro...

  14. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  15. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  16. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  17. Automated Speech Rate Measurement in Dysarthria

    Science.gov (United States)

    Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc

    2015-01-01

    Purpose: In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. Method: The new algorithm was trained and tested using Dutch…

  18. Automatic speech recognition An evaluation of Google Speech

    OpenAIRE

    Stenman, Magnus

    2015-01-01

    The use of speech recognition is increasing rapidly and is now available in smart TVs, desktop computers, every new smart phone, etc. allowing us to talk to computers naturally. With the use in home appliances, education and even in surgical procedures accuracy and speed becomes very important. This thesis aims to give an introduction to speech recognition and discuss its use in robotics. An evaluation of Google Speech, using Google’s speech API, in regards to word error rate and translation ...

  19. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  20. Recent Advances in Robust Speech Recognition Technology

    CERN Document Server

    Ramírez, Javier

    2011-01-01

    This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminishe

  1. Speech recognition with amplitude and frequency modulations

    Science.gov (United States)

    Zeng, Fan-Gang; Nie, Kaibao; Stickney, Ginger S.; Kong, Ying-Yee; Vongphoe, Michael; Bhargave, Ashish; Wei, Chaogang; Cao, Keli

    2005-02-01

    Amplitude modulation (AM) and frequency modulation (FM) are commonly used in communication, but their relative contributions to speech recognition have not been fully explored. To bridge this gap, we derived slowly varying AM and FM from speech sounds and conducted listening tests using stimuli with different modulations in normal-hearing and cochlear-implant subjects. We found that although AM from a limited number of spectral bands may be sufficient for speech recognition in quiet, FM significantly enhances speech recognition in noise, as well as speaker and tone recognition. Additional speech reception threshold measures revealed that FM is particularly critical for speech recognition with a competing voice and is independent of spectral resolution and similarity. These results suggest that AM and FM provide independent yet complementary contributions to support robust speech recognition under realistic listening situations. Encoding FM may improve auditory scene analysis, cochlear-implant, and audiocoding performance. auditory analysis | cochlear implant | neural code | phase | scene analysis

  2. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... record, but also with newer electronic versions typically based on touch-screen and keyboard, in particular ergonomic issues and the fact that anaesthesiologists tend to postpone the registration of the medications and other events during busy periods of anaesthesia, which in turn may lead to gaps...

  3. An effective cluster-based model for robust speech detection and speech recognition in noisy environments.

    Science.gov (United States)

    Górriz, J M; Ramírez, J; Segura, J C; Puntonet, C G

    2006-07-01

    This paper shows an accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognition systems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection (VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.

  4. Speech Emotion Recognition Using Fuzzy Logic Classifier

    Directory of Open Access Journals (Sweden)

    Daniar aghsanavard

    2016-01-01

    Full Text Available Over the last two decades, emotions, speech recognition and signal processing have been one of the most significant issues in the adoption of techniques to detect them. Each method has advantages and disadvantages. This paper tries to suggest fuzzy speech emotion recognition based on the classification of speech's signals in order to better recognition along with a higher speed. In this system, the use of fuzzy logic system with 5 layers, which is the combination of neural progressive network and algorithm optimization of firefly, first, speech samples have been given to input of fuzzy orbit and then, signals will be investigated and primary classified in a fuzzy framework. In this model, a pattern of signals will be created for each class of signals, which results in reduction of signal data dimension as well as easier speech recognition. The obtained experimental results show that our proposed method (categorized by firefly, improves recognition of utterances.

  5. Phoneme vs Grapheme Based Automatic Speech Recognition

    OpenAIRE

    Magimai.-Doss, Mathew; Dines, John; Bourlard, Hervé; Hermansky, Hynek

    2004-01-01

    In recent literature, different approaches have been proposed to use graphemes as subword units with implicit source of phoneme information for automatic speech recognition. The major advantage of using graphemes as subword units is that the definition of lexicon is easy. In previous studies, results comparable to phoneme-based automatic speech recognition systems have been reported using context-independent graphemes or context-dependent graphemes with decision trees. In this paper, we study...

  6. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Cini Kurian; Kannan Balakrishnan

    2013-12-01

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual Linear Predictive (PLP) cepstral coefficient for speech parameterization and continuous density Hidden Markov Model (HMM) in the recognition process. Viterbi algorithm is used for decoding. The training data base has the utterance of 21 speakers from the age group of 20 to 40 years and the sound is recorded in the normal office environment where each speaker is asked to read 20 set of continuous digits. The system obtained an accuracy of 99.5 % with the unseen data.

  7. PCA-Based Speech Enhancement for Distorted Speech Recognition

    Directory of Open Access Journals (Sweden)

    Tetsuya Takiguchi

    2007-09-01

    Full Text Available We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but it remains difficult to completely remove additive or convolution noise (distortion. The most commonly used noise-removal techniques are based on the spectraldomain operation, and then for speech recognition, the MFCC (Mel Frequency Cepstral Coefficient is computed, where DCT (Discrete Cosine Transform is applied to the mel-scale filter bank output. This paper describes a new PCA-based speech enhancement algorithm using kernel PCA instead of DCT, where the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features. Its effectiveness is confirmed by word recognition experiments on distorted speech.

  8. Auditory—Spectrum Quantization Based Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    WuYuanqing; HaoJie; 等

    1997-01-01

    Based on the analysis of the physiological and psychological characteristics of human auditory system[1],we can classify human auditory process into two hearing modes:active one and passive one.A novel approach of robust speech recognition,Auditory-spectrum Quantization Based Speech Recognition(AQBSR),is proposed.In this method,we intend to simulate human active hearing mode and locate the effective areas of speech signals in temporal domain and in frequency domain.Adaptive filter banks are used in place of fixed-band filters to extract feature parameters.The effective speech components and their corresponding frequency areas of each word in the vocabulary can be found out during training.In recognition stage,comparison between the unknown sound and the current template is maintained only in the effective areas of the template word.The control experiments show that the AQ BSR method is more robust than traditional systems.

  9. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  10. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...... (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural...

  11. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  12. Issues in acoustic modeling of speech for automatic speech recognition

    OpenAIRE

    Gong, Yifan; Haton, Jean-Paul; Mari, Jean-François

    1994-01-01

    Projet RFIA; Stochastic modeling is a flexible method for handling the large variability in speech for recognition applications. In contrast to dynamic time warping where heuristic training methods for estimating word templates are used, stochastic modeling allows a probabilistic and automatic training for estimating models. This paper deals with the improvement of stochastic techniques, especially for a better representation of time varying phenomena.

  13. Hidden Markov models in automatic speech recognition

    Science.gov (United States)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  14. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  15. The Phase Spectra Based Feature for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Abbasian ALI

    2009-07-01

    Full Text Available Speech recognition in adverse environment is one of the major issue in automatic speech recognition nowadays. While most current speech recognition system show to be highly efficient for ideal environment but their performance go down extremely when they are applied in real environment because of noise effected speech. In this paper a new feature representation based on phase spectra and Perceptual Linear Prediction (PLP has been suggested which can be used for robust speech recognition. It is shown that this new features can improve the performance of speech recognition not only in clean condition but also in various levels of noise condition when it is compared to PLP features.

  16. Phonological modeling for continuous speech recognition in Korean

    CERN Document Server

    Lee, W I; Lee, J H; Lee, WonIl; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.

  17. Effects of Cognitive Load on Speech Recognition

    Science.gov (United States)

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  18. Speech recognition employing biologically plausible receptive fields

    DEFF Research Database (Denmark)

    Fereczkowski, Michal; Bothe, Hans-Heinrich

    2011-01-01

    The main idea of the project is to build a widely speaker-independent, biologically motivated automatic speech recognition (ASR) system. The two main differences between our approach and current state-of-the-art ASRs are that i) the features used here are based on the responses of neuronlike spec...

  19. Bimodal Emotion Recognition from Speech and Text

    Directory of Open Access Journals (Sweden)

    Weilin Ye

    2014-01-01

    Full Text Available This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.

  20. Robust Speech Recognition Using a Harmonic Model

    Institute of Scientific and Technical Information of China (English)

    许超; 曹志刚

    2004-01-01

    Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been employed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algorithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrelation function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.

  1. Phonetic Alphabet for Speech Recognition of Czech

    Directory of Open Access Journals (Sweden)

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  2. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  3. Speech Recognition in 7 Languages

    Science.gov (United States)

    2000-08-01

    best monolingual cross-lingual [10] F. Weng, H. Bratt, L. Neumeyer, and A. Stol- recognizers could not always be tested. cke. A Study of Multilingual ...are r the same tm the proaches, namely portation, cross-lingual and simul- two languages are recognized at the same time, the taneous multilingual ...in cross-lingual will present experiments and results for different ap- recognition for different baseline systems and found proaches of multilingual

  4. Indonesian Automatic Speech Recognition For Command Speech Controller Multimedia Player

    Directory of Open Access Journals (Sweden)

    Vivien Arief Wardhany

    2014-12-01

    Full Text Available The purpose of multimedia devices development is controlling through voice. Nowdays voice that can be recognized only in English. To overcome the issue, then recognition using Indonesian language model and accousticc model and dictionary. Automatic Speech Recognizier is build using engine CMU Sphinx with modified english language to Indonesian Language database and XBMC used as the multimedia player. The experiment is using 10 volunteers testing items based on 7 commands. The volunteers is classifiedd by the genders, 5 Male & 5 female. 10 samples is taken in each command, continue with each volunteer perform 10 testing command. Each volunteer also have to try all 7 command that already provided. Based on percentage clarification table, the word “Kanan” had the most recognize with percentage 83% while “pilih” is the lowest one. The word which had the most wrong clarification is “kembali” with percentagee 67%, while the word “kanan” is the lowest one. From the result of Recognition Rate by male there are several command such as “Kembali”, “Utama”, “Atas “ and “Bawah” has the low Recognition Rate. Especially for “kembali” cannot be recognized as the command in the female voices but in male voice that command has 4% of RR this is because the command doesn’t have similar word in english near to “kembali” so the system unrecognize the command. Also for the command “Pilih” using the female voice has 80% of RR but for the male voice has only 4% of RR. This problem is mostly because of the different voice characteristic between adult male and female which male has lower voice frequencies (from 85 to 180 Hz than woman (165 to 255 Hz.The result of the experiment showed that each man had different number of recognition rate caused by the difference tone, pronunciation, and speed of speech. For further work needs to be done in order to improving the accouracy of the Indonesian Automatic Speech Recognition system

  5. A Dialectal Chinese Speech Recognition Framework

    Institute of Scientific and Technical Information of China (English)

    Jing Li; Thomas Fang Zheng; William Byrne; Dan Jurafsky

    2006-01-01

    A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level,and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use contextindependent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multipronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10-18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and

  6. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text......-input method for smartphones. On another front, given the significant improvements in Machine Translation (MT) quality and the increasing demand for translations, post-editing of MT is becoming a popular practice in the translation industry, since it has been shown to allow for larger volumes of translations...... to be produced saving time and costs. The translation industry is at a deeply transformative point in its evolution and the coming years herald an era of converge where speech technology could make a difference. As post-editing services are becoming a common practice among language service providers and speech...

  7. Speech Recognition Technology for Hearing Disabled Community

    Directory of Open Access Journals (Sweden)

    Tanvi Dua

    2014-09-01

    Full Text Available As the number of people with hearing disabilities are increasing significantly in the world, it is always required to use technology for filling the gap of communication between Deaf and Hearing communities. To fill this gap and to allow people with hearing disabilities to communicate this paper suggests a framework that contributes to the efficient integration of people with hearing disabilities. This paper presents a robust speech recognition system, which converts the continuous speech into text and image. The results are obtained with an accuracy of 95% with the small size vocabulary of 20 greeting sentences of continuous speech form tested in a speaker independent mode. In this testing phase all these continuous sentences were given as live input to the proposed system.

  8. Speech emotion recognition with unsupervised feature learning

    Institute of Scientific and Technical Information of China (English)

    Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

    2015-01-01

    Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

  9. Multilingual Vocabularies in Automatic Speech Recognition

    Science.gov (United States)

    2000-08-01

    monolingual (a few thousands) is an obstacle to a full generalization of the inventories, then moved to the multilingual case. In the approach towards the...language. of multilingual models than the monolingual models, and it was specifically observed in the test with Spanish utterances. In fact...UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP010389 TITLE: Multilingual Vocabularies in Automatic Speech Recognition

  10. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Lévy Christophe

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  11. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

    Science.gov (United States)

    Lee, Byeongwook; Cho, Kwang-Hyun

    2016-11-01

    Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

  12. Improved Open-Microphone Speech Recognition

    Science.gov (United States)

    Abrash, Victor

    2002-01-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken

  13. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  14. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  15. Automated Discovery of Speech Act Categories in Educational Games

    Science.gov (United States)

    Rus, Vasile; Moldovan, Cristian; Niraula, Nobal; Graesser, Arthur C.

    2012-01-01

    In this paper we address the important task of automated discovery of speech act categories in dialogue-based, multi-party educational games. Speech acts are important in dialogue-based educational systems because they help infer the student speaker's intentions (the task of speech act classification) which in turn is crucial to providing adequate…

  16. Automated leukocyte recognition using fuzzy divergence.

    Science.gov (United States)

    Ghosh, Madhumala; Das, Devkumar; Chakraborty, Chandan; Ray, Ajoy K

    2010-10-01

    This paper aims at introducing an automated approach to leukocyte recognition using fuzzy divergence and modified thresholding techniques. The recognition is done through the segmentation of nuclei where Gamma, Gaussian and Cauchy type of fuzzy membership functions are studied for the image pixels. It is in fact found that Cauchy leads better segmentation as compared to others. In addition, image thresholding is modified for better recognition. Results are studied and discussed.

  17. Speech Recognition Technology Applied to Intelligent Mobile Navigation System

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    The capability of human-computer interaction reflects the intelligent degree of mobile navigation system.The navigation data and functions of mobile navigation system are divided into system commands and non-system commands in this paper.And then a group of speech commands are Abstracted.This paper applies speech recognition technology to intelligent mobile navigation system to process speech commands and does some deep research on the integration of speech recognition technology with mobile navigation system.The navigation operation can be performed by speech commands,which makes human-computer interaction easy during navigation.Speech command interface of navigation system is implemented by Dutty ++ Software,which is based on speech recognition system -Via Voice of IBM.Through navigation experiments,navigation can be done almost without keyboard,which proved that human-computer interaction is very convenient by speech commands and the reliability is also higher.

  18. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  19. Combined Hand Gesture — Speech Model for Human Action Recognition

    Directory of Open Access Journals (Sweden)

    Sheng-Tzong Cheng

    2013-12-01

    Full Text Available This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  20. Combined hand gesture--speech model for human action recognition.

    Science.gov (United States)

    Cheng, Sheng-Tzong; Hsu, Chih-Wei; Li, Jian-Pan

    2013-12-12

    This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  1. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  2. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  3. Ensemble Feature Extraction Modules for Improved Hindi Speech Recognition System

    Directory of Open Access Journals (Sweden)

    Malay Kumar

    2012-05-01

    Full Text Available Speech is the most natural way of communication between human beings. The field of speech recognition generates intrigues of man - machine conversation and due to its versatile applications; automatic speech recognition systems have been designed. In this paper we are presenting a novel approach for Hindi speech recognition by ensemble feature extraction modules of ASR systems and their outputs have been combined using voting technique ROVER. Experimental results have been shown that proposed system will produce better result than traditional ASR systems.

  4. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    2016-01-01

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep auto

  5. Mispronunciation Detection for Language Learning and Speech Recognition Adaptation

    Science.gov (United States)

    Ge, Zhenhao

    2013-01-01

    The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…

  6. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult...... of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages......, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic...

  7. Speech recognition using articulatory and excitation source features

    CERN Document Server

    Rao, K Sreenivasa

    2017-01-01

    This book discusses the contribution of articulatory and excitation source information in discriminating sound units. The authors focus on excitation source component of speech -- and the dynamics of various articulators during speech production -- for enhancement of speech recognition (SR) performance. Speech recognition is analyzed for read, extempore, and conversation modes of speech. Five groups of articulatory features (AFs) are explored for speech recognition, in addition to conventional spectral features. Each chapter provides the motivation for exploring the specific feature for SR task, discusses the methods to extract those features, and finally suggests appropriate models to capture the sound unit specific knowledge from the proposed features. The authors close by discussing various combinations of spectral, articulatory and source features, and the desired models to enhance the performance of SR systems.

  8. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  9. Speech recognition algorithms based on weighted finite-state transducers

    CERN Document Server

    Hori, Takaaki

    2013-01-01

    This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recogni

  10. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  11. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  12. Confidence and rejection in automatic speech recognition

    Science.gov (United States)

    Colton, Larry Don

    Automatic speech recognition (ASR) is performed imperfectly by computers. For some designated part (e.g., word or phrase) of the ASR output, rejection is deciding (yes or no) whether it is correct, and confidence is the probability (0.0 to 1.0) of it being correct. This thesis presents new methods of rejecting errors and estimating confidence for telephone speech. These are also called word or utterance verification and can be used in wordspotting or voice-response systems. Open-set or out-of-vocabulary situations are a primary focus. Language models are not considered. In vocabulary-dependent rejection all words in the target vocabulary are known in advance and a strategy can be developed for confirming each word. A word-specific artificial neural network (ANN) is shown to discriminate well, and scores from such ANNs are shown on a closed-set recognition task to reorder the N-best hypothesis list (N=3) for improved recognition performance. Segment-based duration and perceptual linear prediction (PLP) features are shown to perform well for such ANNs. The majority of the thesis concerns vocabulary- and task-independent confidence and rejection based on phonetic word models. These can be computed for words even when no training examples of those words have been seen. New techniques are developed using phoneme ranks instead of probabilities in each frame. These are shown to perform as well as the best other methods examined despite the data reduction involved. Certain new weighted averaging schemes are studied but found to give no performance benefit. Hierarchical averaging is shown to improve performance significantly: frame scores combine to make segment (phoneme state) scores, which combine to make phoneme scores, which combine to make word scores. Use of intermediate syllable scores is shown to not affect performance. Normalizing frame scores by an average of the top probabilities in each frame is shown to improve performance significantly. Perplexity of the wrong

  13. Continuous speech recognition based on convolutional neural network

    Science.gov (United States)

    Zhang, Qing-qing; Liu, Yong; Pan, Jie-lin; Yan, Yong-hong

    2015-07-01

    Convolutional Neural Networks (CNNs), which showed success in achieving translation invariance for many image processing tasks, are investigated for continuous speech recognitions in the paper. Compared to Deep Neural Networks (DNNs), which have been proven to be successful in many speech recognition tasks nowadays, CNNs can reduce the NN model sizes significantly, and at the same time achieve even better recognition accuracies. Experiments on standard speech corpus TIMIT showed that CNNs outperformed DNNs in the term of the accuracy when CNNs had even smaller model size.

  14. Novel Extended Phonemic Set for Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    谢湘; 匡镜明

    2003-01-01

    An extended phonemic set of mandarin from the view of speech recognition is proposed. This set absorbs most principles of some other existing phonemic sets for mandarin, like Worldbet and SAMPA-C, and also takes advantage of some practical experiences from speech recognition research for increasing the discriminability between word models. And the experiments in speaker independent continuous speech recognition show that hidden Markov models defined by this phonemic set have a better performance than those based on initial/final units of mandarin and have a very compact size.

  15. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  16. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  17. Speech-recognition interfaces for music information retrieval

    Science.gov (United States)

    Goto, Masataka

    2005-09-01

    This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)

  18. A Review on Speech Corpus Development for Automatic Speech Recognition in Indian Languages

    Directory of Open Access Journals (Sweden)

    Cini kurian

    2015-05-01

    Full Text Available Corpus development gained much attention due to recent statistics based natural language processing. It has new applications in Language Technology, linguistic research, language education and information exchange. Corpus based Language research has an innovative outlook which will discard the aged linguistic theories. Speech corpus is the essential resources for building a speech recognizer. One of the main challenges faced by speech scientist is the unavailability of these resources. Very fewer efforts have been made in Indian languages to make these resources available to public compared to English. In this paper we review the efforts made in Indian languages for developing speech corpus for automatic speech recognition.

  19. Objects Control through Speech Recognition Using LabVIEW

    Directory of Open Access Journals (Sweden)

    Ankush Sharma

    2013-01-01

    Full Text Available Speech is the natural form of human communication and the speech processing is the one of the most stimulating area of the signal processing. Speech recognition technology has made it possible for computer to follow the human voice command and understand the human languages. The objects (LED, Toggle switch etc. control through human speech is designed in this paper. By combine the virtual instrumentation technology and speech recognition techniques. And also provided password authentication. This can be done with the help of LabVIEW programming concepts. The microphone is using to take voice commands from Human. This microphone signals interface with LabVIEW code. The LabVIEW code will generate appropriate control signal to control the objects. The entire work done on the LabVIEW platform.

  20. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hoirin Kim

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  1. Mandarin Digits Speech Recognition Using Support Vector Machines

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; KUANG Jing-ming

    2005-01-01

    A method of applying support vector machine (SVM) in speech recognition was proposed, and a speech recognition system for mandarin digits was built up by SVMs. In the system, vectors were linearly extracted from speech feature sequence to make up time-aligned input patterns for SVM, and the decisions of several 2-class SVM classifiers were employed for constructing an N-class classifier. Four kinds of SVM kernel functions were compared in the experiments of speaker-independent speech recognition of mandarin digits. And the kernel of radial basis function has the highest accurate rate of 99.33%, which is better than that of the baseline system based on hidden Markov models (HMM) (97.08%). And the experiments also show that SVM can outperform HMM especially when the samples for learning were very limited.

  2. Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database

    Science.gov (United States)

    Mishra, A. N.; Shrotriya, M. C.; Sharan, S. N.

    2010-02-01

    In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. In this paper wavelet based features using different wavelets are used for Hindi digits recognition. The recognition performance of these features has been compared with Linear Prediction Coefficients (LPC) and Perceptual Linear Prediction (PLP) features. All features have been tested using Hidden Markov Model (HMM) based classifier for speaker independent Hindi digits recognition. The recognition performance of PLP features is11.3% better than LPC features. The recognition performance with db10 features has shown a further improvement of 12.55% over PLP features. The recognition performance with db10 is best among all wavelet based features.

  3. Lexicon Optimization for Dutch Speech Recognition in Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  4. Lexicon optimization for Dutch speech recognition in spoken document retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  5. Modelling context in automatic speech recognition

    NARCIS (Netherlands)

    Wiggers, P.

    2008-01-01

    Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the o

  6. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  7. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting at....... Lastly, to understand the effects of applying various EC techniques, this paper introduces three approaches consisting of speech feature, dynamic programming distance and hidden Markov model state duration comparison.......In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...

  8. Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair

    Science.gov (United States)

    Sasou, Akira; Kojima, Hiroaki

    2009-12-01

    Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.

  9. A Multi-Modal Recognition System Using Face and Speech

    Directory of Open Access Journals (Sweden)

    Samir Akrouf

    2011-05-01

    Full Text Available Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of recognition systems. In this paper, we study the fusion of speech and face in a recognition system for taking a final decision (i.e., accept or reject identity claim. We evaluate the performance of each system differently then we fuse the results and compare the performances.

  10. Duration-Distribution-Based HMM for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    WANG Zuo-ying; XIAO Xi

    2006-01-01

    To overcome the defects of the duration modeling in the homogeneous Hidden Markov Model (HMM)for speech recognition,a duration-distribution-based HMM (DDBHMM) is proposed in this paper based on a formalized definition of a left-to-right inhomogeneous Markov model.It has been demonstrated that it can be identically defined by either the state duration or the state transition probability.The speaker-independent continuous speech recognition experiments show that by only modeling the state duration in DDBHMM,a significant improvement (17.8% error rate reduction) can be achieved compared with the classical HMM.The ideal properties of DDBHMM give promise to many aspects of speech modeling,such as the modeling of the state duration,speed variation,speech discontinuity,and interframe correlation.

  11. Speech Recognition Method Based on Multilayer Chaotic Neural Network

    Institute of Scientific and Technical Information of China (English)

    REN Xiaolin; HU Guangrui

    2001-01-01

    In this paper,speech recognitionusing neural networks is investigated.Especially,chaotic dynamics is introduced to neurons,and a mul-tilayer chaotic neural network (MLCNN) architectureis built.A learning algorithm is also derived to trainthe weights of the network.We apply the MLCNNto speech recognition and compare the performanceof the network with those of recurrent neural net-work (RNN) and time-delay neural network (TDNN).Experimental results show that the MLCNN methodoutperforms the other neural networks methods withrespect to average recognition rate.

  12. Bayesian estimation of keyword confidence in Chinese continuous speech recognition

    Institute of Scientific and Technical Information of China (English)

    HAO Jie; LI Xing

    2003-01-01

    In a syllable-based speaker-independent Chinese continuous speech recognition system based on classical Hidden Markov Model (HMM), a Bayesian approach of keyword confidence estimation is studied, which utilizes both acoustic layer scores and syllable-based statistical language model (LM) score. The Maximum a posteriori (MAP) confidence measure is proposed, and the forward-backward algorithm calculating the MAP confidence scores is deduced. The performance of the MAP confidence measure is evaluated in keyword spotting application and the experiment results show that the MAP confidence scores provide high discriminability for keyword candidates. Furthermore, the MAP confidence measure can be applied to various speech recognition applications.

  13. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  14. Subspace Distribution Clustering HMM for Chinese Digit Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    As a kind of statistical method, the technique of Hidden Markov Model (HMM) is widely used for speech recognition. In order to train the HMM to be more effective with much less amount of data, the Subspace Distribution Clustering Hidden Markov Model (SDCHMM), derived from the Continuous Density Hidden Markov Model (CDHMM), is introduced. With parameter tying, a new method to train SDCHMMs is described. Compared with the conventional training method, an SDCHMM recognizer trained by means of the new method achieves higher accuracy and speed. Experiment results show that the SDCHMM recognizer outperforms the CDHMM recognizer on speech recognition of Chinese digits.

  15. Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

    CERN Document Server

    Lee, G; Kim, S; Lee, Geunbae; Lee, Jong-Hyeok; Kim, Sangeok

    1996-01-01

    This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

  16. Robust Automatic Speech Recognition in Impulsive Noise Environment

    Institute of Scientific and Technical Information of China (English)

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  17. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    OpenAIRE

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrument that may cause a shift in writing process research because the underlying processes are changing. In addition to this, we take advantage of on of the weak points of speech recognition, namely the...

  18. Finding Acoustic Regularities in Speech: Applications to Phonetic Recognition

    Science.gov (United States)

    1988-12-01

    Phonetic recognition can be viewed as a process through which the acoustic signal is mapped to a set of phonological units used to represent a lexicon...the phonological interpretation of the acoustic organization, only 5 regions which aligned with the phonetic transcription were used as training data...9Jnt itLL O. C I I I | Finding Acoustic Regularities in Speech: | Applications to Phonetic Recognition I lo RLE Technical Report No. 536 I December

  19. Objective Gender and Age Recognition from Speech Sentences

    Directory of Open Access Journals (Sweden)

    Fatima K. Faek

    2015-10-01

    Full Text Available In this work, an automatic gender and age recognizer from speech is investigated. The relevant features to gender recognition are selected from the first four formant frequencies and twelve MFCCs and feed the SVM classifier. While the relevant features to age has been used with k-NN classifier for the age recognizer model, using MATLAB as a simulation tool. A special selection of robust features is used in this work to improve the results of the gender and age classifiers based on the frequency range that the feature represents. The gender and age classification algorithms are evaluated using 114 (clean and noisy speech samples uttered in Kurdish language. The model of two classes (adult males and adult females gender recognition, reached 96% recognition accuracy. While for three categories classification (adult males, adult females, and children, the model achieved 94% recognition accuracy. For the age recognition model, seven groups according to their ages are categorized. The model performance after selecting the relevant features to age achieved 75.3%. For further improvement a de-noising technique is used with the noisy speech signals, followed by selecting the proper features that are affected by the de-noising process and result in 81.44% recognition accuracy.

  20. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrumen

  1. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Science.gov (United States)

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency. PMID:26346654

  2. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  3. Development of a speech recognition system for Spanish broadcast news

    NARCIS (Netherlands)

    Niculescu, Andreea; Jong, de Franciska

    2008-01-01

    This paper reports on the development process of a speech recognition system for Spanish broadcast news within the MESH FP6 project. The system uses the SONIC recognizer developed at the Center for Spoken Language Research (CSLR), University of Colorado. Acoustic and language models were trained usi

  4. How Aging Affects the Recognition of Emotional Speech

    Science.gov (United States)

    Paulmann, Silke; Pell, Marc D.; Kotz, Sonja A.

    2008-01-01

    To successfully infer a speaker's emotional state, diverse sources of emotional information need to be decoded. The present study explored to what extent emotional speech recognition of "basic" emotions (anger, disgust, fear, happiness, pleasant surprise, sadness) differs between different sex (male/female) and age (young/middle-aged) groups in a…

  5. Speech Recognition Using Neural Nets and Dynamic Time Warping

    Science.gov (United States)

    1988-12-01

    8217 Phonetic Typewriter", Computer, 21: 11-22 (March 1988). - - 7. Lippmann, Richard P. "An Introduction to Computing with Neural Nets," IEEE ASSP Mag...azine, 4: 4-22 (April 1987). 8. Kohonen, Teuvo and others. "Phonotopic Maps-Insightful Representation of Phonological Features for Speech Recognition

  6. Intonation and Dialog Context as Constraints for Speech Recognition.

    Science.gov (United States)

    Taylor, Paul; King, Simon; Isard, Stephen; Wright, Helen

    1998-01-01

    Describes how to use intonation and dialog context to improve the performance of an automatic speech-recognition system. Experiments utilized the DCIEM Maptask corpus, using a separate bigram language model for each type of move and showing that, with the correct move-specific language model for each utterance in the test set, the recognizer's…

  7. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  8. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  9. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    Science.gov (United States)

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  10. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  11. EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR

    Institute of Scientific and Technical Information of China (English)

    Chen Yanxiang; Xie Jian

    2012-01-01

    Emotion recognition from speech is an important field of research in human computer interaction.In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition.Because of the importance of variance in reflecting the distribution of speech,the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector.Comparative experiments from five aspects are conducted to study their corresponding effect to system performance.The experiment results,which indicate that the influence of number of mixtures is strong as well as influence of duration is weak,provide basis for the train set selection of Universal Background Model (UBM).

  12. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue;

    2006-01-01

    that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...... of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  13. Robust Speech Recognition Method Based on Discriminative Environment Feature Extraction

    Institute of Scientific and Technical Information of China (English)

    HAN Jiqing; GAO Wen

    2001-01-01

    It is an effective approach to learn the influence of environmental parameters,such as additive noise and channel distortions, from training data for robust speech recognition.Most of the previous methods are based on maximum likelihood estimation criterion. However,these methods do not lead to a minimum error rate result. In this paper, a novel discrimina-tive learning method of environmental parameters, which is based on Minimum ClassificationError (MCE) criterion, is proposed. In the method, a simple classifier and the Generalized Probabilistic Descent (GPD) algorithm are adopted to iteratively learn the environmental parameters. Consequently, the clean speech features are estimated from the noisy speech features with the estimated environmental parameters, and then the estimations of clean speech features are utilized in the back-end HMM classifier. Experiments show that the best error rate reduction of 32.1% is obtained, tested on a task of 18 isolated confusion Korean words, relative to a conventional HMM system.

  14. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  15. New Ideas for Speech Recognition and Related Technologies

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J F

    2002-06-17

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstrated by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun

  16. Biologically inspired emotion recognition from speech

    Science.gov (United States)

    Caponetti, Laura; Buscicchio, Cosimo Alessandro; Castellano, Giovanna

    2011-12-01

    Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC) and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  17. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  18. Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach

    Directory of Open Access Journals (Sweden)

    A.Revathi

    2009-11-01

    Full Text Available This paper presents the effectiveness of perceptual features and iterative clustering approach forperforming both speech and speaker recognition. Procedure used for formation of training speech is differentfor developing training models for speaker independent speech and text independent speaker recognition. So,this work mainly emphasizes the utilization of clustering models developed for the training data to obtainbetter accuracy as 91%, 91% and 99.5% for mel frequency perceptual linear predictive cepstrum with respectto three categories such as speaker identification, isolated digit recognition and continuous speechrecognition. This feature also produces 9% as low equal error rate which is used as a performance measurefor speaker verification. The work is experimentally evaluated on the set of isolated digits and continuousspeeches from TI digits_1 and TI digits_2 database for speech recognition and on speeches of 50 speakersrandomly chosen from TIMIT database for speaker recognition. The noteworthy feature of speakerrecognition algorithm is to evaluate the testing procedure on identical messages of all the 50 speakers,theoretical validation of results using F-ratio and validation of results by statistical analysis using2 cdistribution.

  19. Working Papers in Speech Recognition, 3

    Science.gov (United States)

    1974-04-01

    du roman ’Eugene Onegm’ iMjolrant la liaison dec epreuve en chair;," Bulletin de ’Academie Imneri^e des Sciences de St. Peieiibourg, VII, 1913. [4...as having a ternary value (+, -, or 0). Other than being ternary, as opposed to binary, these features be»r some resemblance to the the JaKobson ...Waverly Press, Baltimore. Jakobson , R., G Fant, and M. Halle (1951), Preliminariet to Speech Anal/sit, MIT. Postal, P. (1968a), Aspects of Phonological

  20. Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique

    Directory of Open Access Journals (Sweden)

    Ibrahim Patel

    2010-12-01

    Full Text Available This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.

  1. Design of an expert system for phonetic speech recognition

    Energy Technology Data Exchange (ETDEWEB)

    Carbonell, N.; Haton, J.P.; Pierrel, J.M.; Lonchamp, F.

    1983-07-01

    Expert systems have been extensively used as a means for integrating the expertise of a human being into an artificial intelligence system. The authors are presently designing an expert system which will integrate the strategy and the knowledge of a phonetician reading a speech spectrogram. Their goal is twofold, firstly to obtain a better insight into the acoustic-decoding of speech, and, secondly, to improve the efficiency of present automatic phonetic recognition systems. This paper presents a preliminary description of the project, especially the overall strategy of the expert and the role of duration parameters in the segmentation and identification processes.

  2. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  3. An automatic speech recognition system with speaker-independent identification support

    Science.gov (United States)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  4. An overview of the SPHINX speech recognition system

    Science.gov (United States)

    Lee, Kai-Fu; Hon, Hsiao-Wuen; Reddy, Raj

    1990-01-01

    A description is given of SPHINX, a system that demonstrates the feasibility of accurate, large-vocabulary, speaker-independent, continuous speech recognition. SPHINX is based on discrete hidden Markov models (HMMs) with linear-predictive-coding derived parameters. To provide speaker independence, knowledge was added to these HMMs in several ways: multiple codebooks of fixed-width parameters, and an enhanced recognizer with carefully designed models and word-duration modeling. To deal with coarticulation in continuous speech, yet still adequately represent a large vocabulary, two new subword speech units are introduced: function-word-dependent phone models and generalized triphone models. With grammars of perplexity 997, 60, and 20, SPHINX attained word accuracies of 71, 94, and 96 percent, respectively, on a 997-word task.

  5. Dynamic Bayesian Networks for Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Liang Luhong

    2002-01-01

    Full Text Available The use of visual features in audio-visual speech recognition (AVSR is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM and the factorial HMM (FHMM, and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.

  6. Approximated mutual information training for speech recognition using myoelectric signals.

    Science.gov (United States)

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  7. An audio-visual corpus for multimodal speech recognition in Dutch language

    NARCIS (Netherlands)

    Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.

    2002-01-01

    This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also i

  8. Study on Unequal Error Protection for Distributed Speech Recognition System

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; WANG Si-yao; LIU Jia-kang

    2006-01-01

    The unequal error protection (UEP) is applied in distributed speech recognition (DSR) system and three schemes are proposed. All of these three schemes are evaluated on the GSM simulating platform for recognizing mandarin digit strings and compared with the equal error protection (EEP) scheme. Experiments show that UEP can protect the data transmitted in DSR system more effectively, which results in a higher word accurate rate of DSR system.

  9. Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition

    Science.gov (United States)

    1992-01-01

    normalization (ISDCN) proposed by Acero et al. [2] for microphone adaptation are incorporated into the our VI system to achieve environmental...reverberation from surface reflec- tions, etc. Acero at al. [1,2] proposed a series of environment normal- ization algorithms based on joint...support. References [1] Acero , A. Acoustical and Environmental Robustness in Auto- matic Speech Recognition. Department of Electrical Engineer- ing

  10. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  11. Merge-Weighted Dynamic Time Warping for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张湘莉兰; 骆志刚; 李明

    2014-01-01

    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

  12. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    Science.gov (United States)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  13. EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Directory of Open Access Journals (Sweden)

    Theodoros Anagnostopoulos

    2014-11-01

    Full Text Available Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space.

  14. Error analysis to improve the speech recognition accuracy on Telugu language

    Indian Academy of Sciences (India)

    N Usha Rani; P N Girija

    2012-12-01

    Speech is one of the most important communication channels among the people. Speech Recognition occupies a prominent place in communication between the humans and machine. Several factors affect the accuracy of the speech recognition system. Much effort was involved to increase the accuracy of the speech recognition system, still erroneous output is generating in current speech recognition systems. Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy. Modification should be performed in the dictionary, which is used in the decoder of the speech recognition system. This modification reduces the number of the confusion pairs which improves the performance of the speech recognition system. Language model scores are also varied with this modification. Hit rate is considerably increased during this modification and false alarms have been changing during the modification of the pronunciation dictionary. Variations are observed in different error measures such as F-measures, error-rate and Word Error Rate (WER) by application of the proposed method.

  15. A study of speech emotion recognition based on hybrid algorithm

    Science.gov (United States)

    Zhu, Ju-xia; Zhang, Chao; Lv, Zhao; Rao, Yao-quan; Wu, Xiao-pei

    2011-10-01

    To effectively improve the recognition accuracy of the speech emotion recognition system, a hybrid algorithm which combines Continuous Hidden Markov Model (CHMM), All-Class-in-One Neural Network (ACON) and Support Vector Machine (SVM) is proposed. In SVM and ACON methods, some global statistics are used as emotional features, while in CHMM method, instantaneous features are employed. The recognition rate by the proposed method is 92.25%, with the rejection rate to be 0.78%. Furthermore, it obtains the relative increasing of 8.53%, 4.69% and 0.78% compared with ACON, CHMM and SVM methods respectively. The experiment result confirms the efficiency of distinguishing anger, happiness, neutral and sadness emotional states.

  16. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    Indian Academy of Sciences (India)

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  17. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  18. Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx

    Directory of Open Access Journals (Sweden)

    Veton Këpuska

    2017-03-01

    Full Text Available The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is by using some audio recordings that were selected from different sources and calculating the word error rate (WER. Although the WER of the three aforementioned systems were acceptable, it was observed that the Google API is superior.

  19. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...... for noisy speech while achieving higher compression rates....

  20. Speech recognition based on a combination of acoustic features with articulatory information

    Institute of Scientific and Technical Information of China (English)

    LU Xugang; DANG Jianwu

    2005-01-01

    The contributions of the static and dynamic articulatory information to speech recognition were evaluated, and the recognition approaches by combining the articulatory information with acoustic features were discussed. Articulatory movements were observed by the Electromagnetic Articulographic System for reading speech, and the speech signals were recorded simultaneously. First, we conducted several speech recognition experiments by using articulatory features alone, consisting of a number of specific articulatory channels, to evaluate the contribution of each observation point on articulators. Then, the displacement information of articulatory data were combined with acoustic features directly and adopted in speech recognition. The results show that articulatory information provides with additional information for speech recognition which is not encoded in acoustic features. Furthermore, the contribution of the dynamic information of the articulatory data was evaluated by combining them in speech recognition. It is found that the second derivative of articulatory information provided quite larger contribution to speech recognition comparing with the second derivative of acoustical information. At last, the combination methods of articulatory features and acoustic ones were investigated for speech recognition. The basic approach isthat the Bayesian Network (BN) is added to each state of HMM, where the articulatory information is represented by the BN as a factor of observed signals during training the model and is marginalized as a hidden variable in recognition stage. Results based on this HMM/BN framework show a better performance than the traditional method.

  1. Speaker-Adaptive Speech Recognition Based on Surface Electromyography

    Science.gov (United States)

    Wand, Michael; Schultz, Tanja

    We present our recent advances in silent speech interfaces using electromyographic signals that capture the movements of the human articulatory muscles at the skin surface for recognizing continuously spoken speech. Previous systems were limited to speaker- and session-dependent recognition tasks on small amounts of training and test data. In this article we present speaker-independent and speaker-adaptive training methods which allow us to use a large corpus of data from many speakers to train acoustic models more reliably. We use the speaker-dependent system as baseline, carefully tuning the data preprocessing and acoustic modeling. Then on our corpus we compare the performance of speaker-dependent and speaker-independent acoustic models and carry out model adaptation experiments.

  2. Adaptive Compensation Algorithm in Open Vocabulary Mandarin Speaker-Independent Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    In speech recognition systems, the physiological characteristics of the speech production model cause the voiced sections of the speech signal to have an attenuation of approximately 20 dB per decade. Many speech recognition algorithms have been developed to solve this problem by filtering the input signal with a single-zero high pass filter. Unfortunately, this technique increases the noise energy at high frequencies above 4 kHz, which in some cases degrades the recognition accuracy. This paper solves the problem using a pre-emphasis filter in the front end of the recognizer. The aim is to develop a modified parameterization approach taking into account the whole energy zone in the spectrum to improve the performance of the existing baseline recognition system in the acoustic phase. The results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.

  3. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2009-01-01

    Objective: The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that spe

  4. Statistic Model Based Dynamic Channel Compensation for Telephony Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    ZHANGHuayun; HANZhaobing; XUBo

    2004-01-01

    The degradation of speech recognition performance in real-life environments and through transmission channels is a main embarrassment for many speechbased applications around the world, especially when nonstationary noise and changing channel exist. Previous works have shown that the main reason for this performance degradation is the variational mismatch caused by different telephone channels between the testing and training sets. In this paper, we propose a statistic model based implementation to dynamically compensate this mismatch. Firstly, we focus on a Maximum-likelihood (ML) estimation algorithm for telephone channels. In experiments on Mandarin Large vocabulary continuous speech recognition (LVCSR) over telephone lines, the Character error rate (CER) decreases more than 20%. The average delay is about 300-400ms. Secondly, we will extend it by introducing a phone-conditioned prior statistic model for the channels and applying Maximum a posteriori (MAP) estimation technique. Compared to the ML based method, the MAP based algorithm follows with the variations within channels more effectively. Average delay of the algorithm is decreased to 200ms. An additional 7-8% CER relative reduction is observed in LVCSR.

  5. Distributed Speech Recognition Systems and Some Key Factors Affecting It's Performance

    Institute of Scientific and Technical Information of China (English)

    YE Lei; YANG Zhen

    2003-01-01

    In this paper we first analyze the Distributed Speech Recognition (DSR) system and the key factors that affect it's performance and then focus on the research on the relationship between the length of testing speech and the recognition accuracy of the system. Some experimental results are given at last.

  6. Modeling words with subword units in an articulatorily constrained speech recognition algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1997-11-20

    The goal of speech recognition is to find the most probable word given the acoustic evidence, i.e. a string of VQ codes or acoustic features. Speech recognition algorithms typically take advantage of the fact that the probability of a word, given a sequence of VQ codes, can be calculated.

  7. Supporting Dictation Speech Recognition Error Correction: The Impact of External Information

    Science.gov (United States)

    Shi, Yongmei; Zhou, Lina

    2011-01-01

    Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…

  8. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    NARCIS (Netherlands)

    Jesse, A.; McQueen, J.M.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes i

  9. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  10. Audibility-based predictions of speech recognition for children and adults with normal hearing.

    Science.gov (United States)

    McCreery, Ryan W; Stelmachowicz, Patricia G

    2011-12-01

    This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands.

  11. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  12. Speech Recognition for Environmental Control: Effect of Microphone Type, Dysarthria, and Severity on Recognition Results.

    Science.gov (United States)

    Fager, Susan Koch; Burnfield, Judith M

    2015-01-01

    This study examines the use of commercially available automatic speech recognition (ASR) across microphone options as access to environmental control for individuals with and without dysarthria. A study of two groups of speakers (typical speech and dysarthria), was conducted to understand their performance using ASR and various microphones for environmental control. Specifically, dependent variables examined included attempts per command, recognition accuracy, frequency of error type, and perceived workload. A further sub-analysis of the group of participants with dysarthria examined the impact of severity. Results indicated a significantly larger number of attempts were required (P = 0.007), and significantly lower recognition accuracies were achieved by the dysarthric participants (P = 0.010). A sub-analysis examining severity demonstrated no significant differences between the typical speakers and participants with mild dysarthria. However, significant differences were evident (P = 0.007, P = 0.008) between mild and moderate-severe dysarthric participants. No significant differences existed across microphones. A higher frequency of threshold errors occurred for typical participants and no response errors for moderate-severe dysarthrics. There were no significant differences on the NASA Task Load Index.

  13. Automated Defect Inspection Systems by Pattern Recognition

    Directory of Open Access Journals (Sweden)

    Mira Park

    2009-06-01

    Full Text Available Visual inspection and classification of cigarettes packaged in a tin container is very important in manufacturing cigarette products that require high quality package presentation. For accurate automated inspection and classification, computer vision has been deployed widely in manufacturing. We present the detection of the defective packaging of tins of cigarettes by identifying individual objects in the cigarette tins. Object identification information is used for the classification of the acceptable cases (correctly packaged tins or defective cases (incorrectly packaged tins. This paper investigates the problem of identifying the individual cigarettes and a paper spoon in the packaged tin using image processing andmorphology operations. The segmentation performance was evaluated on 500 images including examples of both good cases and defective cases.

  14. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

    Directory of Open Access Journals (Sweden)

    Mihaela Gordan

    2002-11-01

    Full Text Available Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

  15. Automated Robot with Object Recognition and Handling Features

    Directory of Open Access Journals (Sweden)

    Amiraj Dhawan

    2013-06-01

    Full Text Available With the advent of new technologies, every industry is moving towards automation. A large number of jobs in industries, such as Manufacturing, are performed repeatedly. These jobs require a lot of human effort. In such cases, there is a need of an automated robot which can perform the repetitive task more efficiently. This paper is about a robot which has object recognition and handling features. The robot will optically recognize the objects and pick and place them as per the hand gestures given by the user. It will have a camera to capture image of the objects and one arm to perform the pick and place function.

  16. Automated License Plate Recognition for Toll Booth Application

    Directory of Open Access Journals (Sweden)

    Ketan S. Shevale

    2014-10-01

    Full Text Available This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as well as some experimental results to recognize the license plate successfully.

  17. A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

    OpenAIRE

    Pao, Tsang-Long; Liao, Wen-Yuan; Chen, Yu-Te

    2008-01-01

    In this chapter, we present a speech emotion recognition system to compare several classifiers on the clean speech and noisy speech. Our proposed WD-KNN classifier outperforms the other three KNN-based classifiers at every SNR level and achieves highest accuracy from clean speech to 20dB noisy speech when compared with all other classifiers. Similar to (Neiberg et al, 2006), GMM is a feasible technique for emotion classification on the frame level and the results of GMM are better than perfor...

  18. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    Energy Technology Data Exchange (ETDEWEB)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C., E-mail: calexandre@ien.gov.b, E-mail: mol@ien.gov.b, E-mail: cmnap@ien.gov.b, E-mail: mag@ien.gov.b [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Nomiya, Diogo V., E-mail: diogonomiya@gmail.co [Universidade Federal do Rio de Janeiro (UFRJ), RJ (Brazil)

    2009-07-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  19. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  20. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  1. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  2. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  3. Effect of Speaker Age on Speech Recognition and Perceived Listening Effort in Older Adults with Hearing Loss

    Science.gov (United States)

    McAuliffe, Megan J.; Wilding, Phillipa J.; Rickard, Natalie A.; O'Beirne, Greg A.

    2012-01-01

    Purpose: Older adults exhibit difficulty understanding speech that has been experimentally degraded. Age-related changes to the speech mechanism lead to natural degradations in signal quality. We tested the hypothesis that older adults with hearing loss would exhibit declines in speech recognition when listening to the speech of older adults,…

  4. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    effect, recognition rates for common noises (e.g. ventilation, alarms) are only slightly below rates obtained in a quiet environment. Finally, a redundant architecture succeeds in improving the reliability of the recognitions. Conclusion: This study removes some uncertainties regarding the feasibility...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type...... of training. Methods: Eight volunteers read aloud a total of about 3 600 typical short anaesthesia comments to be transcribed by a continuous speech recognition system. Background noises were collected in an operating room and reproduced. A regression analysis and descriptive statistics were done to evaluate...

  5. A Russian Keyword Spotting System Based on Large Vocabulary Continuous Speech Recognition and Linguistic Knowledge

    Directory of Open Access Journals (Sweden)

    Valentin Smirnov

    2016-01-01

    Full Text Available The paper describes the key concepts of a word spotting system for Russian based on large vocabulary continuous speech recognition. Key algorithms and system settings are described, including the pronunciation variation algorithm, and the experimental results on the real-life telecom data are provided. The description of system architecture and the user interface is provided. The system is based on CMU Sphinx open-source speech recognition platform and on the linguistic models and algorithms developed by Speech Drive LLC. The effective combination of baseline statistic methods, real-world training data, and the intensive use of linguistic knowledge led to a quality result applicable to industrial use.

  6. Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models to Psychophysical Data

    Directory of Open Access Journals (Sweden)

    Svante Stadler

    2009-01-01

    Full Text Available Users of cochlear implants (CIs vary widely in their ability to recognize speech in noisy conditions. There are many factors that may influence their performance. We have investigated to what degree it can be explained by the users' ability to discriminate spectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. The models were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speech recognition performance was compared to experimental results, and they were significantly correlated. The presented framework may be used to simulate the effects of changing the CI encoding strategy.

  7. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior...... to the appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  8. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  9. Joint variable frame rate and length analysis for speech recognition under adverse conditions

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Kraljevski, Ivan

    2014-01-01

    This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal......-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased...... frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness....

  10. An open-set detection evaluation methodology for automatic emotion recognition in speech

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we present a detection approach and an ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification d

  11. AUTOMATIC SPEECH RECOGNITION SYSTEM CONCERNING THE MOROCCAN DIALECTE (Darija and Tamazight)

    OpenAIRE

    A. EL GHAZI; Daoui, C.; Idrissi, N

    2012-01-01

    In this work we present an automatic speech recognition system for Moroccan dialect mainly: Darija (Arab dialect) and Tamazight. Many approaches have been used to model the Arabic and Tamazightphonetic units. In this paper, we propose to use the hidden Markov model (HMM) for modeling these phoneticunits. Experimental results show that the proposed approach further improves the recognition.

  12. Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students

    Science.gov (United States)

    Sersen, William J.

    2011-01-01

    The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…

  13. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Directory of Open Access Journals (Sweden)

    Rosa S Gisladottir

    Full Text Available The ability to recognize speech acts (verbal actions in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  14. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions. PMID:27378997

  15. Relative Contributions of Spectral and Temporal Cues for Speech Recognition in Patients with Sensorineural Hearing Loss

    Institute of Scientific and Technical Information of China (English)

    XU Li; ZHOU Ning; Rebecca Brashears; Katherine Rife

    2008-01-01

    The present study was designed to examine speech recognition in patients with sensorineural hearing loss when the temporal and spectral information in the speech signals were co-varied. Four subjects with mild to moderate sensorineural hearing loss were recruited to participate in consonant and vowel recognition tests that used speech stimuli processed through a noise-excited voeoder. The number of channels was varied between 2 and 32, which defined spectral information. The lowpass cutoff frequency of the temporal envelope extractor was varied from 1 to 512 Hz, which defined temporal information. Results indicate that performance of subjects with sensorineural heating loss varied tremendously among the subjects. For consonant recognition, patterns of relative contributions of spectral and temporal information were similar to those in normal-hearing subjects. The utility of temporal envelope information appeared to be normal in the hearing-impaired listeners. For vowel recognition, which depended predominately on spectral information, the performance plateau was achieved with numbers of channels as high as 16-24, much higher than expected, given that the frequency selectivity in patients with sensorineural hearing loss might be compromised. In order to understand the mechanisms on how hearing-impaired listeners utilize spectral and temporal cues for speech recognition, future studies that involve a large sample of patients with sensorineural hearing loss will be necessary to elucidate the relationship between frequency selectivity as well as central processing capability and speech recognition performance using vocoded signals.

  16. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    Science.gov (United States)

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line

  17. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  18. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    Science.gov (United States)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  19. Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

    Institute of Scientific and Technical Information of China (English)

    CHEN YiNing (陈一宁); ZHU Xuan (朱璇); LIU Jia (刘加); LIU RunSheng (刘润生)

    2003-01-01

    In mandarin all-syllable recognition, many insert errors occur due to the influence of non-consonant syllables. Introducing the duration model into the recognition process is a direct way to lessen these errors. But that usually could not work well as expected, for the duration is sensitive to speech rate. Hence, aiming at this problem, a novel context dependent duration distribution normalized by speech rate is proposed in this paper and applied to a speech recognition system based on the frame of improved Hidden Markov Model (HMM). To realize this algorithm,the authors employ a new method to estimate the speech rate of a sentence; then compute the duration probability combined with speech rate; and finally implement this duration information in the post-processing stage. With little change in the recognition process and resource demand,the duration model is adopted efficiently in the system. The experimental results indicate that the syllable error rates decrease significantly in two different speech corpora. Especially for the insertions, the error rates reduce about sixty to eighty percent.

  20. Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?

    CERN Document Server

    Wegmann, Steven

    2010-01-01

    Hidden Markov models (HMMs) have been successfully applied to automatic speech recognition for more than 35 years in spite of the fact that a key HMM assumption -- the statistical independence of frames -- is obviously violated by speech data. In fact, this data/model mismatch has inspired many attempts to modify or replace HMMs with alternative models that are better able to take into account the statistical dependence of frames. However it is fair to say that in 2010 the HMM is the consensus model of choice for speech recognition and that HMMs are at the heart of both commercially available products and contemporary research systems. In this paper we present a preliminary exploration aimed at understanding how speech data depart from HMMs and what effect this departure has on the accuracy of HMM-based speech recognition. Our analysis uses standard diagnostic tools from the field of statistics -- hypothesis testing, simulation and resampling -- which are rarely used in the field of speech recognition. Our ma...

  1. Cognitive resources related to speech recognition with a competing talker in young and older listeners.

    Science.gov (United States)

    Meister, H; Schreitmüller, S; Grugel, L; Ortmann, M; Beutner, D; Walger, M; Meister, I G

    2013-03-01

    Speech recognition in a multi-talker situation poses high demands on attentional and other central resources. This study examines the relationship between age, cognition and speech recognition in tasks that require selective or divided attention in a multi-talker setting. Two groups of normal-hearing adults (one younger and one older group) were asked to repeat utterances from either one or two concurrent speakers. Cognitive abilities were then inspected by neuropsychological tests. Speech recognition scores approached its ceiling and did not significantly differ between age groups for tasks that demanded selective attention. However, when divided attention was required, performance in older listeners was reduced as compared to the younger group. When selective attention was required, speech recognition was strongly related to working memory skills, as determined by a regression model. In comparison, speech recognition for tests requiring divided attention could be more strongly determined by neuropsychological probes of fluid intelligence. The findings of this study indicate that - apart from hearing impairment - cognitive aspects account for the typical difficulties of older listeners in a multi-speaker setting. Our results are discussed in the context of evidence showing that frontal lobe functions in terms of working memory and fluid intelligence generally decline with age.

  2. Spectrum warping based on sub-glottal resonances in speaker-independent speech recognition

    Institute of Scientific and Technical Information of China (English)

    HOU Limin; HUANG Zhenhua; XIE Juanmin

    2011-01-01

    To reduce degradation in speech recognition due to varied characteristics of different speakers, a method of perceptual frequency warping based on subglottal resonances for speaker normalization is proposed. The warping factor is extracted from the second subglottal resonance using acoustic coupling between subglottis and vocal tract. The second subglottal resonance is independent of the speech content, which reflects the speaker characteristics more than the third formant. The perceptual minimum variation distortionless response (PMVDR) coefficient is normalized, which is more robust and has better anti-noise capability than MFCC. The normalized coefficients are used in the speech-mode training and speech recognition. Experiments show that the word error rate, as compared with MFCC and the spectrum warping by the third formant, decreases by 4% and 3% respectively in clean speech recognition, and by 9% and 5% respectively in a noisy environment. The results indicate that the proposed method can improve the word recognition accuracy in a speaker-independent recognition system.

  3. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  4. Pattern recognition

    CERN Document Server

    Theodoridis, Sergios

    2003-01-01

    Pattern recognition is a scientific discipline that is becoming increasingly important in the age of automation and information handling and retrieval. Patter Recognition, 2e covers the entire spectrum of pattern recognition applications, from image analysis to speech recognition and communications. This book presents cutting-edge material on neural networks, - a set of linked microprocessors that can form associations and uses pattern recognition to ""learn"" -and enhances student motivation by approaching pattern recognition from the designer's point of view. A direct result of more than 10

  5. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    Science.gov (United States)

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  6. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar

    Directory of Open Access Journals (Sweden)

    Young Hoon Shin

    2016-10-01

    Full Text Available People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  7. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent...... secretarial transcription for all physicians in clinical departments. The aim of the survey was (i) to identify how attitudes and perceptions among physicians affected the acceptance and success of the speech-recognition system and the new work procedures associated with it; and (ii) to assess the degree...... they had had some experience with the system. The survey data were supplemented with performance data from the speech-recognition system. The results show that the surveyed physicians tended to report a more negative view of the system after having used it for some months than before. When judging...

  8. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    Directory of Open Access Journals (Sweden)

    Veton Z. Këpuska

    2014-12-01

    Full Text Available A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc. with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD, and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., serverfor remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV. This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.

  9. A Review on Speech Corpus Development for Automatic Speech Recognition in Indian Languages

    OpenAIRE

    Cini kurian

    2015-01-01

    Corpus development gained much attention due to recent statistics based natural language processing. It has new applications in Language Technology, linguistic research, language education and information exchange. Corpus based Language research has an innovative outlook which will discard the aged linguistic theories. Speech corpus is the essential resources for building a speech recognizer. One of the main challenges faced by speech scientist is the unavailability of these resources. Very f...

  10. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  11. Design and Implementation of Monophones and Triphones-Based Speech Recognition Systems for Voice Activated Telephony

    Directory of Open Access Journals (Sweden)

    Rupayan Das

    2013-07-01

    Full Text Available Speech recognition is the ability of a machine or program to convert spoken words into its equivalent text form. Nowadays, most recognition systems use Hidden Markov Models for modeling the spoken utterances. In this paper we have implemented two speaker independent speech recognition systems which include all the words required for dialing a phone. The systems contain 42 words including digits from zero to nine and also include names of 20 persons. A total of 16,800 utterances have been used for training each system. The two systems are able to recognize continuous speech and it is implemented with the help of monophones and triphones using HTK. Experimental results show an accuracy of 74.11% for monophones based models and 93.77% for triphones based models.

  12. Hybrid Approach for Language Identification Oriented to Multilingual Speech Recognition in the Basque Context

    Science.gov (United States)

    Barroso, N.; de Ipiña, K. López; Ezeiza, A.; Barroso, O.; Susperregi, U.

    The development of Multilingual Large Vocabulary Continuous Speech Recognition systems involves issues as: Language Identification, Acoustic-Phonetic Decoding, Language Modelling or the development of appropriated Language Resources. The interest on Multilingual Systems arouses because there are three official languages in the Basque Country (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition for the Basque context. The work presents hybrid strategies for LID, based on the selection of system elements by Support Vector Machines and Multilayer Perceptron classifiers and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams).

  13. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  14. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  15. Low-cost speech recognition system for small vocabulary and independent speaker

    Science.gov (United States)

    Teh, Chih Chiang; Jong, Ching C.; Siek, Liter

    2000-10-01

    In this paper an ASIC implementation of a low cost speech recognition system for small vocabulary, 15 isolated word, speaker independent is presented. The IC is a digital block that receives a 12 bit sample with a sampling rate of 11.025 kHz as its input. The IC is running at 10 MHz system clock and targeted at 0.35 micrometers CMOS process. The whole chip, which includes the speech recognition system core, RAM and ROM contains about 61000 gates. The die size is 1.5 mm by 3 mm. The current design had been coded in VHDL for hardware implementation and its functionality is identical with the Matlab simulation. The average speech recognition rate for this IC is 89 percent for 15 isolated words.

  16. The Combined Effects of Aging and Hearing Loss on Temporal Resolution and Recognition of Reverberant Speech

    Science.gov (United States)

    Halling, Dan C.

    Listeners perform more poorly on a speech-recognition task when in a reverberant listening condition than a non -reverberant one. Elderly listeners experience even greater difficulty than young listeners. It has been suggested that this greater difficulty can be almost entirely explained by taking into account the hearing-impairment that typically accompanies the aging process. Nevertheless, existing evidence suggests that elderly hearing-impaired listeners still experience greater difficulty than young hearing -impaired listeners. Some have suggested that elderly listeners, in addition to the hearing loss, also exhibit poorer temporal resolution than young listeners, poorer than even young hearing-impaired listeners. Temporal resolution and speech -recognition performance was evaluated in 8 young normal -hearing listeners, 8 elderly normal-hearing listeners, and 12 elderly listeners with hearing impairment of varying degree. The results suggested that there was an effect of both age and hearing loss on temporal resolution and speech-recognition performance. Additional analyses indicated that the age effects may have actually been caused by slight elevations in the quiet thresholds for the elderly normal -hearing subjects relative to the young normal-hearing subjects. The results also suggested that individual differences in hearing loss and temporal resolution underlie individual differences in speech-recognition performance. Finally, an objective measure of predicting speech intelligibility, the Speech Transmission Index (STI), was evaluated as to its adequacy as a tool for predicting speech-recognition performance in young and elderly, normal-hearing and hearing-impaired, listeners in anechoic or reverberant conditions. Several derivations of the STI provided tight-fitting functions relating percent correct to STI, one of which requires only knowledge of the listener's quiet thresholds and the acoustical properties of the room.

  17. Sensitivity Based Segmentation and Identification in Automatic Speech Recognition.

    Science.gov (United States)

    1984-03-30

    by a network constructed from phonemic, phonetic , and phonological rules. Regardless of the speech processing system used, Klatt 1 2 has described...analysis, and its use in the segmentation and identification of the phonetic units of speech, that was initiated during the 1982 Summer Faculty Research...practicable framework for incorporation of acoustic- phonetic variance as well as time and talker normalization. XOI iF- ? ’:: .:- .- . . l ] 2 D

  18. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  19. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Directory of Open Access Journals (Sweden)

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  20. Studying the Speech Recognition Scores of Hearing Impaied Children by Using Nonesense Syllables

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Keyhani

    1998-09-01

    Full Text Available Background: The current article is aimed at evaluating speech recognition scores in hearing aid wearers to determine whether nonsense syllables are suitable speech materials to evaluate the effectiveness of their hearing aids. Method: Subjects were 60 children (15 males and 15 females with bilateral moderate and moderately severe sensorineural hearing impairment who were aged between 7.7-14 years old. Gain prescription was fitted by NAL method. Then speech evaluation was performed in a quiet place with and without hearing aid by using a list of 25 monosyllable words recorded on a tape. A list was prepared for the subjects to check in the correct response. The same method was used to obtain results for normal subjects. Results: The results revealed that the subjects using hearing aids achieved significantly higher SRS in comparison of not wearing it. Although the speech recognition ability was not compensated completely (the maximum score obtained was 60% it was also revealed that the syllable recognition ability in the less amplified frequencies were decreased. the SRS was very higher in normal subjects (with an average of 88%. Conclusion: It seems that Speech recognition score can prepare Audiologist with a more comprehensive method to evaluate the hearing aid benefits.

  1. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  2. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the langu......In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...

  3. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  4. A Computational Auditory Scene Analysis System for Speech Segregation and Robust Speech Recognition

    Science.gov (United States)

    2007-01-01

    Droppo, J., Acero , A., 2005. Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech...analysis. IEEE Trans. on Audio, Speech, and Language Processing 15, 396–405. Huang, X., Acero , A., Hon, H., 2001. Spoken Language Processing. Prentice Hall

  5. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Directory of Open Access Journals (Sweden)

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  6. Space discriminative function for microphone array robust speech recognition

    Institute of Scientific and Technical Information of China (English)

    Zhao Xianyu; Ou Zhijian; Wang Zuoying

    2005-01-01

    Based on W-disjoint orthogonality of speech mixtures, a space discriminative function was proposed to enumerate and localize competing speakers in the surrounding environments. Then, a Wiener-like post-filterer was developed to adaptively suppress interferences. Experimental results with a hands-free speech recognizer under various SNR and competing speakers settings show that nearly 69% error reduction can be obtained with a two-channel small aperture microphone array against the conventional single microphone baseline system. Comparisons were made against traditional delay-and-sum and Griffiths-Jim adaptive beamforming techniques to further assess the effectiveness of this method.

  7. Speech Emotion Recognition Based on Parametric Filter and Fractal Dimension

    Science.gov (United States)

    Mao, Xia; Chen, Lijiang

    In this paper, we propose a new method that employs two novel features, correlation density (Cd) and fractal dimension (Fd), to recognize emotional states contained in speech. The former feature obtained by a list of parametric filters reflects the broad frequency components and the fine structure of lower frequency components, contributed by unvoiced phones and voiced phones, respectively; the latter feature indicates the non-linearity and self-similarity of a speech signal. Comparative experiments based on Hidden Markov Model and K Nearest Neighbor methods are carried out. The results show that Cd and Fd are much more closely related with emotional expression than the features commonly used.

  8. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  9. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  10. Speech recognition using Kohonen neural networks, dynamic programming, and multi-feature fusion

    Science.gov (United States)

    Stowe, Francis S.

    1990-12-01

    The purpose of this thesis was to develop and evaluate the performance of a three-feature speech recognition system. The three features used were LPC spectrum, formants (F1/F2), and cepstrum. The system uses Kohonen neural networks, dynamic programming, and a rule-based, feature-fusion process which integrates the three input features into one output result. The first half of this research involved evaluating the system in a speaker-dependent atmosphere. For this, the 70 word F-16 cockpit command vocabulary was used and both isolated and connected speech was tested. Results obtained are compared to a two-feature system with the same system configuration. Isolated-speech testing yielded 98.7 percent accuracy. Connected-speech testing yielded 75/0 percent accuracy. The three-feature system performed an average of 1.7 percent better than the two-feature system for isolated-speech. The second half of this research was concerned with the speaker-independent performance of the system. First, cross-speaker testing was performed using an updated 86 word library. In general, this testing yielded less than 50 percent accuracy. Then, testing was performed using averaged templates. This testing yielded an overall average in-template recognition rate of approximately 90 percent and an out-of-template recognition rate of approximately 75 percent.

  11. Recognition of Rapid Speech by Blind and Sighted Older Adults

    Science.gov (United States)

    Gordon-Salant, Sandra; Friedman, Sarah A.

    2011-01-01

    Purpose: To determine whether older blind participants recognize time-compressed speech better than older sighted participants. Method: Three groups of adults with normal hearing participated (n = 10/group): (a) older sighted, (b) older blind, and (c) younger sighted listeners. Low-predictability sentences that were uncompressed (0% time…

  12. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    by a keyword; combination of command and free text modes); finally, to quantify some of the gains that could be provided by the speech input modality. Methods: Six anaesthesia teams composed of one doctor and one nurse were each confronted with two crisis scenarios in a full-scale anaesthesia simulator. Each...

  13. Working Papers in Speech Recognition. IV. The Hearsay II System

    Science.gov (United States)

    1976-02-01

    to the overall goal of the problem solver, (4) The efficiency principle : more processing should be given tfi KSs which perform most reliably and...duration of the speech, actions which can produce such hypotheses or support them will be most preferred. The objective of the efficiency principle , to

  14. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    Science.gov (United States)

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  15. Comparing Three Methods to Create Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks

    Science.gov (United States)

    2000-08-01

    Glass, et al.: Multilingual Spoken Language Under- ( multilingual clusters) and 5280 monolingual clusters. This standing in the MIT VOYAGER System...UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP010392 TITLE: Comparing Three Methods to Create Multilingual Phone...METHODS TO CREATE MULTILINGUAL PHONE MODELS FOR VOCABULARY INDEPENDENT SPEECH RECOGNITION TASKS Joachim Kdhler German National Research Center for

  16. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  17. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  18. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  19. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    NARCIS (Netherlands)

    Baskent, Deniz; Chatterjeec, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelli

  20. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    Science.gov (United States)

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  1. Serial audiometry and speech recognition findings in Finnish Usher syndrome type III patients.

    NARCIS (Netherlands)

    Plantinga, R.F.; Kleemola, L.; Huygen, P.L.M.; Joensuu, T.; Sankila, E.M.; Pennings, R.J.E.; Cremers, C.W.R.J.

    2005-01-01

    Audiometric features, evaluated by serial pure tone audiometry and speech recognition tests (n = 31), were analysed in 59 Finnish Usher syndrome type III patients (USH3) with Finmajor/Finmajor (n = 55) and Finmajor/Finminor (n = 4) USH3A mutations. These patients showed a highly variable type and de

  2. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  3. Errors in Automatic Speech Recognition versus Difficulties in Second Language Listening

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Akita, Yuya; Kawahara, Tatsuya

    2015-01-01

    Automatic Speech Recognition (ASR) technology has become a part of contemporary Computer-Assisted Language Learning (CALL) systems. ASR systems however are being criticized for their erroneous performance especially when utilized as a mean to develop skills in a Second Language (L2) where errors are not tolerated. Nevertheless, these errors can…

  4. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  5. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet P.; Leeuwen, van David A.; Jong, de Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two t

  6. Development of coffee maker service robot using speech and face recognition systems using POMDP

    Science.gov (United States)

    Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung

    2016-07-01

    There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.

  7. Cued Speech Gesture Recognition: A First Prototype Based on Early Reduction

    Directory of Open Access Journals (Sweden)

    Caplier Alice

    2007-01-01

    Full Text Available Cued Speech is a specific linguistic code for hearing-impaired people. It is based on both lip reading and manual gestures. In the context of THIMP (Telephony for the Hearing-IMpaired Project, we work on automatic cued speech translation. In this paper, we only address the problem of automatic cued speech manual gesture recognition. Such a gesture recognition issue is really common from a theoretical point of view, but we approach it with respect to its particularities in order to derive an original method. This method is essentially built around a bioinspired method called early reduction. Prior to a complete analysis of each image of a sequence, the early reduction process automatically extracts a restricted number of key images which summarize the whole sequence. Only the key images are studied from a temporal point of view with lighter computation than the complete sequence.

  8. Cued Speech Gesture Recognition: A First Prototype Based on Early Reduction

    Directory of Open Access Journals (Sweden)

    Pascal Perret

    2008-01-01

    Full Text Available Cued Speech is a specific linguistic code for hearing-impaired people. It is based on both lip reading and manual gestures. In the context of THIMP (Telephony for the Hearing-IMpaired Project, we work on automatic cued speech translation. In this paper, we only address the problem of automatic cued speech manual gesture recognition. Such a gesture recognition issue is really common from a theoretical point of view, but we approach it with respect to its particularities in order to derive an original method. This method is essentially built around a bioinspired method called early reduction. Prior to a complete analysis of each image of a sequence, the early reduction process automatically extracts a restricted number of key images which summarize the whole sequence. Only the key images are studied from a temporal point of view with lighter computation than the complete sequence.

  9. A commercial large-vocabulary discrete speech recognition system: DragonDictate.

    Science.gov (United States)

    Mandel, M A

    1992-01-01

    DragonDictate is currently the only commercially available general-purpose, large-vocabulary speech recognition system. It uses discrete speech and is speaker-dependent, adapting to the speaker's voice and language model with every word. Its acoustic adaptability is based in a three-level phonology and a stochastic model of production. The phonological levels are phonemes, augmented triphones (phonemes-in-context or PICs), and steady-state spectral slices that are concatenated to approximate the spectra of these PICs (phonetic elements or PELs) and thus of words. Production is treated as a hidden Markov process, which the recognizer has to identify from its output, the spoken word. Findings of practical value to speech recognition are presented from research on six European languages.

  10. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    Science.gov (United States)

    1983-05-01

    but also suprasegmental (prosodic) information, such as appropriate stress levels and intonation patterns, to improve their output. This is a definite...Society of America Paper, 1975. " Suprasegmentals ." Cambridge, Mass.: MIT Press, 1970. M4akhoul, J., Viswanathan, R., and Huggins, W. F. "A Mixed...the " suprasegmentals " or influences of duration, fundamental frequency, and speech product ion power upon basic phonemes. These ...- 102- influences

  11. Dyadic Wavelet Features for Isolated Word Speaker Dependent Speech Recognition

    Science.gov (United States)

    1994-03-01

    Guppies RL/IRAA 32 Hangar Rd Griffiss AFB NY 13441 11. SUPPLEMENTARY NOTES 12a. DISTRIBUTION/ AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE Distribution...contains ten examples of each of the spoken digits ("zero" through "nine") for eight different speakers; four male and four female. The speech recordings...there were no overlapping windows. Once the feature vector was determined, the features were level normalized . This was achieved by subtracting each

  12. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.

  13. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  14. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  15. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  16. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Turicchia Lorenzo

    2007-01-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at -5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (-5 dB SNR to +5 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  17. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bhiksha Raj

    2007-06-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at −5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (−5 dB SNR to +15 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  18. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Science.gov (United States)

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  19. Automated detection and recognition of wildlife using thermal cameras.

    Science.gov (United States)

    Christiansen, Peter; Steen, Kim Arild; Jørgensen, Rasmus Nyholm; Karstoft, Henrik

    2014-01-01

    In agricultural mowing operations, thousands of animals are injured or killed each year, due to the increased working widths and speeds of agricultural machinery. Detection and recognition of wildlife within the agricultural fields is important to reduce wildlife mortality and, thereby, promote wildlife-friendly farming. The work presented in this paper contributes to the automated detection and classification of animals in thermal imaging. The methods and results are based on top-view images taken manually from a lift to motivate work towards unmanned aerial vehicle-based detection and recognition. Hot objects are detected based on a threshold dynamically adjusted to each frame. For the classification of animals, we propose a novel thermal feature extraction algorithm. For each detected object, a thermal signature is calculated using morphological operations. The thermal signature describes heat characteristics of objects and is partly invariant to translation, rotation, scale and posture. The discrete cosine transform (DCT) is used to parameterize the thermal signature and, thereby, calculate a feature vector, which is used for subsequent classification. Using a k-nearest-neighbor (kNN) classifier, animals are discriminated from non-animals with a balanced classification accuracy of 84.7% in an altitude range of 3-10 m and an accuracy of 75.2% for an altitude range of 10-20 m. To incorporate temporal information in the classification, a tracking algorithm is proposed. Using temporal information improves the balanced classification accuracy to 93.3% in an altitude range 3-10 of meters and 77.7% in an altitude range of 10-20 m.

  20. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition‐ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model‐based real‐time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360‐degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  1. Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition

    Science.gov (United States)

    Gao, Tian; Du, Jun; Xu, Yong; Liu, Cong; Dai, Li-Rong; Lee, Chin-Hui

    2016-12-01

    We explore joint training strategies of DNNs for simultaneous dereverberation and acoustic modeling to improve the performance of distant speech recognition. There are two key contributions. First, a new DNN structure incorporating both dereverberated and original reverberant features is shown to effectively improve recognition accuracy over the conventional one using only dereverberated features as the input. Second, in most of the simulated reverberant environments for training data collection and DNN-based dereverberation, the resource data and learning targets are high-quality clean speech. With our joint training strategy, we can relax this constraint by using large-scale diversified real close-talking data as the targets which are easy to be collected via many speech-enabled applications from mobile internet users, and find the scenario even more effective. Our experiments on a Mandarin speech recognition task with 2000-h training data show that the proposed framework achieves relative word error rate reductions of 9.7 and 8.6 % over the multi-condition training systems for the cases of single-channel and multi-channel with beamforming, respectively. Furthermore, significant gains are consistently observed over the pre-processing approach using simply DNN-based dereverberation.

  2. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes...

  3. Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

    CERN Document Server

    Terbeh, Naim

    2012-01-01

    This work consists of creating a system of the Computer Assisted Language Learning (CALL) based on a system of Automatic Speech Recognition (ASR) for the Arabic language using the tool CMU Sphinx3 [1], based on the approach of HMM. To this work, we have constructed a corpus of six hours of speech recordings with a number of nine speakers. we find in the robustness to noise a grounds for the choice of the HMM approach [2]. the results achieved are encouraging since our corpus is made by only nine speakers, but they are always reasons that open the door for other improvement works.

  4. Towards a Global Optimization Scheme for Multi-Band Speech Recognition

    OpenAIRE

    Cerisara, Christophe; Haton, Jean-Paul; Fohr, Dominique

    1999-01-01

    Colloque avec actes et comité de lecture.; n this paper, we deal with a new method to globally optimize a Multi-Band Speech Recognition (MBSR) system. We have tested our algorithm with the TIMIT database and obtained a significant improvement in the accuracy over a basic HMM system for clean speech. The goal of this work is not to prove the effectiveness of MBSR, what has yet been done, but to improve the training scheme by introducing a global optimization procedure. A consequence of this me...

  5. Discriminative tonal feature extraction method in mandarin speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2007-01-01

    To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project F0 (fundamental frequency) features of neighboring syllables as compensations, and adds them to the original F0 features of the current syllable. The transforms are discriminatively trained by using an objective function termed as "minimum tone error", which is a smooth approximation of tone recognition accuracy. Experiments show that the new tonal features achieve 3.82% tone recognition rate improvement, compared with the baseline, using maximum likelihood trained HMM on the normal F0 features. Further experiments show that discriminative HMM training on the new features is 8.78% better than the baseline.

  6. Study on Acoustic Modeling in a Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    PENG Di; LIU Gang; GUO Jun

    2007-01-01

    The design of acoustic models is of vital importance to build a reliable connection between acoustic waveform and linguistic messages in terms of individual speech units. According to the characteristic of Chinese phonemes,the base acoustic phoneme units set is decided and refined and a decision tree based state tying approach is explored.Since one of the advantages of top-down tying method is flexibility in maintaining a balance between model accuracy and complexity, relevant adjustments are conducted, such as the stopping criterion of decision tree node splitting, during which optimal thresholds are captured. Better results are achieved in improving acoustic modeling accuracy as well as minimizing the scale of the model to a trainable extent.

  7. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN......-pass-filtered speech items. Data were collected from 22 hearing-impaired subjects with moderate-to-profound sensorineural hearing losses. The results showed that 11 subjects exhibited abnormal psychoacoustic behaviour in the TEN test, indicative of a possible dead region. Estimates of audibility were used to assess...... the possible connection between dead-region candidacy and ability to recognize low-pass-filtered speech. Large variability was observed with regard to the ability of audibility to predict recognition scores for both dead-region and no-dead-region subjects. Furthermore, the results indicate that dead...

  8. Robust multi-stream speech recognition based on weighting the output probabilities of feature components

    Institute of Scientific and Technical Information of China (English)

    ZHANG Jun; WEI Gang; YU Hua; NING Genxin

    2009-01-01

    In the traditional multi-stream fusion methods of speech recognition, all the feature components in a data stream share the same stream weight, while their distortion levels are usually different when the speech recognizer works in noisy environments. To overcome this limitation of the traditional multi-stream frameworks, the current study proposes a new stream fusion method that weights not only the stream outputs, but also the output probabilities of feature components. How the stream and feature component weights in the new fusion method affect the decision is analyzed and two stream fusion schemes based on the 03iginalisation and soft decision models in the missing data techniques are proposed. Experimental results on the hybrid sub-band multi-stream speech recognizer show that the proposed schemes can adjust the stream influences on the decision adaptively and outperform the traditional multi-stream methods in various noisy environments.

  9. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition

    Institute of Scientific and Technical Information of China (English)

    Hong-tao WANG; Yuan-qing LI; Tian-you YU

    2014-01-01

    An intelligent wheelchair is devised, which is controlled by a coordinated mechanism based on a brain-computer interface (BCI) and speech recognition. By performing appropriate activities, users can navigate the wheelchair with four steering behaviors (start, stop, turn left, and turn right). Five healthy subjects participated in an indoor experiment. The results demonstrate the efficiency of the coordinated control mechanism with satisfactory path and time optimality ratios, and show that speech recognition is a fast and accurate supplement for BCI-based control systems. The proposed intelligent wheelchair is especially suitable for patients suffering from paralysis (especially those with aphasia) who can learn to pronounce only a single sound (e.g.,‘ah’).

  10. Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

    Directory of Open Access Journals (Sweden)

    Reinhold Orglmeister

    2010-01-01

    Full Text Available When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.

  11. Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

    CERN Document Server

    Bouselmi, Ghazi; Illina, Irina

    2007-01-01

    In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The ``phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.

  12. Data-Driven Temporal Filtering on Teager Energy Time Trajectory for Robust Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    ZHAO Jun-hui; XIE Xiang; KUANG Jing-ming

    2006-01-01

    Data-driven temporal filtering technique is integrated into the time trajectory of Teager energy operation (TEO) based feature parameter for improving the robustness of speech recognition system against noise.Three kinds of data-driven temporal filters are investigated for the motivation of alleviating the harmful effects that the environmental factors have on the speech. The filters include: principle component analysis (PCA) based filters, linear discriminant analysis (LDA) based filters and minimum classification error (MCE) based filters. Detailed comparative analysis among these temporal filtering approaches applied in Teager energy domain is presented. It is shown that while all of them can improve the recognition performance of the original TEO based feature parameter in adverse environment, MCE based temporal filtering can provide the lowest error rate as SNR decreases than any other algorithms.

  13. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  14. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Iwano Koji

    2007-01-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  15. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  16. A Log—Index Weighted Cepstral Distance Measure for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 吴文虎; 等

    1997-01-01

    A log-index weighted cepstral distance measure is proposed and tested in speacker-independent and speaker-dependent isolated word recognition systems using statistic techniques.The weights for the cepstral coefficients of this measure equal the logarithm of the corresponding indices.The experimental results show that this kind of measure works better than any other weighted Euclidean cepstral distance measures on three speech databases.The error rate obtained using this measure is about 1.8 percent for three databases on average,which is a 25% reduction from that obtained using other measures,and a 40% reduction from that obtained using Log Likelihood Ratio(LLR)measure.The experimental results also show that this kind of distance measure woks well in both speaker-dependent and speaker-independent speech recognition systems.

  17. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  18. Integrated search technique for parameter determination of SVM for speech recognition

    Institute of Scientific and Technical Information of China (English)

    Teena Mittal; R K Sharma

    2016-01-01

    Support vector machine (SVM) has a good application prospect for speech recognition problems;still optimum parameter selection is a vital issue for it. To improve the learning ability of SVM, a method for searching the optimal parameters based on integration of predator prey optimization (PPO) and Hooke-Jeeves method has been proposed. In PPO technique, population consists of prey and predator particles. The prey particles search the optimum solution and predator always attacks the global best prey particle. The solution obtained by PPO is further improved by applying Hooke-Jeeves method. Proposed method is applied to recognize isolated words in a Hindi speech database and also to recognize words in a benchmark database TI-20 in clean and noisy environment. A recognition rate of 81.5% for Hindi database and 92.2% for TI-20 database has been achieved using proposed technique.

  19. Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition

    Directory of Open Access Journals (Sweden)

    Simon eRigoulot

    2013-06-01

    Full Text Available Recent studies suggest that the time course for recognizing vocal expressions of basic emotion in speech varies significantly by emotion type, implying that listeners uncover acoustic evidence about emotions at different rates in speech (e.g., fear is recognized most quickly whereas happiness and disgust are recognized relatively slowly, Pell and Kotz, 2011. To investigate whether vocal emotion recognition is largely dictated by the amount of time listeners are exposed to speech or the position of critical emotional cues in the utterance, 40 English participants judged the meaning of emotionally-inflected pseudo-utterances presented in a gating paradigm, where utterances were gated as a function of their syllable structure in segments of increasing duration from the end of the utterance (i.e., gated ‘backwards’. Accuracy for detecting six target emotions in each gate condition and the mean identification point for each emotion in milliseconds were analyzed and compared to results from Pell & Kotz (2011. We again found significant emotion-specific differences in the time needed to accurately recognize emotions from speech prosody, and new evidence that utterance-final syllables tended to facilitate listeners’ accuracy in many conditions when compared to utterance-initial syllables. The time needed to recognize fear, anger, sadness, and neutral from speech cues was not influenced by how utterances were gated, although happiness and disgust were recognized significantly faster when listeners heard the end of utterances first. Our data provide new clues about the relative time course for recognizing vocally-expressed emotions within the 400-1200 millisecond time window, while highlighting that emotion recognition from prosody can be shaped by the temporal properties of speech.

  20. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    recognition system depends heavily on the dictionary and the transcriptions therein. This paper presents an analysis of phonetic/phonemic features that are salient for current Danish ASR systems. This preliminary study consists of a series of experiments using an ASR system trained on the DK-PAROLE corpus....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  1. Combining speech recognition software with Digital Imaging and Communications in Medicine (DICOM) workstation software on a Microsoft Windows platform.

    Science.gov (United States)

    Ernst, R; Carpenter, W; Torres, W; Wheeler, S

    2001-06-01

    This presentation describes our experience in combining speech recognition software, clinical review software, and other software products on a single computer. Different processor speeds, random access memory (RAM), and computer costs were evaluated. We found that combining continuous speech recognition software with Digital Imaging and Communications in Medicine (DICOM) workstation software on the same platform is feasible and can lead to substantial savings of hardware cost. This combination optimizes use of limited workspace and can improve radiology workflow.

  2. Combining speech recognition software with digital imaging and communications in medicine (DICOM) workstation software on a microsoft windows platform

    OpenAIRE

    Ernst, Randy; Carpenter, Walter; Torres, William; Wheeler, Scott

    2001-01-01

    This presentation describes our experience in combining speech recognition software, clinical review software, and other software products on a single computer. Different processor speeds, random access memory (RAM), and computer costs were evaluated. We found that combining continuous speech recognition software with Digital Imaging and Communications in Medicine (DICOM) workstation software on the same platform is feasible and can lead to substantial savings of hardware cost. This combinati...

  3. Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

    OpenAIRE

    Li, Xiangang; Wu, Xihong

    2014-01-01

    Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed an...

  4. A SPEECH RECOGNITION METHOD USING COMPETITIVE AND SELECTIVE LEARNING NEURAL NETWORKS

    Institute of Scientific and Technical Information of China (English)

    2000-01-01

    On the basis of asymptotic theory of Gersho, the isodistortion principle of vector clustering was discussed and a kind of competitive and selective learning method (CSL) which may avoid local optimization and have excellent result in application to clusters of HMM model was also proposed. In combining the parallel, self-organizational hierarchical neural networks (PSHNN) to reclassify the scores of every form output by HMM, the CSL speech recognition rate is obviously elevated.

  5. A continuous speech recognition approach for the design of a dictation machine

    OpenAIRE

    Smaïli, Kamel; Charpillet, François; Pierrel, Jean-Mari; Haton, Jean-Paul

    1991-01-01

    International audience; The oral entry of texts (dictation machine) remains an important potential field of application for automatic speech recognition. The RFIA group of CRIN/INRIA has been investigating this research area for the french language during the past ten years. We propose in this paper a general presentation of the present state of our MAUD system which is based upon four major interacting components: an acoustic phonetic decoder, a lexical component, a linguistic model and a us...

  6. Using commercial-off-the-shelf speech recognition software for conning U.S. warships

    OpenAIRE

    Tamez, Dorothy J.

    2003-01-01

    Approved for public release; distribution is unlimited The U.S. Navy's Transformation Roadmap is leading the fleet in a smaller, faster, and more technologically advanced direction. Smaller platforms and reduced manpower resources create opportunities to fill important positions, including ship-handling control, with technology. This thesis investigates the feasibility of using commercial-off-the-shelf (COTS) speech recognition software (SRS) for conning a Navy ship. Dragon NaturallySpeaki...

  7. Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

    Science.gov (United States)

    Zhang, Qingqing; Pan, Jielin; Lin, Yang; Shao, Jian; Yan, Yonghong

    In recent decades, there has been a great deal of research into the problem of bilingual speech recognition-to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language**. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was

  8. Speech Recognition System For Robotic Control And Movement

    Directory of Open Access Journals (Sweden)

    Biraja Nalini Rout

    2015-08-01

    Full Text Available Abstract In a current scenario voice and data recognition is one of the most sought after field in the area of artificial intelligence and robotic 1 engineering. The idea specializes on deriving a voice to voice intelligent system which operates purely on audiovoice instructions using a specialized voice recognition module a micro controller a set of wheels and a movable arm to operate. The working involves real time voice inputs feeded to the VR module which equivalently processes the audio signals and produces the output in audio format. It consists an IDE for both Windows and UNIX based operating system for manipulating and processing instructions both at software and hardware levels. The system also can perform a basic set of manual operations decides through the expert system. The VR module processes the data using multilayer perceptron to generate the required result. Movable arm operates to pick and place objects as per the given voice instructions. Its usability involves substituting manual work at both personal and professional levels.

  9. AUTOMATIC SPEECH RECOGNITION – THE MAIN STAGES OVER LAST 50 YEARS

    Directory of Open Access Journals (Sweden)

    I. B. Tampel

    2015-11-01

    Full Text Available The main stages of automatic speech recognition systems over last 50 years are regarded. The attempt is made to evaluate different methods in the context of approaching to functioning of biological systems. The method implementation based on dynamic programming algorithm and done in 1968 is considered as a benchmark. Shortcomings of the method, which make it possible to use it only for command recognition, are considered. The next method considered is based on a formalism of Markov chains. Based on the notion of coarticulation the necessity of applying context dependent triphones and biphones instead of context independent phonemes is shown. The problems of insufficiency of speech databases for triphone training which lead to state tying methods are explained. The importance of model adaptation and feature normalization methods providing better invariance to speakers, communication channels and additive noise are shown. Deep Neural Networks and Recurrent Networks are considered as the most up-to-date methods. The similarity of deep (multilayer neural networks and biological systems is noted. In conclusion, the problems and drawbacks of the modern systems of automatic speech recognition are described and prognosis of their development is given.

  10. Multiclassifier fusion of an ultrasonic lip reader in automatic speech recognition

    Science.gov (United States)

    Jennnings, David L.

    1994-12-01

    This thesis investigates the use of two active ultrasonic devices in collecting lip information for performing and enhancing automatic speech recognition. The two devices explored are called the 'Ultrasonic Mike' and the 'Lip Lock Loop.' The devices are tested in a speaker dependent isolated word recognition task with a vocabulary consisting of the spoken digits from zero to nine. Two automatic lip readers are designed and tested based on the output of the ultrasonic devices. The automatic lip readers use template matching and dynamic time warping to determine the best candidate for a given test utterance. The automatic lip readers alone achieve accuracies of 65-89%, depending on the number of reference templates used. Next the automatic lip reader is combined with a conventional automatic speech recognizer. Both classifier level fusion and feature level fusion are investigated. Feature fusion is based on combining the feature vectors prior to dynamic time warping. Classifier fusion is based on a pseudo probability mass function derived from the dynamic time warping distances. The combined systems are tested with various levels of acoustic noise added. In one typical test, at a signal to noise ratio of 0dB, the acoustic recognizer's accuracy alone was 78%, the automatic lip reader's accuracy was 69%, but the combined accuracy was 93%. This experiment demonstrates that a simple ultrasonic lip motion detector, that has an output data rate 12,500 times less than a typical video camera, can significantly improve the accuracy of automatic speech recognition in noise.

  11. Estimation of Phoneme-Specific HMM Topologies for the Automatic Recognition of Dysarthric Speech

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic for each phoneme in the speaker’s phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components is performed with a Genetic Algorithm (GA. Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria.

  12. Using sample entropy for automated sign language recognition on sEMG and accelerometer data.

    Science.gov (United States)

    Kosmidou, Vasiliki E; Hadjileontiadis, Leontios I

    2010-03-01

    Communication using sign language (SL) provides alternative means for information transmission among the deaf. Automated gesture recognition involved in SL, however, could further expand this communication channel to the world of hearers. In this study, data from five-channel surface electromyogram and three-dimensional accelerometer from signers' dominant hand were subjected to a feature extraction process. The latter consisted of sample entropy (SampEn)-based analysis, whereas time-frequency feature (TFF) analysis was also performed as a baseline method for the automated recognition of 60-word lexicon Greek SL (GSL) isolated signs. Experimental results have shown a 66 and 92% mean classification accuracy threshold using TFF and SampEn, respectively. These results justify the superiority of SampEn against conventional methods, such as TFF, to provide with high recognition hit-ratios, combined with feature vector dimension reduction, toward a fast and reliable automated GSL gesture recognition.

  13. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-12-01

    Full Text Available This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  14. The software for automatic creation of the formal grammars used by speech recognition, computer vision, editable text conversion systems, and some new functions

    Science.gov (United States)

    Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan

    2017-02-01

    For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.

  15. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  16. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition.

    Science.gov (United States)

    Jesse, Alexandra; McQueen, James M

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker say fragments of word pairs that were segmentally identical but differed in their stress realization (e.g., 'ca-vi from cavia "guinea pig" vs. 'ka-vi from kaviaar "caviar"). Participants were able to distinguish between these pairs from seeing a speaker alone. Only the presence of primary stress in the fragment, not its absence, was informative. Participants were able to distinguish visually primary from secondary stress on first syllables, but only when the fragment-bearing target word carried phrase-level emphasis. Furthermore, participants distinguished fragments with primary stress on their second syllable from those with secondary stress on their first syllable (e.g., pro-'jec from projector "projector" vs. 'pro-jec from projectiel "projectile"), independently of phrase-level emphasis. Seeing a speaker thus contributes to spoken-word recognition by providing suprasegmental information about the presence of primary lexical stress.

  17. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  18. Speech emotion recognition based on LS-SVM%基才LS-SVM的情感语音识别

    Institute of Scientific and Technical Information of China (English)

    周慧; 魏霖静

    2012-01-01

    The dissertation proposed an approach for emotional speech recognition based on LS-SVM. First, pitch frequency, energy, speech rate parameters extracted from speech signals as emotional features. Then emotional speech modeling is established with LS-SVM method. Experimental results show that, basic emotion recognition can get high recognition rates.%提出了一种基于LS—SVM的情感语音识别方法。即先提取实验中语音信号的基频,能量,语速等参数为情感特征,然后采用LS—SVM方法对相应的情感语音信号建立模型,进行识别。实验结果表明,利用LS—SVM进行基本情感识别时,识别率较高。

  19. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  20. Robust Speech Recognition Using Temporal Pattern Feature Extracted From MTMLP Structure

    Directory of Open Access Journals (Sweden)

    Yasser Shekofteh

    2014-10-01

    Full Text Available Temporal Pattern feature of a speech signal could be either extracted from the time domain or via their front-end vectors. This feature includes long-term information of variations in the connected speech units. In this paper, the second approach is followed, i.e. the features which are the cases of temporal computations, consisting of Spectral-based (LFBE and Cepstrum-based (MFCC feature vectors, are considered. To extract these features, we use posterior probability-based output of the proposed MTMLP neural networks. The combination of the temporal patterns, which represents the long-term dynamics of the speech signal, together with some traditional features, composed of the MFCC and its first and second derivatives are evaluated in an ASR task. It is shown that the use of such a combined feature vector results in the increase of the phoneme recognition accuracy by more than 1 percent regarding the results of the baseline system, which does not benefit from the long-term temporal patterns. In addition, it is shown that the use of extracted features by the proposed method gives robust recognition under different noise conditions (by 13 percent and, therefore, the proposed method is a robust feature extraction method.

  1. Speech recognition interface to a hospital information system using a self-designed visual basic program: initial experience.

    Science.gov (United States)

    Callaway, Edward C; Sweet, Clifford F; Siegel, Eliot; Reiser, John M; Beall, Douglas P

    2002-03-01

    Speech recognition (SR) in the radiology department setting is viewed as a method of decreasing overhead expenses by reducing or eliminating transcription services and improving care by reducing report turnaround times incurred by transcription backlogs. The purpose of this study was to show the ability to integrate off-the-shelf speech recognition software into a Hospital Information System in 3 types of military medical facilities using the Windows programming language Visual Basic 6.0 (Microsoft, Redmond, WA). Report turnaround times and costs were calculated for a medium-sized medical teaching facility, a medium-sized nonteaching facility, and a medical clinic. Results of speech recognition versus contract transcription services were assessed between July and December, 2000. In the teaching facility, 2042 reports were dictated on 2 computers equipped with the speech recognition program, saving a total of US dollars 3319 in transcription costs. Turnaround times were calculated for 4 first-year radiology residents in 4 imaging categories. Despite requiring 2 separate electronic signatures, we achieved an average reduction in turnaround time from 15.7 hours to 4.7 hours. In the nonteaching facility, 26600 reports were dictated with average turnaround time improving from 89 hours for transcription to 19 hours for speech recognition saving US dollars 45500 over the same 6 months. The medical clinic generated 5109 reports for a cost savings of US dollars 10650. Total cost to implement this speech recognition was approximately US dollars 3000 per workstation, mostly for hardware. It is possible to design and implement an affordable speech recognition system without a large-scale expensive commercial solution.

  2. Dynamic detection model and its application for perimeter security, intruder detection, and automated target recognition

    Science.gov (United States)

    Koltunov, Joseph; Koltunov, Alexander

    2003-09-01

    Under unsteady weather conditions (gusty wind and partial cloudiness), the pixel intensities measured by infrared or optical imaging sensors may change considerably within even minutes. This makes a principal obstacle to automated target detection and recognition in real, outdoor settings. Currently existing automated recognition algorithms require strong similarity between the weather conditions of training and recognition. Empirical attempts to normalize image intensities do not lead to reliable detection in practice (e.g. for scenes with a complex relief). Also if the weather is relatively stable (weak wind, rare clouds), as short as 15-20 minutes delay between the training survey and the recognition survey may badly affect target recognition or detection, unless the targets are well separable from background. Thermal IR technologies based on invariants such as emissivity and thermal inertia are expensive and ineffective in making the recognition automated. Our approach to overcoming the problem is to take advantage of multitemporal prior surveying. It exploits the fact, that any new infrared or optical image of a scene can be accurately predicted based on sufficiently many scene images acquired previously. This removes the above severe constraints to variability of the weather conditions, whereas neither meteorological measurement nor radiometric calibration of the sensor are required. The present paper further generalizes the approach and addresses several points that are important for putting the ideas in practice. Two experimental examples: intruder detection and recognition of a suspicious target illustrate the potential of our method.

  3. Intrinsic mode entropy: an enhanced classification means for automated Greek Sign Language gesture recognition.

    Science.gov (United States)

    Kosmidou, Vasiliki E; Hadjileontiadis, Leontios J

    2008-01-01

    Sign language forms a communication channel among the deaf; however, automated gesture recognition could further expand their communication with the hearers. In this work, data from three-dimensional accelerometer and five-channel surface electromyogram of the user's dominant forearm are analyzed using intrinsic mode entropy (IMEn) for the automated recognition of Greek Sign Language (GSL) gestures. IMEn was estimated for various window lengths and evaluated by the Mahalanobis distance criterion. Discriminant analysis was used to identify the effective scales of the intrinsic mode functions and the window length for the calculation of the IMEn that contributes to the correct classification of the GSL gestures. Experimental results from the IMEn analysis of GSL gestures corresponding to ten words have shown 100% classification accuracy using IMEn as the only classification feature. This provides a promising bed-set towards the automated GSL gesture recognition.

  4. Automated Fourier space region-recognition filtering for off-axis digital holographic microscopy

    CERN Document Server

    He, Xuefei; Pratap, Mrinalini; Zheng, Yujie; Wang, Yi; Nisbet, David R; Williams, Richard J; Rug, Melanie; Maier, Alexander G; Lee, Woei Ming

    2016-01-01

    Automated label-free quantitative imaging of biological samples can greatly benefit high throughput diseases diagnosis. Digital holographic microscopy (DHM) is a powerful quantitative label-free imaging tool that retrieves structural details of cellular samples non-invasively. In off-axis DHM, a proper spatial filtering window in Fourier space is crucial to the quality of reconstructed phase image. Here we describe a region-recognition approach that combines shape recognition with an iterative thresholding to extracts the optimal shape of frequency components. The region recognition technique offers fully automated adaptive filtering that can operate with a variety of samples and imaging conditions. When imaging through optically scattering biological hydrogel matrix, the technique surpasses previous histogram thresholding techniques without requiring any manual intervention. Finally, we automate the extraction of the statistical difference of optical height between malaria parasite infected and uninfected re...

  5. Dynamic Relation Between Working Memory Capacity and Speech Recognition in Noise During the First 6 Months of Hearing Aid Use

    Directory of Open Access Journals (Sweden)

    Elaine H. N. Ng

    2014-11-01

    Full Text Available The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced.

  6. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    Science.gov (United States)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  7. CAR2 - Czech Database of Car Speech

    Directory of Open Access Journals (Sweden)

    P. Sovka

    1999-12-01

    Full Text Available This paper presents new Czech language two-channel (stereo speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done.

  8. An Introduction to the Chinese Speech Recognition Front-End of the NICT/ATR Multi-Lingual Speech Translation System

    Institute of Scientific and Technical Information of China (English)

    ZHANG Jinsong; Takatoshi Jitsuhiro; Hirofumi Yamamoto; HU Xinhui; Satoshi Nakamura

    2008-01-01

    This paper introduces several important features of the Chinese large vocabulary continuous speech recognition system in the NICT/ATR multi-lingual speech-to-speech translation system.The features include: (1) a flexible way to derive an information rich phoneme set based on mutual information between a text corpus and its phoneme set; (2) a hidden Markov network acoustic model and a successive state split-ting algorithm to generate its model topology based on a minimum description length criterion; and (3) ad-vanced language modeling using multi-class composite N-grams.These features allow a recognition per-formance of 90% character accuracy in tourism related dialogue with a real time response speed.

  9. An Automated Size Recognition Technique for Acetabular Implant in Total Hip Replacement

    CERN Document Server

    Shapi'i, Azrulhizam; Hasan, Mohammad Khatim; Kassim, Abdul Yazid Mohd; 10.5121/ijcsit.2011.3218

    2011-01-01

    Preoperative templating in Total Hip Replacement (THR) is a method to estimate the optimal size and position of the implant. Today, observational (manual) size recognition techniques are still used to find a suitable implant for the patient. Therefore, a digital and automated technique should be developed so that the implant size recognition process can be effectively implemented. For this purpose, we have introduced the new technique for acetabular implant size recognition in THR preoperative planning based on the diameter of acetabulum size. This technique enables the surgeon to recognise a digital acetabular implant size automatically. Ten randomly selected X-rays of unidentified patients were used to test the accuracy and utility of an automated implant size recognition technique. Based on the testing result, the new technique yielded very close results to those obtained by the observational method in nine studies (90%).

  10. Robust Speaker Recognition with Combined Use of Acoustic and Throat Microphone Speech

    DEFF Research Database (Denmark)

    Sahidullah, Md; Gonzalez Hautamäki, Rosa; Thomsen, Dennis Alexander Lehmann;

    2016-01-01

    Accuracy of automatic speaker recognition (ASV) systems degrades severely in the presence of background noise. In this paper, we study the use of additional side information provided by a body-conducted sensor, throat microphone. Throat microphone signal is much less affected by background noise...... of this additional information for both speech activity detection, feature extraction and fusion of the acoustic and throat microphone signals. We collect a pilot database consisting of 38 subjects including both clean and noisy sessions. We carry out speaker verification experiments using Gaussian mixture model...

  11. Using vector Taylor series with noise clustering for speech recognition in non-stationary noisy environments

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    The performance of automatic speech recognizer degrades seriously when there are mismatches between the training and testing conditions. Vector Taylor Series (VTS) approach has been used to compensate mismatches caused by additive noise and convolutive channel distortion in the cepstral domain. In this paper, the conventional VTS is extended by incorporating noise clustering into its EM iteration procedure, improving its compensation effectiveness under non-stationary noisy environments. Recognition experiments under babble and exhibition noisy environments demonstrate that the new algorithm achieves35 % average error rate reduction compared with the conventional VTS.

  12. Rule Based Approach for Arabic Part of Speech Tagging and Name Entity Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Hjouj Btoush

    2016-06-01

    Full Text Available The aim of this study is to build a tool for Part of Speech (POS tagging and Name Entity Recognition for Arabic Language, the approach used to build this tool is a rule base technique. The POS Tagger contains two phases:The first phase is to pass word into a lexicon phase, the second level is the morphological phase, and the tagset are (Noun, Verb and Determine. The Named-Entity detector will apply rules on the text and give the correct Labels for each word, the labels are Person(PERS, Location (LOC and Organization (ORG.

  13. Fuzzy C-Means Clustering Based Phonetic Tied-Mixture HMM in Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    XU Xiang-hua; ZHU Jie; GUO Qiang

    2005-01-01

    A fuzzy clustering analysis based phonetic tied-mixture HMM(FPTM) was presented to decrease parameter size and improve robustness of parameter training. FPTM was synthesized from state-tied HMMs by a modified fuzzy C-means clustering algorithm. Each Gaussian codebook of FPTM was built from Gaussian components within the same root node in phonetic decision tree. The experimental results on large vocabulary Mandarin speech recognition show that compared with conventional phonetic tied-mixture HMM and state-tied HMM with approximately the same number of Gaussian mixtures, FPTM achieves word error rate reductions by 4.84% and 13.02 % respectively. Combining the two schemes of mixing weights pruning and Gaussian centers fuzzy merging, a significantly parameter size reduction was achieved with little impact on recognition accuracy.

  14. Deficits in audiovisual speech perception in normal aging emerge at the level of whole-word recognition.

    Science.gov (United States)

    Stevenson, Ryan A; Nelms, Caitlin E; Baum, Sarah H; Zurkovsky, Lilia; Barense, Morgan D; Newhouse, Paul A; Wallace, Mark T

    2015-01-01

    Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals.

  15. Deficits in audiovisual speech perception in normal aging emerge at the level of whole-word recognition

    Science.gov (United States)

    Stevenson, Ryan A.; Nelms, Caitlin; Baum, Sarah H.; Zurkovsky, Lilia; Barense, Morgan D.; Newhouse, Paul A.; Wallace, Mark T.

    2014-01-01

    Over the next two decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio (SNR). For whole-word recognition, older relative to younger adults showed greater multisensory gains at intermediate SNRs, but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as SNR decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments, and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy aging populations, and that deficits begin to emerge only at the more complex, word-recognition level of speech signals. PMID:25282337

  16. Peripheral Nonlinear Time Spectrum Features Algorithm for Large Vocabulary Mandarin Automatic Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    Fadhil H. T. Al-dulaimy; WANG Zuoying

    2005-01-01

    This work describes an improved feature extractor algorithm to extract the peripheral features of point x(ti,fj) using a nonlinear algorithm to compute the nonlinear time spectrum (NL-TS) pattern. The algorithm observes n×n neighborhoods of the point in all directions, and then incorporates the peripheral features using the Mel frequency cepstrum components (MFCCs)-based feature extractor of the Tsinghua electronic engineering speech processing (THEESP) for Mandarin automatic speech recognition (MASR) system as replacements of the dynamic features with different feature combinations. In this algorithm, the orthogonal bases are extracted directly from the speech data using discrite cosime transformation (DCT) with 3×3 blocks on an NL-TS pattern as the peripheral features. The new primal bases are then selected and simplified in the form of the operator in the time direction and the operator in the frequency direction. The algorithm has 23.29% improvements of the relative error rate in comparison with the standard MFCC feature-set and the dynamic features in tests using THEESP with the duration distribution-based hidden Markov model (DDBHMM) based on MASR system.

  17. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  18. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  19. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  20. Automated transformation-invariant shape recognition through wavelet multiresolution

    Science.gov (United States)

    Brault, Patrice; Mounier, Hugues

    2001-12-01

    We present here new results in Wavelet Multi-Resolution Analysis (W-MRA) applied to shape recognition in automatic vehicle driving applications. Different types of shapes have to be recognized in this framework. They pertain to most of the objects entering the sensors field of a car. These objects can be road signs, lane separation lines, moving or static obstacles, other automotive vehicles, or visual beacons. The recognition process must be invariant to global, affine or not, transformations which are : rotation, translation and scaling. It also has to be invariant to more local, elastic, deformations like the perspective (in particular with wide angle camera lenses), and also like deformations due to environmental conditions (weather : rain, mist, light reverberation) or optical and electrical signal noises. To demonstrate our method, an initial shape, with a known contour, is compared to the same contour altered by rotation, translation, scaling and perspective. The curvature computed for each contour point is used as a main criterion in the shape matching process. The original part of this work is to use wavelet descriptors, generated with a fast orthonormal W-MRA, rather than Fourier descriptors, in order to provide a multi-resolution description of the contour to be analyzed. In such way, the intrinsic spatial localization property of wavelet descriptors can be used and the recognition process can be speeded up. The most important part of this work is to demonstrate the potential performance of Wavelet-MRA in this application of shape recognition.

  1. Extraction of prostatic lumina and automated recognition for prostatic calculus image using PCA-SVM.

    Science.gov (United States)

    Wang, Zhuocai; Xu, Xiangmin; Ding, Xiaojun; Xiao, Hui; Huang, Yusheng; Liu, Jian; Xing, Xiaofen; Wang, Hua; Liao, D Joshua

    2011-01-01

    Identification of prostatic calculi is an important basis for determining the tissue origin. Computation-assistant diagnosis of prostatic calculi may have promising potential but is currently still less studied. We studied the extraction of prostatic lumina and automated recognition for calculus images. Extraction of lumina from prostate histology images was based on local entropy and Otsu threshold recognition using PCA-SVM and based on the texture features of prostatic calculus. The SVM classifier showed an average time 0.1432 second, an average training accuracy of 100%, an average test accuracy of 93.12%, a sensitivity of 87.74%, and a specificity of 94.82%. We concluded that the algorithm, based on texture features and PCA-SVM, can recognize the concentric structure and visualized features easily. Therefore, this method is effective for the automated recognition of prostatic calculi.

  2. Testing of a Composite Wavelet Filter to Enhance Automated Target Recognition in SONAR

    Science.gov (United States)

    Chiang, Jeffrey N.

    2011-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low resolution SONAR and camera videos taken from Unmanned Underwater Vehicles (UUVs). These SONAR images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both SONAR and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this report.

  3. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  4. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  5. A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System

    Directory of Open Access Journals (Sweden)

    P.S. Sathidevi

    2010-03-01

    Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.

  6. Speech Emotion Recognition Algorithm Based on SVM%基于SVM的语音情感识别算法

    Institute of Scientific and Technical Information of China (English)

    朱菊霞; 吴小培; 吕钊

    2011-01-01

    为有效提高语音情感识别系统的识别正确率,提出一种基于SVM的语音情感识别算法.该算法提取语音信号的能量、基音频率及共振峰等参数作为情感特征,采用SVM(Support Vector Machine,支持向量机)方法对情感信号进行建模与识别.在仿真环境下的情感识别实验中,所提算法相比较人工神经网络的ACON(All Class inone Network,"一对多")和OCON(One class in one network,"一对一")方法识别正确率分别提高了7.06%和7.21%.实验结果表明基于SVM的语音情感识别算法能够对语音情感信号进行较好地识别.%In order to improve recognition accuracy of the speech emotion recognition system effectively, a speech emotion recognition algorithm based on SVM is proposed. In the proposed algorithm, some parameters extracted from speech signals, such as: energy, pitch frequency and formant, are used as emotional features. Furthermore, an emotion recognition model is established with SVM method. Simulation environment experiential results reveal that the recognition ratio of the proposed algorithm obtains the relative increasing of 7.06% and 7.21% compared with artificial neural networks such as ACON (All Class in one Network, "one to many") and OCON (One class in one network, "one to one") methods. The result of the experiment shows that the speech emotion recognition algorithm based on SVM can improve the performance of the emotion recognition system effectively.

  7. Feature Extraction and Selection Strategies for Automated Target Recognition

    Science.gov (United States)

    Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2010-01-01

    Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory region of-interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.

  8. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Ji Ming

    2008-03-01

    Full Text Available We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  9. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  10. Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous Chinese Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 武健; 宋战江

    2000-01-01

    The previously proposed syllable-synchronous network search (SSNS) algorithm plays a very important role in the word decoding of the continuous Chinese speech recognition and achieves satisfying performance. Several related key factors that may affect the overall word decoding effect are carefully studied in this paper, including the perfecting of the vocabulary, the big-discount Turing re-estimating of the N-Gram probabilities, and the managing of the searching path buffers. Based on these discussions, corresponding approaches to improving the SSNS algorithm are proposed. Compared with the previous version of SSNS algorithm, the new version decreases the Chinese character error rate (CCER) in the word decoding by 42.1% across a database consisting of a large number of testing sentences (syllable strings).

  11. Recognition of 3D objects for autonomous mobile robot's navigation in automated shipbuilding

    Science.gov (United States)

    Lee, Hyunki; Cho, Hyungsuck

    2007-10-01

    Nowadays many parts of shipbuilding process are automated, but the painting process is not, because of the difficulty of automated on-line painting quality measurement, harsh painting environment and the difficulty of robot navigation. However, the painting automation is necessary, because it can provide consistent performance of painting film thickness. Furthermore, autonomous mobile robots are strongly required for flexible painting work. However, the main problem of autonomous mobile robot's navigation is that there are many obstacles which are not expressed in the CAD data. To overcome this problem, obstacle detection and recognition are necessary to avoid obstacles and painting work effectively. Until now many object recognition algorithms have been studied, especially 2D object recognition methods using intensity image have been widely studied. However, in our case environmental illumination does not exist, so these methods cannot be used. To overcome this, to use 3D range data must be used, but the problem of using 3D range data is high computational cost and long estimation time of recognition due to huge data base. In this paper, we propose a 3D object recognition algorithm based on PCA (Principle Component Analysis) and NN (Neural Network). In the algorithm, the novelty is that the measured 3D range data is transformed into intensity information, and then adopts the PCA and NN algorithm for transformed intensity information to reduce the processing time and make the data easy to handle which are disadvantages of previous researches of 3D object recognition. A set of experimental results are shown to verify the effectiveness of the proposed algorithm.

  12. Learning to read shapes the activation of neural lexical representations in the speech recognition pathway.

    Science.gov (United States)

    Schild, Ulrike; Röder, Brigitte; Friedrich, Claudia K

    2011-04-01

    It has been demonstrated that written and spoken language processing are tightly linked. Here we focus on the development of this relationship at the time children start reading and writing. We hypothesize that the newly acquired knowledge about graphemes shapes lexical access in neural spoken word recognition. A group of preliterate children (six years old) and two groups of beginning readers (six and eight years old) were tested in a spoken word identification task. Using word onset priming we compared behavioural and neural facilitation for target words in identical prime-target pairs (e.g., mon-monster) and in prime target pairs that varied in the first speech sound (e.g., non-monster, Variation condition). In both groups of beginning readers priming was less effective in the Variation condition than in the Identity condition. This was indexed by less behavioural facilitation and enhanced P350 amplitudes in the event related potentials (ERPs). In the group of preliterate children, by contrast, both conditions did not differ. Together these results reveal that lexical access in beginning readers is based on more acoustic detail than lexical access in preliterate children. The results are discussed in the light of bidirectional speech and print interactions in readers.

  13. Influence of tinnitus percentage index of speech recognition in patients with normal hearing

    Directory of Open Access Journals (Sweden)

    Urnau, Daila

    2010-12-01

    Full Text Available Introduction: The understanding of speech is one of the most important measurable aspects of human auditory function. Tinnitus affects the quality of life, impairing communication. Objective: To investigate possible changes in the Percentage Index of Speech Recognition (SDT in individuals with tinnitus have normal hearing and examining the relationship between tinnitus, gender and age. Methods:A retrospective study by analyzing the records of 82 individuals of both genders, aged 21-70 years, totaling 128 ears with normal hearing. The ears were analyzed separately, and divided into control group, no complaints of tinnitus and group study, with complaints of tinnitus. The variables gender and age groups and examined the influence of tinnitus in the SDT. It was considered normal, the percentage of 100% correct and changed, and the value between 88-96%. These criteria were adopted, since the percentage below 88% correct is found in individuals with sensorineural hearing loss. Results:There was no statistically significant difference between the variables age and tinnitus, and tinnitus SDT, only gender and tinnitus. The prevalence of tinnitus in females (56%, higher incidence of tinnitus in the age group 31-40 years (41.67% and fewer from 41 to 50 years (18.75% and on the SDT there was a greater percentage change in individuals with tinnitus (61.11%. Conclusion: The buzz does not interfere with SDT and there is no relationship between tinnitus and age, only between tinnitus and gender.

  14. Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition

    Science.gov (United States)

    Skowronski, Mark D.; Harris, John G.

    2004-09-01

    Mel frequency cepstral coefficients (MFCC) are the most widely used speech features in automatic speech recognition systems, primarily because the coefficients fit well with the assumptions used in hidden Markov models and because of the superior noise robustness of MFCC over alternative feature sets such as linear prediction-based coefficients. The authors have recently introduced human factor cepstral coefficients (HFCC), a modification of MFCC that uses the known relationship between center frequency and critical bandwidth from human psychoacoustics to decouple filter bandwidth from filter spacing. In this work, the authors introduce a variation of HFCC called HFCC-E in which filter bandwidth is linearly scaled in order to investigate the effects of wider filter bandwidth on noise robustness. Experimental results show an increase in signal-to-noise ratio of 7 dB over traditional MFCC algorithms when filter bandwidth increases in HFCC-E. An important attribute of both HFCC and HFCC-E is that the algorithms only differ from MFCC in the filter bank coefficients: increased noise robustness using wider filters is achieved with no additional computational cost.

  15. An Additive and Convolutive Bias Compensation Algorithm for Telephone Speech Recognition1)

    Institute of Scientific and Technical Information of China (English)

    HANZhao-Bing; ZHANGShu-Wu; XUBo; HUANGTai-Yi

    2004-01-01

    A Vector piecewise polynomial (VPP) approximation algorithm is proposed for environ-ment compensation of speech signals degraded by both additive and convolutive noises. By investi-gating the model of the telephone environment, we propose a piecewise polynomial, namely twolinear polynomials and a quadratic polynomial, to approximate the environment function precisely.The VPP is applied either to the stationary noise, or to the non stationary noise. In the first case,the batch EM is used in log-spectral domain; in the second case the recursive EM with iterativestochastic approximation is developed in cepstral domain. Both approaches are based on the mini-mum mean squared error (MMSE) sense. Experimental results are presented on the application ofthis approach in improving the performance of Mandarin large vocabulary continuous speech recog-nition (LVCSR) due to the background noises and different transmission channels (such as fixedtelephone line and GSM). The method can reduce the average character error rate (CER) by a-bout 18%.

  16. A PRELIMARY APPROACH FOR THE AUTOMATED RECOGNITION OF MALIGNANT MELANOMA

    Directory of Open Access Journals (Sweden)

    Ezzeddine Zagrouba

    2011-05-01

    Full Text Available In this work, we are motivated by the desire to classify skin lesions as malignants or benigns from color photographic slides of the lesions. Thus, we use color images of skin lesions, image processing techniques and artificial neural network classifier to distinguish melanoma from benign pigmented lesions. As the first step of the data set analysis, a preprocessing sequence is implemented to remove noise and undesired structures from the color image. Second, an automated segmentation approach localizes suspicious lesion regions by region growing after a preliminary step based on fuzzy sets. Then, we rely on quantitative image analysis to measure a series of candidate attributes hoped to contain enough information to differentiate melanomas from benign lesions. At last, the selected features are supplied to an artificial neural network for classification of tumor lesion as malignant or benign. For a preliminary balanced training/testing set, our approach is able to obtain 79.1% of correct classification of malignant and benign lesions on real skin lesion images.

  17. Automated recognition of forest patterns using aerial photographs

    Science.gov (United States)

    Barbezat, Vincent; Kreiss, Philippe; Sulzmann, Armin; Jacot, Jacques

    1996-12-01

    In Switzerland, aerial photos are indispensable tools for research into ecosystems and their management. Every six years since 1950, the whole of Switzerland has been systematically surveyed by aerial photos. In the forestry field, these documents not only provide invaluable information but also give support to field activities such as the drawing up of tree population maps, intervention planning, precise positioning of the upper forest limit, evaluation of forest damage and rates of tree growth. Up to now, the analysis of aerial photos has been carried out by specialists who painstakingly examine every photograph, which makes it a very long, exacting and expensive job. The IMT-DMT of the EPFL and Antenne romande of FNP, aware of the special interest involved and the necessity of automated classification of aerial photos, have pooled their resources to develop a software program capable of differentiating between single trees, copses and dense forests. The developed algorithms detect the crowns of the trees and the surface of the orthogonal projection. Form the shadow of each tree they calculate its height. They also determine the position of the tree in the Swiss national coordinate thanks to the implementation of a numeric altitude model. For the future, we have the prospect of many new and better uses of aerial photos being available to us, particularly where isolated stands are concerned and also when evolutions based on a diachronic series of photos have to be assessed: from timberline monitoring in the research on global change to the exploitation of wooded pastures on small surface areas.

  18. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  19. Speaking to the trained ear: musical expertise enhances the recognition of emotions in speech prosody.

    Science.gov (United States)

    Lima, César F; Castro, São Luís

    2011-10-01

    Language and music are closely related in our minds. Does musical expertise enhance the recognition of emotions in speech prosody? Forty highly trained musicians were compared with 40 musically untrained adults (controls) in the recognition of emotional prosody. For purposes of generalization, the participants were from two age groups, young (18-30 years) and middle adulthood (40-60 years). They were presented with short sentences expressing six emotions-anger, disgust, fear, happiness, sadness, surprise-and neutrality, by prosody alone. In each trial, they performed a forced-choice identification of the expressed emotion (reaction times, RTs, were collected) and an intensity judgment. General intelligence, cognitive control, and personality traits were also assessed. A robust effect of expertise was found: musicians were more accurate than controls, similarly across emotions and age groups. This effect cannot be attributed to socioeducational background, general cognitive or personality characteristics, because these did not differ between musicians and controls; perceived intensity and RTs were also similar in both groups. Furthermore, basic acoustic properties of the stimuli like fundamental frequency and duration were predictive of the participants' responses, and musicians and controls were similarly efficient in using them. Musical expertise was thus associated with cross-domain benefits to emotional prosody. These results indicate that emotional processing in music and in language engages shared resources.

  20. A Robust Method for Speech Emotion Recognition Based on Infinite Student’s t-Mixture Model

    Directory of Open Access Journals (Sweden)

    Xinran Zhang

    2015-01-01

    Full Text Available Speech emotion classification method, proposed in this paper, is based on Student’s t-mixture model with infinite component number (iSMM and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model, speech emotion model based on Student’s t-mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover, t-mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus and two acting (DES and EMO-DB universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.

  1. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Science.gov (United States)

    Selouani, Sid-Ahmed; O'Shaughnessy, Douglas

    2003-12-01

    Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  2. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  3. Long-term outcomes on spatial hearing, speech recognition and receptive vocabulary after sequential bilateral cochlear implantation in children.

    Science.gov (United States)

    Sparreboom, Marloes; Langereis, Margreet C; Snik, Ad F M; Mylanus, Emmanuel A M

    2014-11-05

    Sequential bilateral cochlear implantation in profoundly deaf children often leads to primary advantages in spatial hearing and speech recognition. It is not yet known how these children develop in the long-term and if these primary advantages will also lead to secondary advantages, e.g. in better language skills. The aim of the present longitudinal cohort study was to assess the long-term effects of sequential bilateral cochlear implantation in children on spatial hearing, speech recognition in quiet and in noise and receptive vocabulary. Twenty-four children with bilateral cochlear implants (BiCIs) were tested 5-6 years after sequential bilateral cochlear implantation. These children received their second implant between 2.4 and 8.5 years of age. Speech and language data were also gathered in a matched reference group of 26 children with a unilateral cochlear implant (UCI). Spatial hearing was assessed with a minimum audible angle (MAA) task with different stimulus types to gain global insight into the effective use of interaural level difference (ILD) and interaural timing difference (ITD) cues. In the long-term, children still showed improvements in spatial acuity. Spatial acuity was highest for ILD cues compared to ITD cues. For speech recognition in quiet and noise, and receptive vocabulary, children with BiCIs had significant higher scores than children with a UCI. Results also indicate that attending a mainstream school has a significant positive effect on speech recognition and receptive vocabulary compared to attending a school for the deaf. Despite of a period of unilateral deafness, children with BiCIs, participating in mainstream education obtained age-appropriate language scores.

  4. Automated Three-Dimensional Microbial Sensing and Recognition Using Digital Holography and Statistical Sampling

    Directory of Open Access Journals (Sweden)

    Inkyu Moon

    2010-09-01

    Full Text Available We overview an approach to providing automated three-dimensional (3D sensing and recognition of biological micro/nanoorganisms integrating Gabor digital holographic microscopy and statistical sampling methods. For 3D data acquisition of biological specimens, a coherent beam propagates through the specimen and its transversely and longitudinally magnified diffraction pattern observed by the microscope objective is optically recorded with an image sensor array interfaced with a computer. 3D visualization of the biological specimen from the magnified diffraction pattern is accomplished by using the computational Fresnel propagation algorithm. For 3D recognition of the biological specimen, a watershed image segmentation algorithm is applied to automatically remove the unnecessary background parts in the reconstructed holographic image. Statistical estimation and inference algorithms are developed to the automatically segmented holographic image. Overviews of preliminary experimental results illustrate how the holographic image reconstructed from the Gabor digital hologram of biological specimen contains important information for microbial recognition.

  5. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener.

  6. Key Technologies in Speech Emotion Recognition%语音情感识别的关键技术

    Institute of Scientific and Technical Information of China (English)

    张雪英; 孙颖; 张卫; 畅江

    2015-01-01

    语音信号中的情感信息是一种很重要的信息资源,仅靠单纯的数学模型搭建和计算来进行语音情感识别就显现出不足。情感是由外部刺激引发人的生理、心理变化,从而表现出来的一种对人或事物的感知状态,因此,将认知心理学与语音信号处理相结合有益于更好地处理情感语音。首先介绍了语音情感与人类认知的关联性,总结了该领域的最新进展和研究成果,主要包括情感数据库的建立、情感特征的提取以及情感识别网络等。其次介绍了基于认知心理学构建的模糊认知图网络在情感语音识别中的应用。接着,探讨了人脑对情感语音的认知机理,并试图把事件相关电位融合到语音情感识别中,从而提高情感语音识别的准确率,为今后情感语音识别与认知心理学交叉融合发展提出了构思与展望。%Emotional information in speech signal is an important information resource .When verbal expression is combined with human emotion ,emotional speech processing is no longer a simple mathematical model or pure calculation .Fluctuations of the mood are controlled by the brain perception ;speech signal processing based on cognitive psychology can capture emotion bet‐ter .In this paper the relevance analysis between speech emotion and human cognition is intro‐duced firstly .The recent progress in speech emotion recognition is summarized ,including the re‐view of speech emotion databases ,feature extraction and emotion recognition networks .Secondly a fuzzy cognitive map network based on cognitive psychology is introduced into emotional speech recognition .In addition ,the mechanism of the human brain for cognitive emotional speech is ex‐plored .To improve the recognition accuracy ,this report also tries to integrate event‐related poten‐tials to speech emotion recognition .This idea is the conception and prospect of speech emotion recognition

  7. Research of Speech Recognition System Based on Matlab%基于Matlab的语音识别系统研究

    Institute of Scientific and Technical Information of China (English)

    王彪

    2011-01-01

    A speech recognition system based on Matlab software is designed, and record, broadcast, pretreat voice signals, subsection filtering, feature extraction and speech recognition are its main functions. This system has achieved discriminate simple voice requirements is verificated by the experiment, but some places are needed to improve, such as: whether complex voice coule be discriminated in complex environment.%设计了一个基于Matlab软件的语音识别系统,其主要功能有语音信号的录制、播放、预处理、分段滤波、特征提取以及识别语音.通过实验验证了本系统能够达到识别简单语音的要求,但仍有需改进的地方,如:能否在复杂环境下识别比较复杂的语音.

  8. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    Science.gov (United States)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  9. Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

    OpenAIRE

    Shrawankar, Urmila; Thakare, Vilas

    2010-01-01

    International audience; Noise is ubiquitous in almost all acoustic environments. The speech signal, that is recorded by a microphone is generally infected by noise originating from various sources. Such contamination can change the characteristics of the speech signals and degrade the speech quality and intelligibility, thereby causing significant harm to human-to-machine communication systems. Noise detection and reduction for speech applications is often formulated as a digital filtering pr...

  10. Automatic Speech Recognition and Training for Severely Dysarthric Users of Assistive Technology: The STARDUST Project

    Science.gov (United States)

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent…

  11. What’s Wrong With Automatic Speech Recognition (ASR) and How Can We Fix It?

    Science.gov (United States)

    2013-03-01

    Alex Acero , Proc. ICASSP 2005). The authors attempt to use an underlying hidden generative model and demonstrate improved performance on phonetic...Gunawardana and Alex Acero , International Conference on Speech Communication and Technology, International Speech Communication Association...Geoffrey Zweig and Alex Acero , ASRU 2011 (accepted) (pdf)(+tech report) o "The Subspace Gaussian Mixture Model – a Structured Model for Speech

  12. Development of a two wheeled self balancing robot with speech recognition and navigation algorithm

    Science.gov (United States)

    Rahman, Md. Muhaimin; Ashik-E-Rasul, Haq, Nowab. Md. Aminul; Hassan, Mehedi; Hasib, Irfan Mohammad Al; Hassan, K. M. Rafidh

    2016-07-01

    This paper is aimed to discuss modeling, construction and development of navigation algorithm of a two wheeled self balancing mobile robot in an enclosure. In this paper, we have discussed the design of two of the main controller algorithms, namely PID algorithms, on the robot model. Simulation is performed in the SIMULINK environment. The controller is developed primarily for self-balancing of the robot and also it's positioning. As for the navigation in an enclosure, template matching algorithm is proposed for precise measurement of the robot position. The navigation system needs to be calibrated before navigation process starts. Almost all of the earlier template matching algorithms that can be found in the open literature can only trace the robot. But the proposed algorithm here can also locate the position of other objects in an enclosure, like furniture, tables etc. This will enable the robot to know the exact location of every stationary object in the enclosure. Moreover, some additional features, such as Speech Recognition and Object Detection, are added. For Object Detection, the single board Computer Raspberry Pi is used. The system is programmed to analyze images captured via the camera, which are then processed through background subtraction, followed by active noise reduction.

  13. The Usefulness of Automatic Speech Recognition (ASR Eyespeak Software in Improving Iraqi EFL Students’ Pronunciation

    Directory of Open Access Journals (Sweden)

    Lina Fathi Sidig Sidgi

    2017-02-01

    Full Text Available The present study focuses on determining whether automatic speech recognition (ASR technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not found in their native language (Arabic. This study is concerned with ASR and its effectiveness in overcoming this difficulty. The data were obtained from twenty participants randomly selected from first-year college students at Al-Turath University College from the Department of English in Baghdad-Iraq. The students had participated in a two month pronunciation instruction course using ASR Eyespeak software. At the end of the pronunciation instruction course using ASR Eyespeak software, the students completed a questionnaire to get their opinions about the usefulness of the ASR Eyespeak in improving their pronunciation. The findings of the study revealed that the students found ASR Eyespeak software very useful in improving their pronunciation and helping them realise their pronunciation mistakes. They also reported that learning pronunciation with ASR Eyespeak enjoyable.

  14. Authenticity affects the recognition of emotions in speech: behavioral and fMRI evidence.

    Science.gov (United States)

    Drolet, Matthis; Schubotz, Ricarda I; Fischer, Julia

    2012-03-01

    The aim of the present study was to determine how authenticity of emotion expression in speech modulates activity in the neuronal substrates involved in emotion recognition. Within an fMRI paradigm, participants judged either the authenticity (authentic or play acted) or emotional content (anger, fear, joy, or sadness) of recordings of spontaneous emotions and reenactments by professional actors. When contrasting between task types, active judgment of authenticity, more than active judgment of emotion, indicated potential involvement of the theory of mind (ToM) network (medial prefrontal cortex, temporoparietal cortex, retrosplenium) as well as areas involved in working memory and decision making (BA 47). Subsequently, trials with authentic recordings were contrasted with those of reenactments to determine the modulatory effects of authenticity. Authentic recordings were found to enhance activity in part of the ToM network (medial prefrontal cortex). This effect of authenticity suggests that individuals integrate recollections of their own experiences more for judgments involving authentic stimuli than for those involving play-acted stimuli. The behavioral and functional results show that authenticity of emotional prosody is an important property influencing human responses to such stimuli, with implications for studies using play-acted emotions.

  15. Automated Gesturing for Virtual Characters: Speech-driven and Text-driven Approaches

    Directory of Open Access Journals (Sweden)

    Goranka Zoric

    2006-04-01

    Full Text Available We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker’s facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

  16. Automated recognition of spikes in 1 Hz data recorded at the Easter Island magnetic observatory

    Science.gov (United States)

    Soloviev, Anatoly; Chulliat, Arnaud; Bogoutdinov, Shamil; Gvishiani, Alexei; Agayan, Sergey; Peltier, Aline; Heumez, Benoit

    2012-09-01

    In the present paper we apply a recently developed pattern recognition algorithm SPs to the problem of automated detection of artificial disturbances in one-second magnetic observatory data. The SPs algorithm relies on the theory of discrete mathematical analysis, which has been developed by some of the authors for more than 10 years. It continues the authors' research in the morphological analysis of time series using fuzzy logic techniques. We show that, after a learning phase, this algorithm is able to recognize artificial spikes uniformly with low probabilities of target miss and false alarm. In particular, a 94% spike recognition rate and a 6% false alarm rate were achieved as a result of the algorithm application to raw one-second data acquired at the Easter Island magnetic observatory. This capability is critical and opens the possibility to use the SPs algorithm in an operational environment.

  17. OLIVE: Speech-Based Video Retrieval

    NARCIS (Netherlands)

    Jong, de Franciska; Gauvain, Jean-Luc; Hartog, den Jurgen; Netter, Klaus

    1999-01-01

    This paper describes the Olive project which aims to support automated indexing of video material by use of human language technologies. Olive is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which serve as the

  18. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

    Science.gov (United States)

    Alam, Md Jahangir; Gupta, Vishwa; Kenny, Patrick; Dumouchel, Pierre

    2015-12-01

    The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8

  19. The Long Road to Automation: Neurocognitive Development of Letter-Speech Sound Processing

    Science.gov (United States)

    Froyen, Dries J. W.; Bonte, Milene L.; van Atteveldt, Nienke; Blomert, Leo

    2009-01-01

    In transparent alphabetic languages, the expected standard for complete acquisition of letter-speech sound associations is within one year of reading instruction. The neural mechanisms underlying the acquisition of letter-speech sound associations have, however, hardly been investigated. The present article describes an ERP study with beginner and…

  20. Design of an Automated Secure Garage System Using License Plate Recognition Technique

    Directory of Open Access Journals (Sweden)

    Afaz Uddin Ahmed

    2014-01-01

    Full Text Available Modern technologies have reached our garage to secure the cars and entrance to the residences for the demand of high security and automated infrastructure. The concept of intelligent secure garage systems in modern transport management system is a remarkable example of the computer interfaced controlling devices. License Plate Recognition (LPR process is one of the key elements of modern intelligent garage security setups. This paper presents a design of an automated secure garage system featuring LPR process. A study of templates matching approach by using Optical Character Recognition (OCR is implemented to carry out the LPR method. We also developed a prototype design of the secured garage system to verify the application for local use. The system allows only a predefined enlisted cars or vehicles to enter the garage while blocking the others along with a central-alarm feature. Moreover, the system maintains an update database of the cars that has left and entered into the garage within a particular duration. The vehicle is distinguished by the system mainly based on their registration number in the license plates. The tactics are tried on several samples of license plate’s image in both indoor and outdoor setting.

  1. Speech recognition system based on LPCC parameter%基于LPCC参数的语音识别系统

    Institute of Scientific and Technical Information of China (English)

    王彪

    2012-01-01

    为了识别简单语音,设计了一个基于LPCC参数的语音识别系统。该系统其主要功能有语音信号的录制、播放、预处理、分段滤波、特征提取以及识别语音。最后通过仿真实验验证了本系统能够达到识别简单语音的要求,但仍有需改进的地方,如:能否在复杂环境下识别比较复杂的语音。%In order to recognize simple speech,a speech recognition system based on LPCC parameter is designed,and record,broadcast,pretreat voice signals,subsection filtering,feature extraction and speech recognition are its main functions.This system has achieved discriminate simple voice requirements is verificated by the simulation experiment,but some places are needed to improve,such as:whether complex voice coule be discriminated in complex environment.

  2. Speech recognition software and electronic psychiatric progress notes: physicians' ratings and preferences

    Directory of Open Access Journals (Sweden)

    Derman Yaron D

    2010-08-01

    Full Text Available Abstract Background The context of the current study was mandatory adoption of electronic clinical documentation within a large mental health care organization. Psychiatric electronic documentation has unique needs by the nature of dense narrative content. Our goal was to determine if speech recognition (SR would ease the creation of electronic progress note (ePN documents by physicians at our institution. Methods Subjects: Twelve physicians had access to SR software on their computers for a period of four weeks to create ePN. Measurements: We examined SR software in relation to its perceived usability, data entry time savings, impact on the quality of care and quality of documentation, and the impact on clinical and administrative workflow, as compared to existing methods for data entry. Data analysis: A series of Wilcoxon signed rank tests were used to compare pre- and post-SR measures. A qualitative study design was used. Results Six of twelve participants completing the study favoured the use of SR (five with SR alone plus one with SR via hand-held digital recorder for creating electronic progress notes over their existing mode of data entry. There was no clear perceived benefit from SR in terms of data entry time savings, quality of care, quality of documentation, or impact on clinical and administrative workflow. Conclusions Although our findings are mixed, SR may be a technology with some promise for mental health documentation. Future investigations of this nature should use more participants, a broader range of document types, and compare front- and back-end SR methods.

  3. Speaker-sensitive emotion recognition via ranking: Studies on acted and spontaneous speech(☆)

    Science.gov (United States)

    Cao, Houwei; Verma, Ragini; Nenkova, Ani

    2015-01-01

    We introduce a ranking approach for emotion recognition which naturally incorporates information about the general expressivity of speakers. We demonstrate that our approach leads to substantial gains in accuracy compared to conventional approaches. We train ranking SVMs for individual emotions, treating the data from each speaker as a separate query, and combine the predictions from all rankers to perform multi-class prediction. The ranking method provides two natural benefits. It captures speaker specific information even in speaker-independent training/testing conditions. It also incorporates the intuition that each utterance can express a mix of possible emotion and that considering the degree to which each emotion is expressed can be productively exploited to identify the dominant emotion. We compare the performance of the rankers and their combination to standard SVM classification approaches on two publicly available datasets of acted emotional speech, Berlin and LDC, as well as on spontaneous emotional data from the FAU Aibo dataset. On acted data, ranking approaches exhibit significantly better performance compared to SVM classification both in distinguishing a specific emotion from all others and in multi-class prediction. On the spontaneous data, which contains mostly neutral utterances with a relatively small portion of less intense emotional utterances, ranking-based classifiers again achieve much higher precision in identifying emotional utterances than conventional SVM classifiers. In addition, we discuss the complementarity of conventional SVM and ranking-based classifiers. On all three datasets we find dramatically higher accuracy for the test items on whose prediction the two methods agree compared to the accuracy of individual methods. Furthermore on the spontaneous data the ranking and standard classification are complementary and we obtain marked improvement when we combine the two classifiers by late-stage fusion.

  4. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment

    Science.gov (United States)

    2015-10-01

    operating workplace or random noise in a gym . Even if we remove all external environmental factors, physical task stress still shows differences...Workplace ●Within speaker ● Breathing ● Gym ● Fatigue ● Constant ●Across speaker ● Muscle control ● Random Research in this sub-area has been... Sex -specific behaviour”, Language and Speech 38, 267–287. • Williams, K. and Hansen, J. (2013). “Speaker height estimation combining GMM and linear

  5. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    . The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed...... rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise......, denoted the speech-based envelope power spectrum model (sEPSM) which predicts the intelligibility based on the signal-to-noise ratio in the envelope domain. In contrast to the STI, the sEPSM is sensitive to the increased amount of the noise envelope power as a consequence of the spectral subtraction...

  6. Testing Speech Recognition in Spanish-English Bilingual Children with the Computer-Assisted Speech Perception Assessment (CASPA): Initial Report.

    Science.gov (United States)

    García, Paula B; Rosado Rogers, Lydia; Nishi, Kanae

    2016-01-01

    This study evaluated the English version of Computer-Assisted Speech Perception Assessment (E-CASPA) with Spanish-English bilingual children. E-CASPA has been evaluated with monolingual English speakers ages 5 years and older, but it is unknown whether a separate norm is necessary for bilingual children. Eleven Spanish-English bilingual and 12 English monolingual children (6 to 12 years old) with normal hearing participated. Responses were scored by word, phoneme, consonant, and vowel. Regardless of scores, performance across three signal-to-noise ratio conditions was similar between groups, suggesting that the same norm can be used for both bilingual and monolingual children.

  7. Vision-based obstacle recognition system for automated lawn mower robot development

    Science.gov (United States)

    Mohd Zin, Zalhan; Ibrahim, Ratnawati

    2011-06-01

    Digital image processing techniques (DIP) have been widely used in various types of application recently. Classification and recognition of a specific object using vision system require some challenging tasks in the field of image processing and artificial intelligence. The ability and efficiency of vision system to capture and process the images is very important for any intelligent system such as autonomous robot. This paper gives attention to the development of a vision system that could contribute to the development of an automated vision based lawn mower robot. The works involve on the implementation of DIP techniques to detect and recognize three different types of obstacles that usually exist on a football field. The focus was given on the study on different types and sizes of obstacles, the development of vision based obstacle recognition system and the evaluation of the system's performance. Image processing techniques such as image filtering, segmentation, enhancement and edge detection have been applied in the system. The results have shown that the developed system is able to detect and recognize various types of obstacles on a football field with recognition rate of more 80%.

  8. Nonlinear Time-Frequency Distributions of Spectrum Energy Operator in Large Vocabulary Mandarin Speaker Independent Speech Recognition System

    Institute of Scientific and Technical Information of China (English)

    Fadhil H. T. Al-dulaimy; WANG Zuoying(王作英)

    2003-01-01

    This work demonstrates the use of the nonlinear time-frequency distribution (NLTFD) of a discrete time energy operator (DTEO) based on amplitude modulation-frequency modulation demodulation techniques as a feature in speech recognition. The duration distribution based hidden Markov module in a speaker independent large vocabulary mandarin speech recognition system was reconstructed from the feature vectors in the front-end detection stage. The goal was to improve the performance of the existing system by combining new features to the baseline feature vector. This paper also deals with errors associated with using a pre-emphasis filter in the front end processing of the present scheme, which causes an increase in the noise energy at high frequencies above 4 kHz and in some cases degrades the recognition accuracy. The experimental results show that eliminating the pre-emphasis filters from the pre-processing stage and using NLTFD with compensated DTEO combined with Mel frequency cepstrum components give a 21.95% reduction in the relative error rate compared to the conventional technique with 25 candidates used in the test.

  9. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features

    NARCIS (Netherlands)

    Raaijmakers, S.; Truong, K.P.

    2008-01-01

    We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic

  10. A Transducer/Equipment System for Capturing Speech Information for Subsequent Processing by Computer Systems

    Science.gov (United States)

    1994-01-07

    have shown interest in a speech capture system that would operate in a noisy lobby, casino, airport and shopping mall floor for access to the Automated...control or selection. Vending machines, shopping dispenser kiosks , and entertainment virtual reality games of the future will all be voice activated...EVALUATION RESEARCH INC PAGE 1 TR-3150- 178 - High speech recognition accuracy for commercial applications; automated drive thru fast food ordering

  11. Research progress on feature parameters of speech emotion recognition%语音情感识别中特征参数的研究进展

    Institute of Scientific and Technical Information of China (English)

    李杰; 周萍

    2012-01-01

    Speech emotion recognition is one of the new research projects, the extraction of feature parameters extraction influence the final recognition-rate efficiency directly, dimension reduction can extract the most distinguishing feature parameters of different emotions. The importance of feature parameters in speech emotion recognition is point out. The system of speech emotion recognition is introduced. The common methods of feature parameters is detailed. The common methods of dimension reduction which are used in emotion recognition are compared and analyzed. The development of speech emotion recognition in the future are prospected.%语音情感识别是近年来新兴的研究课题之一,特征参数的提取直接影响到最终的识别效率,特征降维可以提取出最能区分不同情感的特征参数.提出了特征参数在语音情感识别中的重要性,介绍了语音情感识别系统的基本组成,重点对特征参数的研究现状进行了综述,阐述了目前应用于情感识别的特征降维常用方法,并对其进行了分析比较.展望了语音情感识别的可能发展趋势.

  12. An Automated Recognition of Fake or Destroyed Indian Currency Notes in Machine Vision

    Directory of Open Access Journals (Sweden)

    Sanjana

    2012-04-01

    Full Text Available Almost every country in the world face the problem of counterfeitcurrency notes, but in India the problem is acute as the country ishit hard by this evil practice. Fake notes in India in denominationsof Rs.100, 500 and 1000 are being flooded into the system. Inorder to deal with such type of problems, an automated recognitionof currency notes in introduced by with the help of featureextraction, classification based in SVM, Neural Nets, and heuristicapproach. This technique is also subjected with the computervision where all processing with the image is done by machine.The machine is fitted with a CDD camera which will scan theimage of the currency note considering the dimensions of thebanknote and software will process the image segments with thehelp of SVM and character recognition methods. ANN is alsointroduced in this paper to train the data and classify the segmentsusing its datasets. To implement this design we are dealing withMATLAB Tool.

  13. Automated, high accuracy classification of Parkinsonian disorders: a pattern recognition approach.

    Directory of Open Access Journals (Sweden)

    Andre F Marquand

    Full Text Available Progressive supranuclear palsy (PSP, multiple system atrophy (MSA and idiopathic Parkinson's disease (IPD can be clinically indistinguishable, especially in the early stages, despite distinct patterns of molecular pathology. Structural neuroimaging holds promise for providing objective biomarkers for discriminating these diseases at the single subject level but all studies to date have reported incomplete separation of disease groups. In this study, we employed multi-class pattern recognition to assess the value of anatomical patterns derived from a widely available structural neuroimaging sequence for automated classification of these disorders. To achieve this, 17 patients with PSP, 14 with IPD and 19 with MSA were scanned using structural MRI along with 19 healthy controls (HCs. An advanced probabilistic pattern recognition approach was employed to evaluate the diagnostic value of several pre-defined anatomical patterns for discriminating the disorders, including: (i a subcortical motor network; (ii each of its component regions and (iii the whole brain. All disease groups could be discriminated simultaneously with high accuracy using the subcortical motor network. The region providing the most accurate predictions overall was the midbrain/brainstem, which discriminated all disease groups from one another and from HCs. The subcortical network also produced more accurate predictions than the whole brain and all of its constituent regions. PSP was accurately predicted from the midbrain/brainstem, cerebellum and all basal ganglia compartments; MSA from the midbrain/brainstem and cerebellum and IPD from the midbrain/brainstem only. This study demonstrates that automated analysis of structural MRI can accurately predict diagnosis in individual patients with Parkinsonian disorders, and identifies distinct patterns of regional atrophy particularly useful for this process.

  14. "Rate My Therapist": Automated Detection of Empathy in Drug and Alcohol Counseling via Speech and Language Processing.

    Science.gov (United States)

    Xiao, Bo; Imel, Zac E; Georgiou, Panayiotis G; Atkins, David C; Narayanan, Shrikanth S

    2015-01-01

    The technology for evaluating patient-provider interactions in psychotherapy-observational coding-has not changed in 70 years. It is labor-intensive, error prone, and expensive, limiting its use in evaluating psychotherapy in the real world. Engineering solutions from speech and language processing provide new methods for the automatic evaluation of provider ratings from session recordings. The primary data are 200 Motivational Interviewing (MI) sessions from a study on MI training methods with observer ratings of counselor empathy. Automatic Speech Recognition (ASR) was used to transcribe sessions, and the resulting words were used in a text-based predictive model of empathy. Two supporting datasets trained the speech processing tasks including ASR (1200 transcripts from heterogeneous psychotherapy sessions and 153 transcripts and session recordings from 5 MI clinical trials). The accuracy of computationally-derived empathy ratings were evaluated against human ratings for each provider. Computationally-derived empathy scores and classifications (high vs. low) were highly accurate against human-based codes and classifications, with a correlation of 0.65 and F-score (a weighted average of sensitivity and specificity) of 0.86, respectively. Empathy prediction using human transcription as input (as opposed to ASR) resulted in a slight increase in prediction accuracies, suggesting that the fully automatic system with ASR is relatively robust. Using speech and language processing methods, it is possible to generate accurate predictions of provider performance in psychotherapy from audio recordings alone. This technology can support large-scale evaluation of psychotherapy for dissemination and process studies.

  15. Application of a Back-Propagation Neural Network to Isolated-Word Speech Recognition

    Science.gov (United States)

    1993-06-01

    discusses the limitations of the proposed BNN system, and offers ideas for further reseach . 2 II. NEURAL NETWORKS A. WHY NEURAL NETWORKS? Recently...Besides the syntactic and semantic issues in the linguistic theories, speech segmentation is a big concern. Boundaries between words and phonemes are...can be estimated by a sudden large variation in the speech spectrum, this method is not very reliable due to coarticulation, i.e., the changes in the

  16. Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation.

    Science.gov (United States)

    Choi, Yong-Sun; Lee, Soo-Young

    2013-09-01

    A nonlinear speech feature extraction algorithm was developed by modeling human cochlear functions, and demonstrated as a noise-robust front-end for speech recognition systems. The algorithm was based on a model of the Organ of Corti in the human cochlea with such features as such as basilar membrane (BM), outer hair cells (OHCs), and inner hair cells (IHCs). Frequency-dependent nonlinear compression and amplification of OHCs were modeled by lateral inhibition to enhance spectral contrasts. In particular, the compression coefficients had frequency dependency based on the psychoacoustic evidence. Spectral subtraction and temporal adaptation were applied in the time-frame domain. With long-term and short-term adaptation characteristics, these factors remove stationary or slowly varying components and amplify the temporal changes such as onset or offset. The proposed features were evaluated with a noisy speech database and showed better performance than the baseline methods such as mel-frequency cepstral coefficients (MFCCs) and RASTA-PLP in unknown noisy conditions.

  17. Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear when Exposed to 65 dB of Auditory Noise

    Directory of Open Access Journals (Sweden)

    Göran B W Söderlund

    2016-01-01

    Full Text Available The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD, affecting approximately 6-9 % of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman’s speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB. Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children (TDC and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure.

  18. Support vector machines for automated recognition of obstructive sleep apnea syndrome from ECG recordings.

    Science.gov (United States)

    Khandoker, Ahsan H; Palaniswami, Marimuthu; Karmakar, Chandan K

    2009-01-01

    Obstructive sleep apnea syndrome (OSAS) is associated with cardiovascular morbidity as well as excessive daytime sleepiness and poor quality of life. In this study, we apply a machine learning technique [support vector machines (SVMs)] for automated recognition of OSAS types from their nocturnal ECG recordings. A total of 125 sets of nocturnal ECG recordings acquired from normal subjects (OSAS - ) and subjects with OSAS (OSAS +), each of approximately 8 h in duration, were analyzed. Features extracted from successive wavelet coefficient levels after wavelet decomposition of signals due to heart rate variability (HRV) from RR intervals and ECG-derived respiration (EDR) from R waves of QRS amplitudes were used as inputs to the SVMs to recognize OSAS +/- subjects. Using leave-one-out technique, the maximum accuracy of classification for 83 training sets was found to be 100% for SVMs using a subset of selected combination of HRV and EDR features. Independent test results on 42 subjects showed that it correctly recognized 24 out of 26 OSAS + subjects and 15 out of 16 OSAS - subjects (accuracy = 92.85%; Cohen's kappa value of 0.85). For estimating the relative severity of OSAS, the posterior probabilities of SVM outputs were calculated and compared with respective apnea/hypopnea index. These results suggest superior performance of SVMs in OSAS recognition supported by wavelet-based features of ECG. The results demonstrate considerable potential in applying SVMs in an ECG-based screening device that can aid a sleep specialist in the initial assessment of patients with suspected OSAS.

  19. 基于Asterisk的语音识别技术研究和实现%Research and Implementation of Speech Recognition Technology Based on Asterisk

    Institute of Scientific and Technical Information of China (English)

    陈可新; 黄伟民

    2015-01-01

    This paper analyzes the problems existing in traditional IVR in call center, introduces the function of speech recognition tech-nology in call center, expounds the development principles and procedure of speech recognition by use of Asterisk dial plan and AGI, and finally, gives the implementation of speech recognition by use of speech recognition engine, which is called by AGI program to rec-ognize inbound user' s speech.%本文简要地分析了当前呼叫中心中传统IVR系统存在的问题,介绍了语音识别技术在呼叫中心的作用,阐述了利用Asterisk的拨号方案和AGI接口开发语音识别功能的原理,最后给出了在AGI程序中调用语音识别引擎实现呼入用户语音信息识别的过程.

  20. Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition

    Directory of Open Access Journals (Sweden)

    Yin Hui

    2009-01-01

    Full Text Available Fractional Fourier transform (FrFT has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.

  1. Effects of Active and Passive Hearing Protection Devices on Sound Source Localization, Speech Recognition, and Tone Detection.

    Directory of Open Access Journals (Sweden)

    Andrew D Brown

    Full Text Available Hearing protection devices (HPDs such as earplugs offer to mitigate noise exposure and reduce the incidence of hearing loss among persons frequently exposed to intense sound. However, distortions of spatial acoustic information and reduced audibility of low-intensity sounds caused by many existing HPDs can make their use untenable in high-risk (e.g., military or law enforcement environments where auditory situational awareness is imperative. Here we assessed (1 sound source localization accuracy using a head-turning paradigm, (2 speech-in-noise recognition using a modified version of the QuickSIN test, and (3 tone detection thresholds using a two-alternative forced-choice task. Subjects were 10 young normal-hearing males. Four different HPDs were tested (two active, two passive, including two new and previously untested devices. Relative to unoccluded (control performance, all tested HPDs significantly degraded performance across tasks, although one active HPD slightly improved high-frequency tone detection thresholds and did not degrade speech recognition. Behavioral data were examined with respect to head-related transfer functions measured using a binaural manikin with and without tested HPDs in place. Data reinforce previous reports that HPDs significantly compromise a variety of auditory perceptual facilities, particularly sound localization due to distortions of high-frequency spectral cues that are important for the avoidance of front-back confusions.

  2. Digital speech processing using Matlab

    CERN Document Server

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  3. Measuring Prevalence of Other-Oriented Transactive Contributions Using an Automated Measure of Speech Style Accommodation

    Science.gov (United States)

    Gweon, Gahgene; Jain, Mahaveer; McDonough, John; Raj, Bhiksha; Rose, Carolyn P.

    2013-01-01

    This paper contributes to a theory-grounded methodological foundation for automatic collaborative learning process analysis. It does this by illustrating how insights from the social psychology and sociolinguistics of speech style provide a theoretical framework to inform the design of a computational model. The purpose of that model is to detect…

  4. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search......Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we......, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even...

  5. Speech recognition with dynamic range reduction: (1) deaf and normal subjects in laboratory conditions.

    Science.gov (United States)

    Drysdale, A E; Gregory, R L

    1978-08-01

    Processing to reduce the dynamic range of speech should increase intelligibility and protect the impaired ear from overloading. There are theoretical and practical objections to using AGC devices to reduce dynamic range. These are overcome by using recently available signal processing employing high frequency carrier clipping. An increase in intelligibility of speech with this HFCC has been demonstrated, for normal subjects with simulated deafness, and for most partially hearing patients. Intelligibility is not improved for some patients; possibly due to their having learned to extract features which are lost. These patients may also benefit after training.

  6. Towards Real-Time Speech Emotion Recognition for Affective E-Learning

    Science.gov (United States)

    Bahreini, Kiavash; Nadolski, Rob; Westera, Wim

    2016-01-01

    This paper presents the voice emotion recognition part of the FILTWAM framework for real-time emotion recognition in affective e-learning settings. FILTWAM (Framework for Improving Learning Through Webcams And Microphones) intends to offer timely and appropriate online feedback based upon learner's vocal intonations and facial expressions in order…

  7. Automated Understanding of Selected Voice Tract Pathologies Based on the Speech Signal Analysis

    Science.gov (United States)

    2007-11-02

    THE MATERIAL OF THE STUDY The studies of the speech articulation have been carried out for persons treated for the larynx cancer (men after various...types of operations). Depending on the stage of the tumour, various types of partial larynx surgery have been applied. In the recorded and studied...material the following cases have been present: subtotal larynx remove (laryngetctomia subtotalis), unilateral vertical laryngectomy (hemilaryngectomia

  8. Hints About Some Baseful but Indispensable Elements in Speech Recognition and Reconstruction

    Directory of Open Access Journals (Sweden)

    Mihaela Costin

    2002-07-01

    Full Text Available The cochlear implant (CI is a device used to reconstruct the hearing capabilities of a person diagnosed with total cophosis. This impairment may occur after some accidents, chemotherapy etc., the person still having an intact hearing nerve. The cochlear implant has two parts: a programmable, external part, the Digital Signal Processing (DSP device which process and transform the speech signal, and another surgically implanted part, with a certain number of electrodes (depending on brand used to stimulate the hearing nerve. The speech signal is fully processed in the DSP external device resulting the ``coded'' information on speech. This is modulated with the support of the fundamental frequency F0 and the energy impulses are inductively sent to the hearing nerve. The correct detection of this frequency is very important, determining the manner of hearing and making the difference between a "computer'' voice and a natural one. The results are applicable not only in the medical domain, but also in the Romanian speech synthesis.

  9. An empirical investigation of sparse distributed memory using discrete speech recognition

    Science.gov (United States)

    Danforth, Douglas G.

    1990-01-01

    Presented here is a step by step analysis of how the basic Sparse Distributed Memory (SDM) model can be modified to enhance its generalization capabilities for classification tasks. Data is taken from speech generated by a single talker. Experiments are used to investigate the theory of associative memories and the question of generalization from specific instances.

  10. Research on speech emotion recognition based on SVM%基于SVM的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    胡洋; 吴黎慧; 高磊; 蒲南江

    2011-01-01

    随着计算机技术的发展,人们对和谐人机交互的要求不断提高,这就要求计算机能理解说话人的情感信息,即能进行语音情感识别。本文提出了一种基于支持向量机(SVM)的语音情感识别方法,主要对人类的6种基本情感:高兴、惊奇、愤怒、悲伤、恐惧、平静进行研究。首先对自建语音情感数据库的情感语句提取特征,然后运用序列前向选择(SFS)算法选取最优特征集,最后通过二叉树支持向量机(BT-SVM)进行实验,取得了比较满意的识别率,证实了这种方法的可行性。%With the development of the computer technology,people enhance unceasingly to the harmonious manmachine interaction request.This request the computer can understood the speaker's emotion information,namely can carry on the speech emotion recognition.In this paper,a Support Vector Machine (SVM) method on the speech emotion recognition is presented.6 kinds of basic emotions:happy,surprise,anger,sad,fear and clam is studied.First,the feature of the statement of self-build emotional speech database are extracted,then the optimal characteristic sets are selected by using Sequence Forward Selection (SFS) algorithm,finally we put forward the experimentation through the binary tree support vector machine and obtain the quite satisfactory recognition rate.This also confirms the feasibility of this method.

  11. 基于Julius的机器人语音识别系统构建%Robot Speech Recognition System Based on Julius

    Institute of Scientific and Technical Information of China (English)

    付维; 刘冬; 闵华松

    2011-01-01

    As a result of the continuous development of robot technology, speech recognition of the robot is proposed as intelligent hu- man-computer interaction. After studying the basic principles of HMM speech recognition, in the robot platform of laboratory speech recognition system for isolated words is achieved with open source HTK and Julius. Using the speech recognition system, we can extract the voice command for robot control.%随着机器人技术不断发展,本文提出机器人的语音识别这一智能人机交互方式。在研究了基于HMM语音识别基本原理的情况下,在实验室的机器人平台上,利用HTK和Julius开源平台,构建了一个孤立词的语音识别系统。利用该语音识别系统可以提取语音命令用于机器人的控制。

  12. A Comparison of Accuracy and Rate of Transcription by Adults with Learning Disabilities Using a Continuous Speech Recognition System and a Traditional Computer Keyboard

    Science.gov (United States)

    Millar, Diane C.; McNaughton, David B.; Light, Janice C.

    2005-01-01

    A single-subject, alternating-treatments design was implemented for three adults with learning disabilities to compare the transcription of college-level texts using a speech recognition system and a traditional keyboard. The accuracy and rate of transcribing after editing was calculated for each transcribed passage. The results provide evidence…

  13. ChoiceKey: a real-time speech recognition program for psychology experiments with a small response set.

    Science.gov (United States)

    Donkin, Christopher; Brown, Scott D; Heathcote, Andrew

    2009-02-01

    Psychological experiments often collect choice responses using buttonpresses. However, spoken responses are useful in many cases-for example, when working with special clinical populations, or when a paradigm demands vocalization, or when accurate response time measurements are desired. In these cases, spoken responses are typically collected using a voice key, which usually involves manual coding by experimenters in a tedious and error-prone manner. We describe ChoiceKey, an open-source speech recognition package for MATLAB. It can be optimized by training for small response sets and different speakers. We show ChoiceKey to be reliable with minimal training for most participants in experiments with two different responses. Problems presented by individual differences, and occasional atypical responses, are examined, and extensions to larger response sets are explored. The ChoiceKey source files and instructions may be downloaded as supplemental materials for this article from brm.psychonomic-journals.org/content/supplemental.

  14. A large-scale dataset of solar event reports from automated feature recognition modules

    Science.gov (United States)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  15. A large-scale dataset of solar event reports from automated feature recognition modules

    Directory of Open Access Journals (Sweden)

    Schuh Michael A.

    2016-01-01

    Full Text Available The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT. With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD. Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  16. Speech emotion recognition based on MF-DFA%基于MF-DFA的语音情感识别

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 张密霞; 龚希龄

    2011-01-01

    In order to overcome the inadequacy of emotional conventional linear argument at depicting different types of character sentiments,this paper takes the multiple fractals theory into the sound emotional identify,by analyzing the multiple fractal features on the different sound emotional state, and proposes multifractal spectrum parameters and generalized hurst index as emotional conventional parameters, combined with traditional voice acoustic features and using support vector machine (SVM) for speech emotion recognition. The resuits show that the accuracy and stability of the recognition system are ira. proved effectively through using non-linear parameters, compared with using the linear features of traditional voice recognition method. It provides a new idea for voice emotion recognition.%针对语音情感线性参数在刻画不同情感类型特征上的不足,将多重分形理论引人语音情感识别中.通过分析不同语音情感状态下的多重分形特征,提取多重分形谱参数和广义hurst指数作为新的语音情感特征参数,并结合传统语音声学特征,采用支持向量机SVM对其进行语音情感识别.试验结果表明,该方法可使系统的准确率和稳定性得到有效提高.非线性参数的引人为语音情感识别提供了一个新的思路.

  17. SOFTWARE EFFORT ESTIMATION FRAMEWORK TO IMPROVE ORGANIZATION PRODUCTIVITY USING EMOTION RECOGNITION OF SOFTWARE ENGINEERS IN SPONTANEOUS SPEECH

    Directory of Open Access Journals (Sweden)

    B.V.A.N.S.S. Prabhakar Rao

    2015-10-01

    Full Text Available Productivity is a very important part of any organisation in general and software industry in particular. Now a day’s Software Effort estimation is a challenging task. Both Effort and Productivity are inter-related to each other. This can be achieved from the employee’s of the organization. Every organisation requires emotionally stable employees in their firm for seamless and progressive working. Of course, in other industries this may be achieved without man power. But, software project development is labour intensive activity. Each line of code should be delivered from software engineer. Tools and techniques may helpful and act as aid or supplementary. Whatever be the reason software industry has been suffering with success rate. Software industry is facing lot of problems in delivering the project on time and within the estimated budget limit. If we want to estimate the required effort of the project it is significant to know the emotional state of the team member. The responsibility of ensuring emotional contentment falls on the human resource department and the department can deploy a series of systems to carry out its survey. This analysis can be done using a variety of tools, one such, is through study of emotion recognition. The data needed for this is readily available and collectable and can be an excellent source for the feedback systems. The challenge of recognition of emotion in speech is convoluted primarily due to the noisy recording condition, the variations in sentiment in sample space and exhibition of multiple emotions in a single sentence. The ambiguity in the labels of training set also increases the complexity of problem addressed. The existing models using probabilistic models have dominated the study but present a flaw in scalability due to statistical inefficiency. The problem of sentiment prediction in spontaneous speech can thus be addressed using a hybrid system comprising of a Convolution Neural Network and

  18. Relating hearing loss and executive functions to hearing aid users’ preference for, and speech recognition with, different combinations of binaural noise reduction and microphone directionality

    Directory of Open Access Journals (Sweden)

    Tobias eNeher

    2014-12-01

    Full Text Available Knowledge of how executive functions relate to preferred hearing aid (HA processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1 performance on a measure of reading span (RS is related to preferred binaural noise reduction (NR strength, (2 similar relations exist for two different, nonverbal measures of executive function, (3 pure-tone average hearing loss (PTA, signal-to-noise ratio (SNR, and microphone directionality (DIR also influence preferred NR strength, and (4 preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker relations between worse performance on one nonverbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings.

  19. Emotional intelligence, not music training, predicts recognition of emotional speech prosody.

    Science.gov (United States)

    Trimmer, Christopher G; Cuddy, Lola L

    2008-12-01

    Is music training associated with greater sensitivity to emotional prosody in speech? University undergraduates (n = 100) were asked to identify the emotion conveyed in both semantically neutral utterances and melodic analogues that preserved the fundamental frequency contour and intensity pattern of the utterances. Utterances were expressed in four basic emotional tones (anger, fear, joy, sadness) and in a neutral condition. Participants also completed an extended questionnaire about music education and activities, and a battery of tests to assess emotional intelligence, musical perception and memory, and fluid intelligence. Emotional intelligence, not music training or music perception abilities, successfully predicted identification of intended emotion in speech and melodic analogues. The ability to recognize cues of emotion accurately and efficiently across domains may reflect the operation of a cross-modal processor that does not rely on gains of perceptual sensitivity such as those related to music training.

  20. Hindi Digits Recognition System on Speech Data Collected in Different Natural Noise Environments

    OpenAIRE

    2015-01-01

    This paper presents a baseline digits speech recogn izer for Hindi language. The recording environment is different for all speakers, since th e data is collected in their respective homes. The different environment refers to vehicle horn no ises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic m...

  1. Experimental investigation of the effects of the acoustical conditions in a simulated classroom on speech recognition and learning in children.

    Science.gov (United States)

    Valente, Daniel L; Plevinsky, Hallie M; Franco, John M; Heinrichs-Graham, Elizabeth C; Lewis, Dawna E

    2012-01-01

    The potential effects of acoustical environment on speech understanding are especially important as children enter school where students' ability to hear and understand complex verbal information is critical to learning. However, this ability is compromised because of widely varied and unfavorable classroom acoustics. The extent to which unfavorable classroom acoustics affect children's performance on longer learning tasks is largely unknown as most research has focused on testing children using words, syllables, or sentences as stimuli. In the current study, a simulated classroom environment was used to measure comprehension performance of two classroom learning activities: a discussion and lecture. Comprehension performance was measured for groups of elementary-aged students in one of four environments with varied reverberation times and background noise levels. The reverberation time was either 0.6 or 1.5 s, and the signal-to-noise level was either +10 or +7 dB. Performance is compared to adult subjects as well as to sentence-recognition in the same condition. Significant differences were seen in comprehension scores as a function of age and condition; both increasing background noise and reverberation degraded performance in comprehension tasks compared to minimal differences in measures of sentence-recognition.

  2. A Robust Front-End Processor combining Mel Frequency Cepstral Coefficient and Sub-band Spectral Centroid Histogram methods for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    R. Thangarajan

    2009-06-01

    Full Text Available Environmental robustness is an important area of research in speech recognition. Mismatch between trained speech models and actual speech to be recognized is due to factors like background noise. It can cause severe degradation in the accuracy of recognizers whichare based on commonly used features like mel-frequency cepstral co-efficient (MFCC and linear predictive coding (LPC. It is well understood that all previous auditory based feature extraction methods perform extremely well in terms of robustness due to the dominantfrequency information present in them. But these methods suffer from high computational cost. Another method called sub-band spectral centroid histograms (SSCH integrates dominant-frequency information with sub-band power information. This method is based onsub-band spectral centroids (SSC which are closely related to spectral peaks for both clean and noisy speech. Since SSC can be computed efficiently from short-term speech power spectrum estimate, SSCH method is quite robust to background additive noise at a lowercomputational cost. It has been noted that MFCC method outperforms SSCH method in the case of clean speech. However in the case of speech with additive noise, MFCC method degrades substantially. In this paper, both MFCC and SSCH feature extraction have beenimplemented in Carnegie Melon University (CMU Sphinx 4.0 and trained and tested on AN4 database for clean and noisy speech. Finally, a robust speech recognizer which automatically employs either MFCC or SSCH feature extraction methods based on the variance of shortterm power of the input utterance is suggested.

  3. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults.

    Science.gov (United States)

    McGrath, M; Summerfield, Q

    1985-02-01

    Audio-visual identification of sentences was measured as a function of audio delay in untrained observers with normal hearing; the soundtrack was replaced by rectangular pulses originally synchronized to the closing of the talker's vocal folds and then subjected to delay. When the soundtrack was delayed by 160 ms, identification scores were no better than when no acoustical information at all was provided. Delays of up to 80 ms had little effect on group-mean performance, but a separate analysis of a subgroup of better lipreaders showed a significant trend of reduced scores with increased delay in the range from 0-80 ms. A second experiment tested the interpretation that, although the main disruptive effect of the delay occurred on a syllabic time scale, better lipreaders might be attempting to use intermodal timing cues at a phonemic level. Normal-hearing observers determined whether a 120-Hz complex tone started before or after the opening of a pair of liplike Lissajou figures. Group-mean difference limens (70.7% correct DLs) were - 79 ms (sound leading) and + 138 ms (sound lagging), with no significant correlation between DLs and sentence lipreading scores. It was concluded that most observers, whether good lipreaders or not, possess insufficient sensitivity to intermodal timing cues in audio-visual speech for them to be used analogously to voice onset time in auditory speech perception. The results of both experiments imply that delays of up to about 40 ms introduced by signal-processing algorithms in aids to lipreading should not materially affect audio-visual speech understanding.

  4. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  5. Spoken-word recognition in foreign-accented speech by L2 listeners

    NARCIS (Netherlands)

    Weber, A.C.; Broersma, M.E.; Aoyagi, M.

    2011-01-01

    Two cross-modal priming studies investigated the recognition of English words spoken with a foreign accent. Auditory English primes were either typical of a Dutch accent or typical of a Japanese accent in English and were presented to both Dutch and Japanese L2 listeners. Lexical-decision times to s

  6. Automated Sound Recognition Provides Insights into the Behavioral Ecology of a Tropical Bird

    Science.gov (United States)

    Jahn, Olaf; Ganchev, Todor D.; Marques, Marinez I.; Schuchmann, Karl-L.

    2017-01-01

    Computer-assisted species recognition facilitates the analysis of relevant biological information in continuous audio recordings. In the present study, we assess the suitability of this approach for determining distinct life-cycle phases of the Southern Lapwing Vanellus chilensis lampronotus based on adult vocal activity. For this purpose we use passive 14-min and 30-min soundscape recordings (n = 33 201) collected in 24/7 mode between November 2012 and October 2013 in Brazil’s Pantanal wetlands. Time-stamped detections of V. chilensis call events (n = 62 292) were obtained with a species-specific sound recognizer. We demonstrate that the breeding season fell in a three-month period from mid-May to early August 2013, between the end of the flood cycle and the height of the dry season. Several phases of the lapwing’s life history were identified with presumed error margins of a few days: pre-breeding, territory establishment and egg-laying, incubation, hatching, parental defense of chicks, and post-breeding. Diurnal time budgets confirm high acoustic activity levels during midday hours in June and July, indicative of adults defending young. By August, activity patterns had reverted to nonbreeding mode, with peaks around dawn and dusk and low call frequency during midday heat. We assess the current technological limitations of the V. chilensis recognizer through a comprehensive performance assessment and scrutinize the usefulness of automated acoustic recognizers in studies on the distribution pattern, ecology, life history, and conservation status of sound-producing animal species. PMID:28085893

  7. Modeling and simulation of speech emotional recognition%语音情感智能识别的建模与仿真

    Institute of Scientific and Technical Information of China (English)

    黄晓峰; 彭远芳

    2012-01-01

    语音情感信息具有非线性、信息冗余、高维等复杂特点,数据含有大量噪声,传统识别模型难以消除冗余和噪声信息,导致语音情感识别正确率十分低.为了提高语音情感识别正确率,利用小波分析去噪和神经网络的非线性处理能力,提出一种基于过程神经元网络的语音情感智能识别模型.采用小波分析对语音情感信号进行去噪处理,利用主成分分析消除语音情感特征中的冗余信息,采用过程神经元网络对语音情感进行分类识别.仿真结果表明,基于过程神经元网络的识别模型的识别率比K近邻提高了13%,比支持向量机提高了8.75%,该模型是一种有效的语音情感智能识别工具.%Speech emotion information has nonlinear, redundancy and high dimension characteristics, the data has lots of noise, the traditional methods cannot eliminate the redundancy information and noise, so speech emotion recognition accuracy is quite low. In order to improve the accuracy of speech emotion recognition, this paper puts forward a speech emotion recognition model based on process neural networks strong nonlinear processing ability and wavelet analysisdenoising. The noise of speech signal is eliminated by wavelet analysis, the redundancy information is eliminated by the principal components analysis, the speech emotional signal is recognized by the process neural networks. Simulation results show that the average recognition rate of the process neural networks is higher than K neighbor by 13%, and higher than the support vector machine by 8.75%, therefore the proposed model is an effective speech emotion recognition tool.

  8. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  9. 人机交互中的语音情感识别研究进展%A survey of speech emotion recognition in human computer interaction

    Institute of Scientific and Technical Information of China (English)

    张石清; 李乐民; 赵知劲

    2013-01-01

    Speech emotion recognition is a current active research topic in the fields of signal processing,pattern recognition,artificial intelligence,human computer interaction,etc.The ultimate purpose of such research is to endow computers with emotion ability and make human computer interaction be genuinely harmonic and natural.This paper reviews the recent advance of several key problems involved in speech emotion recognition,including emotional description theory,emotional speech databases,emotional acoustic analysis as well as emotion recognition methods.In addition,the existing research problems and the future direction are presented.%语音情感识别是当前信号处理、模式识别、人工智能、人机交互等领域的热点研究课题,其研究的最终目的是赋予计算机情感能力,使得人机交互做到真正的和谐和自然.本文综述了语音情感识别所涉及到的几个关键问题,包括情感表示理论、情感语音数据库、情感声学特征分析以及情感识别方法四个方面的最新进展,并指出了研究中存在的问题及下一步发展的方向.

  10. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    than typing, making the translation process faster. The spoken translation is analyzed and combined with the machine translation output of the same sentence using different methods. We study a number of different translation models in the context of n-best list rescoring methods. As an alternative...... to the n-best list rescoring, we also use word graphs with the expectation of arriving at a tighter integration of ASR and MT models. Integration methods include constraining ASR models using language and translation models of MT, and vice versa. We currently develop and experiment different methods...... on the Danish – English language pair, with the use of a speech corpora and parallel text. The methods are investigated to check ways that the accuracy of the spoken translation of the translator can be increased with the use of machine translation outputs, which would be useful for potential computer...

  11. Speech processing in mobile environments

    CERN Document Server

    Rao, K Sreenivasa

    2014-01-01

    This book focuses on speech processing in the presence of low-bit rate coding and varying background environments. The methods presented in the book exploit the speech events which are robust in noisy environments. Accurate estimation of these crucial events will be useful for carrying out various speech tasks such as speech recognition, speaker recognition and speech rate modification in mobile environments. The authors provide insights into designing and developing robust methods to process the speech in mobile environments. Covering temporal and spectral enhancement methods to minimize the effect of noise and examining methods and models on speech and speaker recognition applications in mobile environments.

  12. Speech emotion recognition based on Intrinsic Time-scale Decomposition%ITD在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 刘亚

    2014-01-01

    为了更好地表征语音情感状态,将固有时间尺度分解(ITD)用于语音情感特征提取。从语音信号中得到前若干阶合理旋转(PR)分量,并提取PR分量的瞬时参数特征和关联维数,以此作为新的情感特征参数,结合传统特征使用支持向量机(SVM)进行语音情感识别实验。实验结果显示,引入PR特征参数后,与传统特征的方案相比,情感识别率有了明显提高。%In order to express speech emotional state better, this paper takes the Intrinsic Time-scale Decomposition (ITD)into extracting speech emotion features, decomposes the emotion speech into a sum of Proper Rotation(PR)com-ponents, extracts instantaneous characteristic parameters and correlation dimension as new emotional characteristic param-eters, combines with traditional features and uses Support Vector Machine(SVM)for speech emotional recognition. The results show that recognition accuracy is improved obviously through using PR features parameters.

  13. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  14. Logistic Kernel Function and its Application to Speech Recognition%Logistic 核函数及其在语音识别中的应用

    Institute of Scientific and Technical Information of China (English)

    刘晓峰; 张雪英; Zizhong John Wang

    2015-01-01

    核函数是支持向量机( SVM)的核心,直接决定着SVM的性能。为提高SVM在语音识别问题中的学习能力和泛化能力,文中提出了一种 Logistic 核函数,并给出了该Logistic核函数是Mercer核的理论证明。在双螺旋、语音识别问题上的实验结果表明,该Logistic核函数是有效的,其性能优于线性、多项式、径向基、指数径向基的核函数,尤其是在语音识别中,该Logistic核函数具有更好的识别性能。%Kernel function is the core of support vector machine ( SVM) and directly affects the performance of SVM.In order to improve the learning ability and generalization ability of SVM for speech recognition, a Logistic kernel function, which is proved to be a Mercer kernel function, is presented.Experimental results on bi-spiral and speech recognition problems show that the presented Logistic kernel function is effective and performs better than linear, polynomial, radial basis and exponential radial basis kernel functions, especially in the case of speech rec-ognition.

  15. Intelligent Home Speech Recognition System Based on NL6621%语音识别技术在智能家居中的应用

    Institute of Scientific and Technical Information of China (English)

    王爱芸

    2015-01-01

    The research of intelligent home speech recognition system is very important for the development of smart home. Through the analysis of the embedded speech recognition technology and smart home control technology, voice is recorded with NL6621 board as the platform and VS1003 as audio decoding chip. And Hidden Markov Model (HMM) algorithm is used to carry out voice model training and voice matching, so that we can achieve a smart home voice con-trol system. Experiments prove that the speech control system has high recognition rate and real-time performance.%研究实用的智能家居语音识别系统,对于智能家居的发展具有重要意义。通过分析嵌入式语音识别技术以及智能家居控制技术,以 NL6621板为平台,VS1003为音频解码芯片录制语音。并利用隐马尔可夫(HMM)算法进行语音模型训练和语音匹配,实现智能家居语音控制系统。实验证明此语音控制系统具有较高的识别率和实时性。

  16. SecurityAuthentication System based on Speech Recognition%基于语音识别的安全认证系统

    Institute of Scientific and Technical Information of China (English)

    毕俊浩; 叶翰嘉; 王笑臣; 孙国梓

    2012-01-01

      Based on the analysis of the smart terminal security requirements ,speech recognition and sandbox protection technology are used to verify whether the user is authorized or not. We design and implement a safety certification system based on speech recognition in the wide used Android system. From the user speech recognition, the interactive interface and interaction protocols of the sandbox protection, and other aspects we build a detailed analysis of key technologies.%  文章在对智能终端安全性需求进行分析的基础上,将语音识别与沙盒防护技术应用于智能终端。使用者是否获得授权的验证是文章分析的一个重点问题。通过选择在广泛使用的Android系统上设计实现了一个基于语音识别的安全认证系统,从使用者声音的识别、沙盒防护的交互接口和交互协议等几个方面对系统构建的关键技术进行了详细分析。

  17. Model Compensation Approach Based on Nonuniform Spectral Compression Features for Noisy Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ning Geng-Xin

    2007-01-01

    Full Text Available This paper presents a novel model compensation (MC method for the features of mel-frequency cepstral coefficients (MFCCs with signal-to-noise-ratio- (SNR- dependent nonuniform spectral compression (SNSC. Though these new MFCCs derived from a SNSC scheme have been shown to be robust features under matched case, they suffer from serious mismatch when the reference models are trained at different SNRs and in different environments. To solve this drawback, a compressed mismatch function is defined for the static observations with nonuniform spectral compression. The means and variances of the static features with spectral compression are derived according to this mismatch function. Experimental results show that the proposed method is able to provide recognition accuracy better than conventional MC methods when using uncompressed features especially at very low SNR under different noises. Moreover, the new compensation method has a computational complexity slightly above that of conventional MC methods.

  18. 基于改进型SVM算法的语音情感识别%Speech emotion recognition algorithm based on modified SVM

    Institute of Scientific and Technical Information of China (English)

    李书玲; 刘蓉; 张鎏钦; 刘红

    2013-01-01

    为有效提高语音情感识别系统的识别率,研究分析了一种改进型的支持向量机(SVM)算法.该算法首先利用遗传算法对SVM参数惩罚因子和核函数中参数进行优化,然后用优化后的参数进行语音情感的建模与识别.在柏林数据集上进行7种和常用5种情感识别实验,取得了91.03%和96.59%的识别率,在汉语情感数据集上,取得了97.67%的识别率.实验结果表明该算法能够有效识别语音情感.%In order to effectively improve the recognition accuracy of the speech emotion recognition system,an improved speech emotion recognition algorithm based on Support Vector Machine (SVM) was proposed.In the proposed algorithm,the SVM parameters,penalty factor and nuclear function parameter,were optimized with genetic algorithm.Furthermore,an emotion recognition model was established with SVM method.The performance of this algorithm was assessed by computer simulations,and 91.03% and 96.59% recognition rates were achieved respectively in seven-emotion recognition experiments and common five-emotion recognition experiments on the Berlin database.When the Chinese emotional database was used,the rate increased to 97.67%.The obtained results of the simulations demonstrate the validity of the proposed algorithm.

  19. 基于HMM和ANN的语音情感识别研究%Research on emotion recognition of speech signal based on HMM and ANN

    Institute of Scientific and Technical Information of China (English)

    胡洋; 蒲南江; 吴黎慧; 高磊

    2011-01-01

    Speech emotion recognition is not only an important part of speech recognition but also the basic theory of harmonious human-computer interaction.As a single classifier in the limitations of speech emotion recognition.In this paper,we put forward a method:the Combination of Hidden Markov Model (HMM) and Artificial Neural Network (ANN),for the six emotion of happy,surprise,anger,sad,fear and clam,we design six HMM model for every emotion,through this method,we have the best matching sequence of each emotion.Then,the posterior ANN classifier is used to classify the test samples,through the integration of two classifiers to improve speech emotion recognition rate.Based on the emotion speech database established by recording induced,the experimental results indicate that there is great elevation in the recognition rate.%语音情感识别是语音识别中的重要分支,是和谐人机交互的基础理论。由于单一分类器在语音情感识别中的局限性,本文提出了隐马尔科夫模型(HMM)和人工神经网络(ANN)相结合的方法,对高兴、惊奇、愤怒、悲伤、恐惧、平静六种情感分别设计一个HMM模型,得到每种情感的最佳匹配序列,然后利用ANN作为后验分类器对测试样本进行分类,通过两种分类器融合提高语音情感识别率。在通过诱导录音法建立的情感语音库的基础上进行了实验验证,实验结果表明识别率有较大的提高。

  20. 基于RBF神经网络的语音情感识别%Speech Emotion Recognition Based on RBF Neural Network

    Institute of Scientific and Technical Information of China (English)

    张海燕; 唐建芳

    2011-01-01

    The principle of radial base function neural network and its train algorithm are introduced in this paper.Meanwhile,the model of speech emotion recognition based on RBF neural network is established.In the recognition experiments,BP neural network and RBF neural network are compared in the same testing environment.The recognition rate of RBF neural network is 3% more than BP neural network.The results show that the method based on RBF neural network speech emotion recognition is effective.%介绍了径向基函数神经网络的原理、训练算法,并建立了RBF神经网络的语音情感识别的模型。在实验中比较了BP神经网络与RBF神经网络分别用于语音情感识别识别率,RBF神经网络的平均识别率高于BP神经网络3%。结果表明,基于RBF神经网络的语音情感识别方法的有效性。

  1. Robust Speech Recognition Based on Vector Taylor Series%基于矢量泰勒级数的鲁棒语音识别

    Institute of Scientific and Technical Information of China (English)

    吕勇; 吴镇扬

    2011-01-01

    The vector Taylor series (VTS) expansion is an effective approach to noise robust speech recognition.However, in the log-spectral domain, there exist the strong correlations among the different channels of Mel filter bank and thus it is difficult to estimate the noise variance from noisy speech proposes.A feature compensation algorithm in the cepstral domain based on vector Taylor series was proposed.In this algorithm, the distribution of speech cepstral features was represented by a Gaussian mixture model (GMM), and the mean and variance of noise were estimated from noisy speech by the VTS approximation.The experimental results show that the proposed algorithm can significantly improve the performance of speech recognition system, and outperforms the VTS-based feature compensation method in the log-spectral domain.%矢量泰勒级数是一种有效的抗噪声鲁棒语音识别算法.然而在对数谱域,美尔滤波器组的不同通道之间有较强的相关性,因而难以从含噪语音中准确估计噪声的方差.提出了一种基于矢量泰勒级数的倒谱域特征补偿算法.该算法在倒谱域,用一个高斯混合模型描述语音倒谱特征的分布,通过矢量泰勒级数从含噪语音中估计噪声的均值和方差.实验结果表明,此算法能明显提高语音识别系统的性能,优于基于矢量泰勒级数的对数谱域特征补偿算法.

  2. Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech.

    Science.gov (United States)

    Campo, D.; Quintero, O. L.; Bastidas, M.

    2016-04-01

    We propose a study of the mathematical properties of voice as an audio signal. This work includes signals in which the channel conditions are not ideal for emotion recognition. Multiresolution analysis- discrete wavelet transform - was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states. ANNs proved to be a system that allows an appropriate classification of such states. This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features. Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify.

  3. Continuous speech recognition by convolutional neural networks%基于卷积神经网络的连续语音识别

    Institute of Scientific and Technical Information of China (English)

    张晴晴; 刘勇; 潘接林; 颜永红

    2015-01-01

    在语音识别中,卷积神经网络( convolutional neural networks,CNNs)相比于目前广泛使用的深层神经网络( deep neural network,DNNs),能在保证性能的同时,大大压缩模型的尺寸。本文深入分析了卷积神经网络中卷积层和聚合层的不同结构对识别性能的影响情况,并与目前广泛使用的深层神经网络模型进行了对比。在标准语音识别库TIMIT以及大词表非特定人电话自然口语对话数据库上的实验结果证明,相比传统深层神经网络模型,卷积神经网络明显降低模型规模的同时,识别性能更好,且泛化能力更强。%Convolutional neural networks ( CNNs ) , which show success in achieving translation invariance for many image processing tasks, were investigated for continuous speech recognition. Compared to deep neural networks ( DNNs) , which are proven to be successful in many speech recognition tasks nowadays, CNNs can reduce the neural network model sizes significantly, and at the same time achieve even a better recognition accuracy. Experiments on standard speech corpus TIMIT and conversational speech corpus show that CNNs outperform DNNs in terms of the accuracy and the generalization ability.

  4. 基于 PAD 情绪模型的情感语音识别%Emotional Speech Recognition Based on PAD Emotion Model

    Institute of Scientific and Technical Information of China (English)

    宋静; 张雪英; 孙颖; 张卫

    2016-01-01

    简述梅尔频率倒谱系数、线性预测系数、韵律学特征、共振峰频率和过零峰值幅度特征,并将这五种语音特征应用于情感语音识别。根据识别结果从PAD情绪模型的三个维度进行相关性分析得到特征的权重系数,并将识别结果融合映射到PAD三维情绪空间,最终获得情感语音的PAD值。利用情感语音的PAD值可以从连续情感理论对情感语音进行描述分析,采用量化的方法揭示情感空间中各种情绪范畴的定位和关系。%Five approaches of feature extraction :the MEL‐frequency Cepstral Coefficient (MFCC) ,the Linear Predictor Coefficient(LPC) ,prosodic features ,formant frequency and the Zero Crossings with Peak Amplitudes (ZCPA) are described in this paper .These features are applied to emotional speech recognition .According to the recognition results ,the weight coefficients of features are obtained by correlation analysis in the three dimensions of PAD emotion model .Simultaneously the recognition results are fused to the PAD emotional space ,and the PAD values of the emotional speech are obtained .The PAD values of the emotional speech can be analyzed from the theory of continuous emotion . And the quantitative analysis of emotional speech can reveal the position and relationship of emotional category in emotional space .

  5. Is talking to an automated teller machine natural and fun?

    Science.gov (United States)

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.

  6. A single word speech recognition system with GUI-dependent vocabulary selection for MMI applications; Ein Einzelworterkennungssystem mit GUI-basierter Wortschatzumschaltung fuer industrielle Mensch-Maschine-Schnittstellen

    Energy Technology Data Exchange (ETDEWEB)

    Hengen, H. [Ingenieurbuero Hengen GbR, Kandel (Germany); Izak, M.; Liu, S. [Technische Univ. Kaiserslautern (Germany). Lehrstuhl fuer Regelungssysteme

    2007-07-01

    The following article deals with a newly developed method for single word speech recognition for application in MMI. The main idea is to reduce the active vocabulary of the recognition system to the vocabulary required by the active input dialog element (e.g. GUI - Window, dropdown field etc.) By keeping the vocabulary intentionally small, the recognition rate becomes very large and thus other input devices can be replaced or can be kept free for other tasks (Keyboard, Mouse, Pen). The described methodology is especially engineered to suit handheld applications as well as industrial or medical MMI-applications in which the user has to use his hands for purposes other than the interface. (orig.)

  7. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    Directory of Open Access Journals (Sweden)

    Zucker Gerhard

    2011-01-01

    Full Text Available The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour that resolves conflicting demands—a decision-making process. In the last years, research has started to focus on bionic principles for designing new concepts in this area. The information processing principles of the human mind have turned out to be of particular interest as the mind is capable of processing huge amounts of sensory data and taking adequate decisions for (re-actions based on these analysed data. In this paper, we discuss how a bionic approach can solve the upcoming problems of energy optimal systems. A recently developed model for environment recognition and decision-making processes, which is based on research findings from different disciplines of brain research is introduced. This model is the foundation for applications in intelligent building automation that have to deal with information from home and office environments. All of these applications have in common that they consist of a combination of communicating nodes and have many, partly contradicting goals.

  8. 车载语音识别系统可靠性设计的关键技术研究%Research on Key Technology of Reliability Design of Vehicular Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张方伟; 丁武俊; 陈文强; 潘之杰; 赵福全

    2012-01-01

    At present, bluetooth system and speech control system of speech recognition technology are widely used in more and more vehicle types, but reliability of the speech recognition is poor. From system development, based on complex environment in vehicle, this paper illustrates how to improve reliability of speech recognition system from aspects of speech recognitinn logic, keywords determination, speech recognition technology, harness and microphone distribution, so as to maximize effect of the speech recognition system and offer more customized service. The results indicated that reliability of speech recognition has been improved based on this method.%目前基于语音识别技术的蓝牙免提、语音控制等系统在越来越多的车型上得到了广泛的应用,但语音识别系统的可靠性都不是很高。文章主要从系统开发的角度,以车内的复杂环境为基础,从语音识别逻辑、关键词制定、语音识别技术、线束及麦克风布置等多方面来研究如何提高语音识别系统的可靠性,使其价值最大化,提供的服务更加人性化。结果表明,基于以上方法能让语音识别系统的可靠性得到很大的提高。

  9. 基于Fisher准则与SVM的分层语音情感识别%Multi-Level Speech Emotion Recognition Based on Fisher Criterion and SVM

    Institute of Scientific and Technical Information of China (English)

    陈立江; 毛峡; Mitsuru ISHIZUKA

    2012-01-01

    To solve the speaker independent emotion recognition problem, a multi-level speech emotion recognition system is proposed to classify 6 speech emotions, including sadness, anger, surprise, fear, happiness and disgust from coarse to fine. The key is that the emotions divided by each layer are closely related to the emotional features of speech. For each level, appropriate features are selected from 288 candidate features by Fisher ratio which is also regarded as input parameter for the training of support vector machine ( SVM). Based on Beihang emotional speech database and Berlin emotional speech database, principal component analysis ( PC A) for dimension reduction and Artificial Neural Network ( ANN) for classification are adopted to design 4 comparative experiments, including Fisher+SVM, PC A +SVM, Fisher+ANN, PCA+ANN. The experimental results prove that Fisher rule is better than PCA for dimension reduction, and SVM is more expansible than ANN for speaker independent speech emotion recognition. Good cross-cultural adaptation can be inferred from the similar results of experiments based on two different databases.%针对说话人无关的语音情感识别,提出一个分层语音情感识别模型,由粗到细识别悲伤、愤怒、惊奇、恐惧、喜悦和厌恶6种情感.每层采用Fisher比率从288个备选特征中选择适合该层分类的特征,同时将Fisher比率作为输入参数训练该层的支持向量机分类器.基于北京航空航天大学情感语音数据库和德国柏林情感语音数据库,设计4组对比实验,实验结果表明,Fisher准则在两两分类特征选择上优于PCA,SVM在说话人无关的语音情感识别推广方面优于人工神经网络(ANN).在两个数据库的基础上得到类似结果,说明文中分类模型具有一定的跨文化适应性.

  10. Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Kun-Ching Wang

    2015-01-01

    Full Text Available The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI. In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII. The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

  11. Localization and recognition of traffic signs for automated vehicle control systems

    Science.gov (United States)

    Zadeh, Mahmoud M.; Kasvand, T.; Suen, Ching Y.

    1998-01-01

    We present a computer vision system for detection and recognition of traffic signs. Such systems are required to assist drivers and for guidance and control of autonomous vehicles on roads and city streets. For experiments we use sequences of digitized photographs and off-line analysis. The system contains four stages. First, region segmentation based on color pixel classification called SRSM. SRSM limits the search to regions of interest in the scene. Second, we use edge tracing to find parts of outer edges of signs which are circular or straight, corresponding to the geometrical shapes of traffic signs. The third step is geometrical analysis of the outer edge and preliminary recognition of each candidate region, which may be a potential traffic sign. The final step in recognition uses color combinations within each region and model matching. This system maybe used for recognition of other types of objects, provided that the geometrical shape and color content remain reasonably constant. The method is reliable, easy to implement, and fast, This differs form the road signs recognition method in the PROMETEUS. The overall structure of the approach is sketched.

  12. 基于BP神经网络的语音情感识别研究%Speech Emotion Recognition Based on BP Neural Network

    Institute of Scientific and Technical Information of China (English)

    徐照松; 元建

    2014-01-01

    随着科技的迅速发展,人机交互越来越受到人们的重视,语音情感识别更是学术界研究的热点。将BP神经网络算法用于语音情感识别研究,并在汉语情感数据集上进行了相关实验,识别的准确率达到了91.5%,相较于SVM算法分类精度提高了5%。%With the rapid development of technology ,human-computer interaction more and more suffer people’s attention . Research on speech emotion recognition is the focus of academic .In this article ,we use the BP neural network algorithm to research on speech emotion recognition and conducted experiments on chinese sentiment data sets ,recognition accuracy rate reached 91 .5 percent ,compared to the SVM accuracy is improved by 5% .

  13. Study of the Ability of Articulation Index (Al for Predicting the Unaided and Aided Speech Recognition Performance of 25 to 65 Years Old Hearing-Impaired Adults

    Directory of Open Access Journals (Sweden)

    Ghasem Mohammad Khani

    2001-05-01

    Full Text Available Background: In recent years there has been increased interest in the use of Al for assessing hearing handicap and for measuring the potential effectiveness of amplification system. AI is an expression of proportion of average speech signal that is audible to a given patient, and it can vary between 0.0 to 1.0. Method and Materials: This cross-sectional analytical study was carried out in department of audiology, rehabilitation, faculty, IUMS form 31 Oct 98 to 7 March 1999, on 40 normal hearing persons (80 ears; 19 males and 21 females and 40 hearing impaired persons (61 ears; 36 males and 25 females, 25-65 years old with moderate to moderately severe SNI-IL The pavlovic procedure (1988 for calculating Al, open set taped standard mono syllabic word lists, and the real -ear probe- tube microphone system to measure insertion gain were used, through test-retest. Results: 1/A significant correlation was shown between the Al scores and the speech recognition scores of normal hearing and hearing-impaired group with and without the hearing aid (P<0.05 2/ There was no significant differences in age group & sex: also 3 In test-retest measures of the insertion gain in each test and 4/No significant in test-retest of speech recognition test score. Conclusion: According to these results the Al can predict the unaided and aided monosyllabic recognition test scores very well, and age and sex variables have no effect on its ability. Therefore with respect to high reliability of the Al results and its simplicity, easy -to- use, cost effective, and little time consuming for calculation, its recommended the wide use of the Al, especially in clinical situation.

  14. Automated Facial Expression Recognition Using Gradient-Based Ternary Texture Patterns

    Directory of Open Access Journals (Sweden)

    Faisal Ahmed

    2013-01-01

    Full Text Available Recognition of human expression from facial image is an interesting research area, which has received increasing attention in the recent years. A robust and effective facial feature descriptor is the key to designing a successful expression recognition system. Although much progress has been made, deriving a face feature descriptor that can perform consistently under changing environment is still a difficult and challenging task. In this paper, we present the gradient local ternary pattern (GLTP—a discriminative local texture feature for representing facial expression. The proposed GLTP operator encodes the local texture of an image by computing the gradient magnitudes of the local neighborhood and quantizing those values in three discrimination levels. The location and occurrence information of the resulting micropatterns is then used as the face feature descriptor. The performance of the proposed method has been evaluated for the person-independent face expression recognition task. Experiments with prototypic expression images from the Cohn-Kanade (CK face expression database validate that the GLTP feature descriptor can effectively encode the facial texture and thus achieves improved recognition performance than some well-known appearance-based facial features.

  15. 基于粗神经网络的语音情感识别%Speech Emotion Recognition Based on Rough Set and ANN

    Institute of Scientific and Technical Information of China (English)

    曾光菊

    2011-01-01

    Speech emotion recognition is about extracting effect acoustic features from speech signals and recognizing emotion state of human by using of intelligent computation.The domestic related research of emotion speech database,features extraction and recognition ways are studied.Learning from these related researches,the features extraction was found to have important affections on the speech emotion recognition.1050 sentences was recorded and 30 features extracted form every sentence and then formed to a database of 1050×30.The information consistence of rough set is applied to simplify 30 features of database to 12 features.Then artificial neural network is used to recognize emotion state of 525 sentences,it attains to the highest recognition rate of 84%.The results shows that using different ways to recognize different emotion has better effects.%语音情感识别是从语音信号中提取一些有效的声学特征,然后利用智能计算或者识别的方法对话者的情感状态进行识别。介绍了国内外在该领域中关于语音情感数据库、特征提取、识别方法的研究现状。基于对该领域现状的了解,发现特征提取对识别率有着非常大的影响。录制了1050句语音,每句语音提取了30个特征,从而形成了一个1050×30的数据库。提出了用粗糙集理论中的信息一致性对数据库中的30个特征进行化简,最后得到了12个特征。用神经网络中的BP网络对话者的情感状态进行识别,最高识别率达到了84%。从实验结果发现不同的情感用不同的方法识别结果更好。

  16. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    2015-01-01

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers...... of three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...

  17. Automated recognition of cell phenotypes in histology images based on membrane- and nuclei-targeting biomarkers

    Directory of Open Access Journals (Sweden)

    Tözeren Aydın

    2007-09-01

    Full Text Available Abstract Background Three-dimensional in vitro culture of cancer cells are used to predict the effects of prospective anti-cancer drugs in vivo. In this study, we present an automated image analysis protocol for detailed morphological protein marker profiling of tumoroid cross section images. Methods Histologic cross sections of breast tumoroids developed in co-culture suspensions of breast cancer cell lines, stained for E-cadherin and progesterone receptor, were digitized and pixels in these images were classified into five categories using k-means clustering. Automated segmentation was used to identify image regions composed of cells expressing a given biomarker. Synthesized images were created to check the accuracy of the image processing system. Results Accuracy of automated segmentation was over 95% in identifying regions of interest in synthesized images. Image analysis of adjacent histology slides stained, respectively, for Ecad and PR, accurately predicted regions of different cell phenotypes. Image analysis of tumoroid cross sections from different tumoroids obtained under the same co-culture conditions indicated the variation of cellular composition from one tumoroid to another. Variations in the compositions of cross sections obtained from the same tumoroid were established by parallel analysis of Ecad and PR-stained cross section images. Conclusion Proposed image analysis methods offer standardized high throughput profiling of molecular anatomy of tumoroids based on both membrane and nuclei markers that is suitable to rapid large scale investigations of anti-cancer compounds for drug development.

  18. Forensic speaker recognition

    NARCIS (Netherlands)

    Meuwly, Didier

    2009-01-01

    The aim of forensic speaker recognition is to establish links between individuals and criminal activities, through audio speech recordings. This field is multidisciplinary, combining predominantly phonetics, linguistics, speech signal processing, and forensic statistics. On these bases, expert-based

  19. Spectrogram feature extraction algorithm for speech emotion recognition%面向语音情感识别的语谱图特征提取算法

    Institute of Scientific and Technical Information of China (English)

    陶华伟; 查诚; 梁瑞宇; 张昕然; 赵力; 王青云

    2015-01-01

    为研究信号相关性在语音情感识别中的作用,提出了一种面向语音情感识别的语谱图特征提取算法。首先,对语谱图进行处理,得到归一化后的语谱图灰度图像;然后,计算不同尺度、不同方向的 Gabor 图谱,并采用局部二值模式提取 Gabor 图谱的纹理特征;最后,将不同尺度、不同方向 Gabor 图谱提取到的局部二值模式特征进行级联,作为一种新的语音情感特征进行情感识别。柏林库(EMO-DB)及 FAU AiBo 库上的实验结果表明:与已有的韵律、频域、音质特征相比,所提特征的识别率提升3%以上;与声学特征融合后,所提特征的识别率较早期声学特征至少提高5%。因此,利用这种新的语音情感特征可以有效识别不同种类的情感语音。%In order to study the role of signal correlation in emotional speech recognition,a spectro-gram feature extraction algorithm for speech emotion recognition is proposed.First,speech signal is quantized as speech spectrum gray image after preprocessing.Then,the Gabor spectrum images with different scales and different directions are calculated,and the texture features are extracted by local binary pattern (LBP).Finally,the LBP features of the Gabor spectrogram images with different scales and different directions are joined to form a new feature for emotion recognition.The experi-mental results of EMO-DB and FAU AiBo show that the recognition rate of the proposed features can be raised to at least 3% higher than those of the conventional rhythm and frequency domain features. After fusion with acoustic features,the recognition rate can be raised to at least 5% higher than those of the conventional acoustic features.Therefore,the proposed features can effectively identify differ-ent kinds of emotional speech.

  20. 基于ABC优化MVDR的语音情感识别研究%Speech emotion recognition based on ABC optimization MVDR

    Institute of Scientific and Technical Information of China (English)

    孙志锋

    2016-01-01

    It is a crucial problem to extract and choose the features of speech emotion. To solve the problem of Linear Prediction in speech emotion spectrum envelope, this paper puts forward to extract the features of speech emotion with Minimum Variance Distortionless Response (MVDR) spectrum method. In order to eliminate redundant information,it uses Artificial Bee Colony (ABC) algorithm to obtain the optimal subset of the features. Then the experiment recognise four speech emotions namely:angry,neutral,happy,fear,in the Casia Chinese Emotion Corpus through Radial Basis Function (RBF) Neural Network method. The results show that the approach in this paper has higher rate of recognition and is more robust.%语音情感特征的提取和选择是语音情感识别的关键问题,针对线性预测(LP)模型在语音情感谱包络方面存在的不足。本论文提出了最小方差无失真响应(MVDR)谱方法来进行语音情感特征的提取;并通过人工蜂群(ABC)算法找到最优语音情感特征子集,消除特征冗余信息;利用径向基函数(RBF)神经网络对CASIA汉语情感语料库中的4种情感语音即生气、平静、高兴、害怕进行实验识别。实验结果表明,该方法比线性预测法有更高的识别率和更好的鲁棒性。

  1. Automated species recognition of antbirds in a Mexican rainforest using hidden Markov models.

    Science.gov (United States)

    Trifa, Vlad M; Kirschel, Alexander N G; Taylor, Charles E; Vallejo, Edgar E

    2008-04-01

    Behavioral and ecological studies would benefit from the ability to automatically identify species from acoustic recordings. The work presented in this article explores the ability of hidden Markov models to distinguish songs from five species of antbirds that share the same territory in a rainforest environment in Mexico. When only clean recordings were used, species recognition was nearly perfect, 99.5%. With noisy recordings, performance was lower but generally exceeding 90%. Besides the quality of the recordings, performance has been found to be heavily influenced by a multitude of factors, such as the size of the training set, the feature extraction method used, and number of states in the Markov model. In general, training with noisier data also improved recognition in test recordings, because of an increased ability to generalize. Considerations for improving performance, including beamforming with sensor arrays and design of preprocessing methods particularly suited for bird songs, are discussed. Combining sensor network technology with effective event detection and species identification algorithms will enable observation of species interactions at a spatial and temporal resolution that is simply impossible with current tools. Analysis of animal behavior through real-time tracking of individuals and recording of large amounts of data with embedded devices in remote locations is thus a realistic goal.

  2. Automated recognition and tracking of aerosol threat plumes with an IR camera pod

    Science.gov (United States)

    Fauth, Ryan; Powell, Christopher; Gruber, Thomas; Clapp, Dan

    2012-06-01

    Protection of fixed sites from chemical, biological, or radiological aerosol plume attacks depends on early warning so that there is time to take mitigating actions. Early warning requires continuous, autonomous, and rapid coverage of large surrounding areas; however, this must be done at an affordable cost. Once a potential threat plume is detected though, a different type of sensor (e.g., a more expensive, slower sensor) may be cued for identification purposes, but the problem is to quickly identify all of the potential threats around the fixed site of interest. To address this problem of low cost, persistent, wide area surveillance, an IR camera pod and multi-image stitching and processing algorithms have been developed for automatic recognition and tracking of aerosol plumes. A rugged, modular, static pod design, which accommodates as many as four micro-bolometer IR cameras for 45deg to 180deg of azimuth coverage, is presented. Various OpenCV1 based image-processing algorithms, including stitching of multiple adjacent FOVs, recognition of aerosol plume objects, and the tracking of aerosol plumes, are presented using process block diagrams and sample field test results, including chemical and biological simulant plumes. Methods for dealing with the background removal, brightness equalization between images, and focus quality for optimal plume tracking are also discussed.

  3. Reliability of an Automated High-Resolution Manometry Analysis Program across Expert Users, Novice Users, and Speech-Language Pathologists

    Science.gov (United States)

    Jones, Corinne A.; Hoffman, Matthew R.; Geng, Zhixian; Abdelhalim, Suzan M.; Jiang, Jack J.; McCulloch, Timothy M.

    2014-01-01

    Purpose: The purpose of this study was to investigate inter- and intrarater reliability among expert users, novice users, and speech-language pathologists with a semiautomated high-resolution manometry analysis program. We hypothesized that all users would have high intrarater reliability and high interrater reliability. Method: Three expert…

  4. Technique for Automated Recognition of Sunspots on Full-Disk Solar Images

    Directory of Open Access Journals (Sweden)

    Zharkov S

    2005-01-01

    Full Text Available A new robust technique is presented for automated identification of sunspots on full-disk white-light (WL solar images obtained from SOHO/MDI instrument and Ca II K1 line images from the Meudon Observatory. Edge-detection methods are applied to find sunspot candidates followed by local thresholding using statistical properties of the region around sunspots. Possible initial oversegmentation of images is remedied with a median filter. The features are smoothed by using morphological closing operations and filled by applying watershed, followed by dilation operator to define regions of interest containing sunspots. A number of physical and geometrical parameters of detected sunspot features are extracted and stored in a relational database along with umbra-penumbra information in the form of pixel run-length data within a bounding rectangle. The detection results reveal very good agreement with the manual synoptic maps and a very high correlation with those produced manually by NOAA Observatory, USA.

  5. Speech emotion recognition based on phase space reconstruction%相空间重构在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 陈鑫

    2014-01-01

    为了更为全面地表征语音情感状态,弥补线性情感特征参数在刻画不同情感类型上的不足,将相空间重构理论引入语音情感识别中来,通过分析不同情感状态下的混沌特征,提取Kolmogorov熵和关联维作为新的情感特征参数,并结合传统语音特征使用支持向量机(SVM)进行语音情感识别。实验结果表明,通过引入混沌参数,与传统物理特征进行识别的方案相比,准确率有了一定的提高,为语音情感的识别提供了一个新的研究途径。%In order to express the sound emotion state totally, make up the inadequate of emotional conventional linear argu-ment at depicting different types of character sentiments, this paper takes the phase space reconstruction theory into the sound emotional identification, by analyzing chaotic features on the different sound emotional states, proposes correlation dimension and Kolmogorov entropy as emotional characteristic parameters, combines with traditional voice acoustic features and uses Support Vector Machine(SVM)for speech emotion recognition. The results show that recognition accuracy is improved through using chaotic characteristic parameters, providing a new research approach for speech emotion recognition.

  6. 基于嵌入式Linux语音识别系统的设计%Design of Speech Recognition System Based on Embedded Linux

    Institute of Scientific and Technical Information of China (English)

    钟豪; 张常年; 徐成波

    2014-01-01

    该设计运用三星公司的S3C2440,结合ICRoute公司的高性能语音识别芯片LD3320,进行了语音识别系统的硬件和软件设计。在嵌入式Linux操作系统下,运用多进程机制完成了对语音识别芯片、超声波测距和云台的控制,并将语音识别技术应用于多角度超声波测距系统中。通过测试,系统可以通过识别语音指令控制测量方向,无需手动干预,最后将测量结果通过语音播放出来。%This paper fulfills the hardware and software design of the voice recognition system, using the Samsung’s S3C2440 and the high performance chip LD3320 designed by ICRoute. It uses multi-process mechanism to complete the speech recognition, ultrasonic ranging and PTZ control based on embedded Linux platform. At the same time, the system makes the speech recognition technology applied to multi-angle ultrasonic ranging. Through the actual testing, the system can control the direction of measure-ment by identifying the voice command, without manual intervention, and finally the measurement results play out through the voice.

  7. Automated pattern recognition to support geological mapping and exploration target generation - A case study from southern Namibia

    Science.gov (United States)

    Eberle, Detlef; Hutchins, David; Das, Sonali; Majumdar, Anandamayee; Paasche, Hendrik

    2015-06-01

    This paper demonstrates a methodology for the automatic joint interpretation of high resolution airborne geophysical and space-borne remote sensing data to support geological mapping in a largely automated, fast and objective manner. At the request of the Geological Survey of Namibia (GSN), part of the Gordonia Subprovince of the Namaqua Metamorphic Belt situated in southern Namibia was selected for this study. All data - covering an area of 120 km by 100 km in size - were gridded, with a spacing of adjacent data points of only 200 m. The data points were coincident for all data sets. Published criteria were used to characterize the airborne magnetic data and to establish a set of attributes suitable for the recognition of linear features and their pattern within the study area. This multi-attribute analysis of the airborne magnetic data provided the magnetic lineament pattern of the study area. To obtain a (pseudo-) lithology map of the area, the high resolution airborne gamma-ray data were integrated with selected Landsat band data using unsupervised fuzzy partitioning clustering. The outcome of this unsupervised clustering is a classified (zonal) map which in terms of the power of spatial resolution is superior to any regional geological mapping. The classified zones are then assigned geological/geophysical parameters and attributes known from the study area, e.g. lithology, physical rock properties, age, chemical composition, geophysical field characteristics, etc. This information is obtained from the examination of archived geological reports, borehole logs, any kind of existing geological/geophysical data and maps as well as ground truth controls where deemed necessary. To obtain a confidence measure validating the unsupervised fuzzy clustering results and receive a quality criterion of the classified zones, stepwise linear discriminant analysis was chosen. Only a small percentage (8%) of the samples was misclassified by discriminant analysis when compared

  8. 基于语义细胞的语音情感识别%Speech emotion recognition based on information cell

    Institute of Scientific and Technical Information of China (English)

    孙凌云; 何博伟; 刘征; 杨智渊

    2015-01-01

    Information cell was applied in the field of speech emotion recognition to address the problem of high space complexity of speech emotion recognition classifier . Single‐layered information cell (IC‐S ) algorithm and speaker‐emotion recognition based dual‐layered information cell (IC‐D ) algorithm were proposed in the light of information cell mixture model .Cross‐validation test on CASIA (in Chinese) and SAVEE (in English) corpus were conducted using F‐score as the indicator of recognition performance . Results show that the IC‐S algorithm has advantages in both time and space complexity compared to common algorithms like SVM .IC‐D algorithm achieves similar recognition performance as SVM . IC‐D algorithm can reduce the space complexity significantly and it is suitable for scenarios with few or fixed speakers .%为解决语音情感识别分类器空间复杂度高的问题,将语义细胞应用于语音情感识别领域。以语义细胞混合模型为核心,提出基于单层语义细胞的语音情感识别(IC‐S )算法以及基于说话人‐情感识别的双层语义细胞识别(IC‐D)算法。在CASIA(汉语)和SAVEE(英语)情感语料库中进行交叉验证实验,并利用F值评判识别性能。结果表明:相比常用算法(如:SVM ),IC‐S算法在空间和时间复杂度上具有优势;IC‐D算法与SVM 算法识别准确率相似,可以有效降低模型存储空间的复杂度,适用于说话人分类较少或较为固定的场景。

  9. Modeling and simulation of speech emotional recognition based on process neural net-work%基于过程神经元的语音情感识别的建模与仿真

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 陈晋芳

    2014-01-01

    为克服由传统语音情感识别模型的缺陷导致的识别正确率不高的问题,将过程神经元网络引入到语音情感识别中来。通过提取基频、振幅、音质特征参数作为语音情感特征参数,利用小波分析去噪,主成分分析(PCA)消除冗余,用过程神经元网络对生气、高兴、悲伤和惊奇四种情感进行识别。实验结果表明,与传统的识别模型相比,使用过程神经元网络具有较好的识别效果。%To improve the problem of the low recognition accuracy caused by the defect of the traditional speech emotion recognition model, this algorithm of process neural networks is introduced to the speech emotion recognition. This paper extracts the speech emotion features of fundamental frequency, amplitude, sound characteristic, and uses the method of wavelet analysis to reduce noise, the Principal Component Analysis(PCA)to reduce the redundancy, and carries on the experiment of classification and recognition of the four speech emotions of anger, happiness, sadness and surprise. The result proves that the method of process neural network has better recognition effect on the four speech emotions compared with the traditional recognition model.

  10. Automated Recognition of Railroad Infrastructure in Rural Areas from LIDAR Data

    Directory of Open Access Journals (Sweden)

    Mostafa Arastounia

    2015-11-01

    Full Text Available This study is aimed at developing automated methods to recognize railroad infrastructure from 3D LIDAR data. Railroad infrastructure includes rail tracks, contact cables, catenary cables, return current cables, masts, and cantilevers. The LIDAR dataset used in this study is acquired by placing an Optech Lynx mobile mapping system on a railcar, operating at 125 km/h. The acquired dataset covers 550 meters of Austrian rural railroad corridor comprising 31 railroad key elements and containing only spatial information. The proposed methodology recognizes key components of the railroad corridor based on their physical shape, geometrical properties, and the topological relationships among them. The developed algorithms managed to recognize all key components of the railroad infrastructure, including two rail tracks, thirteen masts, thirteen cantilevers, one contact cable, one catenary cable, and one return current cable. The results are presented and discussed both at object level and at point cloud level. The results indicate that 100% accuracy and 100% precision at the object level and an average of 96.4% accuracy and an average of 97.1% precision at point cloud level are achieved.

  11. Automated recognition of obstructive sleep apnea syndrome using support vector machine classifier.

    Science.gov (United States)

    Al-Angari, Haitham M; Sahakian, Alan V

    2012-05-01

    Obstructive sleep apnea (OSA) is a common sleep disorder that causes pauses of breathing due to repetitive obstruction of the upper airways of the respiratory system. The effect of this phenomenon can be observed in other physiological signals like the heart rate variability, oxygen saturation, and the respiratory effort signals. In this study, features from these signals were extracted from 50 control and 50 OSA patients from the Sleep Heart Health Study database and implemented for minute and subject classifications. A support vector machine (SVM) classifier was used with linear and second-order polynomial kernels. For the minute classification, the respiratory features had the highest sensitivity while the oxygen saturation gave the highest specificity. The polynomial kernel always had better performance and the highest accuracy of 82.4% (Sen: 69.9%, Spec: 91.4%) was achieved using the combined-feature classifier. For subject classification, the polynomial kernel had a clear improvement in the oxygen saturation accuracy as the highest accuracy of 95% was achieved by both the oxygen saturation (Sen: 100%, Spec: 90.2%) and the combined-feature (Sen: 91.8%, Spec: 98.0%). Further analysis of the SVM with other kernel types might be useful for optimizing the classifier with the appropriate features for an OSA automated detection algorithm.

  12. Self-Assessed Hearing Handicap in Older Adults with Poorer-than-Predicted Speech Recognition in Noise

    Science.gov (United States)

    Eckert, Mark A.; Matthews, Lois J.; Dubno, Judy R.

    2017-01-01

    Purpose: Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. Method: We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982)…

  13. Application of Hilbert marginal energy spectrum in speech emotion recognition%Hilbert边际能量谱在语音情感识别中的应用

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 胡海翔

    2014-01-01

    Emotional feature extraction plays an important role in speech emotion recognition. Due to the limitations of traditional signal processing methods, traditional phonetic features, especially in terms of frequency domain features, are unable to reflect precisely phonetic emotional characteristic, which leads to a low emotion recognition rate. This paper proposes a new method. Firstly, Hilbert-Huang Transform(HHT)is used in order to process speech signal, thus to obtain Hilbert marginal energy spectrum. Then, a comparison and relative analysis based on Mel-scale is carried out, afterwards a new array of emotional features are obtained, which consists of Mel-Frequency Marginal Energy Coefficient(MFEC), Mel-frequency Sub-band Spectral Centroid(MSSC)and Mel-frequency Sub-band Spectral Flatness(MSSF). Finally, the five kinds of speech emotion namely sadness, happiness, boredom, anger and neutral are recognized by using the Support Vector Machine(SVM). The experimental results show that the new emotional features extracted by this method have better recognition performance.%情感特征的提取是语音情感识别的重要方面。由于传统信号处理方法的局限,使得提取的传统声学特征特别是频域特征并不准确,不能很好地表征语音的情感特性,因而对情感识别率不高。利用希尔伯特黄变换(HHT)对情感语音进行处理,得到情感语音的希尔伯特边际能量谱;通过对不同情感语音的边际能量谱基于Mel尺度的比较分析,提出了一组新的情感特征:Mel频率边际能量系数(MFEC)、Mel频率子带频谱质心(MSSC)、Mel频率子带频谱平坦度(MSSF);利用支持向量机(SVM)对5种情感语音即悲伤、高兴、厌倦、愤怒和平静进行了识别。实验结果表明,通过该方法提取的新的情感特征具有较好的识别效果。

  14. Automated recognition of bird song elements from continuous recordings using dynamic time warping and hidden Markov models: a comparative study.

    Science.gov (United States)

    Kogan, J A; Margoliash, D

    1998-04-01

    The performance of two techniques is compared for automated recognition of bird song units from continuous recordings. The advantages and limitations of dynamic time warping (DTW) and hidden Markov models (HMMs) are evaluated on a large database of male songs of zebra finches (Taeniopygia guttata) and indigo buntings (Passerina cyanea), which have different types of vocalizations and have been recorded under different laboratory conditions. Depending on the quality of recordings and complexity of song, the DTW-based technique gives excellent to satisfactory performance. Under challenging conditions such as noisy recordings or presence of confusing short-duration calls, good performance of the DTW-based technique requires careful selection of templates that may demand expert knowledge. Because HMMs are trained, equivalent or even better performance of HMMs can be achieved based only on segmentation and labeling of constituent vocalizations, albeit with many more training examples than DTW templates. One weakness in HMM performance is the misclassification of short-duration vocalizations or song units with more variable structure (e.g., some calls, and syllables of plastic songs). To address these and other limitations, new approaches for analyzing bird vocalizations are discussed.

  15. Recognition of 3-D symmetric objects from range images in automated assembly tasks

    Science.gov (United States)

    Alvertos, Nicolas; Dcunha, Ivan

    1990-01-01

    A new technique is presented for the three dimensional recognition of symmetric objects from range images. Beginning from the implicit representation of quadrics, a set of ten coefficients is determined for symmetric objects like spheres, cones, cylinders, ellipsoids, and parallelepipeds. Instead of using these ten coefficients trying to fit them to smooth surfaces (patches) based on the traditional way of determining curvatures, a new approach based on two dimensional geometry is used. For each symmetric object, a unique set of two dimensional curves is obtained from the various angles at which the object is intersected with a plane. Using the same ten coefficients obtained earlier and based on the discriminant method, each of these curves is classified as a parabola, circle, ellipse, or hyperbola. Each symmetric object is found to possess a unique set of these two dimensional curves whereby it can be differentiated from the others. It is shown that instead of using the three dimensional discriminant which involves evaluation of the rank of its matrix, it is sufficient to use the two dimensional discriminant which only requires three arithmetic operations.

  16. Investigation into diagnostic agreement using automated computer-assisted histopathology pattern recognition image analysis

    Directory of Open Access Journals (Sweden)

    Joshua D Webster

    2012-01-01

    Full Text Available The extent to which histopathology pattern recognition image analysis (PRIA agrees with microscopic assessment has not been established. Thus, a commercial PRIA platform was evaluated in two applications using whole-slide images. Substantial agreement, lacking significant constant or proportional errors, between PRIA and manual morphometric image segmentation was obtained for pulmonary metastatic cancer areas (Passing/Bablok regression. Bland-Altman analysis indicated heteroscedastic measurements and tendency toward increasing variance with increasing tumor burden, but no significant trend in mean bias. The average between-methods percent tumor content difference was -0.64. Analysis of between-methods measurement differences relative to the percent tumor magnitude revealed that method disagreement had an impact primarily in the smallest measurements (tumor burden 0.988, indicating high reproducibility for both methods, yet PRIA reproducibility was superior (C.V.: PRIA = 7.4, manual = 17.1. Evaluation of PRIA on morphologically complex teratomas led to diagnostic agreement with pathologist assessments of pluripotency on subsets of teratomas. Accommodation of the diversity of teratoma histologic features frequently resulted in detrimental trade-offs, increasing PRIA error elsewhere in images. PRIA error was nonrandom and influenced by variations in histomorphology. File-size limitations encountered while training algorithms and consequences of spectral image processing dominance contributed to diagnostic inaccuracies experienced for some teratomas. PRIA appeared better suited for tissues with limited phenotypic diversity. Technical improvements may enhance diagnostic agreement, and consistent pathologist input will benefit further development and application of PRIA.

  17. Mathematical Modelling for the Evaluation of Automated Speech Recognition Systems--Research Area 3.3.1 (c)

    Science.gov (United States)

    2016-01-07

    automatically generated transcript, and predicts task-specific performance . This measure is less conservative and less labour -intensive. In...AUTHORS 7. PERFORMING ORGANIZATION NAMES AND ADDRESSES 15. SUBJECT TERMS b. ABSTRACT 2. REPORT TYPE 17. LIMITATION OF ABSTRACT 15. NUMBER OF PAGES...SPONSOR/MONITOR’S ACRONYM(S) ARO 8. PERFORMING ORGANIZATION REPORT NUMBER 19a. NAME OF RESPONSIBLE PERSON 19b. TELEPHONE NUMBER Gerald Penn

  18. Speech Compression and Synthesis

    Science.gov (United States)

    1980-10-01

    phonological rules combined with diphone improved the algorithms used by the phonetic synthesis prog?Im for gain normalization and time... phonetic vocoder, spectral template. i0^Th^TreprtTörc"u’d1sTuV^ork for the past two years on speech compression’and synthesis. Since there was an...from Block 19: speech recognition, pnoneme recogmtion. initial design for a phonetic recognition program. We also recorded ana partially labeled a

  19. Recognizing GSM Digital Speech

    OpenAIRE

    2005-01-01

    The Global System for Mobile (GSM) environment encompasses three main problems for automatic speech recognition (ASR) systems: noisy scenarios, source coding distortion, and transmission errors. The first one has already received much attention; however, source coding distortion and transmission errors must be explicitly addressed. In this paper, we propose an alternative front-end for speech recognition over GSM networks. This front-end is specially conceived to be effective against source c...

  20. Research of speech emotion recognition based on emotion features classification%基于情感特征分类的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    周晓凤; 肖南峰; 文翰

    2012-01-01

    针对语音信号的实时性和不确定性,提出证据信任度信息熵和动态先验权重的方法,对传统D-S证据理论的基本概率分配函数进行改进;针对情感特征在语音情感识别中对不同的情感状态具有不同的识别效果,提出对语音情感特征进行分类.利用各类情感特征的识别结果,应用改进的D-S证据理论进行决策级数据融合,实现基于多类情感特征的语音情感识别,以达到细粒度的语音情感识别.最后通过算例验证了改进算法的迅速收敛和抗干扰性,对比实验结果证明了分类情感特征语音情感识别方法的有效性和稳定性.%Because the speech signals were highly real-time uncertainty, this paper proposed evidences' trust entropy and dynamic prior weights to improve the basic probability function of traditional D-S theory. As the emotion recognition result was not the same by emotion features in different emotions, it presented a classification method of emotion features. In order to realize the fine-grain speech emotion recognition, it used the recognition data of different classification and the improved D-S theory to realize the emotion recognition based on multi-classification emotion features. The improved D-S theory is proved to be effective by simulation. And comparing simulation results show that the multi-classification emotion features are effective and stability.

  1. Speech emotion recognition based on multifractal%多重分形在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 王聪慧

    2012-01-01

    为了克服语音情感线性参数在刻画不同情感类型特征上的不足,将多重分形理论引入语音情感识别中来,通过分析不同语音情感状态下的多重分形特征,提取多重分形谱参数和广义Hurst指数作为新的语音情感特征参数,并结合传统语音声学特征采用支持向量机(SVM)进行语音情感识别.实验结果表明,通过非线性参数的介入,与仅使用传统语音线性特征的识别方法相比,识别系统的准确率和稳定性得到有效提高,因此为语音情感识别提供了一个新的思路.%In order to overcome the inadequate of emotional conventional linear argument at depicting different types of character sentiments, this paper takes the multiple fractals theory into the sound emotional identify, by analyzing the multiple fractal features on the different sound emotional state, and proposes multifractal spectrum parameters and generalizes hurst index as emotional conventional parameters, combines with traditional voice acoustic features and using Support Vector Machine (SVM) for speech emotion recognition. The results show that the accuracy and stability of the recognition system are improved effectively through using non-linear parameters, compared with the linear features of traditional voice recognition method, so it provides a new idea for voice emotion recognition.

  2. 维吾尔语语音识别中发音变异现象%Uyghur pronunciation variations in automatic speech recognition systems

    Institute of Scientific and Technical Information of China (English)

    杨雅婷; 马博; 王磊; 吐尔洪·吾司曼; 李晓

    2011-01-01

    The recognition rate in Uyghur standard speech recognition systems is relatively low due to pronunciation variations in the corpus.The assimilation,weakening,loss,and vowel harmony in Uyghur speech were analyzed to provide a better understanding of the voice and rhythm characteristics.The Uyghur accent pronunciation variation rules are summarized by combining knowledge-based and data-driven methods with mapping of vowel and consonant pairs and developing of a phoneme confusion matrix.%维语口语发音中很多音素相对标准语产生了发音变异,基于标准语音的识别系统在识别带有发音变异的口语语料时识别率较低。该文针对维吾尔语同化、弱化、脱落、元音和谐等语流音变难点进行分析,对语音、韵律特性进行知识融合与技术创新,运用基于数据驱动和基于专家经验相结合的方法对维吾尔语方言口语中存在的发音变异现象进行研究,统计元音、辅音多发音变化映射对,建立音素混淆矩阵,为维吾尔语方言口语语音识别研究奠定基础。

  3. Research and design of parallel speech recognition system%并行化语音识别系统的研究与设计

    Institute of Scientific and Technical Information of China (English)

    王硕; 刘文

    2012-01-01

    How to handle large voice data is an important problem in speech recognition applications. It uses parallel computing to replace the traditional standalone process, if the parallel scheduling control is not good, the final result will be an error and if data segmentation is unreasonable, the data will lose semantic consistency leading to decline accuracy. Pieces of the file on the network transmission costs also need to consider. To solve above problems, it proposes a speech recognition system based on Hadoop, uses HDFS and MapReduce to solve pieces of the file transfer and control parallel scheduling and uses silence detection to handle file split. Through the experiment, it proves the effectiveness of this system.%如何处理海量语音数据是语音识别应用的一个重要问题,采用并行化计算取代传统的单机处理,如果并行调度控制不当,最终合并的结果在合并顺序上就会出现错误,并且数据切分不合理还会造成语义连贯性的丢失导致准确率的降低,文件片段在网络上传输的时间开销也需要考虑,针对上述问题,提出了一种基于Hadoop的语音识别系统,借助其分布式文件系统HDFS与MapReduce并行算法解决文件片段传输与并行调度控制的问题,同时引入静音检测算法合理地处理文件切分,通过实验验证了该系统的有效性.

  4. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction.

    Science.gov (United States)

    Nass, C; Lee, K M

    2001-09-01

    Would people exhibit similarity-attraction and consistency-attraction toward unambiguously computer-generated speech even when personality is clearly not relevant? In Experiment 1, participants (extrovert or introvert) heard a synthesized voice (extrovert or introvert) on a book-buying Web site. Participants accurately recognized personality cues in text to speech and showed similarity-attraction in their evaluation of the computer voice, the book reviews, and the reviewer. Experiment 2, in a Web auction context, added personality of the text to the previous design. The results replicated Experiment 1 and demonstrated consistency (voice and text personality)-attraction. To maximize liking and trust, designers should set parameters, for example, words per minute or frequency range, that create a personality that is consistent with the user and the content being presented.

  5. 语音情感的维度特征提取与识别%Dimensional Feature Extraction and Recognition of Speech Emotion

    Institute of Scientific and Technical Information of China (English)

    李嘉; 黄程韦; 余华

    2012-01-01

    研究了情绪的维度空间模型与语音声学特征之间的关系以及语音情感的自动识别方法.介绍了基本情绪的维度空间模型,提取了唤醒度和效价度对应的情感特征,采用全局统计特征减小文本差异对情感特征的影响.研究了生气、高兴、悲伤和平静等情感状态的识别,使用高斯混合模型进行4种基本情感的建模,通过实验设定了高斯混合模型的最佳混合度,从而较好地拟合了4种情感在特征空间中的概率分布.实验结果显示,选取的语音特征适合于基本情感类别的识别,高斯混合模型对情感的建模起到了较好的效果,并且验证了二维情绪空间中,效价维度上的情意特征对语音情感识别的重要作用.%The relation between the emotion dimension space and speech features is studied. The automatic speech emotion recognition problem is addressed. A dimensional space model of basic emotions is introduced. Speech emotion features are extracted according to the arousal dimension and the valence dimension. And statistic features are used to reduce the influence of the text variations on emotional features. Anger, happiness, sadness and neutral state are studied. Gaussian mixture model is adopted for modeling and recognizing the four categories of emotions. Gaussian mixture number is optimized through experiment for the probability distribution of the 4 categories in the feature space. The experimental results show that the chosen features are suitable for recognizing basic emotions. The Gaussian mixture model achieves satisfactory classification results. The valence features in the two-dimensional space plays a more important role in emotion recognition.

  6. 一种基于模板匹配的语音识别算法%A speech recognition algorithms based on pattern matching

    Institute of Scientific and Technical Information of China (English)

    聂晓飞; 赵禹; 詹庆才

    2011-01-01

    Speech recognition is an important research direction of Speech signal processing, which is related to physiology, psychology, linguistics, computer science and signal processing and many other fields. It is widely used in control, communications, digital products and other industries. This article describes a simple speech recognition algorithm. The main processes include pre-processing, endpoint detection, extraction of feature value and pattern matching. The amplitude is the standard in endpoint detection. The critical Eigen Value is used as eigenvector. DTW algorithm is used in pattern matching. Main design idea is to get the voice signal and match it with the stored pattern. A hardware system with TMS32OVCS402 chip as the core is used to implement this algorithm.%语音识别是语音信号处理的一个重要研究方向,涉及到生理学、心理学、语言学、计算机科学以及信号处理等诸多领域,广泛应用于控制、通信、消费品等行业。文中介绍了一种简单的语音识别算法。该算法主要流程包括预处理、端点检测、提取特征值、模式匹配4个过程,其中端点检测以幅值为标准,特征值采用临界带特征矢量,模板匹配采用DTW算法。主要设计思想是将得到的语音信号和已经存储的模板进行匹配。实践上用TMS320VC5402芯片为核心的系统硬件来实现这个算法。

  7. NICT/ATR Chinese-Japanese-English Speech-to-Speech Translation System

    Institute of Scientific and Technical Information of China (English)

    Tohru Shimizu; Yutaka Ashikari; Eiichiro Sumita; ZHANG Jinsong; Satoshi Nakamura

    2008-01-01

    This paper describes the latest version of the Chinese-Japanese-English handheld speech-to-speech translation system developed by NICT/ATR,which is now ready to be deployed for travelers.With the entire speech-to-speech translation function being implemented into one terminal,it realizes real-time,location-free speech-to-speech translation.A new noise-suppression technique notably improves the speech recognition performance.Corpus-based approaches of speech recognition,machine translation,and speech synthesis enable coverage of a wide variety of topics and portability to other languages.Test results show that the character accuracy of speech recognition is 82%-94% for Chinese speech,with a bilingual evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English.

  8. Application of a model of the auditory primal sketch to cross-linguistic differences in speech rhythm: Implications for the acquisition and recognition of speech

    Science.gov (United States)

    Todd, Neil P. M.; Lee, Christopher S.

    2002-05-01

    It has long been noted that the world's languages vary considerably in their rhythmic organization. Different languages seem to privilege different phonological units as their basic rhythmic unit, and there is now a large body of evidence that such differences have important consequences for crucial aspects of language acquisition and processing. The most fundamental finding is that the rhythmic structure of a language strongly influences the process of spoken-word recognition. This finding, together with evidence that infants are sensitive from birth to rhythmic differences between languages, and exploit rhythmic cues to segmentation at an earlier developmental stage than other cues prompted the claim that rhythm is the key which allows infants to begin building a lexicon and then go on to acquire syntax. It is therefore of interest to determine how differences in rhythmic organization arise at the acoustic/auditory level. In this paper, it is shown how an auditory model of the primitive representation of sound provides just such an account of rhythmic differences. Its performance is evaluated on a data set of French and English sentences and compared with the results yielded by the phonetic accounts of Frank Ramus and his colleagues and Esther Grabe and her colleagues.

  9. Speech Emotion Recognition Based on Fusion of Sample Entropy and MFCC%基于样本熵与MFCC融合的语音情感识别

    Institute of Scientific and Technical Information of China (English)

    屠彬彬; 于凤芹

    2012-01-01

    提出一种基于样本熵与Mel频率倒谱系数(MFCC)融合的语音情感识别方法.利用支持向量机分别对样本熵统计量与MFCC进行处理,计算其属于高兴、生气、厌烦和恐惧4种情感的概率,采用加法规则和乘法规则对情感概率进行融合,得到识别结果.仿真实验结果表明,该方法的识别率较高.%This paper proposes a method of speech emotion recognition based on fusion of sample entropy and Mel-frequency Cepstral Coefficients(MFCC). Sample entropy statistic and MFCC are modeled with Support Vector Machine(SVM) respectively to obtain the probabilities of happy, angry, bored and afraid. The sum and product rules are used to fuse the probabilities to gain the final decision. Simulation results demonstrate that the recognition rate obtained with the proposed method is high.

  10. Speech Recognition Based on Deep Neural Networks on Tibetan Corpus%基于深层神经网络的藏语识别

    Institute of Scientific and Technical Information of China (English)

    袁胜龙; 郭武; 戴礼荣

    2015-01-01

    Large vocabulary continuous speech recognition on telephonic conversational Tibetan is firstly addressed in this paper. As a minority language, the major difficulty in Tibetan speech recognition is data deficiency. In this paper, the acoustic model of Tibetan is trained based on deep neural networks (DNN). To address the issue of data deficiencies, the DNN models of other majority languages are used as the initial networks of the objective Tibetan DNN model. In addition, phonetic questions of Tibetan generated by phonetic expert are unavailable due to the lacking knowledge of phonetics. To reduce the number of tri-phone hidden Markov models(HMM) in Tibetan speech recognition, phonetic questions automatically generated in the data driven manner are used for tying the tri-phone HMM. In this paper, different clustering of tri-phone states is tested and the words accuracy is about 30. 86% on the test corpus by Gaussian mixture model ( GMM) . When the acoustic model is trained based on DNN, 3 kinds of DNN model trained by different large corpus are adopted. The experimental results show that the proposed methods can improve the recognition performance, and the words accuracy is about 43. 26% on the test corpus.%文中首次涉及藏语的自然对话风格大词汇电话连续语音识别问题.作为一种少数民族语言,藏语识别面临的最大的困难是数据稀疏问题.文中在基于深层神经网络( DNN)的声学模型建模中,针对数据稀疏的问题,提出采用大语种数据训练好的DNN作为目标模型的初始网络进行模型优化的策略.另外,由于藏语语音学的研究很不完善,人工生成决策树问题集的方式并不可行.针对该问题,文中利用数据驱动的方式自动生成决策树问题集,对三音子隐马尔可夫模型( HMM )进行状态绑定,从而减少需要估计的模型参数.在测试集上,基于混合高斯模型(GMM)声学建模的藏字识别率为30.86%.在基于DNN的声

  11. Psychological Motivation Strategy of Speech Recognition in the Popularization of Marxism%马克思主义大众化过程中言语认同的心理激励策略

    Institute of Scientific and Technical Information of China (English)

    邓瑞琴

    2015-01-01

    言语认同是马克思主义大众化过程中有力的言语支撑体系和话语保障.心理激励是促进言语认同的有效手段,正确把握和处理言语认同与心理激励之间的耦合关系,从构建言语认同机制、塑造群众心理认同、尊重大众主体地位三方面作为心理激励的基本策略来探寻马克思主义大众化的实现方式.在具体实施心理激励策略的过程中,要注意正确理解言语与心理的关系和注意把握言语的表达艺术.%Speech recognition is a powerful speech support system and discourse security in the course of the popularization of Marx. Psychological motivation is effective means to promote speech recognition, correctly grasp and deal with the coup-ling relationship between speech recognition and mental stimulation, from building speech recognition mechanism, shaping people's psychological identity, respect the dominant position of the public as psychological motivation of basic strategy to explore the ways of the popularization of Marxism. In the process of specific implementation psychological incentive strat-egy, should pay attention to the correct understanding of verbal and psychological relationship and pay attention to grasp the language of artistic expression.

  12. The Network Account Identity Authentication System based on Voiceprint Recognition and Speech Recognition%基于声纹识别和语音识别的网络账号身份认证系统

    Institute of Scientific and Technical Information of China (English)

    李秋华; 唐鉴波; 柏森; 朱桂斌

    2013-01-01

    文章设计开发的基于声纹识别和语音识别的网络账号身份认证系统主要包含声纹录入、建库以及声纹判别两个部分。网络用户注册时,对用户的声纹进行采集;用户再次登录时,将用户的声纹与数据库中的声纹进行比较,通过声纹识别验证用户身份,确保用户的数据安全。系统建立在服务器端,安装方便快捷,安全性高。对用户要求低,仅需要一部麦克风即可完成注册。系统操作方便、简单,安全性、保密性好,市场前景广阔。%The network account identity authentication system based the voiceprint recognition and the speech recognition proposed and designed in this paper includes the voiceprint entry and database construction model and the voiceprint discriminant model. When the user registers, the system collects the user's voiceprint. When the user registers again, the system compares the user's voiceprint with voiceprints in the database , validating the user's identity through voiceprint recognition to ensure the data security of users. The system is built on the server side with convenient installation and high security.The system has low request to the users, only needs a microphone to complete registration. The operation on the system is convenient and simple ,with high security and conifdentiality .The system has wide market prospect.

  13. Clinical and audiological features of a syndrome with deterioration in speech recognition out of proportion to pure hearing loss

    Directory of Open Access Journals (Sweden)

    Abdi S

    2007-04-01

    Full Text Available Background: The objective of this study was to describe the audiologic and related characteristics of a group patient with speech perception affected out of proportion to pure tone hearing loss. A case series of patient were referred for evaluation and management to the Hearing Research Center.To describe the clinical picture of the patients with the key clinical feature of hearing loss for pure tones and reduction in speech discrimination out of proportion to the pure tone loss, having some of the criteria of auditory neuropathy (i.e. normal otoacoustic emissions, OAE, and abnormal auditory brainstem evoked potentials, ABR and lacking others (e.g. present auditory reflexes. Methods: Hearing abilities were measured by Pure Tone Audiometry (PTA and Speech Discrimination Scores (SDS, measured in all patients using a standardized list of 25 monosyllabic Farsi words at MCL in quiet. Auditory pathway integrity was measured by using Auditory Brainstem Response (ABR and Otoacoustic Emission (OAE and anatomical lesions Computed Tomography Scan (CT and Magnetic Resonance Image (MRI of brain and retrocochlea. Patient included in the series were 35 patients who have SDS disproportionably low with regard to PTA, absent ABR waves and normal OAE. Results: All patients reported the beginning of their problem around adolescence. Neither of them had anatomical lesion in imaging studies and neither of them had any finding suggestive of conductive hearing lesion. Although in most of the cases the hearing loss had been more apparent in the lower frequencies (i.e. 1000 Hz and less, a stronger correlation was found between SDS and hearing threshold at higher frequencies. These patients may not benefit from hearing aids, as the outer hair cells are functional and amplification doesn’t seem to help; though, it was tried for all. Conclusion: These patients share a pattern of sensory –neural loss with no detectable lesion. The age of onset and the gradual

  14. Building Searchable Collections of Enterprise Speech Data.

    Science.gov (United States)

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  15. 汉语语音声调提取技术方法研究%Research on tone recognition of Chinese speech

    Institute of Scientific and Technical Information of China (English)

    李源; 周莹

    2012-01-01

    Along with the development of modern technology and computers, as well as tablet PC, voice interaction will become the main form of man-machine communication, and in the speech synthesis of Chinese, tone is an indispensable important component. In the process of tone recognition, firstly, using an improved method of shot-time autocorrelation function for pitch detection, in order to get the pitch period of the voiced sound simultaneously at the same time, we use the method of changed frame to get the pitch period, then get the 4 kinds of tone curve of Chinese speech with Matlab. The results of the simulation show that the curve of the tone obtained by this method is consistent with the typical curve of Mandarin tone.%随着现代科技和计算机以及平板电脑等的发展,语音交互将成为人机通信的主要方式,而汉语在语音合成中声调是不可或缺的一个重要组成部分.在声调提取过程中首先采用改进的短时自相关函数的方法进行基音检测,同时为了能较为精确地进行浊音的基音检测,利用变长分帧的方法提取基音周期序列,并通过Matlab仿真得到了汉语语音4种声调的调型曲线.仿真结果表明,该方法所得到的调型曲线与汉语普通话声调的典型曲线较为一致.

  16. Commercial speech and off-label drug uses: what role for wide acceptance, general recognition and research incentives?

    Science.gov (United States)

    Gilhooley, Margaret

    2011-01-01

    This article provides an overview of how the constitutional protections for commercial speech affect the Food and Drug Administration's (FDA) regulation of drugs, and the emerging issues about the scope of these protections. A federal district court has already found that commercial speech allows manufacturers to distribute reprints of medical articles about a new off-label use of a drug as long as it contains disclosures to prevent deception and to inform readers about the lack of FDA review. This paper summarizes the current agency guidance that accepts the manufacturer's distribution of reprints with disclosures. Allergan, the maker of Botox, recently maintained in a lawsuit that the First Amendment permits drug companies to provide "truthful information" to doctors about "widely accepted" off-label uses of a drug. While the case was settled as part of a fraud and abuse case on other grounds, extending constitutional protections generally to "widely accepted" uses is not warranted, especially if it covers the use of a drug for a new purpose that needs more proof of efficacy, and that can involve substantial risks. A health law academic pointed out in an article examining a fraud and abuse case that off-label use of drugs is common, and that practitioners may lack adequate dosage information about the off-label uses. Drug companies may obtain approval of a drug for a narrow use, such as for a specific type of pain, but practitioners use the drug for similar uses based on their experience. The writer maintained that a controlled study may not be necessary to establish efficacy for an expanded use of a drug for pain. Even if this is the case, as discussed below in this paper, added safety risks may exist if the expansion covers a longer period of time and use by a wider number of patients. The protections for commercial speech should not be extended to allow manufacturers to distribute information about practitioner use with a disclosure about the lack of FDA

  17. Automated Face Recognition System

    Science.gov (United States)

    1992-12-01

    atestfOl.feature-vectjJ -averageljJ); for(j=l; <num-coefsj++) for(i= 5 num-train-faces;i++) sdlQjI -(btrainhil.feaure..vecU1- veagU (btraintil.feature- vecU ... vecU ])* (atest(O1.feature-vecUJ - btrain[iI.feature- vecU ]) + temp; btrain(ii.distance = sqrt ( (double) temp); I**** Store the k-nearest neighbors rank

  18. Automated Program Recognition.

    Science.gov (United States)

    1987-02-01

    documentation module which is used to demonstrate the Recognizer’s output takes basically this approach. It gives as output an English description of the...1))) gramars (the-grammar)) 157 % ~ ~ - (defrul. Abs-Val>Absol ute- Value Abe-Vail * (Absolute- Value I ((positivel 2) (null-testl 1)) negatel...Test-PredicateS)))) gramars (the-gr amme r)) 165 AP𔃺 ~,% % * Co-Truncate (def rule Co-Trunc>co-truncate Co-TruncB (co-truncateg Test-Predicateg

  19. 基于蚁群算法特征选择的语音情感识别%Feature Selection of Speech Emotional Recognition Based on Ant Colony Optimization Algorithm

    Institute of Scientific and Technical Information of China (English)

    杨鸿章

    2013-01-01

    情感特征提取是语音情感准确识别的关键,传统方法采用单一特征或者简单组合特征提取方法,单一特征无法全面反映语音情感变化,简单组合特征会使特征间产生大量冗余特征,影响识别正确结果.为了提高语音情感识别率,提了一种蚁群算法的语音情感智能识别方法.首先采用语音识别正确率和特征子集维数加权作为目标函数,然后利用蚁群算法找到最优语音特征子集,消除特征冗余信息.通过汉话和丹麦语两种情感语音库进行仿真测试,仿真结果表明,改进方法不仅消除了冗余、无用特征,降低了特征维数,而且提高了语音情感识别率,是一种有效的语音情感智能识别方法.%Speech emotion information has the characteristics of high dimension and redundancy, in order to improve the accuracy of speech emotion recognition, this paper put forward a speech emotion recognition model to select features based on ant colony optimization algorithm. The classification accuracy of KNN and the selected feature dimension form the fitness function, and the ant colony optimization algorithm provides good global searching capability and multiple sub - optimal solutions. A local refinement searching scheme was designed to exclude the redundant features and improve the convergence rate. The performance of method was tested by Chinese emotional speech database and a Danish Emotional Speech. The simulation results show that the proposed method can not only eliminate redundant and useless features to reduce the dimension of features, but also improve the speech emotion recognition rate, therefore the proposed model is an effective speech emotion recognition method.

  20. Speech in Mobile and Pervasive Environments

    CERN Document Server

    Rajput, Nitendra

    2012-01-01

    This book brings together the latest research in one comprehensive volume that deals with issues related to speech processing on resource-constrained, wireless, and mobile devices, such as speech recognition in noisy environments, specialized hardware for speech recognition and synthesis, the use of context to enhance recognition, the emerging and new standards required for interoperability, speech applications on mobile devices, distributed processing between the client and the server, and the relevance of Speech in Mobile and Pervasive Environments for developing regions--an area of explosiv

  1. 婴儿智能看护系统的语音识别模块设计%The Design of Speech Recognition Module in Infant-Caring Intelligent System

    Institute of Scientific and Technical Information of China (English)

    张荣刚

    2012-01-01

    An efficient module is designed to realize real--time monitor on baby sleeping and effectively pacify infants by means of speech recognition and intelligent control.%文章设计一个不仅能对婴儿的睡眠状况进行实时监测,而且通过语音识别和智能控制能及时有效地安抚婴儿的模块.

  2. The binaural masking-level difference of mandarin tone detection and the binaural intelligibility-level difference of mandarin tone recognition in the presence of speech-spectrum noise.

    Science.gov (United States)

    Ho, Cheng-Yu; Li, Pei-Chun; Chiang, Yuan-Chuan; Young, Shuenn-Tsong; Chu, Woei-Chyn

    2015-01-01

    Binaural hearing involves using information relating to the differences between the signals that arrive at the two ears, and it can make it easier to detect and recognize signals in a noisy environment. This phenomenon of binaural hearing is quantified in laboratory studies as the binaural masking-level difference (BMLD). Mandarin is one of the most commonly used languages, but there are no publication values of BMLD or BILD based on Mandarin tones. Therefore, this study investigated the BMLD and BILD of Mandarin tones. The BMLDs of Mandarin tone detection were measured based on the detection threshold differences for the four tones of the voiced vowels /i/ (i.e., /i1/, /i2/, /i3/, and /i4/) and /u/ (i.e., /u1/, /u2/, /u3/, and /u4/) in the presence of speech-spectrum noise when presented interaurally in phase (S0N0) and interaurally in antiphase (SπN0). The BILDs of Mandarin tone recognition in speech-spectrum noise were determined as the differences in the target-to-masker ratio (TMR) required for 50% correct tone recognitions between the S0N0 and SπN0 conditions. The detection thresholds for the four tones of /i/ and /u/ differed significantly (pMandarin tones were all lower in the SπN0 condition than in the S0N0 condition, and the BMLDs ranged from 7.3 to 11.5 dB. The TMR for 50% correct Mandarin tone recognitions differed significantly (pMandarin tone detection and recognition in the presence of speech-spectrum noise are improved when phase inversion is applied to the target speech. The average BILDs of Mandarin tones are smaller than the average BMLDs of Mandarin tones.

  3. Emotion Recognition in Speech Based on HMM and PNN%基于HMM和PNN的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    叶斌

    2011-01-01

    语音情感识别是从语音的角度赋予计算机理解情感特征的能力,最终使计算机能像人一样进行自然、亲切和生动的交互.提出了一种融合隐马尔科夫模型(hidden markov model,HMM)和概率神经网络(probabilistic neural network,PNN)的语音情感识别方法.在所设计情感识别系统中,提取出基本的韵律参数和频谱参数,利用PNN处理声学参数的统计特征,利用HMM处理声学参数的时序特征,运用加法规则和乘法规则融合了统计特征和时序特征的识别结果.实验结果显示,所提出的算法在语音情感识别中具有有效的识别能力.%The aim of the emotion recognition is make the computer have the capacity of understand emotion by the way of voice characteristics studies and ultimately like people for natural, warm and lively interaction. A speech emotion recognition algorithm based on HMM (hidden Markov model) and PNN (probabilistic neural network) was developed, in the system, the basic prosody parameters and spectral parameters were extracted first, and then the PNN was used to model the statistic features and HMM to model the temporal features. The sum and product rules were used to combine the probabilities from each group of features for the final decision. Experimental results approved the capacity and the efficiency of the proposed method.

  4. A Danish open-set speech corpus for competing-speech studies

    DEFF Research Database (Denmark)

    Nielsen, Jens Bo; Dau, Torsten; Neher, Tobias

    2014-01-01

    Studies investigating speech-on-speech masking effects commonly use closed-set speech materials such as the coordinate response measure [Bolia et al. (2000). J. Acoust. Soc. Am. 107, 1065-1066]. However, these studies typically result in very low (i.e., negative) speech recognition thresholds (SR...

  5. A Kinect-Based Sign Language Hand Gesture Recognition System for Hearing- and Speech-Impaired: A Pilot Study of Pakistani Sign Language.

    Science.gov (United States)

    Halim, Zahid; Abbas, Ghulam

    2015-01-01

    Sign language provides hearing and speech impaired individuals with an interface to communicate with other members of the society. Unfortunately, sign language is not understood by most of the common people. For this, a gadget based on image processing and pattern recognition can provide with a vital aid for detecting and translating sign language into a vocal language. This work presents a system for detecting and understanding the sign language gestures by a custom built software tool and later translating the gesture into a vocal language. For the purpose of recognizing a particular gesture, the system employs a Dynamic Time Warping (DTW) algorithm and an off-the-shelf software tool is employed for vocal language generation. Microsoft(®) Kinect is the primary tool used to capture video stream of a user. The proposed method is capable of successfully detecting gestures stored in the dictionary with an accuracy of 91%. The proposed system has the ability to define and add custom made gestures. Based on an experiment in which 10 individuals with impairments used the system to communicate with 5 people with no disability, 87% agreed that the system was useful.

  6. Speech Equilibrium Recognition in Preschool Children with Functional Articulation Disorder%功能性构音障碍患儿语音均衡式识别能力评估

    Institute of Scientific and Technical Information of China (English)

    赵云静; 孙洪伟; 麻宏伟; 李书娟

    2012-01-01

    目的:评估功能性构音障碍(FAD)患儿的语音均衡式识别能力,为其发病机制研究提供新的依据.方法:FAD患者68例(FAD组),另选择健康体检正常儿童50例为对照组.采用儿童语音均衡式识别能力评估表(孙喜斌词表)对2组进行声母和韵母识别能力评估.结果:FAD组患儿声母及韵母识别能力均显著低于对照组(P<0.05),且中重度FAD患儿声母韵母识别能力均明显低于轻度FAD患儿(P<0.05).结论:FAD患儿语音均衡式识别能力明显落后于正常儿童,语音均衡式识别能力落后可能是FAD的病因之一.%Objective: To explore the speech recognition functions of children with functional articulation disorder. Methods; Sixty-eight children with functional articulation disorder at the age between 4 and 5 years old were selected as case group, and fifty normal speaking children matched for ages were selected as control group. The speech recognition functions were examined by using speech equilibrium recognition scale. Results: The scores in consonants recognition and vowels recognifion were significantly lower in the case group than in the control group(P<0. 05). and the scores in consonants recognition and vowels recognition were significantly lower in moderate to severe FAD group than in mild FAD group(P<0. 05). Conclusion; The speech recognition functions of children with functional articulation disorder were much inferior to normal children, which perhaps contributed to functional articulation disorder.

  7. A Study to Increase the Quality of Financial and Operational Performances of Call Centers using Speech Technology

    Directory of Open Access Journals (Sweden)

    R. Manoharan

    2015-04-01

    Full Text Available Everyone knows technology and automation are not the solutions to every business problem. But when used for the right reasons-and deployed and maintained wisely-speech based contact center applications can be good for the customers as well as business, if the money and time spent to implement and maintain contact center business. Speech based application is an experimental conversational speech system. Experience with redesigning the system based on user feedback indicates the importance of adhering to conversational conventions when designing speech interfaces, particularly in the face of speech recognition errors. Study results also suggest that speech-only interfaces should be designed from scratch rather than directly translated from their graphical counterparts. This paper examines a set of challenging issues facing speech interface designers and describes approaches to address some of these challenges. Let us highlight some of the specific constrains involved in this process of using speech technology in the main stream of business in general and of a Call Center specific and being resolved through the Industrial process. So the real challenge is “Developing a new Business Process Model for an Industry application, for a Call Center specific. And that paves the way to design and analyze the “Financial and Operational performance of call centers through the business process model using speech technology”.

  8. Tackling the complexity in speech

    DEFF Research Database (Denmark)

    section includes four carefully selected chapters. They deal with facets of speech production, speech acoustics, and/or speech perception or recognition, place them in an integrated phonetic-phonological perspective, and relate them in more or less explicit ways to aspects of speech technology. Therefore......, we hope that this volume can help speech scientists with traditional training in phonetics and phonology to keep up with the latest developments in speech technology. In the opposite direction, speech researchers starting from a technological perspective will hopefully get inspired by reading about...... the questions, phenomena, and communicative functions that are currently addressed in phonetics and phonology. Either way, the future of speech research lies in international, interdisciplinary collaborations, and our volume is meant to reflect and facilitate such collaborations...

  9. An enhanced relative spectral processing of speech

    Institute of Scientific and Technical Information of China (English)

    ZHEN Bin; WU Xihong; LIU Zhimin; CHI Huisheng

    2002-01-01

    An enhanced relative spectral (E_RASTA) technique for speech and speaker recognition is proposed. The new method consists of classical RASTA filtering in logarithmic spectral domain following by another additive RASTA filtering in the same domain. In this manner,both the channel distortion and additive noise are removed effectively. In speaker identification and speech recognition experiments on TI46 database, the E_RASTA performs equal or better than J_RASTA method in both tasks. The E_RASTA does not need the speech SNR estimation in order to determinate the optimal value of J in J_RASTA, and the information of how the speech degrades. The choice of E_RASTA filter also indicates that the low temporal modulation components in speech can deteriorate the performance of both recognition tasks. Besides, the speaker recognition needs less temporal modulation frequency band than that of the speech recognition.

  10. 基于局部线性嵌入算法的汉语数字语音识别%Mandarin digit speech recognition based on locally linear embedding algorithm

    Institute of Scientific and Technical Information of China (English)

    高文曦; 于凤芹

    2012-01-01

    语音信号转换到频域后维数较高,流行学习方法可以自主发现高维数据中潜在低维结构的规律性,提出采用流形学习的方法对高维数据降维来进行汉语数字语音识别.采用流形学习中的局部线性嵌入算法提取语音频域上高维数据的低维流形结构特征,再将低维数据输入动态时间规整识别器进行识别.仿真实验结果表明,采用局部线性嵌入算法的汉语数字语音识别相较于常用声学特征MFCC维数要少,识别率提高了1.2%,有效提高了识别速度.%Speech signal dimensions are higher when the signal is transformed to frequency domain, manifold learning algorithm can find a smooth low-dimensional manifold embedded in the high-dimensional data space. The manifold learning algorithm is proposed to reduce the dimensions in the high-dimensional data for Mandarin digit speech recognition. Low-dimensional manifold structure is extracted from the high-dimensional frequency data based on locally linear embedding of manifold learning algorithms. Then the resulting low-dimensional data is inputted into Dynamic Time Warping (DTW) to recognize. Simulation results demonstrate that the dimensions are lower using Local Linear Embedding (LLE) compared with MFCC, the recognition rate increases by 1.2% in Mandarin digit speech recognition, and the recognition speed gets improved effectively.

  11. Dimensional emotion recognition in whispered speech signal based on cognitive performance evaluation%基于认知评估的多维耳语音情感识别

    Institute of Scientific and Technical Information of China (English)

    吴晨健; 黄程韦; 陈虹

    2015-01-01

    研究了基于认知评估原理的多维耳语音情感识别。首先,比较了耳语音情感数据库和数据采集方法,研究了耳语音情感表达的特点,特别是基本情感的表达特点。其次,分析了耳语音的情感特征,并通过近年来的文献总结相关阶特征在效价维和唤醒维上的特征。研究了效价维和唤醒维在区分耳语音情感中的作用。最后,研究情感识别算法和应用耳语音情感识别的高斯混合模型。认知能力的评估也融入到情感识别过程中,从而对耳语音情感识别的结果进行纠错。基于认知分数,可以提高情感识别的结果。实验结果表明,耳语音信号中共振峰特征与唤醒维度不显著相关,而短期能量特征与情感变化在唤醒维度相关。结合认知分数可以提高语音情感识别的结果。%The cognitive performance-based dimensional emotion recognition in whispered speech is studied.First,the whispered speech emotion databases and data collection methods are compared, and the character of emotion expression in whispered speech is studied,especially the basic types of emotions.Secondly,the emotion features for whispered speech is analyzed,and by reviewing the latest references,the related valence features and the arousal features are provided. The effectiveness of valence and arousal features in whispered speech emotion classification is studied.Finally,the Gaussian mixture model is studied and applied to whispered speech emotion recognition. The cognitive performance is also considered in emotion recognition so that the recognition errors of whispered speech emotion can be corrected.Based on the cognitive scores,the emotion recognition results can be improved.The results show that the formant features are not significantly related to arousal dimension,while the short-term energy features are related to the emotion changes in arousal dimension.Using the cognitive scores,the recognition

  12. Automatic Recognition of Improperly Pronounced Initial 'r' Consonant in Romanian

    Directory of Open Access Journals (Sweden)

    VELICAN, V.

    2012-08-01

    Full Text Available Correctly assessing the degree of mispronunciation and deciding upon the necessary treatment are fundamental activities for all speech disorder specialists. Obviously, the experience and the availability of the specialists are essentials in order to assure an efficient therapy for the speech impaired. To overcome this deficiency a more objective approach would include the existence of a tool that independent of the specialist's abilities could be used to establish the diagnostics. A complete automated system based on speech processing algorithms capable of performing the recognition task is therefore thoroughly justified and can be viewed as a goal that will bring many benefits to the field of speech pronunciation correction. This paper presents further results of the authors' work on developing speech processing algorithms able to identify mispronunciations in Romanian language, more exactly we propose the use of the Walsh-Hadamard Transform (WHT as feature selection tool in the case of identifying rhotacism. The results are encouraging with a best recognition rate of 92.55%.

  13. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  14. EMD-SDC方法在机载连接词语音识别系统中的应用%Application of EMD-SDC in airplane conjunction speech recognition system

    Institute of Scientific and Technical Information of China (English)

    严家明; 李永恒

    2012-01-01

    机载连接词语音识别系统与传统语音识别系统相比,具有背景噪声大,系统识别率要求高等特点.依据这些特点,提出了一种基于经验模态分解增强和位移差分倒谱特征的EMD-SDC连接词语音识别方法.经验模态分解的调频调幅特性,可以有效提高机载复杂噪声背景下的端点检测准确度,位移差分倒谱特征由语音帧的一阶差分谱连接扩展而成,能够更好地提取依赖于语言结构的时序信息.该方法对机载交通预警避撞系统提示语音库进行测试,实验结果表明,采用EMD-SDC方法的机载连接词语音识别系统,能够很好地克服机舱背景噪声干扰,在低信噪比条件下实现较高的识别率.%Compared with traditional speech recognition system, airplane conjunction speech recognition system has background noise, and requires a high recognition rate and so on. According to these features, this paper proposes a EMD-SDC method with empirical mode decomposition and shifted delta cepstral features. Empirical mode decomposition with characteristics of AM FM can substantially increase endpoint detection accuracy under complex airplane noise environment. Shifted delta cepstral which is composed of first-order differential spectral of the speech frames, can capture the time sequence information depending on the structure of the language well. This method is tested for airplane traffic collision avoidance system database, experimental result shows that the airplane conjunction speech recognition system with EMD-SDC method can overcome cabin background noise and achieve a higher recognition rate in the low SNR.

  15. Current trends in multilingual speech processing

    Indian Academy of Sciences (India)

    Hervé Bourlard; John Dines; Mathew Magimai-Doss; Philip N Garner; David Imseng; Petr Motlicek; Hui Liang; Lakshmi Saheer; Fabio Valente

    2011-10-01

    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.

  16. How should a speech recognizer work?

    NARCIS (Netherlands)

    Scharenborg, O.E.; Norris, D.G.; Bosch, L.F.M. ten; McQueen, J.M.

    2005-01-01

    Although researchers studying human speech recognition (HSR) and automatic speech recognition (ASR) share a common interest in how information processing systems (human or machine) recognize spoken language, there is little communication between the two disciplines. We suggest that this lack of comm

  17. 一种用于无线通信的数字语音识别系统设计%Design of a digital speech recognition system used for wireless communication

    Institute of Scientific and Technical Information of China (English)

    王艳芬

    2016-01-01

    The interference such as environment,user accent and non⁃target vocabulary exists in digital voice recording pro⁃cess,which makes the developed digital speech recognition systems used for wireless communication low accuracy and poor por⁃tability. Therefore,the optimization design of the digital speech recognition system used for wireless communication was per⁃formed. The core components of the system are chip C6727DSP,speech recognition chip QGDH710 and CC2520RF transceiver. The chip C6727DSP is used for early stage processing of the digital speech. The speech recognition chip QGDH710 is used to recognize the processed digital speech,and feed the recognized instruction back to the CC2520 RF transceiver. The CC2520 RF transceiver is used to convert the instruction format,and transmit the instructions to the user’s wireless communication equip⁃ment to realize effective utilization of the digital speech recognition system in wireless communication. To perform system opera⁃tion conveniently for the users,a virtual function diagram of user’s wireless communication equipment is given by means of soft⁃ware. The experimental verification results show that the designed system has high accuracy and good portability.%数字语音录制过程中存在的环境、用户口音和非目标词汇等干扰,使以往开发出的无线通信数字语音识别系统准确性较低、可移植性较差。因此,对无线通信的数字语音识别系统进行优化设计,设计系统的核心元件为C6727DSP芯片、QGDH710语音识别芯片和CC2520射频收发器。C6727DSP芯片进行数字语音的前期处理工作;QGDH710语音识别芯片对处理后的数字语音进行识别,并将其识别出的指令反馈到CC2520射频收发器;CC2520射频收发器进行指令的格式转换工作,并将指令传输到用户无线通信设备中,最终实现数字语音识别系统在无线通信中的有效利用。为了方便用户进行系统操作

  18. A Survey on Speech Enhancement Methodologies

    Directory of Open Access Journals (Sweden)

    Ravi Kumar. K

    2016-12-01

    Full Text Available Speech enhancement is a technique which processes the noisy speech signal. The aim of speech enhancement is to improve the perceived quality of speech and/or to improve its intelligibility. Due to its vast applications in mobile telephony, VOIP, hearing aids, Skype and speaker recognition, the challenges in speech enhancement have grown over the years. It is more challenging to suppress back ground noise that effects human communication in noisy environments like airports, road works, traffic, and cars. The objective of this survey paper is to outline the single channel speech enhancement methodologies used for enhancing the speech signal which is corrupted with additive background noise and also discuss the challenges and opportunities of single channel speech enhancement. This paper mainly focuses on transform domain techniques and supervised (NMF, HMM speech enhancement techniques. This paper gives frame work for developments in speech enhancement methodologies

  19. Speech rate effects on the processing of conversational speech across the adult life span.

    Science.gov (United States)

    Koch, Xaver; Janse, Esther

    2016-04-01

    This study investigates the effect of speech rate on spoken word recognition across the adult life span. Contrary to previous studies, conversational materials with a natural variation in speech rate were used rather than lab-recorded stimuli that are subsequently artificially time-compressed. It was investigated whether older adults' speech recognition is more adversely affected by increased speech rate compared to younger and middle-aged adults, and which individual listener characteristics (e.g., hearing, fluid cognitive processing ability) predict the size of the speech rate effect on recognition performance. In an eye-tracking experiment, participants indicated with a mouse-click which visually presented words they recognized in a conversational fragment. Click response times, gaze, and pupil size data were analyzed. As expected, click response times and gaze behavior were affected by speech rate, indicating that word recognition is more difficult if speech rate is faster. Contrary to earlier findings, increased speech rate affected the age groups to the same extent. Fluid cognitive processing ability predicted general recognition performance, but did not modulate the speech rate effect. These findings emphasize that earlier results of age by speech rate interactions mainly obtained with artificially speeded materials may not generalize to speech rate variation as encountered in conversational speech.

  20. Speech Problems

    Science.gov (United States)

    ... of your treatment plan may include seeing a speech therapist , a person who is trained to treat speech disorders. How often you have to see the speech therapist will vary — you'll probably start out seeing ...

  1. The Effectiveness of Clear Speech as a Masker

    Science.gov (United States)

    Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.

    2010-01-01

    Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…

  2. Whispered Speech Emotion Recognition Embedded with Markov Networks and Multi-Scale Decision Fusion%嵌入马尔可夫网络的多尺度判决融合耳语音情感识别

    Institute of Scientific and Technical Information of China (English)

    黄程韦; 金赟; 包永强; 余华; 赵力

    2013-01-01

    本文中我们提出了一种将高斯混合模型同马尔可夫网络结合的时域多尺度语音情感识别框架,并将其应用在耳语音情感识别中.针对连续语音信号的特点,分别在耳语音信号的短句尺度上和长句尺度上进行了基于高斯混合模型的情感识别.根据情绪的维度空间论,耳语音信号中的情感信息具有时间上的连续性,因此利用三阶的马尔可夫网络对多尺度的耳语音情感分析进行了上下文的情感依赖关系的建模.采用了一种弹簧模型来定义二维情感维度空间中的高阶形变,并且利用模糊熵评价将高斯混合模型的似然度转化为马尔可夫网络中的一阶能量.实验结果显示,本文提出的情感识别算法在连续耳语音数据上获得了较好的识别结果,对愤怒的识别率达到了64.3%.实验结果进一步显示,与正常音的研究结论不同,耳语音中的喜悦情感的识别相对困难,而愤怒与悲伤之间的区分度较高,与Cirillo等人进行的人耳听辨研究结果一致.%In this paper we proposed a multi-scale framework in the time domain to combine the Gaussian Mixture Model and the Markov Network, and apply which to the whispered speech emotion recognition. Based on Gaussian Mixture Model, speech emotion recognition on the long and short utterances are carried out in continuous speech signals. According to the emotion dimensional model, whispered speech emotion should be continuous in the time domain. Therefore we model the context dependency in whispered speech using Markov Network. A spring model is adopted to model the high-order variance in the emotion dimensional space and fuzzy entropy is used for calculating the unary energy in the Markov Network. Experimental results show that the recognition rate of anger emotion reaches 64. 3%. Compared with the normal speech the recognition of happiness is more difficult in whispered speech, while anger and sadness is relatively easy to

  3. Cleft Audit Protocol for Speech (CAPS-A): A Comprehensive Training Package for Speech Analysis

    Science.gov (United States)

    Sell, D.; John, A.; Harding-Bell, A.; Sweeney, T.; Hegarty, F.; Freeman, J.

    2009-01-01

    Background: The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has been…

  4. 实用语音情感的特征分析与识别的研究%A Study on Feature Analysis and Recognition of Practical Speech Emotion

    Institute of Scientific and Technical Information of China (English)

    黄程韦; 赵艳; 金赟; 于寅骅; 赵力

    2011-01-01

    该文针对语音情感识别在实际中的应用,研究了烦躁等实用语音情感的分析与识别.通过计算机游戏诱发的方式采集了高自然度的语音情感数据,提取了74种情感特征,分析了韵律特征、音质特征与情感维度之间的关系,对烦躁等实用语音情感的声学特征进行了评价与选择,提出了针对实际应用环境的可拒判的实用语音情感识别方法.实验结果表明,文中采用的语音情感特征,能较好识别烦躁等实用语音情感,平均识别率达到75%以上.可拒判的实用语音情感识别方法,对模糊的和未知的情感类别的分类进行了合理的决策,在语音情感的实际应用中具有重要的意义.%Practical speech emotions as impatience and happiness are studied especially for evaluation of emotional well-being in real world applications. Induced natural speech emotion data is collected with a computer game, 74 emotion features are extracted, prosody features and voice quality features are analyzed according to dimensional emotion model, evaluation and selection of acoustic features are carried out for practical emotions in this paper, a method of practical speech emotion classification with rejection decision is proposed for real world occasions. The experiment results show, the speech features analyzed in this paper are suitable for classification of practical speech emotions like impatience and happiness, average recognition rate is above 75%, and the method of emotion classification with rejection decision is necessary for the proper recognition decision of ambiguous or unknown emotion samples, especially for the real world challenges.

  5. 基于语音识别的智能轮椅的设计与实现%Design and implementation of intel igent wheelchair based on speech recognition

    Institute of Scientific and Technical Information of China (English)

    巴金融

    2014-01-01

    设计并实现了一种基于STC10L08XE智能语音识别轮椅的系统,采用了由 ICRoute 公司设计生产的语音识别专用芯片--LD3320芯片,通过STC10L08XE单片机主控,实现了语音命令轮椅前进、后退、左转、右转、停止以及避障功能。%A intel igent wheelchair based on STC10L08XE is designed and implemented by intel igent speech recognition. LD3320 chip is a special speech recognition chip, which is made in ICRoute Company. The control information can be got from STC10L08XE microcontroller. The designed intelligent wheelchair has many performance, such as forward, backward, turn left, turn right, stop function and infrared barrier function.

  6. 言语感知中词汇识别的句子语境效应研究%Effect of Sentimental Contexts on Word Recognition in Speech Perception

    Institute of Scientific and Technical Information of China (English)

    柳鑫淼

    2014-01-01

    言语感知遵循音不离词,词不离句的原则。除了语音特征、音位和单词三个感知单元外,句子单元也参与了言语感知的过程。在这一感知过程中,句子语境分别从句法和语义两方面对词汇的识别发生影响。在句法方面,句子层依据句法规则对词汇层产生自上而下的反馈作用,通过词类限制和曲折形态特征核查等方式实现对词汇层上备选单词的筛选;在语义方面,句子层根据语义限制条件对备选单词产生激活或抑制作用。%Phonemes, words and sentences are interconnected in speech perception. Besides phonetic features, phonemes and words, sentences are also engaged in speech perception. In speech perception, sentimental contexts exert influence on word recog-nition both syntactically and semantically. Syntactically, sentence levels exert top-down feedback effect on world levels according to syntactic rules, screening the candidates on word levels by constraining their part of speech or checking their inflectional fea-tures. Semantically, sentence levels activate pr inhibit the candidates by exerting semantic constraints.

  7. Description and recognition of regular and distorted secondary structures in proteins using the automated protein structure analysis method.

    Science.gov (United States)

    Ranganathan, Sushilee; Izotov, Dmitry; Kraka, Elfi; Cremer, Dieter

    2009-08-01

    The Automated Protein Structure Analysis (APSA) method, which describes the protein backbone as a smooth line in three-dimensional space and characterizes it by curvature kappa and torsion tau as a function of arc length s, was applied on 77 proteins to determine all secondary structural units via specific kappa(s) and tau(s) patterns. A total of 533 alpha-helices and 644 beta-strands were recognized by APSA, whereas DSSP gives 536 and 651 units, respectively. Kinks and distortions were quantified and the boundaries (entry and exit) of secondary structures were classified. Similarity between proteins can be easily quantified using APSA, as was demonstrated for the roll architecture of proteins ubiquitin and spinach ferridoxin. A twenty-by-twenty comparison of all alpha domains showed that the curvature-torsion patterns generated by APSA provide an accurate and meaningful similarity measurement for secondary, super secondary, and tertiary protein structure. APSA is shown to accurately reflect the conformation of the backbone effectively reducing three-dimensional structure information to two-dimensional representations that are easy to interpret and understand.

  8. Automating the Process of Work-Piece Recognition and Location for a Pick-and-Place Robot in a SFMS

    Directory of Open Access Journals (Sweden)

    R. V. Sharan

    2014-03-01

    Full Text Available This paper reports the development of a vision system to automatically classify work-pieces with respect to their shape and color together with determining their location for manipulation by an in-house developed pick-and-place robot from its work-plane. The vision-based pick-and-place robot has been developed as part of a smart flexible manufacturing system for unloading work-pieces for drilling operations at a drilling workstation from an automatic guided vehicle designed to transport the work-pieces in the manufacturing work-cell. Work-pieces with three different shapes and five different colors are scattered on the work-plane of the robot and manipulated based on the shape and color specification by the user through a graphical user interface. The number of corners and the hue, saturation, and value of the colors are used for shape and color recognition respectively in this work. Due to the distinct nature of the feature vectors for the fifteen work-piece classes, all work-pieces were successfully classified using minimum distance classification during repeated experimentations with work-pieces scattered randomly on the work-plane.

  9. Robust Recogmtion Method of Speech Under Stress%心理紧张情况下的Robust语音识别方法

    Institute of Scientific and Technical Information of China (English)

    韩纪庆; 张磊; 王承发

    2000-01-01

    Abstract There are many stressful envronments which deteriorate the performance of speech recognition systems. Techniques for compensating the influence of stress can help neutralize stressed speech and improve robustness of speech recognition systems. In this paper,we smmarize the aproaches for robust recognition of speech under stress and also give the advances in the area.

  10. Testing of Haar-Like Feature in Region of Interest Detection for Automated Target Recognition (ATR) System

    Science.gov (United States)

    Zhang, Yuhan; Lu, Dr. Thomas

    2010-01-01

    The objectives of this project were to develop a ROI (Region of Interest) detector using Haar-like feature similar to the face detection in Intel's OpenCV library, implement it in Matlab code, and test the performance of the new ROI detector against the existing ROI detector that uses Optimal Trade-off Maximum Average Correlation Height filter (OTMACH). The ROI detector included 3 parts: 1, Automated Haar-like feature selection in finding a small set of the most relevant Haar-like features for detecting ROIs that contained a target. 2, Having the small set of Haar-like features from the last step, a neural network needed to be trained to recognize ROIs with targets by taking the Haar-like features as inputs. 3, using the trained neural network from the last step, a filtering method needed to be developed to process the neural network responses into a small set of regions of interests. This needed to be coded in Matlab. All the 3 parts needed to be coded in Matlab. The parameters in the detector needed to be trained by machine learning and tested with specific datasets. Since OpenCV library and Haar-like feature were not available in Matlab, the Haar-like feature calculation needed to be implemented in Matlab. The codes for Adaptive Boosting and max/min filters in Matlab could to be found from the Internet but needed to be integrated to serve the purpose of this project. The performance of the new detector was tested by comparing the accuracy and the speed of the new detector against the existing OTMACH detector. The speed was referred as the average speed to find the regions of interests in an image. The accuracy was measured by the number of false positives (false alarms) at the same detection rate between the two detectors.

  11. An Approach to Hide Secret Speech Information

    Institute of Scientific and Technical Information of China (English)

    WU Zhi-jun; DUAN Hai-xin; LI Xing

    2006-01-01

    This paper presented an approach to hide secret speech information in code excited linear prediction(CELP)-based speech coding scheme by adopting the analysis-by-synthesis (ABS)-based algorithm of speech information hiding and extracting for the purpose of secure speech communication. The secret speech is coded in 2.4Kb/s mixed excitation linear prediction (MELP), which is embedded in CELP type public speech. The ABS algorithm adopts speech synthesizer in speech coder. Speech embedding and coding are synchronous, i.e. a fusion of speech information data of public and secret. The experiment of embedding 2.4 Kb/s MELP secret speech in G.728 scheme coded public speech transmitted via public switched telephone network (PSTN) shows that the proposed approach satisfies the requirements of information hiding, meets the secure communication speech quality constraints, and achieves high hiding capacity of average 3.2 Kb/s with an excellent speech quality and complicating speakers' recognition.

  12. Speech perception of noise with binary gains

    DEFF Research Database (Denmark)

    Wang, DeLiang; Kjems, Ulrik; Pedersen, Michael Syskind;

    2008-01-01

    For a given mixture of speech and noise, an ideal binary time-frequency mask is constructed by comparing speech energy and noise energy within local time-frequency units. It is observed that listeners achieve nearly perfect speech recognition from gated noise with binary gains prescribed...... by the ideal binary mask. Only 16 filter channels and a frame rate of 100 Hz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception....

  13. Design of self-tracking smart car based on speech recognition and infrared photoelectric sensor%基于语音识别和红外光电传感器的自循迹智能小车设计

    Institute of Scientific and Technical Information of China (English)

    李新科; 高潮; 郭永彩; 何卫华

    2011-01-01

    A self-tracking smart car has been designed based on infrared photoelectric sensor and speech recognition technology. The car adopts 16 bit singlechip SPCE061A of Sunplus Inc to work as the core processor of the control circuit, and obtains the path information by the reflected infrared photoelectric sensor. The car can adjust the direction and speed by the location of the black part of the path information to implement self-tracking. Compiled speech program API function to achieve speech capability of human-machine intercourse based on SPCE061 A. Experiments indicates that the smart car can achieve the anticipated steady and credible purpose. The technic can use the fields such as physical disabilities smart wheelchair, service robot, intelligent toy, unmanned driving vehicles, storage, etc.%基于光电传感器和语音识别技术完成了一种自循迹智能小车的设计.该小车采用凌阳16位单片机SPCE061A作为系统控制处理器,以反射式红外光电传感器获取路径信息.根据路径信息中黑线的位置来调整小车的运动方向与速度,从而实现自循迹功能.结合SPCE061A片内资源,编写了语音处理API函数,实现语音人机交互的智能化导航控制.实验表明:智能小车功能达到要求,运行可靠稳定.该技术可以应用于残障人智能轮椅、服务机器人、智能玩具、无人驾驶机动车、仓库等领域.

  14. Research about Tone Recognition of Mandarin Continuous SpeechBased on Multi-space Probability Distribution%基于多空间概率分布的汉语连续语音声调识别研究

    Institute of Scientific and Technical Information of China (English)

    倪崇嘉; 刘文举; 徐波

    2011-01-01

    Chinese Mandarin is the tonal language. Tone is important to Mandarin speech recognitioa We proposed a method to recognize the tone of Mandarin continuous speech;which is the combination of embedded tone model and ex plicit tone modeL This method can fuse the fundamental frequency information of short time and long time. The experi ments in "863-Test" and "TestCorpus98" test show that our proposed method can achieve 96. 12% and 93. 78% tone recognition correct rate separatively.%汉语是一种带声调的语言,声调信息在汉语语音识别中具有非常重要的意义.提出了embedded声调模型与explicit声调模型相结合的方法用以识别汉语连续语音的声调.该方法能够将逐帧的基频信息和较强时长的基频信息相结合来识别声调.在“863-Test”和“TestCorpus 98”测试集上的实验表明,该方法分别能够达到96.12%和93.78%的声调识别正确率.

  15. Research on Continuous Speech Recognition Based on HTK by MatLab Programming%MatLab环境下调用HTK的连续语音识别方法

    Institute of Scientific and Technical Information of China (English)

    李理; 王冬霞

    2014-01-01

    According to the basic principle of HTK(HMM Toolkit),smal vocabulary continuous speech was recognized based on HTK by MatLab programming in this thesis.This thesis used HTK to build HMM model and then used MatLab to program it to do speech recognition,thus it avoided the redundancy of operating single HTK command,and the complexity was reduced.as wel .%本文根据HTK(HMM Toolkit)的基本原理,在MatLab环境下通过调用HTK各命令实现小词汇量连续语音识别。采用HTK工具包搭建语音的隐马尔可夫模型(HMM),再利用MatLab循环编程开发进行仿真实验,避免了传统地逐步运行HTK各个命令的冗余操作,降低了操作复杂度。

  16. 基于WINCE的语音识别系统%Speech Recognition System Based on the Embedded WinCE OS

    Institute of Scientific and Technical Information of China (English)

    张晶; 李心广; 王金矿

    2008-01-01

    论文在基于IntEL PXA270嵌入式微处理器开发平台上实现了WinCE操作系统的定制和移植;并结合WINCE 5.0语音接口Speech Applicadon Programming Interface(SAPI 5.0),使用Embedded Visual C++4.0(EVC)成功开发嵌入式语音识别系统.

  17. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Directory of Open Access Journals (Sweden)

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  18. Speaker Recognition

    DEFF Research Database (Denmark)

    Mølgaard, Lasse Lohilahti; Jørgensen, Kasper Winther

    2005-01-01

    Speaker recognition is basically divided into speaker identification and speaker verification. Verification is the task of automatically determining if a person really is the person he or she claims to be. This technology can be used as a biometric feature for verifying the identity of a person...... in applications like banking by telephone and voice mail. The focus of this project is speaker identification, which consists of mapping a speech signal from an unknown speaker to a database of known speakers, i.e. the system has been trained with a number of speakers which the system can recognize....

  19. A New Database for Speaker Recognition

    DEFF Research Database (Denmark)

    Feng, Ling; Hansen, Lars Kai

    2005-01-01

    In this paper we discuss properties of speech databases used for speaker recognition research and evaluation, and we characterize some popular standard databases. The paper presents a new database called ELSDSR dedicated to speaker recognition applications. The main characteristics of this database...... are: English spoken by non-native speakers, a single session of sentence reading and relatively extensive speech samples suitable for learning person specific speech characteristics....

  20. Speech and gesture interfaces for squad-level human-robot teaming

    Science.gov (United States)

    Harris, Jonathan; Barber, Daniel

    2014-06-01

    As the military increasingly adopts semi-autonomous unmanned systems for military operations, utilizing redundant and intuitive interfaces for communication between Soldiers and robots is vital to mission success. Currently, Soldiers use a common lexicon to verbally and visually communicate maneuvers between teammates. In order for robots to be seamlessly integrated within mixed-initiative teams, they must be able to understand this lexicon. Recent innovations in gaming platforms have led to advancements in speech and gesture recognition technologies, but the reliability of these technologies for enabling communication in human robot teaming is unclear. The purpose for the present study is to investigate the performance of Commercial-Off-The-Shelf (COTS) speech and gesture recognition tools in classifying a Squad Level Vocabulary (SLV) for a spatial navigation reconnaissance and surveillance task. The SLV for this study was based on findings from a survey conducted with Soldiers at Fort Benning, GA. The items of the survey focused on the communication between the Soldier and the robot, specifically in regards to verbally instructing them to execute reconnaissance and surveillance tasks. Resulting commands, identified from the survey, were then converted to equivalent arm and hand gestures, leveraging existing visual signals (e.g. U.S. Army Field Manual for Visual Signaling). A study was then run to test the ability of commercially available automated speech recognition technologies and a gesture recognition glove to classify these commands in a simulated intelligence, surveillance, and reconnaissance task. This paper presents classification accuracy of these devices for both speech and gesture modalities independently.