WorldWideScience

Sample records for automated speech recognition

  1. Health Care in Home Automation Systems with Speech Recognition and Mobile Technology

    Directory of Open Access Journals (Sweden)

    Jasmin Kurti

    2016-08-01

    Full Text Available - Home automation systems use technology to facilitate the lives of people using it, and it is especially useful for assisting the elderly and persons with special needs. These kind of systems have been a popular research subject in last few years. In this work, I present the design and development of a system that provides a life assistant service in a home environment, a smart home-based healthcare system controlled with speech recognition and mobile technology. This includes developing software with speech recognition, speech synthesis, face recognition, controls for Arduino hardware, and a smartphone application for remote controlling the system. With the developed system, elderly and persons with special needs can stay independently in their own home secure and with care facilities. This system is tailored towards the elderly and disabled, but it can also be embedded in any home and used by anybody. It provides healthcare, security, entertainment, and total local and remote control of home.

  2. Acoustic diagnosis of pulmonary hypertension: automated speech- recognition-inspired classification algorithm outperforms physicians

    Science.gov (United States)

    Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian

    2016-09-01

    We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp < 25 mmHg). Physicians blinded to patient data listened to the same heart sound recordings and attempted a diagnosis. We studied 164 subjects: 86 with mPAp ≥ 25 mmHg (mPAp 41 ± 12 mmHg) and 78 with mPAp < 25 mmHg (mPAp 17 ± 5 mmHg) (p  < 0.005). The correct diagnostic rate of the automated speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.

  3. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  4. Advances in Speech Recognition

    CERN Document Server

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  5. The application of manifold based visual speech units for visual speech recognition

    OpenAIRE

    Yu, Dahai

    2008-01-01

    This dissertation presents a new learning-based representation that is referred to as a Visual Speech Unit for visual speech recognition (VSR). The automated recognition of human speech using only features from the visual domain has become a significant research topic that plays an essential role in the development of many multimedia systems such as audio visual speech recognition(AVSR), mobile phone applications, human-computer interaction (HCI) and sign language recognition. The inclusio...

  6. Speech recognition in university classrooms

    OpenAIRE

    Wald, Mike; Bain, Keith; Basson, Sara H

    2002-01-01

    The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions: 1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms? 2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities? This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition te...

  7. Emotion Recognition using Speech Features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    “Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions.  The content of this book is important for designing and developing  natural and sophisticated speech systems. Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance;  and pro...

  8. Time-expanded speech and speech recognition in older adults.

    Science.gov (United States)

    Vaughan, Nancy E; Furukawa, Izumi; Balasingam, Nirmala; Mortz, Margaret; Fausti, Stephen A

    2002-01-01

    Speech understanding deficits are common in older adults. In addition to hearing sensitivity, changes in certain cognitive functions may affect speech recognition. One such change that may impact the ability to follow a rapidly changing speech signal is processing speed. When speakers slow the rate of their speech naturally in order to speak clearly, speech recognition is improved. The acoustic characteristics of naturally slowed speech are of interest in developing time-expansion algorithms to improve speech recognition for older listeners. In this study, we tested younger normally hearing, older normally hearing, and older hearing-impaired listeners on time-expanded speech using increased duration and increased intensity of unvoiced consonants. Although all groups performed best on unprocessed speech, performance with processed speech was better with the consonant gain feature without time expansion in the noise condition and better at the slowest time-expanded rate in the quiet condition. The effects of signal processing on speech recognition are discussed. PMID:17642020

  9. Lattice Parsing for Speech Recognition

    OpenAIRE

    Chappelier, Jean-Cédric; Rajman, Martin; Aragües, Ramon; Rozenknop, Antoine

    1999-01-01

    A lot of work remains to be done in the domain of a better integration of speech recognition and language processing systems. This paper gives an overview of several strategies for integrating linguistic models into speech understanding systems and investigates several ways of producing sets of hypotheses that include more "semantic" variability than usual language models. The main goal is to present and demonstrate by actual experiments that sequential couplingmay be efficiently achieved byw...

  10. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  11. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  12. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  13. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  14. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia ...... accuracy. Finally, the last part of the thesis looks at the acceptance and success of a speech recognition system introduced in a Danish hospital to produce patient records.......This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... inaccuracies in the anaesthesia record. Supplementing the electronic anaesthesia record interface with speech input facilities is proposed as one possible solution to a part of the problem. The testing of the various hypotheses has involved the development of a prototype of an electronic anaesthesia record...

  15. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  16. On speech recognition during anaesthesia

    OpenAIRE

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (HCI, informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia record, but also with newer electronic versions, in particular ergonomic issues and the fact that anaesthesiologists tend to postpone the registration of the medications and other events during b...

  17. Comparison of Speech Features on the Speech Recognition Task

    Directory of Open Access Journals (Sweden)

    Iosif Mporas

    2007-01-01

    Full Text Available In the present work we overview some recently proposed discrete Fourier transform (DFT- and discrete wavelet packet transform (DWPT-based speech parameterization methods and evaluate their performance on the speech recognition task. Specifically, in order to assess the practical value of these less studied speech parameterization methods, we evaluate them in a common experimental setup and compare their performance against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC and perceptual linear predictive (PLP cepstral coefficients which presently dominate the speech recognition field. In particular, utilizing the well established TIMIT speech corpus and employing the Sphinx-III speech recognizer, we present comparative results of 8 different speech parameterization techniques.

  18. Automatic speech recognition a deep learning approach

    CERN Document Server

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  19. Speech Recognition in Natural Background Noise

    OpenAIRE

    Julien Meyer; Laure Dentel; Fanny Meunier

    2013-01-01

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listene...

  20. Robust speech recognition using articulatory information

    OpenAIRE

    Kirchhoff, Katrin

    1999-01-01

    Current automatic speech recognition systems make use of a single source of information about their input, viz. a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy. The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information. Articulatory informa...

  1. PCA-Based Speech Enhancement for Distorted Speech Recognition

    Directory of Open Access Journals (Sweden)

    Tetsuya Takiguchi

    2007-09-01

    Full Text Available We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but it remains difficult to completely remove additive or convolution noise (distortion. The most commonly used noise-removal techniques are based on the spectraldomain operation, and then for speech recognition, the MFCC (Mel Frequency Cepstral Coefficient is computed, where DCT (Discrete Cosine Transform is applied to the mel-scale filter bank output. This paper describes a new PCA-based speech enhancement algorithm using kernel PCA instead of DCT, where the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features. Its effectiveness is confirmed by word recognition experiments on distorted speech.

  2. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Cini Kurian; Kannan Balakrishnan

    2013-12-01

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual Linear Predictive (PLP) cepstral coefficient for speech parameterization and continuous density Hidden Markov Model (HMM) in the recognition process. Viterbi algorithm is used for decoding. The training data base has the utterance of 21 speakers from the age group of 20 to 40 years and the sound is recorded in the normal office environment where each speaker is asked to read 20 set of continuous digits. The system obtained an accuracy of 99.5 % with the unseen data.

  3. Auditory—Spectrum Quantization Based Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    WuYuanqing; HaoJie; 等

    1997-01-01

    Based on the analysis of the physiological and psychological characteristics of human auditory system[1],we can classify human auditory process into two hearing modes:active one and passive one.A novel approach of robust speech recognition,Auditory-spectrum Quantization Based Speech Recognition(AQBSR),is proposed.In this method,we intend to simulate human active hearing mode and locate the effective areas of speech signals in temporal domain and in frequency domain.Adaptive filter banks are used in place of fixed-band filters to extract feature parameters.The effective speech components and their corresponding frequency areas of each word in the vocabulary can be found out during training.In recognition stage,comparison between the unknown sound and the current template is maintained only in the effective areas of the template word.The control experiments show that the AQ BSR method is more robust than traditional systems.

  4. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...... (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural...... networks...

  5. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  6. Phonetic Alphabet for Speech Recognition of Czech

    OpenAIRE

    J. Uhlir; Psutka, J.; J. Nouza

    1997-01-01

    In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech) consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Cze...

  7. Testing for robust speech recognition performance

    Science.gov (United States)

    Simpson, C. A.; Moore, C. A.; Ruth, J. C.

    Results are reported from two studies which evaluated speaker-dependent connected-speech template-matching algorithms. One study examined the recognition performance for vocabularies spoken within a spacesuit. Two token vocabularies were used that were recorded in different noise levels. The second study evaluated the rejection accuracy for two commercial speech recognizers. The spoken test tokens were variations on a single word. The tests underscored the inferiority of speech recognizers relative to the human capability for discerning among phonetically different words. However, one commercial recognizer exhibited over 96-percent rejection accuracy in a noisy environment.

  8. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  9. Speech recognition: Acoustic, phonetic and lexical

    Science.gov (United States)

    Zue, V. W.

    1985-10-01

    Our long-term research goal is the development and implementation of speaker-independent continuous speech recognition systems. It is our conviction that proper utilization of speech-specific knowledge is essential for advanced speech recognition systems. With this in mind, we have continued to make progress on the acquisition of acoustic-phonetic and lexical knowledge. We have completed the development of a continuous digit recognition system. The system was constructed to investigate the utilization of acoustic phonetic knowledge in a speech recognition system. Some of the significant development of this study includes a soft-failure procedure for lexical access, and the discovery of a set of acoustic-phonetic features for verification. We have completed a study of the constraints provided by lexical stress on word recognition. We found that lexical stress information alone can, on the average, reduce the number of word candidates from a large dictionary by more than 80%. In conjunction with this study, we successfully developed a system that automatically determines the stress pattern of a word from the acoustic signal.

  10. Speech recognition employing biologically plausible receptive fields

    DEFF Research Database (Denmark)

    Fereczkowski, Michal; Bothe, Hans-Heinrich

    2011-01-01

    The main idea of the project is to build a widely speaker-independent, biologically motivated automatic speech recognition (ASR) system. The two main differences between our approach and current state-of-the-art ASRs are that i) the features used here are based on the responses of neuronlike spec...

  11. Effects of Cognitive Load on Speech Recognition

    Science.gov (United States)

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  12. Bimodal Emotion Recognition from Speech and Text

    Directory of Open Access Journals (Sweden)

    Weilin Ye

    2014-01-01

    Full Text Available This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.

  13. Robust Speech Recognition Using a Harmonic Model

    Institute of Scientific and Technical Information of China (English)

    许超; 曹志刚

    2004-01-01

    Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been employed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algorithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrelation function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.

  14. Phonetic Alphabet for Speech Recognition of Czech

    Directory of Open Access Journals (Sweden)

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  15. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  16. Speech recognition: Acoustic, phonetic and lexical knowledge

    Science.gov (United States)

    Zue, V. W.

    1985-08-01

    During this reporting period we continued to make progress on the acquisition of acoustic-phonetic and lexical knowledge. We completed development of a continuous digit recognition system. The system was constructed to investigate the use of acoustic-phonetic knowledge in a speech recognition system. The significant achievements of this study include the development of a soft-failure procedure for lexical access and the discovery of a set of acoustic-phonetic features for verification. We completed a study of the constraints that lexical stress imposes on word recognition. We found that lexical stress information alone can, on the average, reduce the number of word candidates from a large dictionary by more than 80 percent. In conjunction with this study, we successfully developed a system that automatically determines the stress pattern of a word from the acoustic signal. We performed an acoustic study on the characteristics of nasal consonants and nasalized vowels. We have also developed recognition algorithms for nasal murmurs and nasalized vowels in continuous speech. We finished the preliminary development of a system that aligns a speech waveform with the corresponding phonetic transcription.

  17. Phoneme fuzzy characterization in speech recognition systems

    Science.gov (United States)

    Beritelli, Francesco; Borrometi, Luca; Cuce, Antonino

    1997-10-01

    The acoustic approach to speech recognition has an important advantage compared with pattern recognition approach: it presents a lower complexity because it doesn't require explicit structures such as the hidden Markov model. In this work, we show how to characterize some phonetic classes of the Italian language in order to obtain a speaker and vocabulary independent speech recognition system. A phonetic data base is carried out with 200 continuous speech sentences of 12 speakers, 6 females and 6 males. The sentences are sampled at 8000 Hz and manual labelled with Asystem Sound Impression Software to obtain about 1600 units. We analyzed several speech parameters such as formants, LPC and reflection coefficients, energy, normal/differential zero crossing rate, cepstral and autocorrelation coefficients. The aim is the achievement of a phonetic recognizer to facilitate the so- called lexical access problem, that is to decode phonetic units into complete sense word strings. The knowledge is supplied to the recognizer in terms of fuzzy systems. The utilized software is called adaptive fuzzy modeler and it belongs to the rule generator family. A procedure has been implemented to integrate in the fuzzy system an 'expert' knowledge in order to obtain significant improvements in the recognition accuracy. Up to this point the tests show a recognition rate of 92% for the vocal class, 89% for the fricatives class and 94% for the nasal class, utilizing 1000 phonemes in phase of learning and 600 phonemes in phase of testing. Our intention is to complete the fuzzy recognizer extending this work to the other phonetic classes.

  18. A Dialectal Chinese Speech Recognition Framework

    Institute of Scientific and Technical Information of China (English)

    Jing Li; Thomas Fang Zheng; William Byrne; Dan Jurafsky

    2006-01-01

    A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level,and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use contextindependent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multipronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10-18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and

  19. Speech Recognition Technology for Hearing Disabled Community

    Directory of Open Access Journals (Sweden)

    Tanvi Dua

    2014-09-01

    Full Text Available As the number of people with hearing disabilities are increasing significantly in the world, it is always required to use technology for filling the gap of communication between Deaf and Hearing communities. To fill this gap and to allow people with hearing disabilities to communicate this paper suggests a framework that contributes to the efficient integration of people with hearing disabilities. This paper presents a robust speech recognition system, which converts the continuous speech into text and image. The results are obtained with an accuracy of 95% with the small size vocabulary of 20 greeting sentences of continuous speech form tested in a speaker independent mode. In this testing phase all these continuous sentences were given as live input to the proposed system.

  20. Speech emotion recognition with unsupervised feature learning

    Institute of Scientific and Technical Information of China (English)

    Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

    2015-01-01

    Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

  1. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Christophe Lévy

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  2. Improved Open-Microphone Speech Recognition

    Science.gov (United States)

    Abrash, Victor

    2002-01-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken

  3. Joint speech and spearker recognition using neural networks

    OpenAIRE

    Xue, Xiaoguo

    2013-01-01

    Speech is the main communication method between human beings. Since the time of the invention of the computer people have been trying to let the computer understand natural speech. Speech recognition is a technology which has close connections with computer science, signal processing, voice linguistics and intelligent systems. It has been a ”hot” subject not only in the field of research but also as a practical application. Especially in real life, speaker and speech recognition have been use...

  4. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  5. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  6. Automated Discovery of Speech Act Categories in Educational Games

    Science.gov (United States)

    Rus, Vasile; Moldovan, Cristian; Niraula, Nobal; Graesser, Arthur C.

    2012-01-01

    In this paper we address the important task of automated discovery of speech act categories in dialogue-based, multi-party educational games. Speech acts are important in dialogue-based educational systems because they help infer the student speaker's intentions (the task of speech act classification) which in turn is crucial to providing adequate…

  7. Performance of current models of speech recognition and resulting challenges

    OpenAIRE

    Schubotz, Wiebke

    2015-01-01

    Speech is usually perceived in background noise (masker) that can severely hamper its recognition. Nevertheless, there are mechanisms that enable speech recognition even in difficult listening conditions. Some of them, such as e.g., the combination of across-frequency information or binaural cues, are studied in this dissertation. Moreover, masking aspects such as energetic, amplitude modulation or informational masking are considered. Speech recognition in complex maskers is investigated tha...

  8. The Use of Speech Recognition Technology in Automotive Applications

    OpenAIRE

    Gellatly, Andrew William

    1997-01-01

    The research objectives were (1) to perform a detailed review of the literature on speech recognition technology and the attentional demands of driving; (2) to develop decision tools that assist designers of in-vehicle systems; (3) to experimentally examine automatic speech recognition (ASR) design parameters, input modalities, and driver ages; and (4) to provide human factors recommendations for the use of speech recognition technology in automotive applicatio...

  9. Automated leukocyte recognition using fuzzy divergence.

    Science.gov (United States)

    Ghosh, Madhumala; Das, Devkumar; Chakraborty, Chandan; Ray, Ajoy K

    2010-10-01

    This paper aims at introducing an automated approach to leukocyte recognition using fuzzy divergence and modified thresholding techniques. The recognition is done through the segmentation of nuclei where Gamma, Gaussian and Cauchy type of fuzzy membership functions are studied for the image pixels. It is in fact found that Cauchy leads better segmentation as compared to others. In addition, image thresholding is modified for better recognition. Results are studied and discussed.

  10. Speech Recognition Technology Applied to Intelligent Mobile Navigation System

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    The capability of human-computer interaction reflects the intelligent degree of mobile navigation system.The navigation data and functions of mobile navigation system are divided into system commands and non-system commands in this paper.And then a group of speech commands are Abstracted.This paper applies speech recognition technology to intelligent mobile navigation system to process speech commands and does some deep research on the integration of speech recognition technology with mobile navigation system.The navigation operation can be performed by speech commands,which makes human-computer interaction easy during navigation.Speech command interface of navigation system is implemented by Dutty ++ Software,which is based on speech recognition system -Via Voice of IBM.Through navigation experiments,navigation can be done almost without keyboard,which proved that human-computer interaction is very convenient by speech commands and the reliability is also higher.

  11. Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

    OpenAIRE

    Bordea, Prashant; Varpeb, Amarsinh; Manzac, Ramesh; Yannawara, Pravin

    2014-01-01

    Automatic Speech Recognition (ASR) by machine is an attractive research topic in signal processing domain and has attracted many researchers to contribute in this area. In recent year, there have been many advances in automatic speech reading system with the inclusion of audio and visual speech features to recognize words under noisy conditions. The objective of audio-visual speech recognition system is to improve recognition accuracy. In this paper we computed visual features using Zernike m...

  12. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    recognition is gaining momentum, it seems reasonable to explore the interplay between both fields in a feasibility study. In the context of machine-aided human translation (MAHT), different scenarios have been investigated where human translators interact with a computer through a variety of input modalities...... computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways of verifying...

  13. A pattern recognition based esophageal speech enhancement system

    Directory of Open Access Journals (Sweden)

    A.Mantilla‐Caeiros

    2010-04-01

    Full Text Available A system for improving the intelligibility and quality of alaryngeal speech based on the replacement of voiced segments ofalaryngeal speech with the equivalent segments of normal speech is proposed. To this end, the system proposed identifies thevoiced segments of the alaryngeal speech signal by using isolate speech recognition methods, and replaces them by theirequivalent voiced segments of normal speech, keeping the silence and unvoiced segments without change. Evaluation resultsusing objective and subjective evaluation methods show that the proposed system proposed provides a fairly goodimprovement of the quality and intelligibility of alaryngeal speech signals.

  14. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  15. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The aim of the research described in this thesis was to develop speech-based affect recognition systems that can deal with spontaneous (‘real’) affect instead of acted affect. Several affect recognition experiments with spontaneous affective speech data were carried out to investigate what combinati

  16. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  17. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part......-of-speech tagger. The feature sets are combined in an early binding design with an optimized binding coefficient that allows weighting of the relative variance contributions of the participating feature sets. With the combined features documents are classi¯ed using a latent semantic indexing representation...... and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  18. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  19. Speech recognition algorithms based on weighted finite-state transducers

    CERN Document Server

    Hori, Takaaki

    2013-01-01

    This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recogni

  20. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  1. Emotion Recognition from Speech Signals and Perception of Music

    OpenAIRE

    Fernandez Pradier, Melanie

    2011-01-01

    This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segment...

  2. Effects of Speech Clarity on Recognition Memory for Spoken Sentences

    OpenAIRE

    Van Engen, Kristin J.; Bharath Chandrasekaran; Rajka Smiljanic

    2012-01-01

    Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communi...

  3. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  4. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  5. Speech-recognition interfaces for music information retrieval

    Science.gov (United States)

    Goto, Masataka

    2005-09-01

    This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)

  6. Speech recognition for 40 patients receiving multichannel cochlear implants.

    Science.gov (United States)

    Dowell, R C; Mecklenburg, D J; Clark, G M

    1986-10-01

    We collected data on 40 patients who received the Nucleus multichannel cochlear implant. Results were reviewed to determine if the coding strategy is effective in transmitting the intended speech features and to assess patient benefit in terms of communication skills. All patients demonstrated significant improvement over preoperative results with a hearing aid for both lipreading enhancement and speech recognition without lipreading. Of the patients, 50% demonstrated ability to understand connected discourse with auditory input only. For the 23 patients who were tested 12 months postoperatively, there was substantial improvement in open-set speech recognition. PMID:3755975

  7. Cost-Efficient Development of Acoustic Models for Speech Recognition of Related Languages

    Directory of Open Access Journals (Sweden)

    J. Nouza

    2013-09-01

    Full Text Available When adapting an existing speech recognition system to a new language, major development costs are associated with the creation of an appropriate acoustic model (AM. For its training, a certain amount of recorded and annotated speech is required. In this paper, we show that not only the annotation process, but also the process of speech acquisition can be automated to minimize the need of human and expert work. We demonstrate the proposed methodology on Croatian language, for which the target AM has been built via cross-lingual adaptation of a Czech AM in 2 ways: a using commercially available GlobalPhone database, and b by automatic speech data mining from HRT radio archive. The latter approach is cost-free, yet it yields comparable or better results in LVCSR experiments conducted on 3 Croatian test sets.

  8. Effects of speech clarity on recognition memory for spoken sentences.

    Directory of Open Access Journals (Sweden)

    Kristin J Van Engen

    Full Text Available Extensive research shows that inter-talker variability (i.e., changing the talker affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language. Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise. Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

  9. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hoirin Kim

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  10. Mandarin Digits Speech Recognition Using Support Vector Machines

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; KUANG Jing-ming

    2005-01-01

    A method of applying support vector machine (SVM) in speech recognition was proposed, and a speech recognition system for mandarin digits was built up by SVMs. In the system, vectors were linearly extracted from speech feature sequence to make up time-aligned input patterns for SVM, and the decisions of several 2-class SVM classifiers were employed for constructing an N-class classifier. Four kinds of SVM kernel functions were compared in the experiments of speaker-independent speech recognition of mandarin digits. And the kernel of radial basis function has the highest accurate rate of 99.33%, which is better than that of the baseline system based on hidden Markov models (HMM) (97.08%). And the experiments also show that SVM can outperform HMM especially when the samples for learning were very limited.

  11. Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database

    Science.gov (United States)

    Mishra, A. N.; Shrotriya, M. C.; Sharan, S. N.

    2010-02-01

    In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. In this paper wavelet based features using different wavelets are used for Hindi digits recognition. The recognition performance of these features has been compared with Linear Prediction Coefficients (LPC) and Perceptual Linear Prediction (PLP) features. All features have been tested using Hidden Markov Model (HMM) based classifier for speaker independent Hindi digits recognition. The recognition performance of PLP features is11.3% better than LPC features. The recognition performance with db10 features has shown a further improvement of 12.55% over PLP features. The recognition performance with db10 is best among all wavelet based features.

  12. Lexicon Optimization for Dutch Speech Recognition in Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  13. Lexicon optimization for Dutch speech recognition in spoken document retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  14. Modelling context in automatic speech recognition

    NARCIS (Netherlands)

    Wiggers, P.

    2008-01-01

    Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the o

  15. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  16. Experiments on Automatic Recognition of Nonnative Arabic Speech

    Directory of Open Access Journals (Sweden)

    Selouani Sid-Ahmed

    2008-01-01

    Full Text Available The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA. We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC and the hidden Markov Model Toolkit (HTK are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.

  17. A Multi-Modal Recognition System Using Face and Speech

    Directory of Open Access Journals (Sweden)

    Samir Akrouf

    2011-05-01

    Full Text Available Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of recognition systems. In this paper, we study the fusion of speech and face in a recognition system for taking a final decision (i.e., accept or reject identity claim. We evaluate the performance of each system differently then we fuse the results and compare the performances.

  18. Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair

    Science.gov (United States)

    Sasou, Akira; Kojima, Hiroaki

    2009-12-01

    Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.

  19. Speech recognition: Acoustic phonetic and lexical knowledge representation

    Science.gov (United States)

    Zue, V. W.

    1984-02-01

    The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words and determine to what extent the phonotactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.

  20. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    Automatic speech recognition (ASR) relies on three resources: audio, orthographic transcriptions and a pronunciation dictionary. The dictionary or lexicon maps orthographic words to sequences of phones or phonemes that represent the pronunciation of the corresponding word. The quality of a speech....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  1. Studies in automatic speech recognition and its application in aerospace

    Science.gov (United States)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  2. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  3. Bayesian estimation of keyword confidence in Chinese continuous speech recognition

    Institute of Scientific and Technical Information of China (English)

    HAO Jie; LI Xing

    2003-01-01

    In a syllable-based speaker-independent Chinese continuous speech recognition system based on classical Hidden Markov Model (HMM), a Bayesian approach of keyword confidence estimation is studied, which utilizes both acoustic layer scores and syllable-based statistical language model (LM) score. The Maximum a posteriori (MAP) confidence measure is proposed, and the forward-backward algorithm calculating the MAP confidence scores is deduced. The performance of the MAP confidence measure is evaluated in keyword spotting application and the experiment results show that the MAP confidence scores provide high discriminability for keyword candidates. Furthermore, the MAP confidence measure can be applied to various speech recognition applications.

  4. Speech Recognition Method Based on Multilayer Chaotic Neural Network

    Institute of Scientific and Technical Information of China (English)

    REN Xiaolin; HU Guangrui

    2001-01-01

    In this paper,speech recognitionusing neural networks is investigated.Especially,chaotic dynamics is introduced to neurons,and a mul-tilayer chaotic neural network (MLCNN) architectureis built.A learning algorithm is also derived to trainthe weights of the network.We apply the MLCNNto speech recognition and compare the performanceof the network with those of recurrent neural net-work (RNN) and time-delay neural network (TDNN).Experimental results show that the MLCNN methodoutperforms the other neural networks methods withrespect to average recognition rate.

  5. Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hironori Matsumasa

    2009-08-01

    Full Text Available We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of the first words spoken tends to be unstable due to the strain placed on the speech-related muscles, and this causes degradation of speech recognition. Therefore, we proposed a robust feature extraction method based on PCA (Principal Component Analysis instead of MFCC, where the main stable utterance element is projected onto low-order features and fluctuation elements of speech style are projected onto high-order features. Therefore, the PCA-based filter will be able to extract stable utterance features only. The fluctuation of speaking style may invoke phone fluctuations, such as substitutions, deletions and insertions. In this paper, we discuss our effort to integrate a Metamodel and an Acoustic model approach. Metamodels have a technique for incorporating a model of a speaker’s confusion matrix into the ASR process in such a way as to increase recognition accuracy. The integration of metamodels and acoustic models enables fluctuation suppression not only in feature extraction but also in recognition. The proposed method resulted in an improvement of 9.9% (from 79.1% to 89% in the recognition rate compared to the conventional method.

  6. Success potential of automated star pattern recognition

    Science.gov (United States)

    Van Bezooijen, R. W. H.

    1986-01-01

    A quasi-analytical model is presented for calculating the success probability of automated star pattern recognition systems for attitude control of spacecraft. The star data is gathered by an imaging star tracker (STR) with a circular FOV capable of detecting 20 stars. The success potential is evaluated in terms of the equivalent diameters of the FOV and the target star area ('uniqueness area'). Recognition is carried out as a function of the position and brightness of selected stars in an area around each guide star. The success of the system is dependent on the resultant pointing error, and is calculated by generating a probability distribution of reaching a threshold probability of an unacceptable pointing error. The method yields data which are equivalent to data available with Monte Carlo simulatins. When applied to the recognition system intended for use on the Space IR Telescope Facility it is shown that acceptable pointing, to a level of nearly 100 percent certainty, can be obtained using a single star tracker and about 4000 guide stars.

  7. Robust Automatic Speech Recognition in Impulsive Noise Environment

    Institute of Scientific and Technical Information of China (English)

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  8. Mixed Bayesian Networks with Auxiliary Variables for Automatic Speech Recognition

    OpenAIRE

    Stephenson, Todd Andrew; Magimai.-Doss, Mathew; Bourlard, Hervé

    2001-01-01

    Standard hidden Markov models (HMMs), as used in automatic speech recognition (ASR), calculate their emission probabilities by an artificial neural network (ANN) or a Gaussian distribution conditioned on the hidden state variable, considering the emissions independent of any other variable in the model. Recent work showed the benefit of conditioning the emission distributions on a discrete auxiliary variable, which is observed in training and hidden in recognition. Related work has shown the ...

  9. Objective Gender and Age Recognition from Speech Sentences

    Directory of Open Access Journals (Sweden)

    Fatima K. Faek

    2015-10-01

    Full Text Available In this work, an automatic gender and age recognizer from speech is investigated. The relevant features to gender recognition are selected from the first four formant frequencies and twelve MFCCs and feed the SVM classifier. While the relevant features to age has been used with k-NN classifier for the age recognizer model, using MATLAB as a simulation tool. A special selection of robust features is used in this work to improve the results of the gender and age classifiers based on the frequency range that the feature represents. The gender and age classification algorithms are evaluated using 114 (clean and noisy speech samples uttered in Kurdish language. The model of two classes (adult males and adult females gender recognition, reached 96% recognition accuracy. While for three categories classification (adult males, adult females, and children, the model achieved 94% recognition accuracy. For the age recognition model, seven groups according to their ages are categorized. The model performance after selecting the relevant features to age achieved 75.3%. For further improvement a de-noising technique is used with the noisy speech signals, followed by selecting the proper features that are affected by the de-noising process and result in 81.44% recognition accuracy.

  10. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  11. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrumen

  12. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  13. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  14. EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR

    Institute of Scientific and Technical Information of China (English)

    Chen Yanxiang; Xie Jian

    2012-01-01

    Emotion recognition from speech is an important field of research in human computer interaction.In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition.Because of the importance of variance in reflecting the distribution of speech,the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector.Comparative experiments from five aspects are conducted to study their corresponding effect to system performance.The experiment results,which indicate that the influence of number of mixtures is strong as well as influence of duration is weak,provide basis for the train set selection of Universal Background Model (UBM).

  15. POLISH EMOTIONAL SPEECH RECOGNITION USING ARTIFICAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2014-11-01

    Full Text Available The article presents the issue of emotion recognition based on polish emotional speech analysis. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. The following parameters extracted from sampled and normalised speech signal has been used for the analysis: energy of signal, speaker’s sex, average value of speech signal and both the minimum and maximum sample value for a given signal. As an emotional state a classifier fof our layers of artificial neural network has been used. The achieved results reach 50% of accuracy. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom.

  16. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  17. New Ideas for Speech Recognition and Related Technologies

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J F

    2002-06-17

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstrated by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun

  18. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue;

    2006-01-01

    BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize that it is the temp......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize...... that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...... of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  19. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  20. Biologically inspired emotion recognition from speech

    Science.gov (United States)

    Caponetti, Laura; Buscicchio, Cosimo Alessandro; Castellano, Giovanna

    2011-12-01

    Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC) and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  1. Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach

    Directory of Open Access Journals (Sweden)

    A.Revathi

    2009-11-01

    Full Text Available This paper presents the effectiveness of perceptual features and iterative clustering approach forperforming both speech and speaker recognition. Procedure used for formation of training speech is differentfor developing training models for speaker independent speech and text independent speaker recognition. So,this work mainly emphasizes the utilization of clustering models developed for the training data to obtainbetter accuracy as 91%, 91% and 99.5% for mel frequency perceptual linear predictive cepstrum with respectto three categories such as speaker identification, isolated digit recognition and continuous speechrecognition. This feature also produces 9% as low equal error rate which is used as a performance measurefor speaker verification. The work is experimentally evaluated on the set of isolated digits and continuousspeeches from TI digits_1 and TI digits_2 database for speech recognition and on speeches of 50 speakersrandomly chosen from TIMIT database for speaker recognition. The noteworthy feature of speakerrecognition algorithm is to evaluate the testing procedure on identical messages of all the 50 speakers,theoretical validation of results using F-ratio and validation of results by statistical analysis using2 cdistribution.

  2. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    OpenAIRE

    Veton Z. Këpuska; Mohamed M. Eljhani; Brian H. Hight

    2014-01-01

    A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR) that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc.) with only speech commands instead of manual activation. T...

  3. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  4. Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique

    Directory of Open Access Journals (Sweden)

    Ibrahim Patel

    2010-12-01

    Full Text Available This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.

  5. Noise robust speech recognition with support vector learning algorithms

    Science.gov (United States)

    Namarvar, Hassan H.; Berger, Theodore W.

    2001-05-01

    We propose a new noise robust speech recognition system using time-frequency domain analysis and radial basis function (RBF) support vector machines (SVM). Here, we ignore the effects of correlative and nonstationary noise and only focus on continuous additive Gaussian white noise. We then develop an isolated digit/command recognizer and compare its performance to two other systems, in which the SVM classifier has been replaced by multilayer perceptron (MLP) and RBF neural networks. All systems are trained under the low signal-to-noise ratio (SNR) condition. We obtained the best correct classification rate of 83% and 52% for digit recognition on the TI-46 corpus for the SVM and MLP systems, respectively under the SNR=0 (dB), while we could not train the RBF network for the same dataset. The newly developed speech recognition system seems to be noise robust for medium size speech recognition problems under continuous, stationary background noise. However, it is still required to test the system under realistic noisy environment to observe whether the system keeps its adaptability and robustness under such conditions. [Work supported in part by grants from DARPA CBS, NASA, and ONR.

  6. An automatic speech recognition system with speaker-independent identification support

    Science.gov (United States)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  7. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  8. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  9. An overview of the SPHINX speech recognition system

    Science.gov (United States)

    Lee, Kai-Fu; Hon, Hsiao-Wuen; Reddy, Raj

    1990-01-01

    A description is given of SPHINX, a system that demonstrates the feasibility of accurate, large-vocabulary, speaker-independent, continuous speech recognition. SPHINX is based on discrete hidden Markov models (HMMs) with linear-predictive-coding derived parameters. To provide speaker independence, knowledge was added to these HMMs in several ways: multiple codebooks of fixed-width parameters, and an enhanced recognizer with carefully designed models and word-duration modeling. To deal with coarticulation in continuous speech, yet still adequately represent a large vocabulary, two new subword speech units are introduced: function-word-dependent phone models and generalized triphone models. With grammars of perplexity 997, 60, and 20, SPHINX attained word accuracies of 71, 94, and 96 percent, respectively, on a 997-word task.

  10. Initial evaluation of a continuous speech recognition program for radiology

    OpenAIRE

    Kanal, KM; Hangiandreou, NJ; Sykes, AM; Eklund, HE; Araoz, PA; Leon, JA; Erickson, BJ

    2001-01-01

    The aims of this work were to measure the accuracy of one continuous speech recognition product and dependence on the speaker's gender and status as a native or nonnative English speaker, and evaluate the product's potential for routine use in transcribing radiology reports. IBM MedSpeak/Radiology software, version 1.1 was evaluated by 6 speakers. Two were nonnative English speakers, and 3 were men. Each speaker dictated a set of 12 reports. The reports included neurologic and body imaging ex...

  11. An audio-visual corpus for multimodal speech recognition in Dutch language

    NARCIS (Netherlands)

    Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.

    2002-01-01

    This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also i

  12. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  13. Merge-Weighted Dynamic Time Warping for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张湘莉兰; 骆志刚; 李明

    2014-01-01

    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

  14. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...... features creates a set of error recovery techniques encompassing multiple description coding and interleaving schemes where interleaving has the advantage of not introducing a delay when there are no transmission errors. Thirdly, a sub-vector based EC technique is presented where error detection...

  15. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise.

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18-35 years) and 22 older (60-78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an

  16. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  17. Error analysis to improve the speech recognition accuracy on Telugu language

    Indian Academy of Sciences (India)

    N Usha Rani; P N Girija

    2012-12-01

    Speech is one of the most important communication channels among the people. Speech Recognition occupies a prominent place in communication between the humans and machine. Several factors affect the accuracy of the speech recognition system. Much effort was involved to increase the accuracy of the speech recognition system, still erroneous output is generating in current speech recognition systems. Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy. Modification should be performed in the dictionary, which is used in the decoder of the speech recognition system. This modification reduces the number of the confusion pairs which improves the performance of the speech recognition system. Language model scores are also varied with this modification. Hit rate is considerably increased during this modification and false alarms have been changing during the modification of the pronunciation dictionary. Variations are observed in different error measures such as F-measures, error-rate and Word Error Rate (WER) by application of the proposed method.

  18. A study of speech emotion recognition based on hybrid algorithm

    Science.gov (United States)

    Zhu, Ju-xia; Zhang, Chao; Lv, Zhao; Rao, Yao-quan; Wu, Xiao-pei

    2011-10-01

    To effectively improve the recognition accuracy of the speech emotion recognition system, a hybrid algorithm which combines Continuous Hidden Markov Model (CHMM), All-Class-in-One Neural Network (ACON) and Support Vector Machine (SVM) is proposed. In SVM and ACON methods, some global statistics are used as emotional features, while in CHMM method, instantaneous features are employed. The recognition rate by the proposed method is 92.25%, with the rejection rate to be 0.78%. Furthermore, it obtains the relative increasing of 8.53%, 4.69% and 0.78% compared with ACON, CHMM and SVM methods respectively. The experiment result confirms the efficiency of distinguishing anger, happiness, neutral and sadness emotional states.

  19. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    Indian Academy of Sciences (India)

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  20. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  1. Speaker-Adaptive Speech Recognition Based on Surface Electromyography

    Science.gov (United States)

    Wand, Michael; Schultz, Tanja

    We present our recent advances in silent speech interfaces using electromyographic signals that capture the movements of the human articulatory muscles at the skin surface for recognizing continuously spoken speech. Previous systems were limited to speaker- and session-dependent recognition tasks on small amounts of training and test data. In this article we present speaker-independent and speaker-adaptive training methods which allow us to use a large corpus of data from many speakers to train acoustic models more reliably. We use the speaker-dependent system as baseline, carefully tuning the data preprocessing and acoustic modeling. Then on our corpus we compare the performance of speaker-dependent and speaker-independent acoustic models and carry out model adaptation experiments.

  2. Speech recognition in individuals having a clinical complaint about understanding speech during noise or not

    Directory of Open Access Journals (Sweden)

    Becker, Karine Thaís

    2011-07-01

    Full Text Available Introduction: Clinical and experimental study. Individuals with a normal hearing can be jeopardized in adverse communication situations, what negatively interferes with speech clearness. Objective: check and compare the performance of normal hearing young adults who have a difficulty in understanding speech during noise or not, by making use of sentences as stimuli. Method: 50 normal hearing individuals, 21 of whom were male and 29 were female, aged between 19 and 32, were evaluated and divided into two groups: with and without a clinical complaint about understanding speech during noise. By using Portuguese Sentence Lists test, the Recognition Threshold of Sentences during Noise research was performed, through which the signal-to-noise (SN ratios were obtained. The contrasting noise was introduced at 65 dB NA. Results: the average values achieved for SN ratios in the left ear, for the group without a complaint and the group with a complaint, were respectively 6.26 dB and 3.62 dB. For the left ear, the values were -7.12 dB and -4.12 dB. A statistically significant difference was noticed in both right and left ears of the two groups. Conclusion: normal hearing individuals showing a clinical complaint about understanding speech at noisy places have more difficulty in the task to recognize sentences during noise, in comparison with the people who do not face such a difficulty. Accordingly, the customary audiologic evaluation must include tests using sentences during a contrasting noise, with a view to evaluating the speech recognition performance more reliably and efficiently. ACTRN12610000822088

  3. Adaptive Compensation Algorithm in Open Vocabulary Mandarin Speaker-Independent Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    In speech recognition systems, the physiological characteristics of the speech production model cause the voiced sections of the speech signal to have an attenuation of approximately 20 dB per decade. Many speech recognition algorithms have been developed to solve this problem by filtering the input signal with a single-zero high pass filter. Unfortunately, this technique increases the noise energy at high frequencies above 4 kHz, which in some cases degrades the recognition accuracy. This paper solves the problem using a pre-emphasis filter in the front end of the recognizer. The aim is to develop a modified parameterization approach taking into account the whole energy zone in the spectrum to improve the performance of the existing baseline recognition system in the acoustic phase. The results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.

  4. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2009-01-01

    Objective: The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that spe

  5. Efficient Speech Recognition by Using Modular Neural Network

    Directory of Open Access Journals (Sweden)

    Dr.R.L.K.Venkateswarlu

    2011-05-01

    Full Text Available The Modular approach and Neural Network approach are well known concepts in the research and engineering community. By combining these two together, the Modular Neural Network approach is very effective in searching for solutions to complex problems of various fields. The aim of this study is the distribution of the complexity for the ambiguous words classification task on a set of modules. Each of these modules is a single Neural Network which is characterized by its high degree of specialization. The number of interfaces, and there with possibilities for filtering external acoustic – phonetic knowledge, increases a modular architecture. Modular Neural Network (MNN for speech recognition is presented with speaker dependent single word recognition in this paper. Using this approach by taking computational effort into account, the system performance can be accessed. The active performance is found maximum for MFCC while training with Modular Neural Network classifiers as 99.88%. The active performance is found maximum for LPCC while training with Modular Neural Network classifier as 99.77%. It is found that MFCC performance is superior to LPCC performance while training the speech data with Modular Neural Network classifier.

  6. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  7. Audibility-based predictions of speech recognition for children and adults with normal hearing.

    Science.gov (United States)

    McCreery, Ryan W; Stelmachowicz, Patricia G

    2011-12-01

    This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands.

  8. Automatic Speech Recognition Using Template Model for Man-Machine Interface

    OpenAIRE

    Mishra, Neema; Shrawankar, Urmila; Thakare, V. M

    2013-01-01

    Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech is being an important mode of interaction with computers. In this paper Feature extraction is implemented using well-known Mel-Frequenc...

  9. Composite Wavelet Filters for Enhanced Automated Target Recognition

    Science.gov (United States)

    Chiang, Jeffrey N.; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2012-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low-resolution sonar and camera videos taken from unmanned vehicles. These sonar images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both sonar and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this paper.

  10. Robust Digital Speech Watermarking For Online Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Ali Nematollahi

    2015-01-01

    Full Text Available A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz. Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT, Massachusetts Institute of Technology (MIT, and Mobile Biometry (MOBIO show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

  11. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type......Introduction: Speech recognition is currently being deployed in medical and anaesthesia applications. This article is part of a project to investigate and further develop a prototype of a speech-input interface in Danish for an electronic anaesthesia patient record, to be used in real time during...... of training. Methods: Eight volunteers read aloud a total of about 3 600 typical short anaesthesia comments to be transcribed by a continuous speech recognition system. Background noises were collected in an operating room and reproduced. A regression analysis and descriptive statistics were done to evaluate...

  12. Benefits of spatial hearing to speech recognition in young people with normal hearing

    Institute of Scientific and Technical Information of China (English)

    SONG Peng-long; LI Hui-jun; WANG Ning-yu

    2011-01-01

    Background Many factors interfering with a listener attempting to grasp speech in noisy environments.The spatial hearing by which speech and noise can be spatially separated may play a crucial role in speech recognition in the presence of competing noise.This study aimed to assess whether,and to what degree,spatial hearing benefit speech recognition in young normal-hearing participants in both quiet and noisy environments.Methods Twenty-eight young participants were tested by Mandarin Hearing In Noise Test (MHINT) in quiet and noisy environments.The assessment method used was characterized by modifications of speech and noise configurations,as well as by changes of speech presentation mode.The benefit of spatial hearing was measured by speech recognition threshold (SRT) variation between speech condition 1 (SC1) and speech condition 2 (SC2).Results There was no significant difference found in the SRT between SC1 and SC2 in quiet.SRT in SC1 was about 4.2 dB lower than that in SC2,both in speech-shaped and four-babble noise conditions.SRTs measured in both SC1 and SC2 were lower in the speech-shaped noise condition than in the four-babble noise condition.Conclusion Spatial hearing in young normal-hearing participants contribute to speech recognition in noisy environments,but provide no benefit to speech recognition in quiet environments,which may be due to the offset of auditory extrinsic redundancy against the lack of spatial hearing.

  13. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

    Directory of Open Access Journals (Sweden)

    Mihaela Gordan

    2002-11-01

    Full Text Available Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

  14. Automated Robot with Object Recognition and Handling Features

    Directory of Open Access Journals (Sweden)

    Amiraj Dhawan

    2013-06-01

    Full Text Available With the advent of new technologies, every industry is moving towards automation. A large number of jobs in industries, such as Manufacturing, are performed repeatedly. These jobs require a lot of human effort. In such cases, there is a need of an automated robot which can perform the repetitive task more efficiently. This paper is about a robot which has object recognition and handling features. The robot will optically recognize the objects and pick and place them as per the hand gestures given by the user. It will have a camera to capture image of the objects and one arm to perform the pick and place function.

  15. Automated License Plate Recognition for Toll Booth Application

    Directory of Open Access Journals (Sweden)

    Ketan S. Shevale

    2014-10-01

    Full Text Available This paper describes the Smart Vehicle Screening System, which can be installed into a tollbooth for automated recognition of vehicle license plate information using a photograph of a vehicle. An automated system could then be implemented to control the payment of fees, parking areas, highways, bridges or tunnels, etc. There are considered an approach to identify vehicle through recognizing of it license plate using image fusion, neural networks and threshold techniques as well as some experimental results to recognize the license plate successfully.

  16. Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling

    OpenAIRE

    Wand, Michael

    2014-01-01

    Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this thesis is substantially improved in terms of accuracy, flexibility, and robustness.

  17. Speech recognition interference by the temporal and spectral properties of a single competing talker.

    Science.gov (United States)

    Fogerty, Daniel; Xu, Jiaqian

    2016-08-01

    This study investigated how speech recognition during speech-on-speech masking may be impaired due to the interaction between amplitude modulations of the target and competing talker. Young normal-hearing adults were tested in a competing talker paradigm where the target and/or competing talker was processed to primarily preserve amplitude modulation cues. Effects of talker sex and linguistic interference were also examined. Results suggest that performance patterns for natural speech-on-speech conditions are largely consistent with the same masking patterns observed for signals primarily limited to temporal amplitude modulations. However, results also suggest a role for spectral cues in talker segregation and linguistic competition. PMID:27586780

  18. Significance of parametric spectral ratio methods in detection and recognition of whispered speech

    Science.gov (United States)

    Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.

    2012-12-01

    In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

  19. A speech recognition system for data collection in precision agriculture

    Science.gov (United States)

    Dux, David Lee

    Agricultural producers have shown interest in collecting detailed, accurate, and meaningful field data through field scouting, but scouting is labor intensive. They use yield monitor attachments to collect weed and other field data while driving equipment. However, distractions from using a keyboard or buttons while driving can lead to driving errors or missed data points. At Purdue University, researchers have developed an ASR system to allow equipment operators to collect georeferenced data while keeping hands and eyes on the machine during harvesting and to ease georeferencing of data collected during scouting. A notebook computer retrieved locations from a GPS unit and displayed and stored data in Excel. A headset microphone with a single earphone collected spoken input while allowing the operator to hear outside sounds. One-, two-, or three-word commands activated appropriate VBA macros. Four speech recognition products were chosen based on hardware requirements and ability to add new terms. After training, speech recognition accuracy was 100% for Kurzweil VoicePlus and Verbex Listen for the 132 vocabulary words tested, during tests walking outdoors or driving an ATV. Scouting tests were performed by carrying the system in a backpack while walking in soybean fields. The system recorded a point or a series of points with each utterance. Boundaries of points showed problem areas in the field and single points marked rocks and field corners. Data were displayed as an Excel chart to show a real-time map as data were collected. The information was later displayed in a GIS over remote sensed field images. Field corners and areas of poor stand matched, with voice data explaining anomalies in the image. The system was tested during soybean harvest by using voice to locate weed patches. A harvester operator with little computer experience marked points by voice when the harvester entered and exited weed patches or areas with poor crop stand. The operator found the

  20. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    Science.gov (United States)

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory

  1. Employment of Spectral Voicing Information for Speech and Speaker Recognition in Noisy Conditions

    OpenAIRE

    Jan&#;ovič, Peter; Köküer, M&#;nevver

    2008-01-01

    This chapter described our recent research on representation and modelling of speech signals for automatic speech and speaker recognition in noisy conditions. The chapter consisted of three parts. In the first part, we presented a novel method for estimation of the voicing information of speech spectra in the presence of noise. The presented method is based on calculating a similarity between the shape of signal short-term spectrum and the spectrum of the frame-analysis window. It does not re...

  2. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  3. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  4. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  5. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    Science.gov (United States)

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  6. Influences of Infant-Directed Speech on Early Word Recognition

    Science.gov (United States)

    Singh, Leher; Nestor, Sarah; Parikh, Chandni; Yull, Ashley

    2009-01-01

    When addressing infants, many adults adopt a particular type of speech, known as infant-directed speech (IDS). IDS is characterized by exaggerated intonation, as well as reduced speech rate, shorter utterance duration, and grammatical simplification. It is commonly asserted that IDS serves in part to facilitate language learning. Although…

  7. Is Listening in Noise Worth It? The Neurobiology of Speech Recognition in Challenging Listening Conditions.

    Science.gov (United States)

    Eckert, Mark A; Teubner-Rhodes, Susan; Vaden, Kenneth I

    2016-01-01

    This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. The authors propose that the behavioral economics or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance. PMID:27355759

  8. Combined Feature Extraction Techniques and Naive Bayes Classifier for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Sonia Sunny

    2013-07-01

    Full Text Available Speech processing and consequent recognition are im portant areas of Digital Signal Processing since speech allows people to communicate more natu -rally and efficiently. In this work, a speech recognition system is developed for re-cogni zing digits in Malayalam. For recognizing speech, features are to be ex-tracted from speech a nd hence feature extraction method plays an important role in speech recognition. Here, front e nd processing for extracting the features is per-formed using two wavelet based methods namely D iscrete Wavelet Transforms (DWT and Wavelet Packet Decomposition (WPD. Naive Bayes cla ssifier is used for classification purpose. After classification using Naive Bayes classifier, DWT produced a recognition accuracy of 83.5% and WPD produced an accuracy of 80.7%. This paper is intended to devise a new feature extraction method which produces improvemen ts in the recognition accuracy. So, a new method called Dis-crete Wavelet Packet Decompositio n (DWPD is introduced which utilizes the hy-brid features of both DWT and WPD. The perfo rmance of this new approach is evaluated and it produced an improved recognition accuracy of 86.2% along with Naive Bayes classifier.

  9. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  10. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  11. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments. PMID:27332294

  12. Effect of Speaker Age on Speech Recognition and Perceived Listening Effort in Older Adults with Hearing Loss

    Science.gov (United States)

    McAuliffe, Megan J.; Wilding, Phillipa J.; Rickard, Natalie A.; O'Beirne, Greg A.

    2012-01-01

    Purpose: Older adults exhibit difficulty understanding speech that has been experimentally degraded. Age-related changes to the speech mechanism lead to natural degradations in signal quality. We tested the hypothesis that older adults with hearing loss would exhibit declines in speech recognition when listening to the speech of older adults,…

  13. An analytical approach to photonic reservoir computing - a network of SOA's - for noisy speech recognition

    Science.gov (United States)

    Salehi, Mohammad Reza; Abiri, Ebrahim; Dehyadegari, Louiza

    2013-10-01

    This paper seeks to investigate an approach of photonic reservoir computing for optical speech recognition on an examination isolated digit recognition task. An analytical approach in photonic reservoir computing is further drawn on to decrease time consumption, compared to numerical methods; which is very important in processing large signals such as speech recognition. It is also observed that adjusting reservoir parameters along with a good nonlinear mapping of the input signal into the reservoir, analytical approach, would boost recognition accuracy performance. Perfect recognition accuracy (i.e. 100%) can be achieved for noiseless speech signals. For noisy signals with 0-10 db of signal to noise ratios, however, the accuracy ranges observed varied between 92% and 98%. In fact, photonic reservoir application demonstrated 9-18% improvement compared to classical reservoir networks with hyperbolic tangent nodes.

  14. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...... has been shifting towards dimensional affect recognition based on the idea that emotional states are not independent from one another but related in a systematic manner. In this paper, we design a continuous dimensional speech affect recognition model that combines acoustic and semantic features. We...... show that combining semantic and acoustic information for dimensional speech recognition improves the results. Moreover, we show that valence is better estimated using semantic features while arousal is better estimated using acoustic features....

  15. Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models to Psychophysical Data

    Directory of Open Access Journals (Sweden)

    Svante Stadler

    2009-01-01

    Full Text Available Users of cochlear implants (CIs vary widely in their ability to recognize speech in noisy conditions. There are many factors that may influence their performance. We have investigated to what degree it can be explained by the users' ability to discriminate spectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. The models were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speech recognition performance was compared to experimental results, and they were significantly correlated. The presented framework may be used to simulate the effects of changing the CI encoding strategy.

  16. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  17. An open-set detection evaluation methodology for automatic emotion recognition in speech

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we present a detection approach and an ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification d

  18. Relative Contributions of Spectral and Temporal Cues for Speech Recognition in Patients with Sensorineural Hearing Loss

    Institute of Scientific and Technical Information of China (English)

    XU Li; ZHOU Ning; Rebecca Brashears; Katherine Rife

    2008-01-01

    The present study was designed to examine speech recognition in patients with sensorineural hearing loss when the temporal and spectral information in the speech signals were co-varied. Four subjects with mild to moderate sensorineural hearing loss were recruited to participate in consonant and vowel recognition tests that used speech stimuli processed through a noise-excited voeoder. The number of channels was varied between 2 and 32, which defined spectral information. The lowpass cutoff frequency of the temporal envelope extractor was varied from 1 to 512 Hz, which defined temporal information. Results indicate that performance of subjects with sensorineural heating loss varied tremendously among the subjects. For consonant recognition, patterns of relative contributions of spectral and temporal information were similar to those in normal-hearing subjects. The utility of temporal envelope information appeared to be normal in the hearing-impaired listeners. For vowel recognition, which depended predominately on spectral information, the performance plateau was achieved with numbers of channels as high as 16-24, much higher than expected, given that the frequency selectivity in patients with sensorineural hearing loss might be compromised. In order to understand the mechanisms on how hearing-impaired listeners utilize spectral and temporal cues for speech recognition, future studies that involve a large sample of patients with sensorineural hearing loss will be necessary to elucidate the relationship between frequency selectivity as well as central processing capability and speech recognition performance using vocoded signals.

  19. Speech recognition materials and ceiling effects: considerations for cochlear implant programs.

    Science.gov (United States)

    Gifford, René H; Shallop, Jon K; Peterson, Anna Mary

    2008-01-01

    Cochlear implant recipients have demonstrated remarkable increases in speech perception since US FDA approval was granted in 1984. Improved performance is due to a number of factors including improved cochlear implant technology, evolving speech coding strategies, and individuals with increasingly more residual hearing receiving implants. Despite this evolution, the same recommendations for pre- and postimplant speech recognition testing have been in place for over 10 years in the United States. To determine whether new recommendations are warranted, speech perception performance was assessed for 156 adult, postlingually deafened implant recipients as well as 50 hearing aid users on monosyllabic word recognition (CNC) and sentence recognition in quiet (HINT and AzBio sentences) and in noise (BKB-SIN). Results demonstrated that for HINT sentences in quiet, 28% of the subjects tested achieved maximum performance of 100% correct and that scores did not agree well with monosyllables (CNC) or sentence recognition in noise (BKB-SIN). For a more difficult sentence recognition material (AzBio), only 0.7% of the subjects achieved 100% performance and scores were in much better agreement with monosyllables and sentence recognition in noise. These results suggest that more difficult materials are needed to assess speech perception performance of postimplant patients - and perhaps also for determining implant candidacy. PMID:18212519

  20. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions. PMID:27378997

  1. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  2. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    Science.gov (United States)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  3. Automated target recognition and tracking using an optical pattern recognition neural network

    Science.gov (United States)

    Chao, Tien-Hsin

    1991-01-01

    The on-going development of an automatic target recognition and tracking system at the Jet Propulsion Laboratory is presented. This system is an optical pattern recognition neural network (OPRNN) that is an integration of an innovative optical parallel processor and a feature extraction based neural net training algorithm. The parallel optical processor provides high speed and vast parallelism as well as full shift invariance. The neural network algorithm enables simultaneous discrimination of multiple noisy targets in spite of their scales, rotations, perspectives, and various deformations. This fully developed OPRNN system can be effectively utilized for the automated spacecraft recognition and tracking that will lead to success in the Automated Rendezvous and Capture (AR&C) of the unmanned Cargo Transfer Vehicle (CTV). One of the most powerful optical parallel processors for automatic target recognition is the multichannel correlator. With the inherent advantages of parallel processing capability and shift invariance, multiple objects can be simultaneously recognized and tracked using this multichannel correlator. This target tracking capability can be greatly enhanced by utilizing a powerful feature extraction based neural network training algorithm such as the neocognitron. The OPRNN, currently under investigation at JPL, is constructed with an optical multichannel correlator where holographic filters have been prepared using the neocognitron training algorithm. The computation speed of the neocognitron-type OPRNN is up to 10(exp 14) analog connections/sec that enabling the OPRNN to outperform its state-of-the-art electronics counterpart by at least two orders of magnitude.

  4. Pattern recognition

    CERN Document Server

    Theodoridis, Sergios

    2003-01-01

    Pattern recognition is a scientific discipline that is becoming increasingly important in the age of automation and information handling and retrieval. Patter Recognition, 2e covers the entire spectrum of pattern recognition applications, from image analysis to speech recognition and communications. This book presents cutting-edge material on neural networks, - a set of linked microprocessors that can form associations and uses pattern recognition to ""learn"" -and enhances student motivation by approaching pattern recognition from the designer's point of view. A direct result of more than 10

  5. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    Directory of Open Access Journals (Sweden)

    Veton Z. Këpuska

    2014-12-01

    Full Text Available A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc. with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD, and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., serverfor remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV. This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.

  6. Using the FASST source separation toolbox for noise robust speech recognition

    OpenAIRE

    Ozerov, Alexey; Vincent, Emmanuel

    2011-01-01

    We describe our submission to the 2011 CHiME Speech Separation and Recognition Challenge. Our speech separation algorithm was built using the Flexible Audio Source Separation Toolbox (FASST) we developed recently. This toolbox is an implementation of a general flexible framework based on a library of structured source models that enable the incorporation of prior knowledge about a source separation problem via user-specifiable constraints. We show how to use FASST to develop an efficient spee...

  7. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    OpenAIRE

    Ibrahim Missaoui; Zied Lachiri

    2016-01-01

    In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that ...

  8. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  9. [Research on Barrier-free Home Environment System Based on Speech Recognition].

    Science.gov (United States)

    Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo

    2015-10-01

    The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment. PMID:26964305

  10. Hybrid Approach for Language Identification Oriented to Multilingual Speech Recognition in the Basque Context

    Science.gov (United States)

    Barroso, N.; de Ipiña, K. López; Ezeiza, A.; Barroso, O.; Susperregi, U.

    The development of Multilingual Large Vocabulary Continuous Speech Recognition systems involves issues as: Language Identification, Acoustic-Phonetic Decoding, Language Modelling or the development of appropriated Language Resources. The interest on Multilingual Systems arouses because there are three official languages in the Basque Country (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition for the Basque context. The work presents hybrid strategies for LID, based on the selection of system elements by Support Vector Machines and Multilayer Perceptron classifiers and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams).

  11. Low-cost speech recognition system for small vocabulary and independent speaker

    Science.gov (United States)

    Teh, Chih Chiang; Jong, Ching C.; Siek, Liter

    2000-10-01

    In this paper an ASIC implementation of a low cost speech recognition system for small vocabulary, 15 isolated word, speaker independent is presented. The IC is a digital block that receives a 12 bit sample with a sampling rate of 11.025 kHz as its input. The IC is running at 10 MHz system clock and targeted at 0.35 micrometers CMOS process. The whole chip, which includes the speech recognition system core, RAM and ROM contains about 61000 gates. The die size is 1.5 mm by 3 mm. The current design had been coded in VHDL for hardware implementation and its functionality is identical with the Matlab simulation. The average speech recognition rate for this IC is 89 percent for 15 isolated words.

  12. Frequency band-importance functions for auditory and auditory-visual speech recognition

    Science.gov (United States)

    Grant, Ken W.

    2005-04-01

    In many everyday listening environments, speech communication involves the integration of both acoustic and visual speech cues. This is especially true in noisy and reverberant environments where the speech signal is highly degraded, or when the listener has a hearing impairment. Understanding the mechanisms involved in auditory-visual integration is a primary interest of this work. Of particular interest is whether listeners are able to allocate their attention to various frequency regions of the speech signal differently under auditory-visual conditions and auditory-alone conditions. For auditory speech recognition, the most important frequency regions tend to be around 1500-3000 Hz, corresponding roughly to important acoustic cues for place of articulation. The purpose of this study is to determine the most important frequency region under auditory-visual speech conditions. Frequency band-importance functions for auditory and auditory-visual conditions were obtained by having subjects identify speech tokens under conditions where the speech-to-noise ratio of different parts of the speech spectrum is independently and randomly varied on every trial. Point biserial correlations were computed for each separate spectral region and the normalized correlations are interpreted as weights indicating the importance of each region. Relations among frequency-importance functions for auditory and auditory-visual conditions will be discussed.

  13. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  14. Studying the Speech Recognition Scores of Hearing Impaied Children by Using Nonesense Syllables

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Keyhani

    1998-09-01

    Full Text Available Background: The current article is aimed at evaluating speech recognition scores in hearing aid wearers to determine whether nonsense syllables are suitable speech materials to evaluate the effectiveness of their hearing aids. Method: Subjects were 60 children (15 males and 15 females with bilateral moderate and moderately severe sensorineural hearing impairment who were aged between 7.7-14 years old. Gain prescription was fitted by NAL method. Then speech evaluation was performed in a quiet place with and without hearing aid by using a list of 25 monosyllable words recorded on a tape. A list was prepared for the subjects to check in the correct response. The same method was used to obtain results for normal subjects. Results: The results revealed that the subjects using hearing aids achieved significantly higher SRS in comparison of not wearing it. Although the speech recognition ability was not compensated completely (the maximum score obtained was 60% it was also revealed that the syllable recognition ability in the less amplified frequencies were decreased. the SRS was very higher in normal subjects (with an average of 88%. Conclusion: It seems that Speech recognition score can prepare Audiologist with a more comprehensive method to evaluate the hearing aid benefits.

  15. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  16. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  17. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the langu......In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...

  18. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    OpenAIRE

    Heungkyu Lee

    2013-01-01

    This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA) on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at...

  19. Automated Recognition of 3D Features in GPIR Images

    Science.gov (United States)

    Park, Han; Stough, Timothy; Fijany, Amir

    2007-01-01

    A method of automated recognition of three-dimensional (3D) features in images generated by ground-penetrating imaging radar (GPIR) is undergoing development. GPIR 3D images can be analyzed to detect and identify such subsurface features as pipes and other utility conduits. Until now, much of the analysis of GPIR images has been performed manually by expert operators who must visually identify and track each feature. The present method is intended to satisfy a need for more efficient and accurate analysis by means of algorithms that can automatically identify and track subsurface features, with minimal supervision by human operators. In this method, data from multiple sources (for example, data on different features extracted by different algorithms) are fused together for identifying subsurface objects. The algorithms of this method can be classified in several different ways. In one classification, the algorithms fall into three classes: (1) image-processing algorithms, (2) feature- extraction algorithms, and (3) a multiaxis data-fusion/pattern-recognition algorithm that includes a combination of machine-learning, pattern-recognition, and object-linking algorithms. The image-processing class includes preprocessing algorithms for reducing noise and enhancing target features for pattern recognition. The feature-extraction algorithms operate on preprocessed data to extract such specific features in images as two-dimensional (2D) slices of a pipe. Then the multiaxis data-fusion/ pattern-recognition algorithm identifies, classifies, and reconstructs 3D objects from the extracted features. In this process, multiple 2D features extracted by use of different algorithms and representing views along different directions are used to identify and reconstruct 3D objects. In object linking, which is an essential part of this process, features identified in successive 2D slices and located within a threshold radius of identical features in adjacent slices are linked in a

  20. Speech recognition in noise as a function of the number of spectral channels : Comparison of acoustic hearing and cochlear implants

    NARCIS (Netherlands)

    Friesen, LM; Shannon, RV; Baskent, D; Wang, YB

    2001-01-01

    Speech recognition was measured as a function of spectral resolution (number of spectral channels) and speech-to-noise ratio in normal-hearing (NH) and cochlear-implant (CI) listeners. Vowel, consonant, word, and sentence recognition were measured in five normal -hearing listeners, ten listeners wit

  1. Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels

    NARCIS (Netherlands)

    Baskent, Deniz

    2006-01-01

    Speech recognition by normal-hearing listeners improves as a function of the number of spectral channels when tested with a noiseband vocoder simulating cochlear implant signal processing. Speech recognition by the best cochlear implant users, however, saturates around eight channels and does not im

  2. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... advantages and disadvantages of a vocal interface compared to the traditional touch-screen and keyboard interface to an electronic anaesthesia record during crisis situations; second, to assess the usability in a realistic work environment of some speech input strategies (hands-free vocal interface activated...... by a keyword; combination of command and free text modes); finally, to quantify some of the gains that could be provided by the speech input modality. Methods: Six anaesthesia teams composed of one doctor and one nurse were each confronted with two crisis scenarios in a full-scale anaesthesia simulator. Each...

  3. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ibrahim Missaoui

    2016-04-01

    Full Text Available In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that the proposed feature extraction method outperforms the classic methods such as Perceptual Linear Prediction, Linear Predictive Coding, Linear Prediction Cepstral coefficients and Mel Frequency Cepstral Coefficients.

  4. Speech Emotion Recognition Based on Parametric Filter and Fractal Dimension

    Science.gov (United States)

    Mao, Xia; Chen, Lijiang

    In this paper, we propose a new method that employs two novel features, correlation density (Cd) and fractal dimension (Fd), to recognize emotional states contained in speech. The former feature obtained by a list of parametric filters reflects the broad frequency components and the fine structure of lower frequency components, contributed by unvoiced phones and voiced phones, respectively; the latter feature indicates the non-linearity and self-similarity of a speech signal. Comparative experiments based on Hidden Markov Model and K Nearest Neighbor methods are carried out. The results show that Cd and Fd are much more closely related with emotional expression than the features commonly used.

  5. Space discriminative function for microphone array robust speech recognition

    Institute of Scientific and Technical Information of China (English)

    Zhao Xianyu; Ou Zhijian; Wang Zuoying

    2005-01-01

    Based on W-disjoint orthogonality of speech mixtures, a space discriminative function was proposed to enumerate and localize competing speakers in the surrounding environments. Then, a Wiener-like post-filterer was developed to adaptively suppress interferences. Experimental results with a hands-free speech recognizer under various SNR and competing speakers settings show that nearly 69% error reduction can be obtained with a two-channel small aperture microphone array against the conventional single microphone baseline system. Comparisons were made against traditional delay-and-sum and Griffiths-Jim adaptive beamforming techniques to further assess the effectiveness of this method.

  6. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Directory of Open Access Journals (Sweden)

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  7. Strategies for the automated recognition of marks in forensic science

    Science.gov (United States)

    Heizmann, Michael

    2002-07-01

    To enable the efficient comparison of striation marks in forensic science, tools for the automated detection of similarities between them are necessary. Such marks show a groove-like texture which can be considered as a fingerprint of the associated tool. Thus, a reliable detection of connections between different toolmarks from the identical tool can be established. In order to avoid the time-consuming visual inspection of toolmarks, automated approaches for the evaluation of marks are essential. Such approaches are commonly based on meaningful characteristics extracted from images of the marks that are to be examined. Besides of a high recognition rate, the required computation time plays an important role within the design of an adequate comparison strategy. The cross-correlation function presented in this paper provides a faithful quantitative measure to determine the degree of similarity. It is shown that appropriate modeling of the signal characteristics considerably improves the performance of methods based on the cross-correlation function. A strategy for quantitative assessment of comparison strategies is introduced. It is based on the processing of a test archive of marks and analyses the comparison results statistically. For a convenient description of the assessment results, meaningful index numbers are discussed.

  8. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  9. Recognition of Rapid Speech by Blind and Sighted Older Adults

    Science.gov (United States)

    Gordon-Salant, Sandra; Friedman, Sarah A.

    2011-01-01

    Purpose: To determine whether older blind participants recognize time-compressed speech better than older sighted participants. Method: Three groups of adults with normal hearing participated (n = 10/group): (a) older sighted, (b) older blind, and (c) younger sighted listeners. Low-predictability sentences that were uncompressed (0% time…

  10. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  11. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  12. Acoustic Feature Optimization Based on F-Ratio for Robust Speech Recognition

    Science.gov (United States)

    Sun, Yanqing; Zhou, Yu; Zhao, Qingwei; Yan, Yonghong

    This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1kHz and 3kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15dB and 0dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.

  13. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    Science.gov (United States)

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  14. Probabilistic SVM/GMM Classifier for Speaker-Independent Vowel Recognition in Continues Speech

    CERN Document Server

    Nazari, Mohammad; Valiollahzadeh, SeyedMajid

    2008-01-01

    In this paper, we discuss the issues in automatic recognition of vowels in Persian language. The present work focuses on new statistical method of recognition of vowels as a basic unit of syllables. First we describe a vowel detection system then briefly discuss how the detected vowels can feed to recognition unit. According to pattern recognition, Support Vector Machines (SVM) as a discriminative classifier and Gaussian mixture model (GMM) as a generative model classifier are two most popular techniques. Current state-ofthe- art systems try to combine them together for achieving more power of classification and improving the performance of the recognition systems. The main idea of the study is to combine probabilistic SVM and traditional GMM pattern classification with some characteristic of speech like band-pass energy to achieve better classification rate. This idea has been analytically formulated and tested on a FarsDat based vowel recognition system. The results show inconceivable increases in recogniti...

  15. Comparative Evaluation of Three Continuous Speech Recognition Software Packages in the Generation of Medical Reports

    OpenAIRE

    Devine, Eric G.; Gaehde, Stephan A.; Curtis, Arthur C.

    2000-01-01

    Objective: To compare out-of-box performance of three commercially available continuous speech recognition software packages: IBM ViaVoice 98 with General Medicine Vocabulary; Dragon Systems NaturallySpeaking Medical Suite, version 3.0; and L&H Voice Xpress for Medicine, General Medicine Edition, version 1.2.

  16. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet P.; Leeuwen, van David A.; Jong, de Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two t

  17. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  18. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  19. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  20. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  1. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    NARCIS (Netherlands)

    Baskent, Deniz; Chatterjeec, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelli

  2. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...

  3. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  4. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Science.gov (United States)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  5. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Directory of Open Access Journals (Sweden)

    Rubén Fernández Pozo

    2009-01-01

    Full Text Available This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA. Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  6. Post-Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion

    CERN Document Server

    Bassil, Youssef

    2012-01-01

    ASR short for Automatic Speech Recognition is the process of converting a spoken speech into text that can be manipulated by a computer. Although ASR has several applications, it is still erroneous and imprecise especially if used in a harsh surrounding wherein the input speech is of low quality. This paper proposes a post-editing ASR error correction method and algorithm based on Bing's online spelling suggestion. In this approach, the ASR recognized output text is spell-checked using Bing's spelling suggestion technology to detect and correct misrecognized words. More specifically, the proposed algorithm breaks down the ASR output text into several word-tokens that are submitted as search queries to Bing search engine. A returned spelling suggestion implies that a query is misspelled; and thus it is replaced by the suggested correction; otherwise, no correction is performed and the algorithm continues with the next token until all tokens get validated. Experiments carried out on various speeches in differen...

  7. Integrating Stress Information in Large Vocabulary Continuous Speech Recognition

    OpenAIRE

    Ludusan, Bogdan; Ziegler, Stefan; Gravier, Guillaume

    2012-01-01

    In this paper we propose a novel method for integrating stress information in the decoding step of a speech recognizer. A multiscale rhythm model was used to determine the stress scores for each syllable, which are further used to reinforce paths during search. Two strategies for integrating the stress were employed: the first one reinforces paths through all the syllables with a value proportional to the their stress score, while the second one enhances paths passing only through stressed sy...

  8. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  9. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  10. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  11. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.

  12. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Turicchia Lorenzo

    2007-01-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at -5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (-5 dB SNR to +5 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  13. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bhiksha Raj

    2007-06-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at −5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (−5 dB SNR to +15 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  14. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Science.gov (United States)

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  15. Automated detection and recognition of wildlife using thermal cameras.

    Science.gov (United States)

    Christiansen, Peter; Steen, Kim Arild; Jørgensen, Rasmus Nyholm; Karstoft, Henrik

    2014-01-01

    In agricultural mowing operations, thousands of animals are injured or killed each year, due to the increased working widths and speeds of agricultural machinery. Detection and recognition of wildlife within the agricultural fields is important to reduce wildlife mortality and, thereby, promote wildlife-friendly farming. The work presented in this paper contributes to the automated detection and classification of animals in thermal imaging. The methods and results are based on top-view images taken manually from a lift to motivate work towards unmanned aerial vehicle-based detection and recognition. Hot objects are detected based on a threshold dynamically adjusted to each frame. For the classification of animals, we propose a novel thermal feature extraction algorithm. For each detected object, a thermal signature is calculated using morphological operations. The thermal signature describes heat characteristics of objects and is partly invariant to translation, rotation, scale and posture. The discrete cosine transform (DCT) is used to parameterize the thermal signature and, thereby, calculate a feature vector, which is used for subsequent classification. Using a k-nearest-neighbor (kNN) classifier, animals are discriminated from non-animals with a balanced classification accuracy of 84.7% in an altitude range of 3-10 m and an accuracy of 75.2% for an altitude range of 10-20 m. To incorporate temporal information in the classification, a tracking algorithm is proposed. Using temporal information improves the balanced classification accuracy to 93.3% in an altitude range 3-10 of meters and 77.7% in an altitude range of 10-20 m.

  16. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition‐ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model‐based real‐time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360‐degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  17. Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

    OpenAIRE

    Anguita Ortega, Jan; Hernando Pericás, Francisco Javier

    2005-01-01

    In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have te...

  18. Fully Automated Assessment of the Severity of Parkinson's Disease from Speech.

    Science.gov (United States)

    Bayestehtashk, Alireza; Asgari, Meysam; Shafran, Izhak; McNames, James

    2015-01-01

    For several decades now, there has been sporadic interest in automatically characterizing the speech impairment due to Parkinson's disease (PD). Most early studies were confined to quantifying a few speech features that were easy to compute. More recent studies have adopted a machine learning approach where a large number of potential features are extracted and the models are learned automatically from the data. In the same vein, here we characterize the disease using a relatively large cohort of 168 subjects, collected from multiple (three) clinics. We elicited speech using three tasks - the sustained phonation task, the diadochokinetic task and a reading task, all within a time budget of 4 minutes, prompted by a portable device. From these recordings, we extracted 1582 features for each subject using openSMILE, a standard feature extraction tool. We compared the effectiveness of three strategies for learning a regularized regression and find that ridge regression performs better than lasso and support vector regression for our task. We refine the feature extraction to capture pitch-related cues, including jitter and shimmer, more accurately using a time-varying harmonic model of speech. Our results show that the severity of the disease can be inferred from speech with a mean absolute error of about 5.5, explaining 61% of the variance and consistently well-above chance across all clinics. Of the three speech elicitation tasks, we find that the reading task is significantly better at capturing cues than diadochokinetic or sustained phonation task. In all, we have demonstrated that the data collection and inference can be fully automated, and the results show that speech-based assessment has promising practical application in PD. The techniques reported here are more widely applicable to other paralinguistic tasks in clinical domain. PMID:25382935

  19. Discriminative tonal feature extraction method in mandarin speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2007-01-01

    To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project F0 (fundamental frequency) features of neighboring syllables as compensations, and adds them to the original F0 features of the current syllable. The transforms are discriminatively trained by using an objective function termed as "minimum tone error", which is a smooth approximation of tone recognition accuracy. Experiments show that the new tonal features achieve 3.82% tone recognition rate improvement, compared with the baseline, using maximum likelihood trained HMM on the normal F0 features. Further experiments show that discriminative HMM training on the new features is 8.78% better than the baseline.

  20. A Factored Language Model for Prosody Dependent Speech Recognition

    OpenAIRE

    Chen, Ken; Hasegawa-Johnson, Mark A.; Cole, Jennifer S.

    2007-01-01

    In this chapter, we proposed a novel approach that improves the robustness of prosody dependent language modeling by leveraging the dependence between prosody and syntax. In our experiments on Radio News Corpus, a factorial prosody dependent language model estimated using our proposed approach has achieved as much as 31% reduction of the joint perplexity over a prosody dependent language model estimated using the standard Maximum Likelihood approach. In recognition experiments, our approach r...

  1. Study on Acoustic Modeling in a Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    PENG Di; LIU Gang; GUO Jun

    2007-01-01

    The design of acoustic models is of vital importance to build a reliable connection between acoustic waveform and linguistic messages in terms of individual speech units. According to the characteristic of Chinese phonemes,the base acoustic phoneme units set is decided and refined and a decision tree based state tying approach is explored.Since one of the advantages of top-down tying method is flexibility in maintaining a balance between model accuracy and complexity, relevant adjustments are conducted, such as the stopping criterion of decision tree node splitting, during which optimal thresholds are captured. Better results are achieved in improving acoustic modeling accuracy as well as minimizing the scale of the model to a trainable extent.

  2. Robust multi-stream speech recognition based on weighting the output probabilities of feature components

    Institute of Scientific and Technical Information of China (English)

    ZHANG Jun; WEI Gang; YU Hua; NING Genxin

    2009-01-01

    In the traditional multi-stream fusion methods of speech recognition, all the feature components in a data stream share the same stream weight, while their distortion levels are usually different when the speech recognizer works in noisy environments. To overcome this limitation of the traditional multi-stream frameworks, the current study proposes a new stream fusion method that weights not only the stream outputs, but also the output probabilities of feature components. How the stream and feature component weights in the new fusion method affect the decision is analyzed and two stream fusion schemes based on the 03iginalisation and soft decision models in the missing data techniques are proposed. Experimental results on the hybrid sub-band multi-stream speech recognizer show that the proposed schemes can adjust the stream influences on the decision adaptively and outperform the traditional multi-stream methods in various noisy environments.

  3. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search......, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even...... for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used...

  4. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition

    Institute of Scientific and Technical Information of China (English)

    Hong-tao WANG; Yuan-qing LI; Tian-you YU

    2014-01-01

    An intelligent wheelchair is devised, which is controlled by a coordinated mechanism based on a brain-computer interface (BCI) and speech recognition. By performing appropriate activities, users can navigate the wheelchair with four steering behaviors (start, stop, turn left, and turn right). Five healthy subjects participated in an indoor experiment. The results demonstrate the efficiency of the coordinated control mechanism with satisfactory path and time optimality ratios, and show that speech recognition is a fast and accurate supplement for BCI-based control systems. The proposed intelligent wheelchair is especially suitable for patients suffering from paralysis (especially those with aphasia) who can learn to pronounce only a single sound (e.g.,‘ah’).

  5. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  6. A Log—Index Weighted Cepstral Distance Measure for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 吴文虎; 等

    1997-01-01

    A log-index weighted cepstral distance measure is proposed and tested in speacker-independent and speaker-dependent isolated word recognition systems using statistic techniques.The weights for the cepstral coefficients of this measure equal the logarithm of the corresponding indices.The experimental results show that this kind of measure works better than any other weighted Euclidean cepstral distance measures on three speech databases.The error rate obtained using this measure is about 1.8 percent for three databases on average,which is a 25% reduction from that obtained using other measures,and a 40% reduction from that obtained using Log Likelihood Ratio(LLR)measure.The experimental results also show that this kind of distance measure woks well in both speaker-dependent and speaker-independent speech recognition systems.

  7. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  8. Robust Features for Speech Recognition using Temporal Filtering Technique in the Presence of Impulsive Noise

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-10-01

    Full Text Available In this paper we introduce a robust feature extractor, dubbed as Modified Function Cepstral Coefficients (MODFCC, based on gammachirp filterbank, Relative Spectral (RASTA and Autoregressive Moving-Average (ARMA filter. The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. In speech recognition systems Mel-Frequency Cepstral Coefficients (MFCC, RASTA and ARMA Frequency Cepstral Coefficients (RASTA-MFCC and ARMA-MFCC are the three main techniques used. It will be shown in this paper that it presents some modifications to the original MFCC method. In our work the effectiveness of proposed changes to MFCC were tested and compared against the original RASTA-MFCC and ARMA-MFCC features. The prosodic features such as jitter and shimmer are added to baseline spectral features. The above-mentioned techniques were tested with impulsive signals under various noisy conditions within AURORA databases.

  9. Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

    CERN Document Server

    Bouselmi, Ghazi; Illina, Irina

    2007-01-01

    In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The ``phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.

  10. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Iwano Koji

    2007-01-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  11. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  12. Automated recognition of malignancy mentions in biomedical literature

    Directory of Open Access Journals (Sweden)

    Liberman Mark Y

    2006-11-01

    Full Text Available Abstract Background The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining. Results We developed a named entity recognizer (MTag, an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance. Conclusion Together, these results suggest that the

  13. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Science.gov (United States)

    Vlaj, Damjan; Kotnik, Bojan; Horvat, Bogomir; Kačič, Zdravko

    2005-12-01

    This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by [InlineEquation not available: see fulltext.] relative (G.723.1 VAD), by [InlineEquation not available: see fulltext.] relative (G.729 VAD), and by [InlineEquation not available: see fulltext.] relative (DSR VAD) in all SNRs.

  14. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Vlaj Damjan

    2005-01-01

    Full Text Available This paper presents a novel computationally efficient voice activity detection (VAD algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by relative (G.723.1 VAD, by relative (G.729 VAD, and by relative (DSR VAD in all SNRs.

  15. Improving speech recognition on a mobile robot platform through the use of top-down visual queues

    OpenAIRE

    Ross, Robert; O'Donoghue, R. P. S.; O'Hare, G. M. P.

    2003-01-01

    In many real-world environments, Automatic Speech Recognition (ASR) technologies fail to provide adequate performance for applications such as human robot dialog. Despite substantial evidence that speech recognition in humans is performed in a top-down as well as bottom-up manner, ASR systems typically fail to capitalize on this, instead relying on a purely statistical, bottom up methodology. In this paper we advocate the use of a knowledge based approach to improving ASR in domains such as m...

  16. Managing predefined templates and macros for a departmental speech recognition system using common software

    OpenAIRE

    Sistrom, Chris L.; Honeyman, Janice C.; Mancuso, Anthony; Quisling, Ronald G.

    2001-01-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design consider...

  17. Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

    Science.gov (United States)

    Zhang, Qingqing; Pan, Jielin; Lin, Yang; Shao, Jian; Yan, Yonghong

    In recent decades, there has been a great deal of research into the problem of bilingual speech recognition-to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language**. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was

  18. Visual Word Recognition is Accompanied by Covert Articulation: Evidence for a Speech-like Phonological Representation

    OpenAIRE

    Eiter, Brianna M.; INHOFF, ALBRECHT W.

    2008-01-01

    Two lexical decision task (LDT) experiments examined whether visual word recognition involves the use of a speech-like phonological code that may be generated via covert articulation. In Experiment 1, each visual item was presented with an irrelevant spoken word (ISW) that was either phonologically identical, similar, or dissimilar to it. An ISW delayed classification of a visual word when the two were phonologically similar, and it delayed the classification of a pseudoword when it was ident...

  19. Audiovisual benefit for recognition of speech presented with single-talker noise in older listeners

    OpenAIRE

    Jesse, A.; Janse, E.

    2012-01-01

    Older listeners are more affected than younger listeners in their recognition of speech in adverse conditions, such as when they also hear a single-competing speaker. In the present study, we investigated with a speeded response task whether older listeners with various degrees of hearing loss benefit under such conditions from also seeing the speaker they intend to listen to. We also tested, at the same time, whether older adults need postperceptual processing to obtain an audiovisual benefi...

  20. A SPEECH RECOGNITION METHOD USING COMPETITIVE AND SELECTIVE LEARNING NEURAL NETWORKS

    Institute of Scientific and Technical Information of China (English)

    2000-01-01

    On the basis of asymptotic theory of Gersho, the isodistortion principle of vector clustering was discussed and a kind of competitive and selective learning method (CSL) which may avoid local optimization and have excellent result in application to clusters of HMM model was also proposed. In combining the parallel, self-organizational hierarchical neural networks (PSHNN) to reclassify the scores of every form output by HMM, the CSL speech recognition rate is obviously elevated.

  1. Estimation of Phoneme-Specific HMM Topologies for the Automatic Recognition of Dysarthric Speech

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic for each phoneme in the speaker’s phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components is performed with a Genetic Algorithm (GA. Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria.

  2. Environment-dependent denoising autoencoder for distant-talking speech recognition

    Science.gov (United States)

    Ueda, Yuma; Wang, Longbiao; Kai, Atsuhiko; Ren, Bo

    2015-12-01

    In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two

  3. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    Science.gov (United States)

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types. PMID:26723348

  4. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author)

  5. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  6. Speech emotion recognition based on LS-SVM%基才LS-SVM的情感语音识别

    Institute of Scientific and Technical Information of China (English)

    周慧; 魏霖静

    2012-01-01

    The dissertation proposed an approach for emotional speech recognition based on LS-SVM. First, pitch frequency, energy, speech rate parameters extracted from speech signals as emotional features. Then emotional speech modeling is established with LS-SVM method. Experimental results show that, basic emotion recognition can get high recognition rates.%提出了一种基于LS—SVM的情感语音识别方法。即先提取实验中语音信号的基频,能量,语速等参数为情感特征,然后采用LS—SVM方法对相应的情感语音信号建立模型,进行识别。实验结果表明,利用LS—SVM进行基本情感识别时,识别率较高。

  7. The Effects of Noise on Speech Recognition in Cochlear Implant Subjects: Predictions and Analysis Using Acoustic Models

    Science.gov (United States)

    Remus, Jeremiah J.; Collins, Leslie M.

    2005-12-01

    Cochlear implants can provide partial restoration of hearing, even with limited spectral resolution and loss of fine temporal structure, to severely deafened individuals. Studies have indicated that background noise has significant deleterious effects on the speech recognition performance of cochlear implant patients. This study investigates the effects of noise on speech recognition using acoustic models of two cochlear implant speech processors and several predictive signal-processing-based analyses. The results of a listening test for vowel and consonant recognition in noise are presented and analyzed using the rate of phonemic feature transmission for each acoustic model. Three methods for predicting patterns of consonant and vowel confusion that are based on signal processing techniques calculating a quantitative difference between speech tokens are developed and tested using the listening test results. Results of the listening test and confusion predictions are discussed in terms of comparisons between acoustic models and confusion prediction performance.

  8. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  9. Performance of Czech Speech Recognition with Language Models Created from Public Resources

    Directory of Open Access Journals (Sweden)

    V. Prochazka

    2011-12-01

    Full Text Available In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

  10. Automated Fourier space region-recognition filtering for off-axis digital holographic microscopy

    CERN Document Server

    He, Xuefei; Pratap, Mrinalini; Zheng, Yujie; Wang, Yi; Nisbet, David R; Williams, Richard J; Rug, Melanie; Maier, Alexander G; Lee, Woei Ming

    2016-01-01

    Automated label-free quantitative imaging of biological samples can greatly benefit high throughput diseases diagnosis. Digital holographic microscopy (DHM) is a powerful quantitative label-free imaging tool that retrieves structural details of cellular samples non-invasively. In off-axis DHM, a proper spatial filtering window in Fourier space is crucial to the quality of reconstructed phase image. Here we describe a region-recognition approach that combines shape recognition with an iterative thresholding to extracts the optimal shape of frequency components. The region recognition technique offers fully automated adaptive filtering that can operate with a variety of samples and imaging conditions. When imaging through optically scattering biological hydrogel matrix, the technique surpasses previous histogram thresholding techniques without requiring any manual intervention. Finally, we automate the extraction of the statistical difference of optical height between malaria parasite infected and uninfected re...

  11. Dynamic Relation Between Working Memory Capacity and Speech Recognition in Noise During the First 6 Months of Hearing Aid Use

    Directory of Open Access Journals (Sweden)

    Elaine H. N. Ng

    2014-11-01

    Full Text Available The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced.

  12. An Introduction to the Chinese Speech Recognition Front-End of the NICT/ATR Multi-Lingual Speech Translation System

    Institute of Scientific and Technical Information of China (English)

    ZHANG Jinsong; Takatoshi Jitsuhiro; Hirofumi Yamamoto; HU Xinhui; Satoshi Nakamura

    2008-01-01

    This paper introduces several important features of the Chinese large vocabulary continuous speech recognition system in the NICT/ATR multi-lingual speech-to-speech translation system.The features include: (1) a flexible way to derive an information rich phoneme set based on mutual information between a text corpus and its phoneme set; (2) a hidden Markov network acoustic model and a successive state split-ting algorithm to generate its model topology based on a minimum description length criterion; and (3) ad-vanced language modeling using multi-class composite N-grams.These features allow a recognition per-formance of 90% character accuracy in tourism related dialogue with a real time response speed.

  13. Recognition of handprinted characters for automated cartography A progress report

    Science.gov (United States)

    Lybanon, M.; Brown, R. M.; Gronmeyer, L. K.

    1980-01-01

    A research program for developing handwritten character recognition techniques is reported. The generation of cartographic/hydrographic manuscripts is overviewed. The performance of hardware/software systems is discussed, along with future research problem areas and planned approaches.

  14. CAR2 - Czech Database of Car Speech

    Directory of Open Access Journals (Sweden)

    P. Sovka

    1999-12-01

    Full Text Available This paper presents new Czech language two-channel (stereo speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done.

  15. Using vector Taylor series with noise clustering for speech recognition in non-stationary noisy environments

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    The performance of automatic speech recognizer degrades seriously when there are mismatches between the training and testing conditions. Vector Taylor Series (VTS) approach has been used to compensate mismatches caused by additive noise and convolutive channel distortion in the cepstral domain. In this paper, the conventional VTS is extended by incorporating noise clustering into its EM iteration procedure, improving its compensation effectiveness under non-stationary noisy environments. Recognition experiments under babble and exhibition noisy environments demonstrate that the new algorithm achieves35 % average error rate reduction compared with the conventional VTS.

  16. Robust Speaker Recognition with Combined Use of Acoustic and Throat Microphone Speech

    DEFF Research Database (Denmark)

    Sahidullah, Md; Gonzalez Hautamäki, Rosa; Thomsen, Dennis Alexander Lehmann;

    2016-01-01

    Accuracy of automatic speaker recognition (ASV) systems degrades severely in the presence of background noise. In this paper, we study the use of additional side information provided by a body-conducted sensor, throat microphone. Throat microphone signal is much less affected by background noise...... of this additional information for both speech activity detection, feature extraction and fusion of the acoustic and throat microphone signals. We collect a pilot database consisting of 38 subjects including both clean and noisy sessions. We carry out speaker verification experiments using Gaussian mixture model...

  17. An Automated Size Recognition Technique for Acetabular Implant in Total Hip Replacement

    CERN Document Server

    Shapi'i, Azrulhizam; Hasan, Mohammad Khatim; Kassim, Abdul Yazid Mohd; 10.5121/ijcsit.2011.3218

    2011-01-01

    Preoperative templating in Total Hip Replacement (THR) is a method to estimate the optimal size and position of the implant. Today, observational (manual) size recognition techniques are still used to find a suitable implant for the patient. Therefore, a digital and automated technique should be developed so that the implant size recognition process can be effectively implemented. For this purpose, we have introduced the new technique for acetabular implant size recognition in THR preoperative planning based on the diameter of acetabulum size. This technique enables the surgeon to recognise a digital acetabular implant size automatically. Ten randomly selected X-rays of unidentified patients were used to test the accuracy and utility of an automated implant size recognition technique. Based on the testing result, the new technique yielded very close results to those obtained by the observational method in nine studies (90%).

  18. High-order hidden Markov model for piecewise linear processes and applications to speech recognition.

    Science.gov (United States)

    Lee, Lee-Min; Jean, Fu-Rong

    2016-08-01

    The hidden Markov models have been widely applied to systems with sequential data. However, the conditional independence of the state outputs will limit the output of a hidden Markov model to be a piecewise constant random sequence, which is not a good approximation for many real processes. In this paper, a high-order hidden Markov model for piecewise linear processes is proposed to better approximate the behavior of a real process. A parameter estimation method based on the expectation-maximization algorithm was derived for the proposed model. Experiments on speech recognition of noisy Mandarin digits were conducted to examine the effectiveness of the proposed method. Experimental results show that the proposed method can reduce the recognition error rate compared to a baseline hidden Markov model. PMID:27586781

  19. Analysis of speech under stress using Linear techniques and Non-Linear techniques for emotion recognition system

    OpenAIRE

    A. A. Khulageand; B. V. Pathak

    2012-01-01

    Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German database. Here e...

  20. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  1. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  2. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  3. Object Type Recognition for Automated Analysis of Protein Subcellular Location

    OpenAIRE

    Zhao, Ting; Velliste, Meel; Boland, Michael V.; Murphy, Robert F.

    2005-01-01

    The new field of location proteomics seeks to provide a comprehensive, objective characterization of the subcellular locations of all proteins expressed in a given cell type. Previous work has demonstrated that automated classifiers can recognize the patterns of all major subcellular organelles and structures in fluorescence microscope images with high accuracy. However, since some proteins may be present in more than one organelle, this paper addresses a more difficult task: recognizing a pa...

  4. Automated Recognition of Algorithmic Patterns in DSP Programs

    OpenAIRE

    Shafiee Sarvestani, Amin

    2011-01-01

    We introduce an extensible knowledge based tool for idiom (pattern) recognition in DSP(digital signal processing) programs. Our tool utilizesfunctionality provided by the Cetus compiler infrastructure fordetecting certain computation patterns that frequently occurin DSP code. We focus on recognizing patterns for for-loops andstatements in their bodies as these often are the performance criticalconstructs in DSP applications for which replacementby highly optimized, target-specific parallel al...

  5. Testing of a Composite Wavelet Filter to Enhance Automated Target Recognition in SONAR

    Science.gov (United States)

    Chiang, Jeffrey N.

    2011-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low resolution SONAR and camera videos taken from Unmanned Underwater Vehicles (UUVs). These SONAR images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both SONAR and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this report.

  6. ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNIQUES FOR EMOTION RECOGNITION SYSTEM

    Directory of Open Access Journals (Sweden)

    A. A. Khulageand

    2012-07-01

    Full Text Available Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin German database. Here emotion recognition is done for different emotions like neutral, happy, disgust, sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.

  7. A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System

    Directory of Open Access Journals (Sweden)

    P.S. Sathidevi

    2010-03-01

    Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.

  8. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  9. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  10. Speech Emotion Recognition Algorithm Based on SVM%基于SVM的语音情感识别算法

    Institute of Scientific and Technical Information of China (English)

    朱菊霞; 吴小培; 吕钊

    2011-01-01

    为有效提高语音情感识别系统的识别正确率,提出一种基于SVM的语音情感识别算法.该算法提取语音信号的能量、基音频率及共振峰等参数作为情感特征,采用SVM(Support Vector Machine,支持向量机)方法对情感信号进行建模与识别.在仿真环境下的情感识别实验中,所提算法相比较人工神经网络的ACON(All Class inone Network,"一对多")和OCON(One class in one network,"一对一")方法识别正确率分别提高了7.06%和7.21%.实验结果表明基于SVM的语音情感识别算法能够对语音情感信号进行较好地识别.%In order to improve recognition accuracy of the speech emotion recognition system effectively, a speech emotion recognition algorithm based on SVM is proposed. In the proposed algorithm, some parameters extracted from speech signals, such as: energy, pitch frequency and formant, are used as emotional features. Furthermore, an emotion recognition model is established with SVM method. Simulation environment experiential results reveal that the recognition ratio of the proposed algorithm obtains the relative increasing of 7.06% and 7.21% compared with artificial neural networks such as ACON (All Class in one Network, "one to many") and OCON (One class in one network, "one to one") methods. The result of the experiment shows that the speech emotion recognition algorithm based on SVM can improve the performance of the emotion recognition system effectively.

  11. 声纹识别中合成语音的鲁棒性%Robust Speaker Recognition against Synthetic Speech

    Institute of Scientific and Technical Information of China (English)

    陈联武; 郭武; 戴礼荣

    2011-01-01

    随着以隐马尔科夫模型为基础的语音合成技术的发展,冒认者很容易利用该技术生成具有目标说话人特性的合成语音,这对现有的声纹识别系统构成巨大威胁.针对此问题,文中从统计学的角度分析自然语音与合成语音在实倒谱上的区别,并提出对合成语音具有鲁棒性的声纹识别系统.实验结果初步表明,相比于传统的声纹识别系统,在对自然语音的等错误率不变的情况下,该系统对合成语音的错误接受率由99.2%降为0.%With the development of the hidden markov model ( HMM) based speech synthesis technology, it is easy for impostors to produce synthetic speech with the specific speakers characteristics, which becomes an enormous threat to the existing speaker recognition system. In this paper, the difference between natural speech and synthetic speech is investigated on the real part of cepstrum. And a speaker recognition system is proposed which is robust against synthetic speech. Experimental results demonstrate that the false accept rate (FAR) for synthetic speech is zero in the proposed system, while that of the existing speaker recognition system is 99. 2% with the equal error rate (EER) for natural speech unchanged.

  12. Feature Extraction and Selection Strategies for Automated Target Recognition

    Science.gov (United States)

    Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2010-01-01

    Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory region of-interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.

  13. Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous Chinese Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 武健; 宋战江

    2000-01-01

    The previously proposed syllable-synchronous network search (SSNS) algorithm plays a very important role in the word decoding of the continuous Chinese speech recognition and achieves satisfying performance. Several related key factors that may affect the overall word decoding effect are carefully studied in this paper, including the perfecting of the vocabulary, the big-discount Turing re-estimating of the N-Gram probabilities, and the managing of the searching path buffers. Based on these discussions, corresponding approaches to improving the SSNS algorithm are proposed. Compared with the previous version of SSNS algorithm, the new version decreases the Chinese character error rate (CCER) in the word decoding by 42.1% across a database consisting of a large number of testing sentences (syllable strings).

  14. Hindi Digits Recognition System on Speech Data Collected in Different Natural Noise Environments

    Directory of Open Access Journals (Sweden)

    Babita Saxena

    2015-02-01

    Full Text Available This paper presents a baseline digits speech recogn izer for Hindi language. The recording environment is different for all speakers, since th e data is collected in their respective homes. The different environment refers to vehicle horn no ises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. Th e Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer i s 10 words. HTK toolkit is used for building acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.

  15. Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition

    Science.gov (United States)

    Skowronski, Mark D.; Harris, John G.

    2004-09-01

    Mel frequency cepstral coefficients (MFCC) are the most widely used speech features in automatic speech recognition systems, primarily because the coefficients fit well with the assumptions used in hidden Markov models and because of the superior noise robustness of MFCC over alternative feature sets such as linear prediction-based coefficients. The authors have recently introduced human factor cepstral coefficients (HFCC), a modification of MFCC that uses the known relationship between center frequency and critical bandwidth from human psychoacoustics to decouple filter bandwidth from filter spacing. In this work, the authors introduce a variation of HFCC called HFCC-E in which filter bandwidth is linearly scaled in order to investigate the effects of wider filter bandwidth on noise robustness. Experimental results show an increase in signal-to-noise ratio of 7 dB over traditional MFCC algorithms when filter bandwidth increases in HFCC-E. An important attribute of both HFCC and HFCC-E is that the algorithms only differ from MFCC in the filter bank coefficients: increased noise robustness using wider filters is achieved with no additional computational cost.

  16. An Additive and Convolutive Bias Compensation Algorithm for Telephone Speech Recognition1)

    Institute of Scientific and Technical Information of China (English)

    HANZhao-Bing; ZHANGShu-Wu; XUBo; HUANGTai-Yi

    2004-01-01

    A Vector piecewise polynomial (VPP) approximation algorithm is proposed for environ-ment compensation of speech signals degraded by both additive and convolutive noises. By investi-gating the model of the telephone environment, we propose a piecewise polynomial, namely twolinear polynomials and a quadratic polynomial, to approximate the environment function precisely.The VPP is applied either to the stationary noise, or to the non stationary noise. In the first case,the batch EM is used in log-spectral domain; in the second case the recursive EM with iterativestochastic approximation is developed in cepstral domain. Both approaches are based on the mini-mum mean squared error (MMSE) sense. Experimental results are presented on the application ofthis approach in improving the performance of Mandarin large vocabulary continuous speech recog-nition (LVCSR) due to the background noises and different transmission channels (such as fixedtelephone line and GSM). The method can reduce the average character error rate (CER) by a-bout 18%.

  17. Influence of tinnitus percentage index of speech recognition in patients with normal hearing

    Directory of Open Access Journals (Sweden)

    Urnau, Daila

    2010-12-01

    Full Text Available Introduction: The understanding of speech is one of the most important measurable aspects of human auditory function. Tinnitus affects the quality of life, impairing communication. Objective: To investigate possible changes in the Percentage Index of Speech Recognition (SDT in individuals with tinnitus have normal hearing and examining the relationship between tinnitus, gender and age. Methods:A retrospective study by analyzing the records of 82 individuals of both genders, aged 21-70 years, totaling 128 ears with normal hearing. The ears were analyzed separately, and divided into control group, no complaints of tinnitus and group study, with complaints of tinnitus. The variables gender and age groups and examined the influence of tinnitus in the SDT. It was considered normal, the percentage of 100% correct and changed, and the value between 88-96%. These criteria were adopted, since the percentage below 88% correct is found in individuals with sensorineural hearing loss. Results:There was no statistically significant difference between the variables age and tinnitus, and tinnitus SDT, only gender and tinnitus. The prevalence of tinnitus in females (56%, higher incidence of tinnitus in the age group 31-40 years (41.67% and fewer from 41 to 50 years (18.75% and on the SDT there was a greater percentage change in individuals with tinnitus (61.11%. Conclusion: The buzz does not interfere with SDT and there is no relationship between tinnitus and age, only between tinnitus and gender.

  18. A PRELIMARY APPROACH FOR THE AUTOMATED RECOGNITION OF MALIGNANT MELANOMA

    Directory of Open Access Journals (Sweden)

    Ezzeddine Zagrouba

    2011-05-01

    Full Text Available In this work, we are motivated by the desire to classify skin lesions as malignants or benigns from color photographic slides of the lesions. Thus, we use color images of skin lesions, image processing techniques and artificial neural network classifier to distinguish melanoma from benign pigmented lesions. As the first step of the data set analysis, a preprocessing sequence is implemented to remove noise and undesired structures from the color image. Second, an automated segmentation approach localizes suspicious lesion regions by region growing after a preliminary step based on fuzzy sets. Then, we rely on quantitative image analysis to measure a series of candidate attributes hoped to contain enough information to differentiate melanomas from benign lesions. At last, the selected features are supplied to an artificial neural network for classification of tumor lesion as malignant or benign. For a preliminary balanced training/testing set, our approach is able to obtain 79.1% of correct classification of malignant and benign lesions on real skin lesion images.

  19. Automated recognition of forest patterns using aerial photographs

    Science.gov (United States)

    Barbezat, Vincent; Kreiss, Philippe; Sulzmann, Armin; Jacot, Jacques

    1996-12-01

    In Switzerland, aerial photos are indispensable tools for research into ecosystems and their management. Every six years since 1950, the whole of Switzerland has been systematically surveyed by aerial photos. In the forestry field, these documents not only provide invaluable information but also give support to field activities such as the drawing up of tree population maps, intervention planning, precise positioning of the upper forest limit, evaluation of forest damage and rates of tree growth. Up to now, the analysis of aerial photos has been carried out by specialists who painstakingly examine every photograph, which makes it a very long, exacting and expensive job. The IMT-DMT of the EPFL and Antenne romande of FNP, aware of the special interest involved and the necessity of automated classification of aerial photos, have pooled their resources to develop a software program capable of differentiating between single trees, copses and dense forests. The developed algorithms detect the crowns of the trees and the surface of the orthogonal projection. Form the shadow of each tree they calculate its height. They also determine the position of the tree in the Swiss national coordinate thanks to the implementation of a numeric altitude model. For the future, we have the prospect of many new and better uses of aerial photos being available to us, particularly where isolated stands are concerned and also when evolutions based on a diachronic series of photos have to be assessed: from timberline monitoring in the research on global change to the exploitation of wooded pastures on small surface areas.

  20. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  1. A Robust Method for Speech Emotion Recognition Based on Infinite Student’s t-Mixture Model

    Directory of Open Access Journals (Sweden)

    Xinran Zhang

    2015-01-01

    Full Text Available Speech emotion classification method, proposed in this paper, is based on Student’s t-mixture model with infinite component number (iSMM and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model, speech emotion model based on Student’s t-mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover, t-mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus and two acting (DES and EMO-DB universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.

  2. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  3. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Science.gov (United States)

    Selouani, Sid-Ahmed; O'Shaughnessy, Douglas

    2003-12-01

    Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  4. Automated Three-Dimensional Microbial Sensing and Recognition Using Digital Holography and Statistical Sampling

    Directory of Open Access Journals (Sweden)

    Inkyu Moon

    2010-09-01

    Full Text Available We overview an approach to providing automated three-dimensional (3D sensing and recognition of biological micro/nanoorganisms integrating Gabor digital holographic microscopy and statistical sampling methods. For 3D data acquisition of biological specimens, a coherent beam propagates through the specimen and its transversely and longitudinally magnified diffraction pattern observed by the microscope objective is optically recorded with an image sensor array interfaced with a computer. 3D visualization of the biological specimen from the magnified diffraction pattern is accomplished by using the computational Fresnel propagation algorithm. For 3D recognition of the biological specimen, a watershed image segmentation algorithm is applied to automatically remove the unnecessary background parts in the reconstructed holographic image. Statistical estimation and inference algorithms are developed to the automatically segmented holographic image. Overviews of preliminary experimental results illustrate how the holographic image reconstructed from the Gabor digital hologram of biological specimen contains important information for microbial recognition.

  5. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener. PMID:24253361

  6. Development and Evaluation of a Speech Recognition Test for Persian Speaking Adults

    Directory of Open Access Journals (Sweden)

    Mohammad Mosleh

    2001-05-01

    Full Text Available Method and Materials: This research is carried out for development and evaluation of 25 phonemically balanced word lists for Persian speaking adults in two separate stages: development and evaluation. In the first stage, in order to balance the lists phonemically, frequency -of- occurrences of each 29phonems (6 vowels and 23 Consonants of the Persian language in adults speech are determined. This section showed some significant differences between some phonemes' frequencies. Then, all Persian monosyllabic words extracted from the Mo ‘in Persian dictionary. The semantically difficult words were refused and the appropriate words choosed according to judgment of 5 adult native speakers of Persian with high school diploma. 12 openset 25 word lists are prepared. The lists were recorded on magnetic tapes in an audio studio by a professional speaker of IRIB. "nIn the second stage, in order to evaluate the test's validity and reliability, 60 normal hearing adults (30 male, 30 female, were randomly selected and evaluated as test and retest. Findings: 1- Normal hearing adults obtained 92-1 0O scores for each list at their MCL through test-retest. 2- No significant difference was observed a/ in test-retest scores in each list (‘P>O.05 b/ between the lists at test or retest scores (P>0.05, c/between sex (P>0.05. Conclusion: This research is reliable and valid, the lists are phonemically balanced and equal in difficulty and valuable for evaluation of Persian speaking adults speech recognition.

  7. Key Technologies in Speech Emotion Recognition%语音情感识别的关键技术

    Institute of Scientific and Technical Information of China (English)

    张雪英; 孙颖; 张卫; 畅江

    2015-01-01

    语音信号中的情感信息是一种很重要的信息资源,仅靠单纯的数学模型搭建和计算来进行语音情感识别就显现出不足。情感是由外部刺激引发人的生理、心理变化,从而表现出来的一种对人或事物的感知状态,因此,将认知心理学与语音信号处理相结合有益于更好地处理情感语音。首先介绍了语音情感与人类认知的关联性,总结了该领域的最新进展和研究成果,主要包括情感数据库的建立、情感特征的提取以及情感识别网络等。其次介绍了基于认知心理学构建的模糊认知图网络在情感语音识别中的应用。接着,探讨了人脑对情感语音的认知机理,并试图把事件相关电位融合到语音情感识别中,从而提高情感语音识别的准确率,为今后情感语音识别与认知心理学交叉融合发展提出了构思与展望。%Emotional information in speech signal is an important information resource .When verbal expression is combined with human emotion ,emotional speech processing is no longer a simple mathematical model or pure calculation .Fluctuations of the mood are controlled by the brain perception ;speech signal processing based on cognitive psychology can capture emotion bet‐ter .In this paper the relevance analysis between speech emotion and human cognition is intro‐duced firstly .The recent progress in speech emotion recognition is summarized ,including the re‐view of speech emotion databases ,feature extraction and emotion recognition networks .Secondly a fuzzy cognitive map network based on cognitive psychology is introduced into emotional speech recognition .In addition ,the mechanism of the human brain for cognitive emotional speech is ex‐plored .To improve the recognition accuracy ,this report also tries to integrate event‐related poten‐tials to speech emotion recognition .This idea is the conception and prospect of speech emotion recognition

  8. Language and Speech Processing

    CERN Document Server

    Mariani, Joseph

    2008-01-01

    Speech processing addresses various scientific and technological areas. It includes speech analysis and variable rate coding, in order to store or transmit speech. It also covers speech synthesis, especially from text, speech recognition, including speaker and language identification, and spoken language understanding. This book covers the following topics: how to realize speech production and perception systems, how to synthesize and understand speech using state-of-the-art methods in signal processing, pattern recognition, stochastic modelling computational linguistics and human factor studi

  9. Research of Speech Recognition System Based on Matlab%基于Matlab的语音识别系统研究

    Institute of Scientific and Technical Information of China (English)

    王彪

    2011-01-01

    A speech recognition system based on Matlab software is designed, and record, broadcast, pretreat voice signals, subsection filtering, feature extraction and speech recognition are its main functions. This system has achieved discriminate simple voice requirements is verificated by the experiment, but some places are needed to improve, such as: whether complex voice coule be discriminated in complex environment.%设计了一个基于Matlab软件的语音识别系统,其主要功能有语音信号的录制、播放、预处理、分段滤波、特征提取以及识别语音.通过实验验证了本系统能够达到识别简单语音的要求,但仍有需改进的地方,如:能否在复杂环境下识别比较复杂的语音.

  10. Automatic Speech Recognition and Training for Severely Dysarthric Users of Assistive Technology: The STARDUST Project

    Science.gov (United States)

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent…

  11. Managing predefined templates and macros for a departmental speech recognition system using common software.

    Science.gov (United States)

    Sistrom, C L; Honeyman, J C; Mancuso, A; Quisling, R G

    2001-09-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design considerations, and implementation details of the structured report management software. The database and its interface are designed to allow all radiologists and division managers to define and update template structures relevant to their practice areas. Two key conceptual extensions supported by the template management system are the addition of a template type construct and allowing individual radiologists to dynamically share common organ system or modality-specific templates. In addition, the template manager software enables specifying predefined report structures that can be triggered at the time of dictation from printed lists of barcodes. Initial experience using the program in a regional, multisite, academic radiology practice has been positive. PMID:11720335

  12. Authenticity affects the recognition of emotions in speech: behavioral and fMRI evidence.

    Science.gov (United States)

    Drolet, Matthis; Schubotz, Ricarda I; Fischer, Julia

    2012-03-01

    The aim of the present study was to determine how authenticity of emotion expression in speech modulates activity in the neuronal substrates involved in emotion recognition. Within an fMRI paradigm, participants judged either the authenticity (authentic or play acted) or emotional content (anger, fear, joy, or sadness) of recordings of spontaneous emotions and reenactments by professional actors. When contrasting between task types, active judgment of authenticity, more than active judgment of emotion, indicated potential involvement of the theory of mind (ToM) network (medial prefrontal cortex, temporoparietal cortex, retrosplenium) as well as areas involved in working memory and decision making (BA 47). Subsequently, trials with authentic recordings were contrasted with those of reenactments to determine the modulatory effects of authenticity. Authentic recordings were found to enhance activity in part of the ToM network (medial prefrontal cortex). This effect of authenticity suggests that individuals integrate recollections of their own experiences more for judgments involving authentic stimuli than for those involving play-acted stimuli. The behavioral and functional results show that authenticity of emotional prosody is an important property influencing human responses to such stimuli, with implications for studies using play-acted emotions.

  13. Development of a two wheeled self balancing robot with speech recognition and navigation algorithm

    Science.gov (United States)

    Rahman, Md. Muhaimin; Ashik-E-Rasul, Haq, Nowab. Md. Aminul; Hassan, Mehedi; Hasib, Irfan Mohammad Al; Hassan, K. M. Rafidh

    2016-07-01

    This paper is aimed to discuss modeling, construction and development of navigation algorithm of a two wheeled self balancing mobile robot in an enclosure. In this paper, we have discussed the design of two of the main controller algorithms, namely PID algorithms, on the robot model. Simulation is performed in the SIMULINK environment. The controller is developed primarily for self-balancing of the robot and also it's positioning. As for the navigation in an enclosure, template matching algorithm is proposed for precise measurement of the robot position. The navigation system needs to be calibrated before navigation process starts. Almost all of the earlier template matching algorithms that can be found in the open literature can only trace the robot. But the proposed algorithm here can also locate the position of other objects in an enclosure, like furniture, tables etc. This will enable the robot to know the exact location of every stationary object in the enclosure. Moreover, some additional features, such as Speech Recognition and Object Detection, are added. For Object Detection, the single board Computer Raspberry Pi is used. The system is programmed to analyze images captured via the camera, which are then processed through background subtraction, followed by active noise reduction.

  14. A Comparative Study: Gammachirp Wavelets and Auditory Filter Using Prosodic Features of Speech Recognition In Noisy Environment

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-04-01

    Full Text Available Modern automatic speech recognition (ASR systems typically use a bank of linear filters as the first step in performing frequency analysis of speech. On the other hand, the cochlea, which is responsible for frequency analysis in the human auditory system, is known to have a compressive non-linear frequency response which depends on input stimulus level. It will be shown in this paper that it presents a new method on the use of the gammachirp auditory filter based on a continuous wavelet analysis. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear that differs from the existing model based on an analysis by a short term Fourier transformation (STFT. The prosodic features such as pitch, formant frequency, jitter and shimmer are extracted from the fundamental frequency contour and added to baseline spectral features, specifically, Mel Frequency Cepstral Coefficients (MFCC for human speech, Gammachirp Filterbank Cepstral Coefficient (GFCC and Gammachirp Wavelet Frequency Cepstral Coefficient (GWFCC. The results show that the gammachirp wavelet gives results that are comparable to ones obtained by MFCC and GFCC. Experimental results show the best performance of this architecture. This paper implements the GW and examines its application to a specific example of speech. Implications for noise robust speech analysis are also discussed within AURORA databases.

  15. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    to the appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which...

  16. Automated Gesturing for Virtual Characters: Speech-driven and Text-driven Approaches

    Directory of Open Access Journals (Sweden)

    Goranka Zoric

    2006-04-01

    Full Text Available We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker’s facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

  17. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

    Science.gov (United States)

    Alam, Md Jahangir; Gupta, Vishwa; Kenny, Patrick; Dumouchel, Pierre

    2015-12-01

    The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8

  18. The Long Road to Automation: Neurocognitive Development of Letter-Speech Sound Processing

    Science.gov (United States)

    Froyen, Dries J. W.; Bonte, Milene L.; van Atteveldt, Nienke; Blomert, Leo

    2009-01-01

    In transparent alphabetic languages, the expected standard for complete acquisition of letter-speech sound associations is within one year of reading instruction. The neural mechanisms underlying the acquisition of letter-speech sound associations have, however, hardly been investigated. The present article describes an ERP study with beginner and…

  19. Speech Recognition Performance in Children with Cochlear Implants Using Bimodal Stimulation

    OpenAIRE

    Rathna Kumar, S. B.; Mohanty, P.; Prakash, S. G. R.

    2010-01-01

    Cochlear implantees have considerably good speech understanding abilities in quiet surroundings. But, ambient noise poses significant difficulties in understanding speech for these individuals. Bimodal stimulation is still not used by many Indian implantees in spite of reports that bimodal stimulation is beneficial for speech understanding in noise as compared to cochlear implant alone and also prevents auditory deprivation in the un-implanted ear. The aim of the study is to evaluate the bene...

  20. Design of an Automated Secure Garage System Using License Plate Recognition Technique

    Directory of Open Access Journals (Sweden)

    Afaz Uddin Ahmed

    2014-01-01

    Full Text Available Modern technologies have reached our garage to secure the cars and entrance to the residences for the demand of high security and automated infrastructure. The concept of intelligent secure garage systems in modern transport management system is a remarkable example of the computer interfaced controlling devices. License Plate Recognition (LPR process is one of the key elements of modern intelligent garage security setups. This paper presents a design of an automated secure garage system featuring LPR process. A study of templates matching approach by using Optical Character Recognition (OCR is implemented to carry out the LPR method. We also developed a prototype design of the secured garage system to verify the application for local use. The system allows only a predefined enlisted cars or vehicles to enter the garage while blocking the others along with a central-alarm feature. Moreover, the system maintains an update database of the cars that has left and entered into the garage within a particular duration. The vehicle is distinguished by the system mainly based on their registration number in the license plates. The tactics are tried on several samples of license plate’s image in both indoor and outdoor setting.

  1. Speech recognition system based on LPCC parameter%基于LPCC参数的语音识别系统

    Institute of Scientific and Technical Information of China (English)

    王彪

    2012-01-01

    为了识别简单语音,设计了一个基于LPCC参数的语音识别系统。该系统其主要功能有语音信号的录制、播放、预处理、分段滤波、特征提取以及识别语音。最后通过仿真实验验证了本系统能够达到识别简单语音的要求,但仍有需改进的地方,如:能否在复杂环境下识别比较复杂的语音。%In order to recognize simple speech,a speech recognition system based on LPCC parameter is designed,and record,broadcast,pretreat voice signals,subsection filtering,feature extraction and speech recognition are its main functions.This system has achieved discriminate simple voice requirements is verificated by the simulation experiment,but some places are needed to improve,such as:whether complex voice coule be discriminated in complex environment.

  2. Testing Speech Recognition in Spanish-English Bilingual Children with the Computer-Assisted Speech Perception Assessment (CASPA): Initial Report.

    Science.gov (United States)

    García, Paula B; Rosado Rogers, Lydia; Nishi, Kanae

    2016-01-01

    This study evaluated the English version of Computer-Assisted Speech Perception Assessment (E-CASPA) with Spanish-English bilingual children. E-CASPA has been evaluated with monolingual English speakers ages 5 years and older, but it is unknown whether a separate norm is necessary for bilingual children. Eleven Spanish-English bilingual and 12 English monolingual children (6 to 12 years old) with normal hearing participated. Responses were scored by word, phoneme, consonant, and vowel. Regardless of scores, performance across three signal-to-noise ratio conditions was similar between groups, suggesting that the same norm can be used for both bilingual and monolingual children.

  3. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features

    NARCIS (Netherlands)

    Raaijmakers, S.; Truong, K.P.

    2008-01-01

    We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic

  4. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent secre...

  5. Vision-based obstacle recognition system for automated lawn mower robot development

    Science.gov (United States)

    Mohd Zin, Zalhan; Ibrahim, Ratnawati

    2011-06-01

    Digital image processing techniques (DIP) have been widely used in various types of application recently. Classification and recognition of a specific object using vision system require some challenging tasks in the field of image processing and artificial intelligence. The ability and efficiency of vision system to capture and process the images is very important for any intelligent system such as autonomous robot. This paper gives attention to the development of a vision system that could contribute to the development of an automated vision based lawn mower robot. The works involve on the implementation of DIP techniques to detect and recognize three different types of obstacles that usually exist on a football field. The focus was given on the study on different types and sizes of obstacles, the development of vision based obstacle recognition system and the evaluation of the system's performance. Image processing techniques such as image filtering, segmentation, enhancement and edge detection have been applied in the system. The results have shown that the developed system is able to detect and recognize various types of obstacles on a football field with recognition rate of more 80%.

  6. Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

    Science.gov (United States)

    Mimura, Masato; Sakai, Shinsuke; Kawahara, Tatsuya

    2015-12-01

    We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7-8 % relative) from the standard DAE.

  7. The accuracy of radiology speech recognition reports in a multilingual South African teaching hospital

    International Nuclear Information System (INIS)

    Speech recognition (SR) technology, the process whereby spoken words are converted to digital text, has been used in radiology reporting since 1981. It was initially anticipated that SR would dominate radiology reporting, with claims of up to 99% accuracy, reduced turnaround times and significant cost savings. However, expectations have not yet been realised. The limited data available suggest SR reports have significantly higher levels of inaccuracy than traditional dictation transcription (DT) reports, as well as incurring greater aggregate costs. There has been little work on the clinical significance of such errors, however, and little is known of the impact of reporter seniority on the generation of errors, or the influence of system familiarity on reducing error rates. Furthermore, there have been conflicting findings on the accuracy of SR amongst users with English as first- and second-language respectively. The aim of the study was to compare the accuracy of SR and DT reports in a resource-limited setting. The first 300 SR and the first 300 DT reports generated during March 2010 were retrieved from the hospital’s PACS, and reviewed by a single observer. Text errors were identified, and then classified as either clinically significant or insignificant based on their potential impact on patient management. In addition, a follow-up analysis was conducted exactly 4 years later. Of the original 300 SR reports analysed, 25.6% contained errors, with 9.6% being clinically significant. Only 9.3% of the DT reports contained errors, 2.3% having potential clinical impact. Both the overall difference in SR and DT error rates, and the difference in ‘clinically significant’ error rates (9.6% vs. 2.3%) were statistically significant. In the follow-up study, the overall SR error rate was strikingly similar at 24.3%, 6% being clinically significant. Radiologists with second-language English were more likely to generate reports containing errors, but level of seniority

  8. Based on SQLite Technology Establishment Chinese Speech Recognition Database%基于SQLite技术的汉语语音识别数据库的建立

    Institute of Scientific and Technical Information of China (English)

    刘祥楼; 李辉; 吴香艳; 高丙坤

    2011-01-01

    建立一个适合于特定说话人识别系统的汉语语音识别数据库,对推动说话人识别技术的研究和应用具有重要意义.基于支持向量机的说话人识别系统研究和开发过程中,构建了一个基于SQLite技术的汉语语音识别数据库,通过LabVIEW平台来实现对数据库控制操作.采用无序列样本和语音数据库样本分别进行比对实验.测试结果表明:一方面,无论是采用该语音识别数据库样本还是无序样本对说话人识别系统的识别率没有改变,这充分说明本系统建立的汉语语音识别数据库具有高稳定性和可靠性;另一方面,采用语音识别数据库样本其系统识别时间却明显缩短,这是改善基于支持向量机的说话人识别系统性能的有效途径.%For the establishment of a particular model of speaker recognition system Chinese speech recognition database, speech recognition technology on the promotion of research and application of great significance, SVM Based Speaker Recognition System research and development process, a SQLite-based database Mandarin speech recognition technology is built, through the LabVIEW platform to implement control operations on the database.Sequence of sample-free and speech database to compare the experimental samples, respectively.The results show that: on the one hand, both the sample database using the speech recognition or speaker recognition disordered sample recognition rate did not change the system, which fully shows that the establishment of the Chinese speech recognition system with high stability and reliability of the database; other on the one hand, the sample database using speech recognition system to recognize their time is significantly reduced, which is to improve the SVM-based speaker recognition system performance effective way.

  9. 基于 LD3320的语音识别智能垃圾桶设计%Design of speech-recognition intelligent trash based on LD 3320

    Institute of Scientific and Technical Information of China (English)

    何侃; 田亚清; 李强; 胡洲荣; 张静

    2015-01-01

    在互联网和自动化技术的不断发展的影响下,智能家居已经成为了当今物联网技术发展的重要热点方向之一。本设计基于ICRoute公司生产的非特定语音识别芯片LD3320,利用非特定语音识别算法实现垃圾桶的智能化声音识别和语音控制,完成语音控制垃圾桶各方向运动、非接触式智能开闭、容量检测功能。通过在模拟工作环境下对于设计正确识别率进行检测,证明系统在正常工作环境下的正确识别率达到88.4%,可以在2 m距离内有效完成设计动作和功能。%With the development of Internet technology and automation ,smart home technology has become the one of the most popular directions of the internet of things .The intelligent trash based on non‐specific speech recognition chip LD3320 made by ICRoute with non‐specific speech recognition algorithm to finish the intelligent voice recognition and voice control and achieve the function of omnidirectional movement ,non‐contact intelligent opening and closing ,meas‐urement .In the simulative environment of the working ,the experiment tests the rate of correct recognition .In the nor‐mal working environment ,the correct recognition rate reaches 88 .4% ,proving the design can finish the functions and ac‐tions within 2 meters .

  10. Research progress on feature parameters of speech emotion recognition%语音情感识别中特征参数的研究进展

    Institute of Scientific and Technical Information of China (English)

    李杰; 周萍

    2012-01-01

    Speech emotion recognition is one of the new research projects, the extraction of feature parameters extraction influence the final recognition-rate efficiency directly, dimension reduction can extract the most distinguishing feature parameters of different emotions. The importance of feature parameters in speech emotion recognition is point out. The system of speech emotion recognition is introduced. The common methods of feature parameters is detailed. The common methods of dimension reduction which are used in emotion recognition are compared and analyzed. The development of speech emotion recognition in the future are prospected.%语音情感识别是近年来新兴的研究课题之一,特征参数的提取直接影响到最终的识别效率,特征降维可以提取出最能区分不同情感的特征参数.提出了特征参数在语音情感识别中的重要性,介绍了语音情感识别系统的基本组成,重点对特征参数的研究现状进行了综述,阐述了目前应用于情感识别的特征降维常用方法,并对其进行了分析比较.展望了语音情感识别的可能发展趋势.

  11. An Automated Recognition of Fake or Destroyed Indian Currency Notes in Machine Vision

    Directory of Open Access Journals (Sweden)

    Sanjana

    2012-04-01

    Full Text Available Almost every country in the world face the problem of counterfeitcurrency notes, but in India the problem is acute as the country ishit hard by this evil practice. Fake notes in India in denominationsof Rs.100, 500 and 1000 are being flooded into the system. Inorder to deal with such type of problems, an automated recognitionof currency notes in introduced by with the help of featureextraction, classification based in SVM, Neural Nets, and heuristicapproach. This technique is also subjected with the computervision where all processing with the image is done by machine.The machine is fitted with a CDD camera which will scan theimage of the currency note considering the dimensions of thebanknote and software will process the image segments with thehelp of SVM and character recognition methods. ANN is alsointroduced in this paper to train the data and classify the segmentsusing its datasets. To implement this design we are dealing withMATLAB Tool.

  12. Automated, high accuracy classification of Parkinsonian disorders: a pattern recognition approach.

    Directory of Open Access Journals (Sweden)

    Andre F Marquand

    Full Text Available Progressive supranuclear palsy (PSP, multiple system atrophy (MSA and idiopathic Parkinson's disease (IPD can be clinically indistinguishable, especially in the early stages, despite distinct patterns of molecular pathology. Structural neuroimaging holds promise for providing objective biomarkers for discriminating these diseases at the single subject level but all studies to date have reported incomplete separation of disease groups. In this study, we employed multi-class pattern recognition to assess the value of anatomical patterns derived from a widely available structural neuroimaging sequence for automated classification of these disorders. To achieve this, 17 patients with PSP, 14 with IPD and 19 with MSA were scanned using structural MRI along with 19 healthy controls (HCs. An advanced probabilistic pattern recognition approach was employed to evaluate the diagnostic value of several pre-defined anatomical patterns for discriminating the disorders, including: (i a subcortical motor network; (ii each of its component regions and (iii the whole brain. All disease groups could be discriminated simultaneously with high accuracy using the subcortical motor network. The region providing the most accurate predictions overall was the midbrain/brainstem, which discriminated all disease groups from one another and from HCs. The subcortical network also produced more accurate predictions than the whole brain and all of its constituent regions. PSP was accurately predicted from the midbrain/brainstem, cerebellum and all basal ganglia compartments; MSA from the midbrain/brainstem and cerebellum and IPD from the midbrain/brainstem only. This study demonstrates that automated analysis of structural MRI can accurately predict diagnosis in individual patients with Parkinsonian disorders, and identifies distinct patterns of regional atrophy particularly useful for this process.

  13. Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear when Exposed to 65 dB of Auditory Noise

    Directory of Open Access Journals (Sweden)

    Göran B W Söderlund

    2016-01-01

    Full Text Available The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD, affecting approximately 6-9 % of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman’s speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB. Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children (TDC and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure.

  14. Pitch- and Formant-Based Order Adaptation of the Fractional Fourier Transform and Its Application to Speech Recognition

    Directory of Open Access Journals (Sweden)

    Yin Hui

    2009-01-01

    Full Text Available Fractional Fourier transform (FrFT has been proposed to improve the time-frequency resolution in signal analysis and processing. However, selecting the FrFT transform order for the proper analysis of multicomponent signals like speech is still debated. In this work, we investigated several order adaptation methods. Firstly, FFT- and FrFT- based spectrograms of an artificially-generated vowel are compared to demonstrate the methods. Secondly, an acoustic feature set combining MFCC and FrFT is proposed, and the transform orders for the FrFT are adaptively set according to various methods based on pitch and formants. A tonal vowel discrimination test is designed to compare the performance of these methods using the feature set. The results show that the FrFT-MFCC yields a better discriminability of tones and also of vowels, especially by using multitransform-order methods. Thirdly, speech recognition experiments were conducted on the clean intervocalic English consonants provided by the Consonant Challenge. Experimental results show that the proposed features with different order adaptation methods can obtain slightly higher recognition rates compared to the reference MFCC-based recognizer.

  15. Digital speech processing using Matlab

    CERN Document Server

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  16. Speech recognition with dynamic range reduction: (1) deaf and normal subjects in laboratory conditions.

    Science.gov (United States)

    Drysdale, A E; Gregory, R L

    1978-08-01

    Processing to reduce the dynamic range of speech should increase intelligibility and protect the impaired ear from overloading. There are theoretical and practical objections to using AGC devices to reduce dynamic range. These are overcome by using recently available signal processing employing high frequency carrier clipping. An increase in intelligibility of speech with this HFCC has been demonstrated, for normal subjects with simulated deafness, and for most partially hearing patients. Intelligibility is not improved for some patients; possibly due to their having learned to extract features which are lost. These patients may also benefit after training.

  17. Hints About Some Baseful but Indispensable Elements in Speech Recognition and Reconstruction

    Directory of Open Access Journals (Sweden)

    Mihaela Costin

    2002-07-01

    Full Text Available The cochlear implant (CI is a device used to reconstruct the hearing capabilities of a person diagnosed with total cophosis. This impairment may occur after some accidents, chemotherapy etc., the person still having an intact hearing nerve. The cochlear implant has two parts: a programmable, external part, the Digital Signal Processing (DSP device which process and transform the speech signal, and another surgically implanted part, with a certain number of electrodes (depending on brand used to stimulate the hearing nerve. The speech signal is fully processed in the DSP external device resulting the ``coded'' information on speech. This is modulated with the support of the fundamental frequency F0 and the energy impulses are inductively sent to the hearing nerve. The correct detection of this frequency is very important, determining the manner of hearing and making the difference between a "computer'' voice and a natural one. The results are applicable not only in the medical domain, but also in the Romanian speech synthesis.

  18. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN...

  19. Automated urban features classification and recognition from combined RGB/lidar data

    Science.gov (United States)

    Elhifnawy Eid, Hassan Elsaid

    2011-12-01

    Although a Red, Green and Blue (RGB) image provides rich semantic information for different features, it is difficult to extract and separate features which share similar texture properties. The data provided by a LIght Detection And Ranging (LIDAR) system contain dense spatial information for terrain and non-terrain objects, but feature extraction poses difficulties in separating different features sharing the same height information. The thesis objective is to introduce an automated urban classification technique using combined semantic and spatial information leading to the ability to extract different features efficiently. RGB color channels are used to produce two color invariant images for vegetation and shadowy areas identification. Otsu segmentation is applied on these color invariant images to identify shadows and vegetation candidates from each other. An RGB image is transformed into two other color spaces, YCbCr and HSV. Luminance color channel is extracted from YCbCr color space, while hue and saturation color channels are extracted from HSV color space. Global thresholding is applied on these color channels individually and collectively for detecting sandy areas. Wavelet transform is used for detecting building boundaries from LIDAR height data. Final building candidates are identified after removing vegetation areas from the resulting image of extracted buildings from LIDAR data. After successful building extraction using wavelets and vegetation, sandy and shadowy areas from an RGB, remaining features will be the roads. This new filter combination introduces a highly efficient automatic urban classification approach from combined LIDAR/RGB data. The proposed urban classification algorithm will introduce classified libraries for several features and in order to use this output efficiently an independent search algorithm is required. An efficient texture and boundary search algorithm is introduced for automatic object recognition of buildings using both

  20. 基于Julius的机器人语音识别系统构建%Robot Speech Recognition System Based on Julius

    Institute of Scientific and Technical Information of China (English)

    付维; 刘冬; 闵华松

    2011-01-01

    As a result of the continuous development of robot technology, speech recognition of the robot is proposed as intelligent hu- man-computer interaction. After studying the basic principles of HMM speech recognition, in the robot platform of laboratory speech recognition system for isolated words is achieved with open source HTK and Julius. Using the speech recognition system, we can extract the voice command for robot control.%随着机器人技术不断发展,本文提出机器人的语音识别这一智能人机交互方式。在研究了基于HMM语音识别基本原理的情况下,在实验室的机器人平台上,利用HTK和Julius开源平台,构建了一个孤立词的语音识别系统。利用该语音识别系统可以提取语音命令用于机器人的控制。

  1. Speech emotion recognition based on MF-DFA%基于MF-DFA的语音情感识别

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 张密霞; 龚希龄

    2011-01-01

    In order to overcome the inadequacy of emotional conventional linear argument at depicting different types of character sentiments,this paper takes the multiple fractals theory into the sound emotional identify,by analyzing the multiple fractal features on the different sound emotional state, and proposes multifractal spectrum parameters and generalized hurst index as emotional conventional parameters, combined with traditional voice acoustic features and using support vector machine (SVM) for speech emotion recognition. The resuits show that the accuracy and stability of the recognition system are ira. proved effectively through using non-linear parameters, compared with using the linear features of traditional voice recognition method. It provides a new idea for voice emotion recognition.%针对语音情感线性参数在刻画不同情感类型特征上的不足,将多重分形理论引人语音情感识别中.通过分析不同语音情感状态下的多重分形特征,提取多重分形谱参数和广义hurst指数作为新的语音情感特征参数,并结合传统语音声学特征,采用支持向量机SVM对其进行语音情感识别.试验结果表明,该方法可使系统的准确率和稳定性得到有效提高.非线性参数的引人为语音情感识别提供了一个新的思路.

  2. A large-scale dataset of solar event reports from automated feature recognition modules

    Science.gov (United States)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  3. Laser Opto-Electronic Correlator for Robotic Vision Automated Pattern Recognition

    Science.gov (United States)

    Marzwell, Neville

    1995-01-01

    A compact laser opto-electronic correlator for pattern recognition has been designed, fabricated, and tested. Specifically it is a translation sensitivity adjustable compact optical correlator (TSACOC) utilizing convergent laser beams for the holographic filter. Its properties and performance, including the location of the correlation peak and the effects of lateral and longitudinal displacements for both filters and input images, are systematically analyzed based on the nonparaxial approximation for the reference beam. The theoretical analyses have been verified in experiments. In applying the TSACOC to important practical problems including fingerprint identification, we have found that the tolerance of the system to the input lateral displacement can be conveniently increased by changing a geometric factor of the system. The system can be compactly packaged using the miniature laser diode sources and can be used in space by the National Aeronautics and Space Administration (NASA) and ground commercial applications which include robotic vision, and industrial inspection of automated quality control operations. The personnel of Standard International will work closely with the Jet Propulsion Laboratory (JPL) to transfer the technology to the commercial market. Prototype systems will be fabricated to test the market and perfect the product. Large production will follow after successful results are achieved.

  4. Emotional intelligence, not music training, predicts recognition of emotional speech prosody.

    Science.gov (United States)

    Trimmer, Christopher G; Cuddy, Lola L

    2008-12-01

    Is music training associated with greater sensitivity to emotional prosody in speech? University undergraduates (n = 100) were asked to identify the emotion conveyed in both semantically neutral utterances and melodic analogues that preserved the fundamental frequency contour and intensity pattern of the utterances. Utterances were expressed in four basic emotional tones (anger, fear, joy, sadness) and in a neutral condition. Participants also completed an extended questionnaire about music education and activities, and a battery of tests to assess emotional intelligence, musical perception and memory, and fluid intelligence. Emotional intelligence, not music training or music perception abilities, successfully predicted identification of intended emotion in speech and melodic analogues. The ability to recognize cues of emotion accurately and efficiently across domains may reflect the operation of a cross-modal processor that does not rely on gains of perceptual sensitivity such as those related to music training. PMID:19102595

  5. Emotional intelligence, not music training, predicts recognition of emotional speech prosody.

    Science.gov (United States)

    Trimmer, Christopher G; Cuddy, Lola L

    2008-12-01

    Is music training associated with greater sensitivity to emotional prosody in speech? University undergraduates (n = 100) were asked to identify the emotion conveyed in both semantically neutral utterances and melodic analogues that preserved the fundamental frequency contour and intensity pattern of the utterances. Utterances were expressed in four basic emotional tones (anger, fear, joy, sadness) and in a neutral condition. Participants also completed an extended questionnaire about music education and activities, and a battery of tests to assess emotional intelligence, musical perception and memory, and fluid intelligence. Emotional intelligence, not music training or music perception abilities, successfully predicted identification of intended emotion in speech and melodic analogues. The ability to recognize cues of emotion accurately and efficiently across domains may reflect the operation of a cross-modal processor that does not rely on gains of perceptual sensitivity such as those related to music training.

  6. SOFTWARE EFFORT ESTIMATION FRAMEWORK TO IMPROVE ORGANIZATION PRODUCTIVITY USING EMOTION RECOGNITION OF SOFTWARE ENGINEERS IN SPONTANEOUS SPEECH

    Directory of Open Access Journals (Sweden)

    B.V.A.N.S.S. Prabhakar Rao

    2015-10-01

    Full Text Available Productivity is a very important part of any organisation in general and software industry in particular. Now a day’s Software Effort estimation is a challenging task. Both Effort and Productivity are inter-related to each other. This can be achieved from the employee’s of the organization. Every organisation requires emotionally stable employees in their firm for seamless and progressive working. Of course, in other industries this may be achieved without man power. But, software project development is labour intensive activity. Each line of code should be delivered from software engineer. Tools and techniques may helpful and act as aid or supplementary. Whatever be the reason software industry has been suffering with success rate. Software industry is facing lot of problems in delivering the project on time and within the estimated budget limit. If we want to estimate the required effort of the project it is significant to know the emotional state of the team member. The responsibility of ensuring emotional contentment falls on the human resource department and the department can deploy a series of systems to carry out its survey. This analysis can be done using a variety of tools, one such, is through study of emotion recognition. The data needed for this is readily available and collectable and can be an excellent source for the feedback systems. The challenge of recognition of emotion in speech is convoluted primarily due to the noisy recording condition, the variations in sentiment in sample space and exhibition of multiple emotions in a single sentence. The ambiguity in the labels of training set also increases the complexity of problem addressed. The existing models using probabilistic models have dominated the study but present a flaw in scalability due to statistical inefficiency. The problem of sentiment prediction in spontaneous speech can thus be addressed using a hybrid system comprising of a Convolution Neural Network and

  7. Speech recognition by bilateral cochlear implant users in a cocktail-party setting

    OpenAIRE

    Loizou, Philipos C.; Hu, Yi; Litovsky, Ruth; Yu, Gongqiang; PETERS, Robert; Lake, Jennifer; Roland, Peter

    2009-01-01

    Unlike prior studies with bilateral cochlear implant users which considered only one interferer, the present study considered realistic listening situations wherein multiple interferers were present and in some cases originating from both hemifields. Speech reception thresholds were measured in bilateral users unilaterally and bilaterally in four different spatial configurations, with one and three interferers consisting of modulated noise or competing talkers. The data were analyzed in terms...

  8. Speech recall and word recognition depending on prosodic and musical cues as well as voice pitch

    OpenAIRE

    Rozanovskaya, Anna; Sokolova, Taisia

    2011-01-01

    Within this study, speech perception in different conditions was examined. The aim of the research was to compare perception results based on stimuli mode (plain spoken, rhythmic spoken or rhythmic sung stimuli) and pitch (normal, lower and higher). In the study, an experiment was conducted on 44 participants who had been asked to listen to 9 recorded sentences in Russian language (unknown to them) and write them down using Latin letters. These 9 sentences were specially prepared using differ...

  9. Relating hearing loss and executive functions to hearing aid users’ preference for, and speech recognition with, different combinations of binaural noise reduction and microphone directionality

    Directory of Open Access Journals (Sweden)

    Tobias eNeher

    2014-12-01

    Full Text Available Knowledge of how executive functions relate to preferred hearing aid (HA processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1 performance on a measure of reading span (RS is related to preferred binaural noise reduction (NR strength, (2 similar relations exist for two different, nonverbal measures of executive function, (3 pure-tone average hearing loss (PTA, signal-to-noise ratio (SNR, and microphone directionality (DIR also influence preferred NR strength, and (4 preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker relations between worse performance on one nonverbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings.

  10. Speech emotion recognition in emotional feedback for Human-Robot Interaction

    Directory of Open Access Journals (Sweden)

    Javier G. R´azuri

    2015-02-01

    Full Text Available For robots to plan their actions autonomously and interact with people, recognizing human emotions is crucial. For most humans nonverbal cues such as pitch, loudness, spectrum, speech rate are efficient carriers of emotions. The features of the sound of a spoken voice probably contains crucial information on the emotional state of the speaker, within this framework, a machine might use such properties of sound to recognize emotions. This work evaluated six different kinds of classifiers to predict six basic universal emotions from non-verbal features of human speech. The classification techniques used information from six audio files extracted from the eNTERFACE05 audio-visual emotion database. The information gain from a decision tree was also used in order to choose the most significant speech features, from a set of acoustic features commonly extracted in emotion analysis. The classifiers were evaluated with the proposed features and the features selected by the decision tree. With this feature selection could be observed that each one of compared classifiers increased the global accuracy and the recall. The best performance was obtained with Support Vector Machine and bayesNet.

  11. Intermodal timing relations and audio-visual speech recognition by normal-hearing adults.

    Science.gov (United States)

    McGrath, M; Summerfield, Q

    1985-02-01

    Audio-visual identification of sentences was measured as a function of audio delay in untrained observers with normal hearing; the soundtrack was replaced by rectangular pulses originally synchronized to the closing of the talker's vocal folds and then subjected to delay. When the soundtrack was delayed by 160 ms, identification scores were no better than when no acoustical information at all was provided. Delays of up to 80 ms had little effect on group-mean performance, but a separate analysis of a subgroup of better lipreaders showed a significant trend of reduced scores with increased delay in the range from 0-80 ms. A second experiment tested the interpretation that, although the main disruptive effect of the delay occurred on a syllabic time scale, better lipreaders might be attempting to use intermodal timing cues at a phonemic level. Normal-hearing observers determined whether a 120-Hz complex tone started before or after the opening of a pair of liplike Lissajou figures. Group-mean difference limens (70.7% correct DLs) were - 79 ms (sound leading) and + 138 ms (sound lagging), with no significant correlation between DLs and sentence lipreading scores. It was concluded that most observers, whether good lipreaders or not, possess insufficient sensitivity to intermodal timing cues in audio-visual speech for them to be used analogously to voice onset time in auditory speech perception. The results of both experiments imply that delays of up to about 40 ms introduced by signal-processing algorithms in aids to lipreading should not materially affect audio-visual speech understanding.

  12. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  13. Spoken-word recognition in foreign-accented speech by L2 listeners

    NARCIS (Netherlands)

    Weber, A.C.; Broersma, M.E.; Aoyagi, M.

    2011-01-01

    Two cross-modal priming studies investigated the recognition of English words spoken with a foreign accent. Auditory English primes were either typical of a Dutch accent or typical of a Japanese accent in English and were presented to both Dutch and Japanese L2 listeners. Lexical-decision times to s

  14. Separable spectro-temporal Gabor filter bank features: Reducing the complexity of robust features for automatic speech recognition.

    Science.gov (United States)

    Schädler, Marc René; Kollmeier, Birger

    2015-04-01

    To test if simultaneous spectral and temporal processing is required to extract robust features for automatic speech recognition (ASR), the robust spectro-temporal two-dimensional-Gabor filter bank (GBFB) front-end from Schädler, Meyer, and Kollmeier [J. Acoust. Soc. Am. 131, 4134-4151 (2012)] was de-composed into a spectral one-dimensional-Gabor filter bank and a temporal one-dimensional-Gabor filter bank. A feature set that is extracted with these separate spectral and temporal modulation filter banks was introduced, the separate Gabor filter bank (SGBFB) features, and evaluated on the CHiME (Computational Hearing in Multisource Environments) keywords-in-noise recognition task. From the perspective of robust ASR, the results showed that spectral and temporal processing can be performed independently and are not required to interact with each other. Using SGBFB features permitted the signal-to-noise ratio (SNR) to be lowered by 1.2 dB while still performing as well as the GBFB-based reference system, which corresponds to a relative improvement of the word error rate by 12.8%. Additionally, the real time factor of the spectro-temporal processing could be reduced by more than an order of magnitude. Compared to human listeners, the SNR needed to be 13 dB higher when using Mel-frequency cepstral coefficient features, 11 dB higher when using GBFB features, and 9 dB higher when using SGBFB features to achieve the same recognition performance. PMID:25920855

  15. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  16. Modeling and simulation of speech emotional recognition%语音情感智能识别的建模与仿真

    Institute of Scientific and Technical Information of China (English)

    黄晓峰; 彭远芳

    2012-01-01

    语音情感信息具有非线性、信息冗余、高维等复杂特点,数据含有大量噪声,传统识别模型难以消除冗余和噪声信息,导致语音情感识别正确率十分低.为了提高语音情感识别正确率,利用小波分析去噪和神经网络的非线性处理能力,提出一种基于过程神经元网络的语音情感智能识别模型.采用小波分析对语音情感信号进行去噪处理,利用主成分分析消除语音情感特征中的冗余信息,采用过程神经元网络对语音情感进行分类识别.仿真结果表明,基于过程神经元网络的识别模型的识别率比K近邻提高了13%,比支持向量机提高了8.75%,该模型是一种有效的语音情感智能识别工具.%Speech emotion information has nonlinear, redundancy and high dimension characteristics, the data has lots of noise, the traditional methods cannot eliminate the redundancy information and noise, so speech emotion recognition accuracy is quite low. In order to improve the accuracy of speech emotion recognition, this paper puts forward a speech emotion recognition model based on process neural networks strong nonlinear processing ability and wavelet analysisdenoising. The noise of speech signal is eliminated by wavelet analysis, the redundancy information is eliminated by the principal components analysis, the speech emotional signal is recognized by the process neural networks. Simulation results show that the average recognition rate of the process neural networks is higher than K neighbor by 13%, and higher than the support vector machine by 8.75%, therefore the proposed model is an effective speech emotion recognition tool.

  17. 人机交互中的语音情感识别研究进展%A survey of speech emotion recognition in human computer interaction

    Institute of Scientific and Technical Information of China (English)

    张石清; 李乐民; 赵知劲

    2013-01-01

    Speech emotion recognition is a current active research topic in the fields of signal processing,pattern recognition,artificial intelligence,human computer interaction,etc.The ultimate purpose of such research is to endow computers with emotion ability and make human computer interaction be genuinely harmonic and natural.This paper reviews the recent advance of several key problems involved in speech emotion recognition,including emotional description theory,emotional speech databases,emotional acoustic analysis as well as emotion recognition methods.In addition,the existing research problems and the future direction are presented.%语音情感识别是当前信号处理、模式识别、人工智能、人机交互等领域的热点研究课题,其研究的最终目的是赋予计算机情感能力,使得人机交互做到真正的和谐和自然.本文综述了语音情感识别所涉及到的几个关键问题,包括情感表示理论、情感语音数据库、情感声学特征分析以及情感识别方法四个方面的最新进展,并指出了研究中存在的问题及下一步发展的方向.

  18. 俄语语音识别技术的研究现状和发展趋势%Research Status and Development Trend of Russian Speech Recognition Technology

    Institute of Scientific and Technical Information of China (English)

    马延周

    2015-01-01

    Abstract:Technological advance of speech recognition facilitates intelligent human-computer interactions. And applica-tions of speech recognition technology have made human communications easier and more instantaneous. Starting with a look at the past and the present of Russian speech recognition, this paper attempts to conduct a detailed analysis on fundamental princi-ples of speech recognition, speech recognition technology based on Hammond theoretical groundwork for consecutive vast-vo-cabulary speech recognition. The paper also demonstrates steps for establishing models in Russian acoustics and speeches. As to technological barriers in speech recognition, it probes into possible way out strategies. Finally, it predicts future development di-rection and application prospects for Russian speech recognition technology.%语音识别技术的发展,推动了人机交互的智能化,语音识别实用化技术使得人们之间的交流更加方便顺畅.本文从语音识别的发展历程及俄语语音识别的现状入手,对语音识别的基本原理、基于HMM模型的语音识别技术和大词汇量连续语音识别的理论基础进行了详细分析,并介绍了俄语语音声学模型和语言模型的创建办法.针对语音识别技术面临的难点问题,探讨了应对的策略,最后对俄语语音识别技术的发展方向和应用前景作了展望.

  19. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    than typing, making the translation process faster. The spoken translation is analyzed and combined with the machine translation output of the same sentence using different methods. We study a number of different translation models in the context of n-best list rescoring methods. As an alternative...... to the n-best list rescoring, we also use word graphs with the expectation of arriving at a tighter integration of ASR and MT models. Integration methods include constraining ASR models using language and translation models of MT, and vice versa. We currently develop and experiment different methods...... on the Danish – English language pair, with the use of a speech corpora and parallel text. The methods are investigated to check ways that the accuracy of the spoken translation of the translator can be increased with the use of machine translation outputs, which would be useful for potential computer...

  20. Speech emotion recognition based on Intrinsic Time-scale Decomposition%ITD在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 刘亚

    2014-01-01

    为了更好地表征语音情感状态,将固有时间尺度分解(ITD)用于语音情感特征提取。从语音信号中得到前若干阶合理旋转(PR)分量,并提取PR分量的瞬时参数特征和关联维数,以此作为新的情感特征参数,结合传统特征使用支持向量机(SVM)进行语音情感识别实验。实验结果显示,引入PR特征参数后,与传统特征的方案相比,情感识别率有了明显提高。%In order to express speech emotional state better, this paper takes the Intrinsic Time-scale Decomposition (ITD)into extracting speech emotion features, decomposes the emotion speech into a sum of Proper Rotation(PR)com-ponents, extracts instantaneous characteristic parameters and correlation dimension as new emotional characteristic param-eters, combines with traditional features and uses Support Vector Machine(SVM)for speech emotional recognition. The results show that recognition accuracy is improved obviously through using PR features parameters.

  1. GesRec3D: a real-time coded gesture-to-speech system with automatic segmentation and recognition thresholding using dissimilarity measures

    OpenAIRE

    Craven, Michael P; Curtis, K. Mervyn

    2004-01-01

    A complete microcomputer system is described, GesRec3D, which facilitates the data acquisition, segmentation, learning, and recognition of 3-Dimensional arm gestures, with application as a Augmentative and Alternative Communication (AAC) aid for people with motor and speech disability. The gesture data is acquired from a Polhemus electro-magnetic tracker system, with sensors attached to the finger, wrist and elbow of one arm. Coded gestures are linked to user-defined text, to be spoken by a t...

  2. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  3. Logistic Kernel Function and its Application to Speech Recognition%Logistic 核函数及其在语音识别中的应用

    Institute of Scientific and Technical Information of China (English)

    刘晓峰; 张雪英; Zizhong John Wang

    2015-01-01

    核函数是支持向量机( SVM)的核心,直接决定着SVM的性能。为提高SVM在语音识别问题中的学习能力和泛化能力,文中提出了一种 Logistic 核函数,并给出了该Logistic核函数是Mercer核的理论证明。在双螺旋、语音识别问题上的实验结果表明,该Logistic核函数是有效的,其性能优于线性、多项式、径向基、指数径向基的核函数,尤其是在语音识别中,该Logistic核函数具有更好的识别性能。%Kernel function is the core of support vector machine ( SVM) and directly affects the performance of SVM.In order to improve the learning ability and generalization ability of SVM for speech recognition, a Logistic kernel function, which is proved to be a Mercer kernel function, is presented.Experimental results on bi-spiral and speech recognition problems show that the presented Logistic kernel function is effective and performs better than linear, polynomial, radial basis and exponential radial basis kernel functions, especially in the case of speech rec-ognition.

  4. Intelligent Home Speech Recognition System Based on NL6621%语音识别技术在智能家居中的应用

    Institute of Scientific and Technical Information of China (English)

    王爱芸

    2015-01-01

    The research of intelligent home speech recognition system is very important for the development of smart home. Through the analysis of the embedded speech recognition technology and smart home control technology, voice is recorded with NL6621 board as the platform and VS1003 as audio decoding chip. And Hidden Markov Model (HMM) algorithm is used to carry out voice model training and voice matching, so that we can achieve a smart home voice con-trol system. Experiments prove that the speech control system has high recognition rate and real-time performance.%研究实用的智能家居语音识别系统,对于智能家居的发展具有重要意义。通过分析嵌入式语音识别技术以及智能家居控制技术,以 NL6621板为平台,VS1003为音频解码芯片录制语音。并利用隐马尔可夫(HMM)算法进行语音模型训练和语音匹配,实现智能家居语音控制系统。实验证明此语音控制系统具有较高的识别率和实时性。

  5. SecurityAuthentication System based on Speech Recognition%基于语音识别的安全认证系统

    Institute of Scientific and Technical Information of China (English)

    毕俊浩; 叶翰嘉; 王笑臣; 孙国梓

    2012-01-01

      Based on the analysis of the smart terminal security requirements ,speech recognition and sandbox protection technology are used to verify whether the user is authorized or not. We design and implement a safety certification system based on speech recognition in the wide used Android system. From the user speech recognition, the interactive interface and interaction protocols of the sandbox protection, and other aspects we build a detailed analysis of key technologies.%  文章在对智能终端安全性需求进行分析的基础上,将语音识别与沙盒防护技术应用于智能终端。使用者是否获得授权的验证是文章分析的一个重点问题。通过选择在广泛使用的Android系统上设计实现了一个基于语音识别的安全认证系统,从使用者声音的识别、沙盒防护的交互接口和交互协议等几个方面对系统构建的关键技术进行了详细分析。

  6. 基于改进型SVM算法的语音情感识别%Speech emotion recognition algorithm based on modified SVM

    Institute of Scientific and Technical Information of China (English)

    李书玲; 刘蓉; 张鎏钦; 刘红

    2013-01-01

    为有效提高语音情感识别系统的识别率,研究分析了一种改进型的支持向量机(SVM)算法.该算法首先利用遗传算法对SVM参数惩罚因子和核函数中参数进行优化,然后用优化后的参数进行语音情感的建模与识别.在柏林数据集上进行7种和常用5种情感识别实验,取得了91.03%和96.59%的识别率,在汉语情感数据集上,取得了97.67%的识别率.实验结果表明该算法能够有效识别语音情感.%In order to effectively improve the recognition accuracy of the speech emotion recognition system,an improved speech emotion recognition algorithm based on Support Vector Machine (SVM) was proposed.In the proposed algorithm,the SVM parameters,penalty factor and nuclear function parameter,were optimized with genetic algorithm.Furthermore,an emotion recognition model was established with SVM method.The performance of this algorithm was assessed by computer simulations,and 91.03% and 96.59% recognition rates were achieved respectively in seven-emotion recognition experiments and common five-emotion recognition experiments on the Berlin database.When the Chinese emotional database was used,the rate increased to 97.67%.The obtained results of the simulations demonstrate the validity of the proposed algorithm.

  7. 基于HMM和ANN的语音情感识别研究%Research on emotion recognition of speech signal based on HMM and ANN

    Institute of Scientific and Technical Information of China (English)

    胡洋; 蒲南江; 吴黎慧; 高磊

    2011-01-01

    Speech emotion recognition is not only an important part of speech recognition but also the basic theory of harmonious human-computer interaction.As a single classifier in the limitations of speech emotion recognition.In this paper,we put forward a method:the Combination of Hidden Markov Model (HMM) and Artificial Neural Network (ANN),for the six emotion of happy,surprise,anger,sad,fear and clam,we design six HMM model for every emotion,through this method,we have the best matching sequence of each emotion.Then,the posterior ANN classifier is used to classify the test samples,through the integration of two classifiers to improve speech emotion recognition rate.Based on the emotion speech database established by recording induced,the experimental results indicate that there is great elevation in the recognition rate.%语音情感识别是语音识别中的重要分支,是和谐人机交互的基础理论。由于单一分类器在语音情感识别中的局限性,本文提出了隐马尔科夫模型(HMM)和人工神经网络(ANN)相结合的方法,对高兴、惊奇、愤怒、悲伤、恐惧、平静六种情感分别设计一个HMM模型,得到每种情感的最佳匹配序列,然后利用ANN作为后验分类器对测试样本进行分类,通过两种分类器融合提高语音情感识别率。在通过诱导录音法建立的情感语音库的基础上进行了实验验证,实验结果表明识别率有较大的提高。

  8. 基于RBF神经网络的语音情感识别%Speech Emotion Recognition Based on RBF Neural Network

    Institute of Scientific and Technical Information of China (English)

    张海燕; 唐建芳

    2011-01-01

    The principle of radial base function neural network and its train algorithm are introduced in this paper.Meanwhile,the model of speech emotion recognition based on RBF neural network is established.In the recognition experiments,BP neural network and RBF neural network are compared in the same testing environment.The recognition rate of RBF neural network is 3% more than BP neural network.The results show that the method based on RBF neural network speech emotion recognition is effective.%介绍了径向基函数神经网络的原理、训练算法,并建立了RBF神经网络的语音情感识别的模型。在实验中比较了BP神经网络与RBF神经网络分别用于语音情感识别识别率,RBF神经网络的平均识别率高于BP神经网络3%。结果表明,基于RBF神经网络的语音情感识别方法的有效性。

  9. Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech.

    Science.gov (United States)

    Campo, D.; Quintero, O. L.; Bastidas, M.

    2016-04-01

    We propose a study of the mathematical properties of voice as an audio signal. This work includes signals in which the channel conditions are not ideal for emotion recognition. Multiresolution analysis- discrete wavelet transform - was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states. ANNs proved to be a system that allows an appropriate classification of such states. This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features. Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify.

  10. Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech.

    Science.gov (United States)

    Campo, D.; Quintero, O. L.; Bastidas, M.

    2016-04-01

    We propose a study of the mathematical properties of voice as an audio signal. This work includes signals in which the channel conditions are not ideal for emotion recognition. Multiresolution analysis- discrete wavelet transform – was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states. ANNs proved to be a system that allows an appropriate classification of such states. This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features. Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify.

  11. Study of the vocal signal in the amplitude-time representation. Speech segmentation and recognition algorithms

    International Nuclear Information System (INIS)

    This dissertation exposes an acoustical and phonetical study of vocal signal. The complex pattern of the signal is segmented into simple sub-patterns and each one of these sub-patterns may be segmented again into another more simplest patterns with lower level. Application of pattern recognition techniques facilitates on one hand this segmentation and on the other hand the definition of the structural relations between the sub-patterns. Particularly, we have developed syntactic techniques in which the rewriting rules, context-sensitive, are controlled by predicates using parameters evaluated on the sub-patterns themselves. This allow to generalize a pure syntactic analysis by adding a semantic information. The system we expose, realizes pre-classification and a partial identification of the phonemes as also the accurate detection of each pitch period. The voice signal is analysed directly using the amplitude-time representation. This system has been implemented on a mini-computer and it works in the real time. (author)

  12. Robust Speech Recognition Based on Vector Taylor Series%基于矢量泰勒级数的鲁棒语音识别

    Institute of Scientific and Technical Information of China (English)

    吕勇; 吴镇扬

    2011-01-01

    The vector Taylor series (VTS) expansion is an effective approach to noise robust speech recognition.However, in the log-spectral domain, there exist the strong correlations among the different channels of Mel filter bank and thus it is difficult to estimate the noise variance from noisy speech proposes.A feature compensation algorithm in the cepstral domain based on vector Taylor series was proposed.In this algorithm, the distribution of speech cepstral features was represented by a Gaussian mixture model (GMM), and the mean and variance of noise were estimated from noisy speech by the VTS approximation.The experimental results show that the proposed algorithm can significantly improve the performance of speech recognition system, and outperforms the VTS-based feature compensation method in the log-spectral domain.%矢量泰勒级数是一种有效的抗噪声鲁棒语音识别算法.然而在对数谱域,美尔滤波器组的不同通道之间有较强的相关性,因而难以从含噪语音中准确估计噪声的方差.提出了一种基于矢量泰勒级数的倒谱域特征补偿算法.该算法在倒谱域,用一个高斯混合模型描述语音倒谱特征的分布,通过矢量泰勒级数从含噪语音中估计噪声的均值和方差.实验结果表明,此算法能明显提高语音识别系统的性能,优于基于矢量泰勒级数的对数谱域特征补偿算法.

  13. Continuous speech recognition by convolutional neural networks%基于卷积神经网络的连续语音识别

    Institute of Scientific and Technical Information of China (English)

    张晴晴; 刘勇; 潘接林; 颜永红

    2015-01-01

    在语音识别中,卷积神经网络( convolutional neural networks,CNNs)相比于目前广泛使用的深层神经网络( deep neural network,DNNs),能在保证性能的同时,大大压缩模型的尺寸。本文深入分析了卷积神经网络中卷积层和聚合层的不同结构对识别性能的影响情况,并与目前广泛使用的深层神经网络模型进行了对比。在标准语音识别库TIMIT以及大词表非特定人电话自然口语对话数据库上的实验结果证明,相比传统深层神经网络模型,卷积神经网络明显降低模型规模的同时,识别性能更好,且泛化能力更强。%Convolutional neural networks ( CNNs ) , which show success in achieving translation invariance for many image processing tasks, were investigated for continuous speech recognition. Compared to deep neural networks ( DNNs) , which are proven to be successful in many speech recognition tasks nowadays, CNNs can reduce the neural network model sizes significantly, and at the same time achieve even a better recognition accuracy. Experiments on standard speech corpus TIMIT and conversational speech corpus show that CNNs outperform DNNs in terms of the accuracy and the generalization ability.

  14. 基于 PAD 情绪模型的情感语音识别%Emotional Speech Recognition Based on PAD Emotion Model

    Institute of Scientific and Technical Information of China (English)

    宋静; 张雪英; 孙颖; 张卫

    2016-01-01

    简述梅尔频率倒谱系数、线性预测系数、韵律学特征、共振峰频率和过零峰值幅度特征,并将这五种语音特征应用于情感语音识别。根据识别结果从PAD情绪模型的三个维度进行相关性分析得到特征的权重系数,并将识别结果融合映射到PAD三维情绪空间,最终获得情感语音的PAD值。利用情感语音的PAD值可以从连续情感理论对情感语音进行描述分析,采用量化的方法揭示情感空间中各种情绪范畴的定位和关系。%Five approaches of feature extraction :the MEL‐frequency Cepstral Coefficient (MFCC) ,the Linear Predictor Coefficient(LPC) ,prosodic features ,formant frequency and the Zero Crossings with Peak Amplitudes (ZCPA) are described in this paper .These features are applied to emotional speech recognition .According to the recognition results ,the weight coefficients of features are obtained by correlation analysis in the three dimensions of PAD emotion model .Simultaneously the recognition results are fused to the PAD emotional space ,and the PAD values of the emotional speech are obtained .The PAD values of the emotional speech can be analyzed from the theory of continuous emotion . And the quantitative analysis of emotional speech can reveal the position and relationship of emotional category in emotional space .

  15. Talking Speech Input.

    Science.gov (United States)

    Berliss-Vincent, Jane; Whitford, Gigi

    2002-01-01

    This article presents both the factors involved in successful speech input use and the potential barriers that may suggest that other access technologies could be more appropriate for a given individual. Speech input options that are available are reviewed and strategies for optimizing use of speech recognition technology are discussed. (Contains…

  16. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    Directory of Open Access Journals (Sweden)

    Zucker Gerhard

    2011-01-01

    Full Text Available The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour that resolves conflicting demands—a decision-making process. In the last years, research has started to focus on bionic principles for designing new concepts in this area. The information processing principles of the human mind have turned out to be of particular interest as the mind is capable of processing huge amounts of sensory data and taking adequate decisions for (re-actions based on these analysed data. In this paper, we discuss how a bionic approach can solve the upcoming problems of energy optimal systems. A recently developed model for environment recognition and decision-making processes, which is based on research findings from different disciplines of brain research is introduced. This model is the foundation for applications in intelligent building automation that have to deal with information from home and office environments. All of these applications have in common that they consist of a combination of communicating nodes and have many, partly contradicting goals.

  17. Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Kun-Ching Wang

    2015-01-01

    Full Text Available The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI. In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII. The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

  18. 基于Fisher准则与SVM的分层语音情感识别%Multi-Level Speech Emotion Recognition Based on Fisher Criterion and SVM

    Institute of Scientific and Technical Information of China (English)

    陈立江; 毛峡; Mitsuru ISHIZUKA

    2012-01-01

    To solve the speaker independent emotion recognition problem, a multi-level speech emotion recognition system is proposed to classify 6 speech emotions, including sadness, anger, surprise, fear, happiness and disgust from coarse to fine. The key is that the emotions divided by each layer are closely related to the emotional features of speech. For each level, appropriate features are selected from 288 candidate features by Fisher ratio which is also regarded as input parameter for the training of support vector machine ( SVM). Based on Beihang emotional speech database and Berlin emotional speech database, principal component analysis ( PC A) for dimension reduction and Artificial Neural Network ( ANN) for classification are adopted to design 4 comparative experiments, including Fisher+SVM, PC A +SVM, Fisher+ANN, PCA+ANN. The experimental results prove that Fisher rule is better than PCA for dimension reduction, and SVM is more expansible than ANN for speaker independent speech emotion recognition. Good cross-cultural adaptation can be inferred from the similar results of experiments based on two different databases.%针对说话人无关的语音情感识别,提出一个分层语音情感识别模型,由粗到细识别悲伤、愤怒、惊奇、恐惧、喜悦和厌恶6种情感.每层采用Fisher比率从288个备选特征中选择适合该层分类的特征,同时将Fisher比率作为输入参数训练该层的支持向量机分类器.基于北京航空航天大学情感语音数据库和德国柏林情感语音数据库,设计4组对比实验,实验结果表明,Fisher准则在两两分类特征选择上优于PCA,SVM在说话人无关的语音情感识别推广方面优于人工神经网络(ANN).在两个数据库的基础上得到类似结果,说明文中分类模型具有一定的跨文化适应性.

  19. 人工蜂群算法改进DHMM的语音识别方法%Speech recognition algorithm based on artificial bee colony modified DHMM

    Institute of Scientific and Technical Information of China (English)

    宁爱平; 张雪英

    2012-01-01

    针对离散隐马尔可夫(Discrete Hidden Markov Model,DHMM)语音识别系统中LBG算法对初始码书的依赖性和易陷入局部最优解的问题,采用人工蜂群(Artificial Bee Colony,ABC)算法对语音特征参数进行矢量量化,从而得到最优码书,提出了ABC改进DHMM的孤立词语音识别方法.先提取语音信号的特征参数,然后用ABC算法中每个食物源表示一个码书,以人工蜂群进化的方式对初始码书进行迭代而获得最优码书,最后把最优码书的码矢标号代入DHMM模型进行训练和识别.实验结果表明,ABC改进的DHMM语音识别方法与传统的LBG及粒子群优化初始码书的LBG的DHMM语音识别方法相比具有较高的识别率和较好的鲁棒性.%The paper proposes the modified DHMM speech recognition algorithm which uses Artificial Bee Colony algorithm (ABC) to cluster speech feature vector and generate the optimal codebook in the Discrete Hidden Markov Model (DHMM) speech recognition system. In the experiments, extract the feature vector of speech. In ABC algorithm, each food source indicates a codebook. The optimal codebook is obtained by using bee evolution ways to iterative initial codebook. The optimal codebook enters the DHMM to be trained and recognized. The experimental results show that the modified DHMM speech recognition algorithm has higher recognition ratio and better robustness than DHMM algorithm which uses the traditional LBG algorithm and the LBG algorithm of particle swarm optimization initial codebook.

  20. Automated Facial Expression Recognition Using Gradient-Based Ternary Texture Patterns

    OpenAIRE

    Faisal Ahmed; Emam Hossain

    2013-01-01

    Recognition of human expression from facial image is an interesting research area, which has received increasing attention in the recent years. A robust and effective facial feature descriptor is the key to designing a successful expression recognition system. Although much progress has been made, deriving a face feature descriptor that can perform consistently under changing environment is still a difficult and challenging task. In this paper, we present the gradient local ternary pattern (G...

  1. 基于BP神经网络的语音情感识别研究%Speech Emotion Recognition Based on BP Neural Network

    Institute of Scientific and Technical Information of China (English)

    徐照松; 元建

    2014-01-01

    随着科技的迅速发展,人机交互越来越受到人们的重视,语音情感识别更是学术界研究的热点。将BP神经网络算法用于语音情感识别研究,并在汉语情感数据集上进行了相关实验,识别的准确率达到了91.5%,相较于SVM算法分类精度提高了5%。%With the rapid development of technology ,human-computer interaction more and more suffer people’s attention . Research on speech emotion recognition is the focus of academic .In this article ,we use the BP neural network algorithm to research on speech emotion recognition and conducted experiments on chinese sentiment data sets ,recognition accuracy rate reached 91 .5 percent ,compared to the SVM accuracy is improved by 5% .

  2. 基于语音识别的自平衡机器人设计%Design of Self-balance Robot System Based on Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    王洪涛; 宋一标; 陈水标

    2014-01-01

    A two rounds of self-balance robot system was designed by using STC1 1 L04E and Freescale Kinetis 60,which was controlled by speech recognition.Speech recognition module was mainly composed of STC1 1 L04E and LD3320,using a real-time speech recognition algorithm,the realization of speaker-independent speech control of self-balance robot was achieved.To achieve the self balancing function of the robot,the triaxial acceleration sensor MMA7260 and gyroscope ENC-03Mwere used to acquire the accel-eration and angular velocity values in real time.Furthermore,in order to determine the robot’s state,Freescale Kinetis 60 (Master control chip)was used for data fusion.To keep the robot in a vertical state,the master control chip control motor positive and negative rotation by PID algorithm.Experimental results demonstrate the robot can realize self balance in short time,and can complete forward, backward,turn left,turn right,acceleration,deceleration according to speech commands.%基于STC11 L04E和 Freescale Kinetis 60,设计了语音控制的两轮自平衡机器人。其中,语音识别模块主要由STC11 L04E和LD3320组成,利用实时语音识别算法,实现非特定人对自平衡机器人动作的语音控制。为实现机器人的自平衡功能,利用三轴加速度传感器MMA7260和陀螺仪ENC-03M实时采集加速度值和角速度值,进一步由主控芯片Frees-cale Kinetis 60进行数据融合,以确定该机器人的姿态。主控芯片通过PID算法控制电机正、反转,以保持机器人处于稳定的直立状态,电机转速由红外对管实时反馈。经过实验测试,该机器人能快速稳定地实现自平衡功能,并能按照语音指令完成前进、后退、左转、右转、加速、减速等动作。

  3. Study of the Ability of Articulation Index (Al for Predicting the Unaided and Aided Speech Recognition Performance of 25 to 65 Years Old Hearing-Impaired Adults

    Directory of Open Access Journals (Sweden)

    Ghasem Mohammad Khani

    2001-05-01

    Full Text Available Background: In recent years there has been increased interest in the use of Al for assessing hearing handicap and for measuring the potential effectiveness of amplification system. AI is an expression of proportion of average speech signal that is audible to a given patient, and it can vary between 0.0 to 1.0. Method and Materials: This cross-sectional analytical study was carried out in department of audiology, rehabilitation, faculty, IUMS form 31 Oct 98 to 7 March 1999, on 40 normal hearing persons (80 ears; 19 males and 21 females and 40 hearing impaired persons (61 ears; 36 males and 25 females, 25-65 years old with moderate to moderately severe SNI-IL The pavlovic procedure (1988 for calculating Al, open set taped standard mono syllabic word lists, and the real -ear probe- tube microphone system to measure insertion gain were used, through test-retest. Results: 1/A significant correlation was shown between the Al scores and the speech recognition scores of normal hearing and hearing-impaired group with and without the hearing aid (P<0.05 2/ There was no significant differences in age group & sex: also 3 In test-retest measures of the insertion gain in each test and 4/No significant in test-retest of speech recognition test score. Conclusion: According to these results the Al can predict the unaided and aided monosyllabic recognition test scores very well, and age and sex variables have no effect on its ability. Therefore with respect to high reliability of the Al results and its simplicity, easy -to- use, cost effective, and little time consuming for calculation, its recommended the wide use of the Al, especially in clinical situation.

  4. Automated Facial Expression Recognition Using Gradient-Based Ternary Texture Patterns

    Directory of Open Access Journals (Sweden)

    Faisal Ahmed

    2013-01-01

    Full Text Available Recognition of human expression from facial image is an interesting research area, which has received increasing attention in the recent years. A robust and effective facial feature descriptor is the key to designing a successful expression recognition system. Although much progress has been made, deriving a face feature descriptor that can perform consistently under changing environment is still a difficult and challenging task. In this paper, we present the gradient local ternary pattern (GLTP—a discriminative local texture feature for representing facial expression. The proposed GLTP operator encodes the local texture of an image by computing the gradient magnitudes of the local neighborhood and quantizing those values in three discrimination levels. The location and occurrence information of the resulting micropatterns is then used as the face feature descriptor. The performance of the proposed method has been evaluated for the person-independent face expression recognition task. Experiments with prototypic expression images from the Cohn-Kanade (CK face expression database validate that the GLTP feature descriptor can effectively encode the facial texture and thus achieves improved recognition performance than some well-known appearance-based facial features.

  5. 基于粗神经网络的语音情感识别%Speech Emotion Recognition Based on Rough Set and ANN

    Institute of Scientific and Technical Information of China (English)

    曾光菊

    2011-01-01

    Speech emotion recognition is about extracting effect acoustic features from speech signals and recognizing emotion state of human by using of intelligent computation.The domestic related research of emotion speech database,features extraction and recognition ways are studied.Learning from these related researches,the features extraction was found to have important affections on the speech emotion recognition.1050 sentences was recorded and 30 features extracted form every sentence and then formed to a database of 1050×30.The information consistence of rough set is applied to simplify 30 features of database to 12 features.Then artificial neural network is used to recognize emotion state of 525 sentences,it attains to the highest recognition rate of 84%.The results shows that using different ways to recognize different emotion has better effects.%语音情感识别是从语音信号中提取一些有效的声学特征,然后利用智能计算或者识别的方法对话者的情感状态进行识别。介绍了国内外在该领域中关于语音情感数据库、特征提取、识别方法的研究现状。基于对该领域现状的了解,发现特征提取对识别率有着非常大的影响。录制了1050句语音,每句语音提取了30个特征,从而形成了一个1050×30的数据库。提出了用粗糙集理论中的信息一致性对数据库中的30个特征进行化简,最后得到了12个特征。用神经网络中的BP网络对话者的情感状态进行识别,最高识别率达到了84%。从实验结果发现不同的情感用不同的方法识别结果更好。

  6. Automated inspection of micro-defect recognition system for color filter

    Science.gov (United States)

    Jeffrey Kuo, Chung-Feng; Peng, Kai-Ching; Wu, Han-Cheng; Wang, Ching-Chin

    2015-07-01

    This study focused on micro-defect recognition and classification in color filters. First, six types of defects were examined, namely grain, black matrix hole (BMH), indium tin oxide (ITO) defect, missing edge and shape (MES), highlights, and particle. Orthogonal projection was applied to locate each pixel in a test image. Then, an image comparison was performed to mark similar blocks on the test image. The block that best resembled the template was chosen as the new template (or matching adaptive template). Afterwards, image subtraction was applied to subtract the pixels at the same location in each block of the test image from the matching adaptive template. The control limit law employed logic operation to separate the defect from the background region. The complete defect structure was obtained by the morphology method. Next, feature values, including defect gray value, red, green, and blue (RGB) color components, and aspect ratio were obtained as the classifier input. The experimental results showed that defect recognition could be completed as fast as 0.154 s using the proposed recognition system and software. In micro-defect classification, back-propagation neural network (BPNN) and minimum distance classifier (MDC) served as the defect classification decision theories for the five acquired feature values. To validate the proposed system, this study used 41 defects as training samples, and treated the feature values of 307 test samples as the BPNN classifier inputs. The total recognition rate was 93.7%. When an MDC was used, the total recognition rate was 96.8%, indicating that the MDC method is feasible in applying automatic optical inspection technology to classify micro-defects of color filters. The proposed system is proven to successfully improve the production yield and lower costs.

  7. A preliminary study on automated freshwater algae recognition and classification system

    OpenAIRE

    Mosleh Mogeeb AA; Manssor Hayat; Malek Sorayya; Milow Pozi; Salleh Aishah

    2012-01-01

    Abstract Background Freshwater algae can be used as indicators to monitor freshwater ecosystem condition. Algae react quickly and predictably to a broad range of pollutants. Thus they provide early signals of worsening environment. This study was carried out to develop a computer-based image processing technique to automatically detect, recognize, and identify algae genera from the divisions Bacillariophyta, Chlorophyta and Cyanobacteria in Putrajaya Lake. Literature shows that most automated...

  8. Advances in speech processing

    Science.gov (United States)

    Ince, A. Nejat

    1992-10-01

    The field of speech processing is undergoing a rapid growth in terms of both performance and applications and this is fueled by the advances being made in the areas of microelectronics, computation, and algorithm design. The use of voice for civil and military communications is discussed considering advantages and disadvantages including the effects of environmental factors such as acoustic and electrical noise and interference and propagation. The structure of the existing NATO communications network and the evolving Integrated Services Digital Network (ISDN) concept are briefly reviewed to show how they meet the present and future requirements. The paper then deals with the fundamental subject of speech coding and compression. Recent advances in techniques and algorithms for speech coding now permit high quality voice reproduction at remarkably low bit rates. The subject of speech synthesis is next treated where the principle objective is to produce natural quality synthetic speech from unrestricted text input. Speech recognition where the ultimate objective is to produce a machine which would understand conversational speech with unrestricted vocabulary, from essentially any talker, is discussed. Algorithms for speech recognition can be characterized broadly as pattern recognition approaches and acoustic phonetic approaches. To date, the greatest degree of success in speech recognition has been obtained using pattern recognition paradigms. It is for this reason that the paper is concerned primarily with this technique.

  9. Automation of the novel object recognition task for use in adolescent rats

    OpenAIRE

    Silvers, Janelle M; Harrod, Steven B.; Mactutus, Charles F.; Booze, Rosemarie M.

    2007-01-01

    The novel object recognition task is gaining popularity for its ability to test a complex behavior which relies on the integrity of memory and attention systems without placing undue stress upon the animal. While the task places few requirements upon the animal, it traditionally requires the experimenter to observe the test phase directly and record behavior. This approach can severely limit the number of subjects which can be tested in a reasonable period of time, as training and testing occ...

  10. Comparing the effects of reverberation and of noise on speech recognition in simulated electric-acoustic listening

    OpenAIRE

    Helms Tillery, Kate; Brown, Christopher A; Bacon, Sid P.

    2012-01-01

    Cochlear implant users report difficulty understanding speech in both noisy and reverberant environments. Electric-acoustic stimulation (EAS) is known to improve speech intelligibility in noise. However, little is known about the potential benefits of EAS in reverberation, or about how such benefits relate to those observed in noise. The present study used EAS simulations to examine these questions. Sentences were convolved with impulse responses from a model of a room whose estimated reverbe...

  11. Multilevel Analysis in Analyzing Speech Data

    Science.gov (United States)

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  12. Forensic speaker recognition

    NARCIS (Netherlands)

    Meuwly, Didier

    2009-01-01

    The aim of forensic speaker recognition is to establish links between individuals and criminal activities, through audio speech recordings. This field is multidisciplinary, combining predominantly phonetics, linguistics, speech signal processing, and forensic statistics. On these bases, expert-based

  13. 基于ABC优化MVDR的语音情感识别研究%Speech emotion recognition based on ABC optimization MVDR

    Institute of Scientific and Technical Information of China (English)

    孙志锋

    2016-01-01

    It is a crucial problem to extract and choose the features of speech emotion. To solve the problem of Linear Prediction in speech emotion spectrum envelope, this paper puts forward to extract the features of speech emotion with Minimum Variance Distortionless Response (MVDR) spectrum method. In order to eliminate redundant information,it uses Artificial Bee Colony (ABC) algorithm to obtain the optimal subset of the features. Then the experiment recognise four speech emotions namely:angry,neutral,happy,fear,in the Casia Chinese Emotion Corpus through Radial Basis Function (RBF) Neural Network method. The results show that the approach in this paper has higher rate of recognition and is more robust.%语音情感特征的提取和选择是语音情感识别的关键问题,针对线性预测(LP)模型在语音情感谱包络方面存在的不足。本论文提出了最小方差无失真响应(MVDR)谱方法来进行语音情感特征的提取;并通过人工蜂群(ABC)算法找到最优语音情感特征子集,消除特征冗余信息;利用径向基函数(RBF)神经网络对CASIA汉语情感语料库中的4种情感语音即生气、平静、高兴、害怕进行实验识别。实验结果表明,该方法比线性预测法有更高的识别率和更好的鲁棒性。

  14. Automated recognition and tracking of aerosol threat plumes with an IR camera pod

    Science.gov (United States)

    Fauth, Ryan; Powell, Christopher; Gruber, Thomas; Clapp, Dan

    2012-06-01

    Protection of fixed sites from chemical, biological, or radiological aerosol plume attacks depends on early warning so that there is time to take mitigating actions. Early warning requires continuous, autonomous, and rapid coverage of large surrounding areas; however, this must be done at an affordable cost. Once a potential threat plume is detected though, a different type of sensor (e.g., a more expensive, slower sensor) may be cued for identification purposes, but the problem is to quickly identify all of the potential threats around the fixed site of interest. To address this problem of low cost, persistent, wide area surveillance, an IR camera pod and multi-image stitching and processing algorithms have been developed for automatic recognition and tracking of aerosol plumes. A rugged, modular, static pod design, which accommodates as many as four micro-bolometer IR cameras for 45deg to 180deg of azimuth coverage, is presented. Various OpenCV1 based image-processing algorithms, including stitching of multiple adjacent FOVs, recognition of aerosol plume objects, and the tracking of aerosol plumes, are presented using process block diagrams and sample field test results, including chemical and biological simulant plumes. Methods for dealing with the background removal, brightness equalization between images, and focus quality for optimal plume tracking are also discussed.

  15. Speech emotion recognition based on phase space reconstruction%相空间重构在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 陈鑫

    2014-01-01

    为了更为全面地表征语音情感状态,弥补线性情感特征参数在刻画不同情感类型上的不足,将相空间重构理论引入语音情感识别中来,通过分析不同情感状态下的混沌特征,提取Kolmogorov熵和关联维作为新的情感特征参数,并结合传统语音特征使用支持向量机(SVM)进行语音情感识别。实验结果表明,通过引入混沌参数,与传统物理特征进行识别的方案相比,准确率有了一定的提高,为语音情感的识别提供了一个新的研究途径。%In order to express the sound emotion state totally, make up the inadequate of emotional conventional linear argu-ment at depicting different types of character sentiments, this paper takes the phase space reconstruction theory into the sound emotional identification, by analyzing chaotic features on the different sound emotional states, proposes correlation dimension and Kolmogorov entropy as emotional characteristic parameters, combines with traditional voice acoustic features and uses Support Vector Machine(SVM)for speech emotion recognition. The results show that recognition accuracy is improved through using chaotic characteristic parameters, providing a new research approach for speech emotion recognition.

  16. Analyzing the relevance of shape descriptors in automated recognition of facial gestures in 3D images

    Science.gov (United States)

    Rodriguez A., Julian S.; Prieto, Flavio

    2013-03-01

    The present document shows and explains the results from analyzing shape descriptors (DESIRE and Spherical Spin Image) for facial recognition of 3D images. DESIRE is a descriptor made of depth images, silhouettes and rays extended from a polygonal mesh; whereas the Spherical Spin Image (SSI) associated to a polygonal mesh point, is a 2D histogram built from neighboring points by using the position information that captures features of the local shape. The database used contains images of facial expressions which in average were recognized 88.16% using a neuronal network and 91.11% with a Bayesian classifier in the case of the first descriptor; in contrast, the second descriptor only recognizes in average 32% and 23,6% using the same mentioned classifiers respectively.

  17. Modeling and simulation of speech emotional recognition based on process neural net-work%基于过程神经元的语音情感识别的建模与仿真

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 陈晋芳

    2014-01-01

    为克服由传统语音情感识别模型的缺陷导致的识别正确率不高的问题,将过程神经元网络引入到语音情感识别中来。通过提取基频、振幅、音质特征参数作为语音情感特征参数,利用小波分析去噪,主成分分析(PCA)消除冗余,用过程神经元网络对生气、高兴、悲伤和惊奇四种情感进行识别。实验结果表明,与传统的识别模型相比,使用过程神经元网络具有较好的识别效果。%To improve the problem of the low recognition accuracy caused by the defect of the traditional speech emotion recognition model, this algorithm of process neural networks is introduced to the speech emotion recognition. This paper extracts the speech emotion features of fundamental frequency, amplitude, sound characteristic, and uses the method of wavelet analysis to reduce noise, the Principal Component Analysis(PCA)to reduce the redundancy, and carries on the experiment of classification and recognition of the four speech emotions of anger, happiness, sadness and surprise. The result proves that the method of process neural network has better recognition effect on the four speech emotions compared with the traditional recognition model.

  18. Individual Differences in Language Ability Are Related to Variation in Word Recognition, Not Speech Perception: Evidence from Eye Movements

    Science.gov (United States)

    McMurray, Bob; Munson, Cheyenne; Tomblin, J. Bruce

    2014-01-01

    Purpose: The authors examined speech perception deficits associated with individual differences in language ability, contrasting auditory, phonological, or lexical accounts by asking whether lexical competition is differentially sensitive to fine-grained acoustic variation. Method: Adolescents with a range of language abilities (N = 74, including…

  19. Application of Hilbert marginal energy spectrum in speech emotion recognition%Hilbert边际能量谱在语音情感识别中的应用

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 胡海翔

    2014-01-01

    Emotional feature extraction plays an important role in speech emotion recognition. Due to the limitations of traditional signal processing methods, traditional phonetic features, especially in terms of frequency domain features, are unable to reflect precisely phonetic emotional characteristic, which leads to a low emotion recognition rate. This paper proposes a new method. Firstly, Hilbert-Huang Transform(HHT)is used in order to process speech signal, thus to obtain Hilbert marginal energy spectrum. Then, a comparison and relative analysis based on Mel-scale is carried out, afterwards a new array of emotional features are obtained, which consists of Mel-Frequency Marginal Energy Coefficient(MFEC), Mel-frequency Sub-band Spectral Centroid(MSSC)and Mel-frequency Sub-band Spectral Flatness(MSSF). Finally, the five kinds of speech emotion namely sadness, happiness, boredom, anger and neutral are recognized by using the Support Vector Machine(SVM). The experimental results show that the new emotional features extracted by this method have better recognition performance.%情感特征的提取是语音情感识别的重要方面。由于传统信号处理方法的局限,使得提取的传统声学特征特别是频域特征并不准确,不能很好地表征语音的情感特性,因而对情感识别率不高。利用希尔伯特黄变换(HHT)对情感语音进行处理,得到情感语音的希尔伯特边际能量谱;通过对不同情感语音的边际能量谱基于Mel尺度的比较分析,提出了一组新的情感特征:Mel频率边际能量系数(MFEC)、Mel频率子带频谱质心(MSSC)、Mel频率子带频谱平坦度(MSSF);利用支持向量机(SVM)对5种情感语音即悲伤、高兴、厌倦、愤怒和平静进行了识别。实验结果表明,通过该方法提取的新的情感特征具有较好的识别效果。

  20. The effect of using integrated signal processing hearing aids on the speech recognition abilities of hearing impaired Arabic-speaking children

    Directory of Open Access Journals (Sweden)

    Somaia Tawfik

    2014-11-01

    Results and conclusions: Significant improvement in aided sound field threshold levels and speech recognition in noise tests was recorded using ISP HAs over time. As regards consonant manner, glides and stop consonants showed the highest improvement. Though voiced and voiceless consonants were equally transmitted through digital HAs, voiced consonants were easier to perceive using ISP HAs. Middle and back consonants were easier to perceive compared to front consonants using both HAs. Application of WILSI self assessment questionnaire revealed that parents reported better performance in different listening situations. In conclusion, results of the present study support the use of ISP HAs in children with moderate to severe hearing loss due to the significant improvement recorded in both subjective and objective measures.

  1. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    2015-01-01

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers...... of three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...... transcription vs. with transcription. We discuss the advantages and drawbacks of a particular ASR approach in different computational platforms when used by various speakers of a given language, who may have different accents and levels of proficiency in that language, and who may have different levels...

  2. Research of speech emotion recognition based on emotion features classification%基于情感特征分类的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    周晓凤; 肖南峰; 文翰

    2012-01-01

    针对语音信号的实时性和不确定性,提出证据信任度信息熵和动态先验权重的方法,对传统D-S证据理论的基本概率分配函数进行改进;针对情感特征在语音情感识别中对不同的情感状态具有不同的识别效果,提出对语音情感特征进行分类.利用各类情感特征的识别结果,应用改进的D-S证据理论进行决策级数据融合,实现基于多类情感特征的语音情感识别,以达到细粒度的语音情感识别.最后通过算例验证了改进算法的迅速收敛和抗干扰性,对比实验结果证明了分类情感特征语音情感识别方法的有效性和稳定性.%Because the speech signals were highly real-time uncertainty, this paper proposed evidences' trust entropy and dynamic prior weights to improve the basic probability function of traditional D-S theory. As the emotion recognition result was not the same by emotion features in different emotions, it presented a classification method of emotion features. In order to realize the fine-grain speech emotion recognition, it used the recognition data of different classification and the improved D-S theory to realize the emotion recognition based on multi-classification emotion features. The improved D-S theory is proved to be effective by simulation. And comparing simulation results show that the multi-classification emotion features are effective and stability.

  3. Speech emotion recognition based on multifractal%多重分形在语音情感识别中的研究

    Institute of Scientific and Technical Information of China (English)

    叶吉祥; 王聪慧

    2012-01-01

    为了克服语音情感线性参数在刻画不同情感类型特征上的不足,将多重分形理论引入语音情感识别中来,通过分析不同语音情感状态下的多重分形特征,提取多重分形谱参数和广义Hurst指数作为新的语音情感特征参数,并结合传统语音声学特征采用支持向量机(SVM)进行语音情感识别.实验结果表明,通过非线性参数的介入,与仅使用传统语音线性特征的识别方法相比,识别系统的准确率和稳定性得到有效提高,因此为语音情感识别提供了一个新的思路.%In order to overcome the inadequate of emotional conventional linear argument at depicting different types of character sentiments, this paper takes the multiple fractals theory into the sound emotional identify, by analyzing the multiple fractal features on the different sound emotional state, and proposes multifractal spectrum parameters and generalizes hurst index as emotional conventional parameters, combines with traditional voice acoustic features and using Support Vector Machine (SVM) for speech emotion recognition. The results show that the accuracy and stability of the recognition system are improved effectively through using non-linear parameters, compared with the linear features of traditional voice recognition method, so it provides a new idea for voice emotion recognition.

  4. Research and design of parallel speech recognition system%并行化语音识别系统的研究与设计

    Institute of Scientific and Technical Information of China (English)

    王硕; 刘文

    2012-01-01

    How to handle large voice data is an important problem in speech recognition applications. It uses parallel computing to replace the traditional standalone process, if the parallel scheduling control is not good, the final result will be an error and if data segmentation is unreasonable, the data will lose semantic consistency leading to decline accuracy. Pieces of the file on the network transmission costs also need to consider. To solve above problems, it proposes a speech recognition system based on Hadoop, uses HDFS and MapReduce to solve pieces of the file transfer and control parallel scheduling and uses silence detection to handle file split. Through the experiment, it proves the effectiveness of this system.%如何处理海量语音数据是语音识别应用的一个重要问题,采用并行化计算取代传统的单机处理,如果并行调度控制不当,最终合并的结果在合并顺序上就会出现错误,并且数据切分不合理还会造成语义连贯性的丢失导致准确率的降低,文件片段在网络上传输的时间开销也需要考虑,针对上述问题,提出了一种基于Hadoop的语音识别系统,借助其分布式文件系统HDFS与MapReduce并行算法解决文件片段传输与并行调度控制的问题,同时引入静音检测算法合理地处理文件切分,通过实验验证了该系统的有效性.

  5. 语音情感的维度特征提取与识别%Dimensional Feature Extraction and Recognition of Speech Emotion

    Institute of Scientific and Technical Information of China (English)

    李嘉; 黄程韦; 余华

    2012-01-01

    研究了情绪的维度空间模型与语音声学特征之间的关系以及语音情感的自动识别方法.介绍了基本情绪的维度空间模型,提取了唤醒度和效价度对应的情感特征,采用全局统计特征减小文本差异对情感特征的影响.研究了生气、高兴、悲伤和平静等情感状态的识别,使用高斯混合模型进行4种基本情感的建模,通过实验设定了高斯混合模型的最佳混合度,从而较好地拟合了4种情感在特征空间中的概率分布.实验结果显示,选取的语音特征适合于基本情感类别的识别,高斯混合模型对情感的建模起到了较好的效果,并且验证了二维情绪空间中,效价维度上的情意特征对语音情感识别的重要作用.%The relation between the emotion dimension space and speech features is studied. The automatic speech emotion recognition problem is addressed. A dimensional space model of basic emotions is introduced. Speech emotion features are extracted according to the arousal dimension and the valence dimension. And statistic features are used to reduce the influence of the text variations on emotional features. Anger, happiness, sadness and neutral state are studied. Gaussian mixture model is adopted for modeling and recognizing the four categories of emotions. Gaussian mixture number is optimized through experiment for the probability distribution of the 4 categories in the feature space. The experimental results show that the chosen features are suitable for recognizing basic emotions. The Gaussian mixture model achieves satisfactory classification results. The valence features in the two-dimensional space plays a more important role in emotion recognition.

  6. Application of a model of the auditory primal sketch to cross-linguistic differences in speech rhythm: Implications for the acquisition and recognition of speech

    Science.gov (United States)

    Todd, Neil P. M.; Lee, Christopher S.

    2002-05-01

    It has long been noted that the world's languages vary considerably in their rhythmic organization. Different languages seem to privilege different phonological units as their basic rhythmic unit, and there is now a large body of evidence that such differences have important consequences for crucial aspects of language acquisition and processing. The most fundamental finding is that the rhythmic structure of a language strongly influences the process of spoken-word recognition. This finding, together with evidence that infants are sensitive from birth to rhythmic differences between languages, and exploit rhythmic cues to segmentation at an earlier developmental stage than other cues prompted the claim that rhythm is the key which allows infants to begin building a lexicon and then go on to acquire syntax. It is therefore of interest to determine how differences in rhythmic organization arise at the acoustic/auditory level. In this paper, it is shown how an auditory model of the primitive representation of sound provides just such an account of rhythmic differences. Its performance is evaluated on a data set of French and English sentences and compared with the results yielded by the phonetic accounts of Frank Ramus and his colleagues and Esther Grabe and her colleagues.

  7. NICT/ATR Chinese-Japanese-English Speech-to-Speech Translation System

    Institute of Scientific and Technical Information of China (English)

    Tohru Shimizu; Yutaka Ashikari; Eiichiro Sumita; ZHANG Jinsong; Satoshi Nakamura

    2008-01-01

    This paper describes the latest version of the Chinese-Japanese-English handheld speech-to-speech translation system developed by NICT/ATR,which is now ready to be deployed for travelers.With the entire speech-to-speech translation function being implemented into one terminal,it realizes real-time,location-free speech-to-speech translation.A new noise-suppression technique notably improves the speech recognition performance.Corpus-based approaches of speech recognition,machine translation,and speech synthesis enable coverage of a wide variety of topics and portability to other languages.Test results show that the character accuracy of speech recognition is 82%-94% for Chinese speech,with a bilingual evaluation understudy score of machine translation is 0.55-0.74 for Chinese-Japanese and Chinese-English.

  8. Integranting prosodic information into a speech recogniser

    OpenAIRE

    López Soto, María Teresa

    2001-01-01

    In the last decade there has been an increasing tendency to incorporate language engineering strategies into speech technology. This technique combines linguistic and mathematical information in different applications: machine translation, natural language processing, speech synthesis and automatic speech recognition (ASR). In the field of speech synthesis, this hybrid approach (linguistic and mathematical/statistical) has led to the design of efficient models for reproducin...

  9. 语音识别技术在船舶甚高频仿真设备中的应用%Application of speech recognition technology in the ship’ s VHF simulation system

    Institute of Scientific and Technical Information of China (English)

    陈大军; 任鸿翔; 肖方兵

    2014-01-01

    Taking the VHF equipment in GMDSS simulator as an example , such key aspects as man-machine interface , op-erating menu and data input were designed , and the simula-tion equipment was realized by using visual studio 2010 . Based on the key technologies about speech signal and recog-nition, a speech recognition module was developed by using Microsoft Speech SDK , and the voice control was operated for VHF simulator.The application of speech recognition technol-ogy was used in marine simulator , and the ideal results were obtained, which can offer a good reference for the speech technology used in other marine simulator systems .%语音识别技术可以实现真正意义上的人机对话,但在航海模拟器中还没有得到应用。以GMDSS模拟器中甚高频设备为例,重点设计人机交互界面、操作菜单、数据输入方式等方面,利用Visual Studio 2010进行仿真设备的实现;在研究语音信号、语音识别等关键技术基础上,使用Microsoft Speech SDK开发语音识别功能模块,实现对甚高频仿真设备的语音控制。首次将语音识别技术应用在航海模拟器中,取得了较理想的效果,对于语音技术应用在其他航海仿真系统有较好的借鉴作用。

  10. 马克思主义大众化过程中言语认同的心理激励策略%Psychological Motivation Strategy of Speech Recognition in the Popularization of Marxism

    Institute of Scientific and Technical Information of China (English)

    邓瑞琴

    2015-01-01

    Speech recognition is a powerful speech support system and discourse security in the course of the popularization of Marx. Psychological motivation is effective means to promote speech recognition, correctly grasp and deal with the coup-ling relationship between speech recognition and mental stimulation, from building speech recognition mechanism, shaping people's psychological identity, respect the dominant position of the public as psychological motivation of basic strategy to explore the ways of the popularization of Marxism. In the process of specific implementation psychological incentive strat-egy, should pay attention to the correct understanding of verbal and psychological relationship and pay attention to grasp the language of artistic expression.%言语认同是马克思主义大众化过程中有力的言语支撑体系和话语保障.心理激励是促进言语认同的有效手段,正确把握和处理言语认同与心理激励之间的耦合关系,从构建言语认同机制、塑造群众心理认同、尊重大众主体地位三方面作为心理激励的基本策略来探寻马克思主义大众化的实现方式.在具体实施心理激励策略的过程中,要注意正确理解言语与心理的关系和注意把握言语的表达艺术.

  11. Psychological Motivation Strategy of Speech Recognition in the Popularization of Marxism%马克思主义大众化过程中言语认同的心理激励策略

    Institute of Scientific and Technical Information of China (English)

    邓瑞琴

    2015-01-01

    言语认同是马克思主义大众化过程中有力的言语支撑体系和话语保障.心理激励是促进言语认同的有效手段,正确把握和处理言语认同与心理激励之间的耦合关系,从构建言语认同机制、塑造群众心理认同、尊重大众主体地位三方面作为心理激励的基本策略来探寻马克思主义大众化的实现方式.在具体实施心理激励策略的过程中,要注意正确理解言语与心理的关系和注意把握言语的表达艺术.%Speech recognition is a powerful speech support system and discourse security in the course of the popularization of Marx. Psychological motivation is effective means to promote speech recognition, correctly grasp and deal with the coup-ling relationship between speech recognition and mental stimulation, from building speech recognition mechanism, shaping people's psychological identity, respect the dominant position of the public as psychological motivation of basic strategy to explore the ways of the popularization of Marxism. In the process of specific implementation psychological incentive strat-egy, should pay attention to the correct understanding of verbal and psychological relationship and pay attention to grasp the language of artistic expression.

  12. 语音识别方法在水下目标识别中的应用%Application of speech recognition methods to underwater target classification

    Institute of Scientific and Technical Information of China (English)

    曾渊; 李钢虎; 赵亚楠; 苗雨

    2012-01-01

    水下目标识别是潜艇在海战中,先敌发现并有效进行水声对抗的关键技术.然而,如何根据声纳接收到的舰船辐射噪声对三类目标进行分类识别是长期困扰人们的问题.研究了四种语音识别中常用的方法——线性预测系数(LPC),线性预测倒谱系数(LPCC),美尔倒谱系数(MFCC)和最小均方无失真响应(MVDR),在水下目标识别中的应用效果,并比较了这四种方法在无噪声情况下的识别概率,以及在不同信噪比下的识别概率,并通过比较找到在无噪声和有噪声情况下的最佳方法.实验表明,在无噪声的情况下,MFCC方法总体识别率最高,第一类目标MFCC方法的识别率最高,第二类目标MFCC和MVDR方法识别率相似,好于其他两者,第三类目标MVDR方法识别率最高.在加入噪声的情况下,MVDR方法对三类目标的识别和抗噪声性能明显好于其余三者.%In submarine battle, underwater target recognition is a key technology used for early finding enemy and then taking effective acoustic countermeasure to defeat the enemy. However, a problem that puzzles people for a long time is how to classify and identify three kinds of ship targets based on the received ship radiation noise by sonar system. This paper studies the effects of applying four frequently-used speech recognition methods on underwater target classification. The four methods are linear prediction coefficient (LPC), linear prediction cepstrum coefficient (LPCC), Mel frequency cepstrum coefficient (MFCC) and Minimum Variance Distortionless Response (MVDR).The paper verifies the results by comparing the rates of recognition for different target samples under different SNRs, and finds out the best method in the condition of having or no noise. Experiments show that, without noise, the overall recognition rate of MFCC is highest; MFCC method has the highest recognition rate to the first kind of targets; MFCC and MVDR methods have similar recognition

  13. Clinical and audiological features of a syndrome with deterioration in speech recognition out of proportion to pure hearing loss

    Directory of Open Access Journals (Sweden)

    Abdi S

    2007-04-01

    Full Text Available Background: The objective of this study was to describe the audiologic and related characteristics of a group patient with speech perception affected out of proportion to pure tone hearing loss. A case series of patient were referred for evaluation and management to the Hearing Research Center.To describe the clinical picture of the patients with the key clinical feature of hearing loss for pure tones and reduction in speech discrimination out of proportion to the pure tone loss, having some of the criteria of auditory neuropathy (i.e. normal otoacoustic emissions, OAE, and abnormal auditory brainstem evoked potentials, ABR and lacking others (e.g. present auditory reflexes. Methods: Hearing abilities were measured by Pure Tone Audiometry (PTA and Speech Discrimination Scores (SDS, measured in all patients using a standardized list of 25 monosyllabic Farsi words at MCL in quiet. Auditory pathway integrity was measured by using Auditory Brainstem Response (ABR and Otoacoustic Emission (OAE and anatomical lesions Computed Tomography Scan (CT and Magnetic Resonance Image (MRI of brain and retrocochlea. Patient included in the series were 35 patients who have SDS disproportionably low with regard to PTA, absent ABR waves and normal OAE. Results: All patients reported the beginning of their problem around adolescence. Neither of them had anatomical lesion in imaging studies and neither of them had any finding suggestive of conductive hearing lesion. Although in most of the cases the hearing loss had been more apparent in the lower frequencies (i.e. 1000 Hz and less, a stronger correlation was found between SDS and hearing threshold at higher frequencies. These patients may not benefit from hearing aids, as the outer hair cells are functional and amplification doesn’t seem to help; though, it was tried for all. Conclusion: These patients share a pattern of sensory –neural loss with no detectable lesion. The age of onset and the gradual

  14. Building Searchable Collections of Enterprise Speech Data.

    Science.gov (United States)

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  15. Commercial speech and off-label drug uses: what role for wide acceptance, general recognition and research incentives?

    Science.gov (United States)

    Gilhooley, Margaret

    2011-01-01

    This article provides an overview of how the constitutional protections for commercial speech affect the Food and Drug Administration's (FDA) regulation of drugs, and the emerging issues about the scope of these protections. A federal district court has already found that commercial speech allows manufacturers to distribute reprints of medical articles about a new off-label use of a drug as long as it contains disclosures to prevent deception and to inform readers about the lack of FDA review. This paper summarizes the current agency guidance that accepts the manufacturer's distribution of reprints with disclosures. Allergan, the maker of Botox, recently maintained in a lawsuit that the First Amendment permits drug companies to provide "truthful information" to doctors about "widely accepted" off-label uses of a drug. While the case was settled as part of a fraud and abuse case on other grounds, extending constitutional protections generally to "widely accepted" uses is not warranted, especially if it covers the use of a drug for a new purpose that needs more proof of efficacy, and that can involve substantial risks. A health law academic pointed out in an article examining a fraud and abuse case that off-label use of drugs is common, and that practitioners may lack adequate dosage information about the off-label uses. Drug companies may obtain approval of a drug for a narrow use, such as for a specific type of pain, but practitioners use the drug for similar uses based on their experience. The writer maintained that a controlled study may not be necessary to establish efficacy for an expanded use of a drug for pain. Even if this is the case, as discussed below in this paper, added safety risks may exist if the expansion covers a longer period of time and use by a wider number of patients. The protections for commercial speech should not be extended to allow manufacturers to distribute information about practitioner use with a disclosure about the lack of FDA

  16. SPEECH CLASSIFICATION USING ZERNIKE MOMENTS

    Directory of Open Access Journals (Sweden)

    Manisha Pacharne

    2011-07-01

    Full Text Available Speech recognition is very popular field of research and speech classification improves the performance for speech recognition. Different patterns are identified using various characteristics or features of speech to do there classification. Typical speech features set consist of many parameters like standard deviation, magnitude, zero crossing representing speech signal. By considering all these parameters, system computation load and time will increase a lot, so there is need to minimize these parameters by selecting important features. Feature selection aims to get an optimal subset of features from given space, leading to high classification performance. Thus feature selection methods should derive features that should reduce the amount of data used for classification. High recognition accuracy is in demand for speech recognition system. In this paper Zernike moments of speech signal are extracted and used as features of speech signal. Zernike moments are the shape descriptor generally used to describe the shape of region. To extract Zernike moments, one dimensional audio signal is converted into two dimensional image file. Then various feature selection and ranking algorithms like t-Test, Chi Square, Fisher Score, ReliefF, Gini Index and Information Gain are used to select important feature of speech signal. Performances of the algorithms are evaluated using accuracy of classifier. Support Vector Machine (SVM is used as the learning algorithm of classifier and it is observed that accuracy is improved a lot after removing unwanted features.

  17. 基于蚁群算法特征选择的语音情感识别%Feature Selection of Speech Emotional Recognition Based on Ant Colony Optimization Algorithm

    Institute of Scientific and Technical Information of China (English)

    杨鸿章

    2013-01-01

    情感特征提取是语音情感准确识别的关键,传统方法采用单一特征或者简单组合特征提取方法,单一特征无法全面反映语音情感变化,简单组合特征会使特征间产生大量冗余特征,影响识别正确结果.为了提高语音情感识别率,提了一种蚁群算法的语音情感智能识别方法.首先采用语音识别正确率和特征子集维数加权作为目标函数,然后利用蚁群算法找到最优语音特征子集,消除特征冗余信息.通过汉话和丹麦语两种情感语音库进行仿真测试,仿真结果表明,改进方法不仅消除了冗余、无用特征,降低了特征维数,而且提高了语音情感识别率,是一种有效的语音情感智能识别方法.%Speech emotion information has the characteristics of high dimension and redundancy, in order to improve the accuracy of speech emotion recognition, this paper put forward a speech emotion recognition model to select features based on ant colony optimization algorithm. The classification accuracy of KNN and the selected feature dimension form the fitness function, and the ant colony optimization algorithm provides good global searching capability and multiple sub - optimal solutions. A local refinement searching scheme was designed to exclude the redundant features and improve the convergence rate. The performance of method was tested by Chinese emotional speech database and a Danish Emotional Speech. The simulation results show that the proposed method can not only eliminate redundant and useless features to reduce the dimension of features, but also improve the speech emotion recognition rate, therefore the proposed model is an effective speech emotion recognition method.

  18. Speech in Mobile and Pervasive Environments

    CERN Document Server

    Rajput, Nitendra

    2012-01-01

    This book brings together the latest research in one comprehensive volume that deals with issues related to speech processing on resource-constrained, wireless, and mobile devices, such as speech recognition in noisy environments, specialized hardware for speech recognition and synthesis, the use of context to enhance recognition, the emerging and new standards required for interoperability, speech applications on mobile devices, distributed processing between the client and the server, and the relevance of Speech in Mobile and Pervasive Environments for developing regions--an area of explosiv

  19. Emotion Recognition in Speech Based on HMM and PNN%基于HMM和PNN的语音情感识别研究

    Institute of Scientific and Technical Information of China (English)

    叶斌

    2011-01-01

    语音情感识别是从语音的角度赋予计算机理解情感特征的能力,最终使计算机能像人一样进行自然、亲切和生动的交互.提出了一种融合隐马尔科夫模型(hidden markov model,HMM)和概率神经网络(probabilistic neural network,PNN)的语音情感识别方法.在所设计情感识别系统中,提取出基本的韵律参数和频谱参数,利用PNN处理声学参数的统计特征,利用HMM处理声学参数的时序特征,运用加法规则和乘法规则融合了统计特征和时序特征的识别结果.实验结果显示,所提出的算法在语音情感识别中具有有效的识别能力.%The aim of the emotion recognition is make the computer have the capacity of understand emotion by the way of voice characteristics studies and ultimately like people for natural, warm and lively interaction. A speech emotion recognition algorithm based on HMM (hidden Markov model) and PNN (probabilistic neural network) was developed, in the system, the basic prosody parameters and spectral parameters were extracted first, and then the PNN was used to model the statistic features and HMM to model the temporal features. The sum and product rules were used to combine the probabilities from each group of features for the final decision. Experimental results approved the capacity and the efficiency of the proposed method.

  20. A Danish open-set speech corpus for competing-speech studies

    DEFF Research Database (Denmark)

    Nielsen, Jens Bo; Dau, Torsten; Neher, Tobias

    2014-01-01

    Studies investigating speech-on-speech masking effects commonly use closed-set speech materials such as the coordinate response measure [Bolia et al. (2000). J. Acoust. Soc. Am. 107, 1065-1066]. However, these studies typically result in very low (i.e., negative) speech recognition thresholds (SRTs...

  1. 婴儿智能看护系统的语音识别模块设计%The Design of Speech Recognition Module in Infant-Caring Intelligent System

    Institute of Scientific and Technical Information of China (English)

    张荣刚

    2012-01-01

    An efficient module is designed to realize real--time monitor on baby sleeping and effectively pacify infants by means of speech recognition and intelligent control.%文章设计一个不仅能对婴儿的睡眠状况进行实时监测,而且通过语音识别和智能控制能及时有效地安抚婴儿的模块.

  2. A Kinect-Based Sign Language Hand Gesture Recognition System for Hearing- and Speech-Impaired: A Pilot Study of Pakistani Sign Language.

    Science.gov (United States)

    Halim, Zahid; Abbas, Ghulam

    2015-01-01

    Sign language provides hearing and speech impaired individuals with an interface to communicate with other members of the society. Unfortunately, sign language is not understood by most of the common people. For this, a gadget based on image processing and pattern recognition can provide with a vital aid for detecting and translating sign language into a vocal language. This work presents a system for detecting and understanding the sign language gestures by a custom built software tool and later translating the gesture into a vocal language. For the purpose of recognizing a particular gesture, the system employs a Dynamic Time Warping (DTW) algorithm and an off-the-shelf software tool is employed for vocal language generation. Microsoft(®) Kinect is the primary tool used to capture video stream of a user. The proposed method is capable of successfully detecting gestures stored in the dictionary with an accuracy of 91%. The proposed system has the ability to define and add custom made gestures. Based on an experiment in which 10 individuals with impairments used the system to communicate with 5 people with no disability, 87% agreed that the system was useful. PMID:26132224

  3. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    OpenAIRE

    Zucker Gerhard; Dietrich Dietmar; Velik Rosemarie

    2011-01-01

    The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour ...

  4. SAM: speech-aware applications in medicine to support structured data entry.

    OpenAIRE

    Wormek, A. K.; Ingenerf, J; Orthner, H. F.

    1997-01-01

    In the last two years, improvement in speech recognition technology has directed the medical community's interest to porting and using such innovations in clinical systems. The acceptance of speech recognition systems in clinical domains increases with recognition speed, large medical vocabulary, high accuracy, continuous speech recognition, and speaker independence. Although some commercial speech engines approach these requirements, the greatest benefit can be achieved in adapting a speech ...

  5. Speaker Recognition Algorithm for Abnormal Speech Based on Abnormal Feature Weighting%变异特征加权的异常语音说话人识别算法

    Institute of Scientific and Technical Information of China (English)

    何俊; 李艳雄; 贺前华; 李威

    2012-01-01

    As the commonly-used weighting algorithm is inefficient in tracking the abnormal feature of abnormal speech, a speaker recognition algorithm for abnormal speech is proposed based on the abnormal feature weighting. In this algorithm, first, a feature template of normal speech is established by computing the probability distribution of MFCC features of each order in a large number of normal speech samples. Then, the K-L distance and the Euclidean distance are used to measure the differences between a given test speech and the normal speech templates and to further determine the K-L and the Euclidean weighting factors. Finally, the two weighting factors are used to weight the MFCC features of the test speech, and the weighted MFCC features are input in the Gaussian mixture model for the speaker recognition with abnormal speech. Experimental results show that the global recognition rates of the speaker recognition algorithms based on the K-L weighting and the Euclidean weighting are respectively 46.61% and 42.25% , while those of the algorithms with and without the weighting of speaker recognition contribution of each order feature are respectively only 39.68% and 36.36%.%常用的加权算法难以跟踪非常态语音特征的变异,为此,文中提出了一种变异特征加权的异常语音说话人识别算法.首先统计大量正常语音各阶MFCC特征的概率分布,建立正常语音特征模板;然后用测试语音特征与正常语音特征模板之间的K-L距离和欧氏距离来度量语音的变异程度,确定K-L加权因子和欧氏加权因子;最后利用加权因子对测试语音的MFCC特征进行加权,并将加权后的特征输入高斯混合模型进行异常语音说话人识别.实验结果表明,文中提出的K-L加权和欧氏加权的异常语音说话人识别算法的整体识别率分别为46.61%和42.25%,而基于各阶特征对说话人识别贡献的加权算法和不加权算法的整体识别率分别为39.68%和36

  6. Dimensional emotion recognition in whispered speech signal based on cognitive performance evaluation%基于认知评估的多维耳语音情感识别

    Institute of Scientific and Technical Information of China (English)

    吴晨健; 黄程韦; 陈虹

    2015-01-01

    研究了基于认知评估原理的多维耳语音情感识别。首先,比较了耳语音情感数据库和数据采集方法,研究了耳语音情感表达的特点,特别是基本情感的表达特点。其次,分析了耳语音的情感特征,并通过近年来的文献总结相关阶特征在效价维和唤醒维上的特征。研究了效价维和唤醒维在区分耳语音情感中的作用。最后,研究情感识别算法和应用耳语音情感识别的高斯混合模型。认知能力的评估也融入到情感识别过程中,从而对耳语音情感识别的结果进行纠错。基于认知分数,可以提高情感识别的结果。实验结果表明,耳语音信号中共振峰特征与唤醒维度不显著相关,而短期能量特征与情感变化在唤醒维度相关。结合认知分数可以提高语音情感识别的结果。%The cognitive performance-based dimensional emotion recognition in whispered speech is studied.First,the whispered speech emotion databases and data collection methods are compared, and the character of emotion expression in whispered speech is studied,especially the basic types of emotions.Secondly,the emotion features for whispered speech is analyzed,and by reviewing the latest references,the related valence features and the arousal features are provided. The effectiveness of valence and arousal features in whispered speech emotion classification is studied.Finally,the Gaussian mixture model is studied and applied to whispered speech emotion recognition. The cognitive performance is also considered in emotion recognition so that the recognition errors of whispered speech emotion can be corrected.Based on the cognitive scores,the emotion recognition results can be improved.The results show that the formant features are not significantly related to arousal dimension,while the short-term energy features are related to the emotion changes in arousal dimension.Using the cognitive scores,the recognition

  7. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  8. EMD-SDC方法在机载连接词语音识别系统中的应用%Application of EMD-SDC in airplane conjunction speech recognition system

    Institute of Scientific and Technical Information of China (English)

    严家明; 李永恒

    2012-01-01

    机载连接词语音识别系统与传统语音识别系统相比,具有背景噪声大,系统识别率要求高等特点.依据这些特点,提出了一种基于经验模态分解增强和位移差分倒谱特征的EMD-SDC连接词语音识别方法.经验模态分解的调频调幅特性,可以有效提高机载复杂噪声背景下的端点检测准确度,位移差分倒谱特征由语音帧的一阶差分谱连接扩展而成,能够更好地提取依赖于语言结构的时序信息.该方法对机载交通预警避撞系统提示语音库进行测试,实验结果表明,采用EMD-SDC方法的机载连接词语音识别系统,能够很好地克服机舱背景噪声干扰,在低信噪比条件下实现较高的识别率.%Compared with traditional speech recognition system, airplane conjunction speech recognition system has background noise, and requires a high recognition rate and so on. According to these features, this paper proposes a EMD-SDC method with empirical mode decomposition and shifted delta cepstral features. Empirical mode decomposition with characteristics of AM FM can substantially increase endpoint detection accuracy under complex airplane noise environment. Shifted delta cepstral which is composed of first-order differential spectral of the speech frames, can capture the time sequence information depending on the structure of the language well. This method is tested for airplane traffic collision avoidance system database, experimental result shows that the airplane conjunction speech recognition system with EMD-SDC method can overcome cabin background noise and achieve a higher recognition rate in the low SNR.

  9. Current trends in multilingual speech processing

    Indian Academy of Sciences (India)

    Hervé Bourlard; John Dines; Mathew Magimai-Doss; Philip N Garner; David Imseng; Petr Motlicek; Hui Liang; Lakshmi Saheer; Fabio Valente

    2011-10-01

    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.

  10. 一种用于无线通信的数字语音识别系统设计%Design of a digital speech recognition system used for wireless communication

    Institute of Scientific and Technical Information of China (English)

    王艳芬

    2016-01-01

    The interference such as environment,user accent and non⁃target vocabulary exists in digital voice recording pro⁃cess,which makes the developed digital speech recognition systems used for wireless communication low accuracy and poor por⁃tability. Therefore,the optimization design of the digital speech recognition system used for wireless communication was per⁃formed. The core components of the system are chip C6727DSP,speech recognition chip QGDH710 and CC2520RF transceiver. The chip C6727DSP is used for early stage processing of the digital speech. The speech recognition chip QGDH710 is used to recognize the processed digital speech,and feed the recognized instruction back to the CC2520 RF transceiver. The CC2520 RF transceiver is used to convert the instruction format,and transmit the instructions to the user’s wireless communication equip⁃ment to realize effective utilization of the digital speech recognition system in wireless communication. To perform system opera⁃tion conveniently for the users,a virtual function diagram of user’s wireless communication equipment is given by means of soft⁃ware. The experimental verification results show that the designed system has high accuracy and good portability.%数字语音录制过程中存在的环境、用户口音和非目标词汇等干扰,使以往开发出的无线通信数字语音识别系统准确性较低、可移植性较差。因此,对无线通信的数字语音识别系统进行优化设计,设计系统的核心元件为C6727DSP芯片、QGDH710语音识别芯片和CC2520射频收发器。C6727DSP芯片进行数字语音的前期处理工作;QGDH710语音识别芯片对处理后的数字语音进行识别,并将其识别出的指令反馈到CC2520射频收发器;CC2520射频收发器进行指令的格式转换工作,并将指令传输到用户无线通信设备中,最终实现数字语音识别系统在无线通信中的有效利用。为了方便用户进行系统操作

  11. Sign Facilitation in Word Recognition.

    Science.gov (United States)

    Wauters, Loes N.; Knoors, Harry E. T.; Vervloed, Mathijs P. J.; Aarnoutse, Cor A. J.

    2001-01-01

    This study examined whether use of sign language would facilitate reading word recognition by 16 deaf children (6- to 1 years-old) in the Netherlands. Results indicated that if words were learned through speech, accompanied by the relevant sign, accuracy of word recognition was greater than if words were learned solely through speech. (Contains…

  12. Speech Development

    Science.gov (United States)

    ... Spotlight Fundraising Ideas Vehicle Donation Volunteer Efforts Speech Development skip to submenu Parents & Individuals Information for Parents & Individuals Speech Development To download the PDF version of this factsheet, ...

  13. Usefullness of Speech Coding in Voice Banking

    Directory of Open Access Journals (Sweden)

    M Satya Sai Ram

    2009-10-01

    Full Text Available Voice banking is an excellent telephone banking service by which a user can access his account for any service at any time of a day, in a year. The speech techniques involved in voice banking are speech coding and speech recognition. This paper investigates the performance of a speech recognizer for a coded output at 20 bits/frame obtained by using various vector quantization techniques namely Split Vector Quantization, Multi Stage Vector Quantization, Split-Multi Stage Vector Quantization, Switched Split Vector Quantization using Hard decision scheme, Switched Multi Stage Vector Quantization using Soft decision scheme and Multi Switched Split Vector Quantization using Hard decision scheme techniques. The speech recognition technique used for recognition of the coded speech signal is the Hidden Markov Model technique and the speech enhancement technique used for enhancing the coded speech signal is the Spectral Subtraction technique. The performance of vector quantization is measured in terms of spectral distortion in decibels, computational complexity in Kflops/frame, and memory requirements in floats. The performance of the speech recognizer for coded outputs at 20 bits/frame has been examined and it is found that the speech recognizer has better percentage probability of recognition for the coded output obtained using Multi Switched Split Vector Quantization using Hard decision scheme. It is also found that the probability of recognition for various coding techniques has been varied from 80% to 100%.

  14. Learning Representations of Affect from Speech

    OpenAIRE

    Ghosh, Sayan; Laksana, Eugene; Morency, Louis-Philippe; Scherer, Stefan

    2015-01-01

    There has been a lot of prior work on representation learning for speech recognition applications, but not much emphasis has been given to an investigation of effective representations of affect from speech, where the paralinguistic elements of speech are separated out from the verbal content. In this paper, we explore denoising autoencoders for learning paralinguistic attributes i.e. categorical and dimensional affective traits from speech. We show that the representations learnt by the bott...

  15. Whispered Speech Emotion Recognition Embedded with Markov Networks and Multi-Scale Decision Fusion%嵌入马尔可夫网络的多尺度判决融合耳语音情感识别

    Institute of Scientific and Technical Information of China (English)

    黄程韦; 金赟; 包永强; 余华; 赵力

    2013-01-01

    本文中我们提出了一种将高斯混合模型同马尔可夫网络结合的时域多尺度语音情感识别框架,并将其应用在耳语音情感识别中.针对连续语音信号的特点,分别在耳语音信号的短句尺度上和长句尺度上进行了基于高斯混合模型的情感识别.根据情绪的维度空间论,耳语音信号中的情感信息具有时间上的连续性,因此利用三阶的马尔可夫网络对多尺度的耳语音情感分析进行了上下文的情感依赖关系的建模.采用了一种弹簧模型来定义二维情感维度空间中的高阶形变,并且利用模糊熵评价将高斯混合模型的似然度转化为马尔可夫网络中的一阶能量.实验结果显示,本文提出的情感识别算法在连续耳语音数据上获得了较好的识别结果,对愤怒的识别率达到了64.3%.实验结果进一步显示,与正常音的研究结论不同,耳语音中的喜悦情感的识别相对困难,而愤怒与悲伤之间的区分度较高,与Cirillo等人进行的人耳听辨研究结果一致.%In this paper we proposed a multi-scale framework in the time domain to combine the Gaussian Mixture Model and the Markov Network, and apply which to the whispered speech emotion recognition. Based on Gaussian Mixture Model, speech emotion recognition on the long and short utterances are carried out in continuous speech signals. According to the emotion dimensional model, whispered speech emotion should be continuous in the time domain. Therefore we model the context dependency in whispered speech using Markov Network. A spring model is adopted to model the high-order variance in the emotion dimensional space and fuzzy entropy is used for calculating the unary energy in the Markov Network. Experimental results show that the recognition rate of anger emotion reaches 64. 3%. Compared with the normal speech the recognition of happiness is more difficult in whispered speech, while anger and sadness is relatively easy to

  16. 面向手语自动翻译的基于Kinect的手势识别%GESTURE RECOGNITION WITH KINECT FOR AUTOMATED SIGN LANGUAGE TRANSLATION

    Institute of Scientific and Technical Information of China (English)

    付倩; 沈俊辰; 张茜颖; 武仲科; 周明全

    2013-01-01

    For a long time,barriers exist in communication to people with hearing disability.To them, being able to effectively communicate with other people has been a dream that seems may never come true. With the invention of computer and technological advances,this dream would be possible in the future. Researchers have always been trying to achieve automated sign language translation by utilizing Human-Computer Interaction (HCI),computer graphics,computer vision and pattern recognition.Most of previous work is based on videos.This paper proposed a gesture recognition method for automated sign language translation based on depth images provided by Kinect,a three-dimensional (3D)motion sensing input device. The 3D coordinates of palm center was computed by using NITE middleware and an open source framework-OpenNI.Then fingers were recognized and hand gesture was detected through adopting contour tracing algorithm and three-point alignment algorithm.The names of fingers were recognized using vector fitting method.Three layers of classifiers were designed to achieve static hand gesture recognition successfully. Compared with traditional methods using data gloves and monocular cameras,the present method was more accurate.Our experiments showed that the method was concise and effective.%提出了一种基于3D体感摄像机Kinect设备实现面向手语自动翻译的手势识别方法,利用 Kinect 体感外设提供的深度图像实现手的检测,并通过 OpenNI+NITE 框架中函数获得手掌质心在三维空间中的位置,然后通过等值线追踪算法(又称围线追踪算法)和三点定位算法识别出手指实现了手势跟踪,最后通过矢量匹配方法识别手指名字,并设计了三层分类器来实现静态手势语的识别。相较于传统的基于数据手套和单目摄像机的方法,本方法识别的更准确。基于上述方法,实现了一个手势识别系统。实验结果显示,本文提出

  17. Speech synthesis, speech simulation and speech science

    OpenAIRE

    Huckvale, M.

    2002-01-01

    Speech synthesis research has been transformed in recent years through the exploitation of speech corpora – both for statistical modelling and as a source of signals for concatenative synthesis. This revolution in methodology and the new techniques it brings calls into question the received wisdom that better computer voice output will come from a better understanding of how humans produce speech. This paper discusses the relationship between this new technology of simulated speech and the tr...

  18. Neural bases of accented speech perception

    Directory of Open Access Journals (Sweden)

    Patti eAdank

    2015-10-01

    Full Text Available The recognition of unfamiliar regional and foreign accents represents a challenging task for the speech perception system (Adank, Evans, Stuart-Smith, & Scott, 2009; Floccia, Goslin, Girard, & Konopczynski, 2006. Despite the frequency with which we encounter such accents, the neural mechanisms supporting successful perception of accented speech are poorly understood. Nonetheless, candidate neural substrates involved in processing speech in challenging listening conditions, including accented speech, are beginning to be identified. This review will outline neural bases associated with perception of accented speech in the light of current models of speech perception, and compare these data to brain areas associated with processing other speech distortions. We will subsequently evaluate competing models of speech processing with regards to neural processing of accented speech. See Cristia et al. (2012 for an in-depth overview of behavioural aspects of accent processing.

  19. 实用语音情感的特征分析与识别的研究%A Study on Feature Analysis and Recognition of Practical Speech Emotion

    Institute of Scientific and Technical Information of China (English)

    黄程韦; 赵艳; 金赟; 于寅骅; 赵力

    2011-01-01

    该文针对语音情感识别在实际中的应用,研究了烦躁等实用语音情感的分析与识别.通过计算机游戏诱发的方式采集了高自然度的语音情感数据,提取了74种情感特征,分析了韵律特征、音质特征与情感维度之间的关系,对烦躁等实用语音情感的声学特征进行了评价与选择,提出了针对实际应用环境的可拒判的实用语音情感识别方法.实验结果表明,文中采用的语音情感特征,能较好识别烦躁等实用语音情感,平均识别率达到75%以上.可拒判的实用语音情感识别方法,对模糊的和未知的情感类别的分类进行了合理的决策,在语音情感的实际应用中具有重要的意义.%Practical speech emotions as impatience and happiness are studied especially for evaluation of emotional well-being in real world applications. Induced natural speech emotion data is collected with a computer game, 74 emotion features are extracted, prosody features and voice quality features are analyzed according to dimensional emotion model, evaluation and selection of acoustic features are carried out for practical emotions in this paper, a method of practical speech emotion classification with rejection decision is proposed for real world occasions. The experiment results show, the speech features analyzed in this paper are suitable for classification of practical speech emotions like impatience and happiness, average recognition rate is above 75%, and the method of emotion classification with rejection decision is necessary for the proper recognition decision of ambiguous or unknown emotion samples, especially for the real world challenges.

  20. 基于语音识别的智能轮椅的设计与实现%Design and implementation of intel igent wheelchair based on speech recognition

    Institute of Scientific and Technical Information of China (English)

    巴金融

    2014-01-01

    设计并实现了一种基于STC10L08XE智能语音识别轮椅的系统,采用了由 ICRoute 公司设计生产的语音识别专用芯片--LD3320芯片,通过STC10L08XE单片机主控,实现了语音命令轮椅前进、后退、左转、右转、停止以及避障功能。%A intel igent wheelchair based on STC10L08XE is designed and implemented by intel igent speech recognition. LD3320 chip is a special speech recognition chip, which is made in ICRoute Company. The control information can be got from STC10L08XE microcontroller. The designed intelligent wheelchair has many performance, such as forward, backward, turn left, turn right, stop function and infrared barrier function.

  1. Head movements encode emotions during speech and song.

    Science.gov (United States)

    Livingstone, Steven R; Palmer, Caroline

    2016-04-01

    When speaking or singing, vocalists often move their heads in an expressive fashion, yet the influence of emotion on vocalists' head motion is unknown. Using a comparative speech/song task, we examined whether vocalists' intended emotions influence head movements and whether those movements influence the perceived emotion. In Experiment 1, vocalists were recorded with motion capture while speaking and singing each statement with different emotional intentions (very happy, happy, neutral, sad, very sad). Functional data analyses showed that head movements differed in translational and rotational displacement across emotional intentions, yet were similar across speech and song, transcending differences in F0 (varied freely in speech, fixed in song) and lexical variability. Head motion specific to emotional state occurred before and after vocalizations, as well as during sound production, confirming that some aspects of movement were not simply a by-product of sound production. In Experiment 2, observers accurately identified vocalists' intended emotion on the basis of silent, face-occluded videos of head movements during speech and song. These results provide the first evidence that head movements encode a vocalist's emotional intent and that observers decode emotional information from these movements. We discuss implications for models of head motion during vocalizations and applied outcomes in social robotics and automated emotion recognition. PMID:26501928

  2. 言语感知中词汇识别的句子语境效应研究%Effect of Sentimental Contexts on Word Recognition in Speech Perception

    Institute of Scientific and Technical Information of China (English)

    柳鑫淼

    2014-01-01

    言语感知遵循音不离词,词不离句的原则。除了语音特征、音位和单词三个感知单元外,句子单元也参与了言语感知的过程。在这一感知过程中,句子语境分别从句法和语义两方面对词汇的识别发生影响。在句法方面,句子层依据句法规则对词汇层产生自上而下的反馈作用,通过词类限制和曲折形态特征核查等方式实现对词汇层上备选单词的筛选;在语义方面,句子层根据语义限制条件对备选单词产生激活或抑制作用。%Phonemes, words and sentences are interconnected in speech perception. Besides phonetic features, phonemes and words, sentences are also engaged in speech perception. In speech perception, sentimental contexts exert influence on word recog-nition both syntactically and semantically. Syntactically, sentence levels exert top-down feedback effect on world levels according to syntactic rules, screening the candidates on word levels by constraining their part of speech or checking their inflectional fea-tures. Semantically, sentence levels activate pr inhibit the candidates by exerting semantic constraints.

  3. 混合蛙跳算法神经网络及其在语音情感识别中的应用%Shuffled Frog-leaping Algorithm Based Neural Network and Its Application in Speech Emotion Recognition

    Institute of Scientific and Technical Information of China (English)

    余华; 黄程韦; 张潇丹; 金赟; 赵力

    2011-01-01

    The shuffled frog-leaping algorithm(SFLA) is applied to the speech emotion recognition in neural network training. The freatures of the six speech emotions are extracted and recognized. The changes of harmonics-to-noise ratio (HNR) features with different emotions are studied. The random initial data trained by the SFLA is used to optimize the connection weights and thresholds of the neural network,and the network can converge fast. The recognition capability of the BP,RBF and SFLA neural networks are compared experimentally. The results show that the recognition ratio of the SFLA neural network is 4.7% better than that of BP neural network and 4. 3% better than that of the RBF neural network.%该文将混合蛙跳算法(SELA)优化方法应用于人工神经网络训练中,对6种语音情感进行了语音情感特征的分析与识别.研究了谐波噪声比特征随情感类别的变化特性.利用混合蛙跳算法训练随机产生的初始数据优化神经网络的连接权值,快速实现了网络收敛.实验比较了BP神经网络、RBF神经网络和SFLA神经网络的语音情感识别性能.结果表明,SFLA神经网络的平均识别率分别高于BP神经网络和RBF神经网络4.7%和4.3%.

  4. Audio-visual gender recognition

    Science.gov (United States)

    Liu, Ming; Xu, Xun; Huang, Thomas S.

    2007-11-01

    Combining different modalities for pattern recognition task is a very promising field. Basically, human always fuse information from different modalities to recognize object and perform inference, etc. Audio-Visual gender recognition is one of the most common task in human social communication. Human can identify the gender by facial appearance, by speech and also by body gait. Indeed, human gender recognition is a multi-modal data acquisition and processing procedure. However, computational multimodal gender recognition has not been extensively investigated in the literature. In this paper, speech and facial image are fused to perform a mutli-modal gender recognition for exploring the improvement of combining different modalities.

  5. Cortisol, Chromogranin A, and Pupillary Responses Evoked by Speech Recognition Tasks in Normally Hearing and Hard-of-Hearing Listeners: A Pilot Study.

    Science.gov (United States)

    Kramer, Sophia E; Teunissen, Charlotte E; Zekveld, Adriana A

    2016-01-01

    Pupillometry is one method that has been used to measure processing load expended during speech understanding. Notably, speech perception (in noise) tasks can evoke a pupil response. It is not known if there is concurrent activation of the sympathetic nervous system as indexed by salivary cortisol and chromogranin A (CgA) and whether such activation differs between normally hearing (NH) and hard-of-hearing (HH) adults. Ten NH and 10 adults with mild-to-moderate hearing loss (mean age 52 years) participated. Two speech perception tests were administered in random order: one in quiet targeting 100% correct performance and one in noise targeting 50% correct performance. Pupil responses and salivary samples for cortisol and CgA analyses were collected four times: before testing, after the two speech perception tests, and at the end of the session. Participants rated their perceived accuracy, effort, and motivation. Effects were examined using repeated-measures analyses of variance. Correlations between outcomes were calculated. HH listeners had smaller peak pupil dilations (PPDs) than NH listeners in the speech in noise condition only. No group or condition effects were observed for the cortisol data, but HH listeners tended to have higher cortisol levels across conditions. CgA levels were larger at the pretesting time than at the three other test times. Hearing impairment did not affect CgA. Self-rated motivation correlated most often with cortisol or PPD values. The three physiological indicators of cognitive load and stress (PPD, cortisol, and CgA) are not equally affected by speech testing or hearing impairment. Each of them seem to capture a different dimension of sympathetic nervous system activity. PMID:27355762

  6. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    Digital noise reduction strategies are important in technical devices such as hearing aids and mobile phones. One well-described noise reduction scheme is the spectral subtraction algorithm. Many versions of the spectral subtraction scheme have been presented in the literature, but the methods have....... The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed...

  7. Automating the Process of Work-Piece Recognition and Location for a Pick-and-Place Robot in a SFMS

    Directory of Open Access Journals (Sweden)

    R. V. Sharan

    2014-03-01

    Full Text Available This paper reports the development of a vision system to automatically classify work-pieces with respect to their shape and color together with determining their location for manipulation by an in-house developed pick-and-place robot from its work-plane. The vision-based pick-and-place robot has been developed as part of a smart flexible manufacturing system for unloading work-pieces for drilling operations at a drilling workstation from an automatic guided vehicle designed to transport the work-pieces in the manufacturing work-cell. Work-pieces with three different shapes and five different colors are scattered on the work-plane of the robot and manipulated based on the shape and color specification by the user through a graphical user interface. The number of corners and the hue, saturation, and value of the colors are used for shape and color recognition respectively in this work. Due to the distinct nature of the feature vectors for the fifteen work-piece classes, all work-pieces were successfully classified using minimum distance classification during repeated experimentations with work-pieces scattered randomly on the work-plane.

  8. Testing of Haar-Like Feature in Region of Interest Detection for Automated Target Recognition (ATR) System

    Science.gov (United States)

    Zhang, Yuhan; Lu, Dr. Thomas

    2010-01-01

    The objectives of this project were to develop a ROI (Region of Interest) detector using Haar-like feature similar to the face detection in Intel's OpenCV library, implement it in Matlab code, and test the performance of the new ROI detector against the existing ROI detector that uses Optimal Trade-off Maximum Average Correlation Height filter (OTMACH). The ROI detector included 3 parts: 1, Automated Haar-like feature selection in finding a small set of the most relevant Haar-like features for detecting ROIs that contained a target. 2, Having the small set of Haar-like features from the last step, a neural network needed to be trained to recognize ROIs with targets by taking the Haar-like features as inputs. 3, using the trained neural network from the last step, a filtering method needed to be developed to process the neural network responses into a small set of regions of interests. This needed to be coded in Matlab. All the 3 parts needed to be coded in Matlab. The parameters in the detector needed to be trained by machine learning and tested with specific datasets. Since OpenCV library and Haar-like feature were not available in Matlab, the Haar-like feature calculation needed to be implemented in Matlab. The codes for Adaptive Boosting and max/min filters in Matlab could to be found from the Internet but needed to be integrated to serve the purpose of this project. The performance of the new detector was tested by comparing the accuracy and the speed of the new detector against the existing OTMACH detector. The speed was referred as the average speed to find the regions of interests in an image. The accuracy was measured by the number of false positives (false alarms) at the same detection rate between the two detectors.

  9. Robust Recogmtion Method of Speech Under Stress%心理紧张情况下的Robust语音识别方法

    Institute of Scientific and Technical Information of China (English)

    韩纪庆; 张磊; 王承发

    2000-01-01

    Abstract There are many stressful envronments which deteriorate the performance of speech recognition systems. Techniques for compensating the influence of stress can help neutralize stressed speech and improve robustness of speech recognition systems. In this paper,we smmarize the aproaches for robust recognition of speech under stress and also give the advances in the area.

  10. “心爱飞扬”中文言语测听平台在儿童人工耳蜗术后言语识别测试中的应用%The Application of Computer-aided Chinese Speech Audiometry Platform to Speech Recognition Text in Children after Cochlear Implantation

    Institute of Scientific and Technical Information of China (English)

    罗琼; 黄艳艳; 冯艳梅; 时海波

    2016-01-01

    目的:利用计算机辅助的“心爱飞扬”中文言语测听平台评估人工耳蜗植入术后儿童在安静和噪声环境下句子的识别率,探索患儿的言语发展规律,同时探讨测听工具的可行性。方法选择18例植入人工耳蜗1年以上的儿童,在声场下接受助听听阈测试,并应用计算机辅助“心爱飞扬”中文言语测听平台,分别对患儿进行安静及噪声环境中句子识别率测试。结果①被试配戴人工耳蜗后的助听听阈平均为(33±5)dB HL;②安静环境中的句子识别率平均为(71±24)%,其中耳蜗植入1~4年的为(53±25)%,耳蜗植入4年以上为(85±9)%,两者之间差异具有统计学意义(P<0.01);③噪声环境中的句子识别率平均为(51±28)%,其中耳蜗植入1~4年的为(31±24)%,4年以上的为(68±19)%,两者之间差异具有统计学意义(P<0.01)。结论人工耳蜗植入后康复时间是影响儿童言语感知能力的重要因素。“心爱飞扬”中文言语测听平台的应用有助于人工耳蜗植入患儿听觉言语康复状况的长时期跟踪评估。%Objective To evaluate the speech recognition rates of hearing-impaired children after cochlear implantation in quiet and noise and to explore the rules of auditory development in these children.Methods Eighteen children with cochlear implants more than one year received the aided hearing threshold test in the sound field. The computer-aided Chinese speech audiometry platform was used to test the sentence recognition abilities of these children in quiet and noise. Results (1)The average aided hearing threshold of the 18 children was(33±5)dB HL. (2)The sentence recognition rate of all children in quiet was (71±24) %. The rates of children with cochlear implants 1-4 years and over 4 years were (53±25) % and (85±9)% respectively, showing a significant difference (P<0.01). (3) The sentence recognition rate of all

  11. An Approach to Hide Secret Speech Information

    Institute of Scientific and Technical Information of China (English)

    WU Zhi-jun; DUAN Hai-xin; LI Xing

    2006-01-01

    This paper presented an approach to hide secret speech information in code excited linear prediction(CELP)-based speech coding scheme by adopting the analysis-by-synthesis (ABS)-based algorithm of speech information hiding and extracting for the purpose of secure speech communication. The secret speech is coded in 2.4Kb/s mixed excitation linear prediction (MELP), which is embedded in CELP type public speech. The ABS algorithm adopts speech synthesizer in speech coder. Speech embedding and coding are synchronous, i.e. a fusion of speech information data of public and secret. The experiment of embedding 2.4 Kb/s MELP secret speech in G.728 scheme coded public speech transmitted via public switched telephone network (PSTN) shows that the proposed approach satisfies the requirements of information hiding, meets the secure communication speech quality constraints, and achieves high hiding capacity of average 3.2 Kb/s with an excellent speech quality and complicating speakers' recognition.

  12. Speech perception of noise with binary gains

    DEFF Research Database (Denmark)

    Wang, DeLiang; Kjems, Ulrik; Pedersen, Michael Syskind;

    2008-01-01

    For a given mixture of speech and noise, an ideal binary time-frequency mask is constructed by comparing speech energy and noise energy within local time-frequency units. It is observed that listeners achieve nearly perfect speech recognition from gated noise with binary gains prescribed by the i...... by the ideal binary mask. Only 16 filter channels and a frame rate of 100 Hz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception....

  13. Strategies for distant speech recognitionin reverberant environments

    Science.gov (United States)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  14. Design of self-tracking smart car based on speech recognition and infrared photoelectric sensor%基于语音识别和红外光电传感器的自循迹智能小车设计

    Institute of Scientific and Technical Information of China (English)

    李新科; 高潮; 郭永彩; 何卫华

    2011-01-01

    A self-tracking smart car has been designed based on infrared photoelectric sensor and speech recognition technology. The car adopts 16 bit singlechip SPCE061A of Sunplus Inc to work as the core processor of the control circuit, and obtains the path information by the reflected infrared photoelectric sensor. The car can adjust the direction and speed by the location of the black part of the path information to implement self-tracking. Compiled speech program API function to achieve speech capability of human-machine intercourse based on SPCE061 A. Experiments indicates that the smart car can achieve the anticipated steady and credible purpose. The technic can use the fields such as physical disabilities smart wheelchair, service robot, intelligent toy, unmanned driving vehicles, storage, etc.%基于光电传感器和语音识别技术完成了一种自循迹智能小车的设计.该小车采用凌阳16位单片机SPCE061A作为系统控制处理器,以反射式红外光电传感器获取路径信息.根据路径信息中黑线的位置来调整小车的运动方向与速度,从而实现自循迹功能.结合SPCE061A片内资源,编写了语音处理API函数,实现语音人机交互的智能化导航控制.实验表明:智能小车功能达到要求,运行可靠稳定.该技术可以应用于残障人智能轮椅、服务机器人、智能玩具、无人驾驶机动车、仓库等领域.

  15. Research on Continuous Speech Recognition Based on HTK by MatLab Programming%MatLab环境下调用HTK的连续语音识别方法

    Institute of Scientific and Technical Information of China (English)

    李理; 王冬霞

    2014-01-01

    According to the basic principle of HTK(HMM Toolkit),smal vocabulary continuous speech was recognized based on HTK by MatLab programming in this thesis.This thesis used HTK to build HMM model and then used MatLab to program it to do speech recognition,thus it avoided the redundancy of operating single HTK command,and the complexity was reduced.as wel .%本文根据HTK(HMM Toolkit)的基本原理,在MatLab环境下通过调用HTK各命令实现小词汇量连续语音识别。采用HTK工具包搭建语音的隐马尔可夫模型(HMM),再利用MatLab循环编程开发进行仿真实验,避免了传统地逐步运行HTK各个命令的冗余操作,降低了操作复杂度。

  16. 一种基于语音识别技术的智能视频监控系统设计%A Design of Intelligent Video Surveillance System Based on Speech Recognition Technology

    Institute of Scientific and Technical Information of China (English)

    孙大飞; 袁诚; 高勇

    2011-01-01

    针对传统视频监控系统中存在的不足,提出了一种将语音识别技术作为辅助的监控手段,构建具有主动预警、监控画面智能切换等功能的视频监控系统的设计方法;使用倒谱平均减(CMS)的方法抑制了传输设备不同带来的信道畸变对语音识别的影响;使用模型合并重估的方法解决了训练样本数据更新导致的所有数据重新训练的问题.仿真测试表明该系统预警准确、切换智能,对市场上传统视频监控系统的改进具有一定的指导作用.%Aimed at the shortcomings of traditional video surveillance systems, a design of video monitor system was put forward, which uses speech recognition technology as an auxiliary means of monitoring. This system possessed both active early-warning and intelligent switching screen functions at the same time. Furthermore, the CMSCcepstral mean subtraction)was used to curb the impact of channel distortion on speech recognition; and the model combining and revaluation method was used to solve the training problem caused by adding updated data. The test results of experiment demonstrate that the system alarmed accurately and switched intellectually, having great values to the improvement of marketed traditional video surveillance systems.

  17. 基于WINCE的语音识别系统%Speech Recognition System Based on the Embedded WinCE OS

    Institute of Scientific and Technical Information of China (English)

    张晶; 李心广; 王金矿

    2008-01-01

    论文在基于IntEL PXA270嵌入式微处理器开发平台上实现了WinCE操作系统的定制和移植;并结合WINCE 5.0语音接口Speech Applicadon Programming Interface(SAPI 5.0),使用Embedded Visual C++4.0(EVC)成功开发嵌入式语音识别系统.

  18. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Directory of Open Access Journals (Sweden)

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  19. Development of a Danish speech intelligibility test

    DEFF Research Database (Denmark)

    Nielsen, Jens Bo; Dau, Torsten

    2009-01-01

    Abstract A Danish speech intelligibility test for assessing the speech recognition threshold in noise (SRTN) has been developed. The test consists of 180 sentences distributed in 18 phonetically balanced lists. The sentences are based on an open word-set and represent everyday language. The sente...

  20. Emotion Classification from Noisy Speech - A Deep Learning Approach

    OpenAIRE

    Rana, Rajib

    2016-01-01

    This paper investigates the performance of Deep Learning for speech emotion classification when the speech is compounded with noise. It reports on the classification accuracy and concludes with the future directions for achieving greater robustness for emotion recognition from noisy speech.

  1. The Kiel Corpora of "Speech & Emotion" - A Summary

    DEFF Research Database (Denmark)

    Niebuhr, Oliver; Peters, Benno; Landgraf, Rabea;

    2015-01-01

    technology applications that sneak in every corner of our life. Apart from the fact that speech corpora seem to become constantly larger (for example, in order to properly train self-learning speech synthesis/recognition algorithms), the content of speech corpora also changes. In particular, recordings...

  2. A neural mechanism for recognizing speech spoken by different speakers

    NARCIS (Netherlands)

    Kreitewolf, Jens; Gaudrain, Etienne; von Kriegstein, Katharina

    2014-01-01

    Understanding speech from different speakers is a sophisticated process, particularly because the same acoustic parameters convey important information about both the speech message and the person speaking. How the human brain accomplishes speech recognition under such conditions is unknown. One vie

  3. 基于MMSE谱减算法的农产品市场信息语音识别技术%Speech Recognition of Agricultural Market Information Based on MMSE Spectral Subtraction

    Institute of Scientific and Technical Information of China (English)

    许金普

    2015-01-01

    In order to solve the inconvenient operation of traditional agricultural market information collection portable device,and other issues such as susceptiblity to environment,we proposed the use of speech recognition technology to collect information in order to increase the flexibility of the operator interface. To enhance the robustness of speech recognition under the special working conditions of agricultural market information collection, we collected 20 males and 20 females voice training set material. Firstly,a noise spectral subtraction of the minimum mean square error ( MMSE) at front-end was adopted,and then the MFCC features were extracted from enhanced speech signal for training HMM acoustic model;Acoustic identification unit was a context-sensitive Triphone model. In the training process,decision tree state clustering and increasing Gaussian mixture component strategy were adopted to improve the accuracy of the model. The trained models were tested in three different environments by different SNR speech sentences, and the results showed that the recognition rate of this method was more significantly improved than that of the basic spectral subtraction ( SS ) , multi-band spectral subtraction ( MB) ,especially at low SNR.%为解决传统的便携式农产品市场信息采集设备操作不便,易受使用环境影响等问题,提出利用语音识别技术采集信息,以增加操作界面的灵活性。为增强语音识别的抗噪声鲁棒性,针对农产品市场信息采集的特殊工作环境,采集到20男20女语音训练集材料。首先利用最小均方误差( MMSE)谱减法进行前端带噪语音增强,得到增强后的语音信号,然后提取其 MFCC 特征用于HMM声学模型的训练;声学识别单元采用上下文相关的三音子模型,模型训练过程中采用了决策树状态聚类和增加高斯混合分量的策略,以提高模型的精确度。在3处不同环境不同信噪比情况下对训练出的模型进行测

  4. Speech and gesture interfaces for squad-level human-robot teaming

    Science.gov (United States)

    Harris, Jonathan; Barber, Daniel

    2014-06-01

    As the military increasingly adopts semi-autonomous unmanned systems for military operations, utilizing redundant and intuitive interfaces for communication between Soldiers and robots is vital to mission success. Currently, Soldiers use a common lexicon to verbally and visually communicate maneuvers between teammates. In order for robots to be seamlessly integrated within mixed-initiative teams, they must be able to understand this lexicon. Recent innovations in gaming platforms have led to advancements in speech and gesture recognition technologies, but the reliability of these technologies for enabling communication in human robot teaming is unclear. The purpose for the present study is to investigate the performance of Commercial-Off-The-Shelf (COTS) speech and gesture recognition tools in classifying a Squad Level Vocabulary (SLV) for a spatial navigation reconnaissance and surveillance task. The SLV for this study was based on findings from a survey conducted with Soldiers at Fort Benning, GA. The items of the survey focused on the communication between the Soldier and the robot, specifically in regards to verbally instructing them to execute reconnaissance and surveillance tasks. Resulting commands, identified from the survey, were then converted to equivalent arm and hand gestures, leveraging existing visual signals (e.g. U.S. Army Field Manual for Visual Signaling). A study was then run to test the ability of commercially available automated speech recognition technologies and a gesture recognition glove to classify these commands in a simulated intelligence, surveillance, and reconnaissance task. This paper presents classification accuracy of these devices for both speech and gesture modalities independently.

  5. Comparison of speech recognition with different speech coding strategies (SPEAK, CIS, and ACE) and their relationship to telemetric measures of compound action potentials in the nucleus CI 24M cochlear implant system.

    Science.gov (United States)

    Kiefer, J; Hohl, S; Stürzebecher, E; Pfennigdorff, T; Gstöettner, W

    2001-01-01

    Speech understanding and subjective preference for three different speech coding strategies (spectral peak coding [SPEAK], continuous interleaved sampling [CIS], and advanced combination encoders [ACE]) were investigated in 11 post-lingually deaf adult subjects, using the Nucleus CI 24M cochlear implant system. Subjects were randomly assigned to two groups in a balanced crossover study design. The first group was initially fitted with SPEAK and the second group with CIS. The remaining strategies were tested sequentially over 8 to 10 weeks with systematic variations of number of channels and rate of stimulation. Following a further interval of 3 months, during which subjects were allowed to listen with their preferred strategy, they were tested again with all three strategies. Compound action potentials (CAPs) were recorded using neural response telemetry. Input/output functions in relation to increasing stimulus levels and inter-stimulus intervals between masker and probe were established to assess the physiological status of the cochlear nerve. Objective results and subjective rating showed significant differences in favour of the ACE strategy. Ten of the 11 subjects preferred the ACE strategy at the end of the study. The estimate of the refractory period based on the inter-stimulus interval correlated significantly with the overall performance with all three strategies, but CAP measures could not be related to individual preference of strategy or differences in performance between strategies. Based on these results, the ACE strategy can be recommended as an initial choice specifically for the Nucleus CI 24M cochlear implant system. Nevertheless, access to the other strategies may help to increase performance in individual patients. PMID:11296939

  6. The role of speech in the user interface : perspective and application

    OpenAIRE

    Abewusi, A.B.

    1994-01-01

    Consideration must be given to the implication of speech as a communication medium before deciding to use speech input or output in an interactive environment. There are several effective control strategies for improving the quality of speech. The utility of the speech has been demonstrated by application to several illustrative problems where their application has proved effective despite all the limitation of synthetic speech output and automatic speech recognition systems. (Résumé d'auteur)

  7. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT i

  8. Bimodal Hearing and Speech Perception with a Competing Talker

    Science.gov (United States)

    Pyschny, Verena; Landwehr, Markus; Hahn, Moritz; Walger, Martin; von Wedel, Hasso; Meister, Hartmut

    2011-01-01

    Purpose: The objective of the study was to investigate the influence of bimodal stimulation upon hearing ability for speech recognition in the presence of a single competing talker. Method: Speech recognition was measured in 3 listening conditions: hearing aid (HA) alone, cochlear implant (CI) alone, and both devices together (CI + HA). To examine…

  9. Real-time speech gisting for ATC applications

    Science.gov (United States)

    Dunkelberger, Kirk A.

    1995-06-01

    Command and control within the ATC environment remains primarily voice-based. Hence, automatic real time, speaker independent, continuous speech recognition (CSR) has many obvious applications and implied benefits to the ATC community: automated target tagging, aircraft compliance monitoring, controller training, automatic alarm disabling, display management, and many others. However, while current state-of-the-art CSR systems provide upwards of 98% word accuracy in laboratory environments, recent low-intrusion experiments in the ATCT environments demonstrated less than 70% word accuracy in spite of significant investments in recognizer tuning. Acoustic channel irregularities and controller/pilot grammar verities impact current CSR algorithms at their weakest points. It will be shown herein, however, that real time context- and environment-sensitive gisting can provide key command phrase recognition rates of greater than 95% using the same low-intrusion approach. The combination of real time inexact syntactic pattern recognition techniques and a tight integration of CSR, gisting, and ATC database accessor system components is the key to these high phase recognition rates. A system concept for real time gisting in the ATC context is presented herein. After establishing an application context, discussion presents a minimal CSR technology context then focuses on the gisting mechanism, desirable interfaces into the ATCT database environment, and data and control flow within the prototype system. Results of recent tests for a subset of the functionality are presented together with suggestions for further research.

  10. Whispered speech emotion recognition based on MD-CM-SFLA neural network%基于MD-CM-SFLA神经网络的耳语音情感识别

    Institute of Scientific and Technical Information of China (English)

    张潇丹; 包永强; 奚吉; 赵力; 邹采荣

    2012-01-01

    A molecular dynamics simulation and cloud model theory based shuffled frog leaping algorithm (MD-CM-SFLA) is proposed. In this algorithm, an individual frog is equivalent to a molecular and only the attractive force between the worst individual and the global best individual is considered. A new intermolecular force instead of the classic two-body Lennard-Jones force is adopted, and the Velocity-Verlet algorithm and a normal cloud generator are substitued for the update strategy of the shuffled frog leaping algorithm (SFLA). The population diversity and the search efficiency are effectively balanced. Then, a MD-CM-SFLA neural network is proposed through combining the MD-CM-SFLA with back propagation (BP) neural network, and it is applied to the whispered speech emotion recognition. The experimental results of whispered speech emotion recognition indicate that compared with BP neural network, the MD-CM-SFLA neural network has obvious advantages. Under the same test conditions, the average recognition rate of the MD-CM-SFLA neural network is 5.2% higher than that of BP neural network. Therefore, utilizing MD-CM-SFLA algorithm to optimize the parameters of BP neural network can obtain fast convergence velocity and good learning ability, thus providing a new idea for whispered speech emotion recognition.%提出了一种基于分子动力学模拟与云模型理论的改进混合蛙跳算法(MD-CM-SFLA).该算法将青蛙个体等效成分子,仅考虑最差个体和全局最优个体之间的吸引力,采用一种新的分子间作用力来代替两体间经典的Lennard-Jones作用力,并利用Velocity-Verlet算法和正态云发生器代替混合蛙跳算法的更新策略,有效平衡了种群的多样性和搜索的高效性.然后,将MD-CM-SFLA算法与BP神经网络相结合,设计出一种MD-CM-SFLA神经网络,并将其应用于耳语音情感识别中.耳语音情感识别结果表明,MD-CM-SFLA神经网络相对于BP神经网络具有明显的优势,在相

  11. Human Lips-Contour Recognition and Tracing

    Directory of Open Access Journals (Sweden)

    Md. Hasan Tareque

    2014-01-01

    Full Text Available Human-lip detection is an important criterion for many automated modern system in present day. Like computerized speech reading, face recognition etc. system can work more precisely if human-lip can detect accurately. There are many processes for detecting human-lip. In this paper an approach is developed so that the region of a human-lip can be detected, we called it lip contour. For this a region-based Active Contour Model (ACM is introduced with watershed segmentation. In this model we used global energy terms instead of local energy terms because, global energy gives better convergence rate for malicious environment. At the time of ACM initialization by using H8 based on Lyapunov stability theory, the system gives more accurate and stable result.

  12. Speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Ravishankar, C., Hughes Network Systems, Germantown, MD

    1998-05-08

    Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the

  13. Speech-on-speech masking with variable access to the linguistic content of the masker speech.

    Science.gov (United States)

    Calandruccio, Lauren; Dhar, Sumitrajit; Bradlow, Ann R

    2010-08-01

    It has been reported that listeners can benefit from a release in masking when the masker speech is spoken in a language that differs from the target speech compared to when the target and masker speech are spoken in the same language [Freyman, R. L. et al. (1999). J. Acoust. Soc. Am. 106, 3578-3588; Van Engen, K., and Bradlow, A. (2007), J. Acoust. Soc. Am. 121, 519-526]. It is unclear whether listeners benefit from this release in masking due to the lack of linguistic interference of the masker speech, from acoustic and phonetic differences between the target and masker languages, or a combination of these differences. In the following series of experiments, listeners' sentence recognition was evaluated using speech and noise maskers that varied in the amount of linguistic content, including native-English, Mandarin-accented English, and Mandarin speech. Results from three experiments indicated that the majority of differences observed between the linguistic maskers could be explained by spectral differences between the masker conditions. However, when the recognition task increased in difficulty, i.e., at a more challenging signal-to-noise ratio, a greater decrease in performance was observed for the maskers with more linguistically relevant information than what could be explained by spectral differences alone. PMID:20707455

  14. Cued Speech: A visual communication mode for the Deaf society

    OpenAIRE

    Heracleous, Panikos; Beautemps, Denis

    2010-01-01

    Cued Speech is a visual mode of communication that uses handshapes and placements in combination with the mouth movements of speech to make the phonemes of a spoken language look different from each other and clearly understandable to deaf individuals. The aim of Cued Speech is to overcome the problems of lip reading and thus enable deaf persons to wholly understand spoken language. In this study, automatic phoneme recognition in Cued Speech for French based on hidden Markov model (HMMs) is i...

  15. The intelligibility of interrupted speech depends upon its uninterrupted intelligibility.

    Science.gov (United States)

    Ardoint, Marine; Green, Tim; Rosen, Stuart

    2014-10-01

    Recognition of sentences containing periodic, 5-Hz, silent interruptions of differing duty cycles was assessed for three types of processed speech. Processing conditions employed different combinations of spectral resolution and the availability of fundamental frequency (F0) information, chosen to yield similar, below-ceiling performance for uninterrupted speech. Performance declined with decreasing duty cycle similarly for each processing condition, suggesting that, at least for certain forms of speech processing and interruption rates, performance with interrupted speech may reflect that obtained with uninterrupted speech. This highlights the difficulty in interpreting differences in interrupted speech performance across conditions for which uninterrupted performance is at ceiling. PMID:25324110

  16. Spatial release from masking for Chinese-native listeners in English speech recognition%母语为汉语的听者听英语时的空间去掩蔽现象研究

    Institute of Scientific and Technical Information of China (English)

    陈妍; 邱小军

    2011-01-01

    Spatial release from masking for Chinese-native listeners in English environment is investigated in psychoacoustic experiments for two kinds of interferences from different directions and with different signal to noise ratio (SNR)signals. Both targets and interferences were produced by loudspeakers in an anechoic room and the correct rates of targets were obtained by listeners' recognition. The results of experiments showed that correct rate was above 99% when only speech targets were in front of listeners, the correct rate was about 57% when speech targets and speech interferences were both in front of listeners, and the correct rate was about 96% when speech targets and speech interferences were produced randomly from a direction of ±60°. When speech targets and interferences were both in front of listeners, it is found that the correct rate dropped from 96% to 34% with decreasing SNR from 0 dB to -12 dB under the condition that noise interferences were in front of listeners. While speech interferences were in front of listeners, the correct rate first dropped, then rose by 27% and after that, resumed its declining. When the noise and speech interferences were produced from a direction of 60°, the correct rate dropped respectively from 99% to 80% and from 98% to 91% with SNR decreasing from -4 dB dB to -16 dB. The conclusions are that the spatial separation has an obvious gain effect on English speech intelligibility of Chinese-native listeners, and the correct rate dropped with SNR decreasing under most situations. This agrees well with the conclusion of the related researches on other native languages.%通过心理声学实验研究了来自不同方向具有不同信噪比的两种十扰声条件下,母语为汉语的听者对英语的空间去掩蔽现象.在消声室指定位置布放扬声器,发出目标声和干扰声,通过听者对目标卢进行听音识别,得到听者识别的正确率.实验结果显示:只在正前方播放

  17. Multimedia content with a speech track: ACM multimedia 2010 workshop on searching spontaneous conversational speech

    NARCIS (Netherlands)

    Larson, M.; Ordelman, R.; Metze, F.; Kraaij, W.; Jong, F. de

    2010-01-01

    When multimedia content has a speech track, a whole array of techniques involving speech recognition and analysis becomes available for indexing and structuring and can provide users with improved access and search. The set of new domains standing to benefit from these techniques encompasses talksho

  18. Limiares de reconhecimento de sentenças no ruído, em campo livre: valores de referência para adultos normo-ouvintes Speech recognition thresholds in noisy areas: reference values for normal hearing adults

    Directory of Open Access Journals (Sweden)

    Marília Oliveira Henriques

    2008-04-01

    Full Text Available Nas clínicas de audiologia, as queixas de dificuldade de compreensão da fala em ambientes ruidosos são freqüentes, mesmo para indivíduos normo-ouvintes. Assim, o audiologista deve não só identificar uma perda auditiva, mas também analisar a compreensão da fala, em condições de comunicação próximas às encontradas no cotidiano. OBJETIVO: Determinar o valor de referência para os limiares de reconhecimento de sentenças no ruído, em campo livre, para indivíduos adultos normo-ouvintes. MATERIAL E MÉTODO: O experimento foi realizado nos anos de 2005 e 2006. Participaram da pesquisa 150 indivíduos adultos normo-ouvintes, com idade entre 18 e 64 anos, avaliados em cabine acusticamente tratada. Realizou-se a avaliação a partir da aplicação do teste Listas de Sentenças em Português. As listas de sentenças foram apresentadas em campo livre, na presença de um ruído competitivo, na intensidade fixa de 65 dB A. O ângulo de incidência de ambos os estímulos foi de 0º- 0º azimute. RESULTADOS E CONCLUSÃO: Os limiares de reconhecimento de sentenças em campo-livre foram obtidos na relação sinal-ruído de -8,14 dB A, sendo este o valor de referência para indivíduos normo-ouvintes.In audiology clinics, complaints about difficulties in speech recognition in noise environments are frequent, even for normal-hearing individuals. Thus, the audiologist must not only identify a hearing loss, but also analyze speech recognition, under noisy conditions similar to those found in our daily lives. AIM: Determine the reference value for the recognition of phrases under noisy conditions, in the free field, for adult normal hearing patients. MATERIALS AND METHODS: This study was carried out in 2005 and 2006. We had 150 adult normal hearing individuals participating, with ages between 18 and 64 years, assessed in a sound-proof booth. We evaluation was based on lists of phrases in Portuguese. The phrases lists were presented in the free field

  19. A Survey on Statistical Based Single Channel Speech Enhancement Techniques

    Directory of Open Access Journals (Sweden)

    Sunnydayal. V

    2014-11-01

    Full Text Available Speech enhancement is a long standing problem with various applications like hearing aids, automatic recognition and coding of speech signals. Single channel speech enhancement technique is used for enhancement of the speech degraded by additive background noises. The background noise can have an adverse impact on our ability to converse without hindrance or smoothly in very noisy environments, such as busy streets, in a car or cockpit of an airplane. Such type of noises can affect quality and intelligibility of speech. This is a survey paper and its object is to provide an overview of speech enhancement algorithms so that enhance the noisy speech signal which is corrupted by additive noise. The algorithms are mainly based on statistical based approaches. Different estimators are compared. Challenges and Opportunities of speech enhancement are also discussed. This paper helps in choosing the best statistical based technique for speech enhancement

  20. Online handwriting recognition in a form-filling task: evaluating the impact of context-awareness

    Science.gov (United States)

    Seni, Giovanni; Rice, Kimberly; Mayoraz, Eddy

    2003-12-01

    Guiding a recognition task using a language model is commonly accepted as having a positive effect on accuracy and is routinely used in automated speech processing. This paper presents a quantitative study of the impact of the use of word models in online handwriting recognition applied to form-filling tasks on handheld devices. Two types of word models are considered: a dictionary, typically from few thousands and up to hundred-thousand words; and a grammar or regular expression generating a language several orders of magnitude bigger than the dictionary. It is reported that the improvement in accuracy obtained by the use of a grammar compares with the gain provided by the use of a dictionary. Finally, the impact of the word models on user acceptance of online handwriting recognition in a specific form-filling application is presented.

  1. Design of household control system based on speech recognition%基于语音识别的家居控制系统设计

    Institute of Scientific and Technical Information of China (English)

    黄辉健; 程良鸿; 黄明杰; 林垣华; 李志杰

    2014-01-01

    This paper studied the technology of speaker-dependent recognition based on Sunplus SPCE061A, voice recognition technology will be applied to the home control system. Proposed a control scheme which is convenient operation,easy to expand, and applicable to home applications. The system will be analyzed from the perspective of hardware circuit and software design. Also in the Google App Inventer platform, built out a control software based on Android smartphone’s Bluetooth communication.The tested results showed that the system has successfully realized the voice technology appliances and Android smartphones remote control technology.%本文研究了凌阳SPCE061A的特定人的语音识别与控制技术,将语音识别技术应用到家居控制系统中。提出一种操作简便、易扩展、适用于家庭应用的控制方案。分析了系统的硬件组成和软件设计流程。同时在Google App Inventer平台下,介绍了基于蓝牙通信的Android智能手机控制软件的搭建。经实际测试表明,本系统成功地实现对家电的声控技术和Android智能手机远程控制。

  2. The relative phonetic contributions of a cochlear implant and residual acoustic hearing to bimodal speech perceptiona

    OpenAIRE

    Sheffield, Benjamin M.; Zeng, Fan-Gang

    2012-01-01

    The addition of low-passed (LP) speech or even a tone following the fundamental frequency (F0) of speech has been shown to benefit speech recognition for cochlear implant (CI) users with residual acoustic hearing. The mechanisms underlying this benefit are still unclear. In this study, eight bimodal subjects (CI users with acoustic hearing in the non-implanted ear) and eight simulated bimodal subjects (using vocoded and LP speech) were tested on vowel and consonant recognition to determine th...

  3. Modeling speech imitation and ecological learning of auditory-motor maps

    OpenAIRE

    Claudia eCanevari; Leonardo eBadino; Alessandro eD'Ausilio; Luciano eFadiga; Giorgio eMetta

    2013-01-01

    Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR ...

  4. Speech is Golden

    DEFF Research Database (Denmark)

    Juel Henrichsen, Peter

    2014-01-01

    on the supply side. The present article reports on a new public action strategy which has taken shape in the course of 2013-14. While Denmark is a small language area, our public sector is well organised and has considerable purchasing power. Across this past year, Danish local authorities have organised around......Most of the Danish municipalities are ready to begin to adopt automatic speech recognition, but at the same time remain nervous following a long series of bad business cases in the recent past. Complaints are voiced over costly licences and low service levels, typical effects of a de facto monopoly...

  5. 基于改进谱减法的语音识别系统去噪%Denoising in the Speech Recognition System Based on the Improved Spectral Subtraction Method

    Institute of Scientific and Technical Information of China (English)

    田莎莎; 田艳

    2012-01-01

      The paper studied genetic spectral subtraction method and improved spectral subtraction method for eliminating the noise jamming in the speech recognition system. The two algorithm method are programmed and implemented.The simulation show that the improved algorithm can not only eliminate noise, but also suppress the“music noise”. The improved algorithm has an advantage over the genetic one.%  为消除语音识别系统中的噪声干扰,研究了传统谱减法和改进的谱减法,描述了两种算法的原理和特点,并对两种算法进行了编程实现.matlab仿真结果表明,改进的谱减法不仅可以消除噪声,还可以抑制“音乐噪声”的产生,比传统谱减法更具优势.

  6. 基于声学特征空间非线性流形结构的语音识别声学模型%Feature Space Nonlinear Manifold Based Acoustic Model for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张文林; 牛铜; 屈丹; 李弼程; 裴喜龙

    2015-01-01

    从语音信号声学特征空间的非线性流形结构特点出发,利用流形上的压缩感知原理,构建新的语音识别声学模型.将特征空间划分为多个局部区域,对每个局部区域用一个低维的因子分析模型进行近似,从而得到混合因子分析模型.将上下文相关状态的观测矢量限定在该非线性低维流形结构上,推导得到其观测概率模型.最终,每个状态由一个服从稀疏约束的权重矢量和若干个服从标准正态分布的低维局部因子矢量所决定.文中给出了局部区域潜在维数的确定准则及模型参数的迭代估计算法.基于RM 语料库的连续语音识别实验表明,相比于传统的高斯混合模型(Gaussian mixture model, GMM)和子空间高斯混合模型(Subspace Gaussian mixture model, SGMM),新声学模型在测试集上的平均词错误率(Word error rate, WER)分别相对下降了33.1%和9.2%.%Based on nonlinear manifold structure of acoustic feature space of speech signal, a new type of acoustic model for speech recognition is developed using compressive sensing. The feature space is divided into multiple local areas, with each area approximated by a low dimensional factor analysis model, so that in a mixture of factor analyzers is obtained. By restricting the observation vectors to be located on that nonlinear manifold, the probabilistic model of each context dependent state can be derived. Each state is determined by a sparse weight vector and several low-dimensional factors which follow standard Gaussian distributions. The principle for selection of the dimension for each local area is given, and iterated estimation methods for various model parameters are presented. Continuous speech recognition experiments on the RM corpus show that compared with the conventional Gaussian mixture model (GMM) and the subspace Gaussian mixture model (SGMM), the new acoustic model reduces the word error rate (WER) by 33.1%and 9.2%respectively.

  7. Talker Variability in Audiovisual Speech Perception

    Directory of Open Access Journals (Sweden)

    Shannon eHeald

    2014-07-01

    Full Text Available A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition. So far, this talker-variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker’s face, speech recognition is improved under adverse listening (e.g., noise or distortion conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target-word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker’s face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred.

  8. 基于语音识别的农产品价格信息采集方法%The Agricultural Price Information Acquisition Method Based on Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    许金普; 诸叶平

    2015-01-01

    分别采用相同的测试集进行试验,得出不同方法下的句子识别率、词识别率以及精准度。【结果】三音子声学模型的识别性能明显优于单音素声学模型,女声模型和男声模型的性能均优于男女混合声学模型,决策树聚类方法对识别率的提高不明显但可以明显减少三音子模型的数量,混合高斯分量的增加对识别率具有一定提高但同时带来计算量的增加,CMN和CVN方法可以明显提高系统的识别性能。通过对不同地点和不同说话人进行测试,最终识别率男性为95.04%,女性为97.62%。【结论】语音识别技术应用到农产品价格信息采集过程中是可行的。本文提出了一种农产品价格采集环境下提高语音识别率的方法,试验证明通过该方法训练出的模型具有较好的识别性能,本研究方法为日后应用系统的开发奠定了基础。%Objective]In this research, speech recognition technology was applied to collect agricultural price information. The aim of the research is to recognize the continuous speech which is limited in vocabulary and uttered by independent Chinese mandarin speakers, and to propose a robust speech recognition method suitable for the environment where agricultural product prices are collected. On the basis of Hidden Markov Model (HMM), we train the acoustic models for this environment, so as to relieve the decrease of recognition rate caused by the mismatching between the test environment and the training environment, and to make further improvement of the recognition rate.[Method] In the stage of acquiring and processing data, we first built the transformation grammar according to certain rules to recognize the limited vocabulary, and this grammar will be used to guide the recording of both train data and test data. Then we select different environments to collect agricultural product prices by different speakers. On this basis, we built a speech

  9. Personality in speech assessment and automatic classification

    CERN Document Server

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  10. Utility of TMS to understand the neurobiology of speech.

    Science.gov (United States)

    Murakami, Takenobu; Ugawa, Yoshikazu; Ziemann, Ulf

    2013-01-01

    According to a traditional view, speech perception and production are processed largely separately in sensory and motor brain areas. Recent psycholinguistic and neuroimaging studies provide novel evidence that the sensory and motor systems dynamically interact in speech processing, by demonstrating that speech perception and imitation share regional brain activations. However, the exact nature and mechanisms of these sensorimotor interactions are not completely understood yet. Transcranial magnetic stimulation (TMS) has often been used in the cognitive neurosciences, including speech research, as a complementary technique to behavioral and neuroimaging studies. Here we provide an up-to-date review focusing on TMS studies that explored speech perception and imitation. Single-pulse TMS of the primary motor cortex (M1) demonstrated a speech specific and somatotopically specific increase of excitability of the M1 lip area during speech perception (listening to speech or lip reading). A paired-coil TMS approach showed increases in effective connectivity from brain regions that are involved in speech processing to the M1 lip area when listening to speech. TMS in virtual lesion mode applied to speech processing areas modulated performance of phonological recognition and imitation of perceived speech. In summary, TMS is an innovative tool to investigate processing of speech perception and imitation. TMS studies have provided strong evidence that the sensory system is critically involved in mapping sensory input onto motor output and that the motor system plays an important role in speech perception.

  11. Utility of TMS to understand the neurobiology of speech

    Directory of Open Access Journals (Sweden)

    Takenobu eMurakami

    2013-07-01

    Full Text Available According to a traditional view, speech perception and production are processed largely separately in sensory and motor brain areas. Recent psycholinguistic and neuroimaging studies provide novel evidence that the sensory and motor systems dynamically interact in speech processing, by demonstrating that speech perception and imitation share regional brain activations. However, the exact nature and mechanisms of these sensorimotor interactions are not completely understood yet.Transcranial magnetic stimulation (TMS has often been used in the cognitive neurosciences, including speech research, as a complementary technique to behavioral and neuroimaging studies. Here we provide an up-to-date review focusing on TMS studies that explored speech perception and imitation.Single-pulse TMS of the primary motor cortex (M1 demonstrated a speech specific and somatotopically specific increase of excitability of the M1 lip area during speech perception (listening to speech or lip reading. A paired-coil TMS approach showed increases in effective connectivity from brain regions that are involved in speech processing to the M1 lip area when listening to speech. TMS in virtual lesion mode applied to speech processing areas modulated performance of phonological recognition and imitation of perceived speech.In summary, TMS is an innovative tool to investigate processing of speech perception and imitation. TMS studies have provided strong evidence that the sensory system is critically involved in mapping sensory input onto motor output and that the motor system plays an important role in speech perception.

  12. Hate speech

    Directory of Open Access Journals (Sweden)

    Anne Birgitta Nilsen

    2014-03-01

    Full Text Available The manifesto of the Norwegian terrorist Anders Behring Breivik is based on the “Eurabia” conspiracy theory. This theory is a key starting point for hate speech amongst many right-wing extremists in Europe, but also has ramifications beyond these environments. In brief, proponents of the Eurabia theory claim that Muslims are occupying Europe and destroying Western culture, with the assistance of the EU and European governments. By contrast, members of Al-Qaeda and other extreme Islamists promote the conspiracy theory “the Crusade” in their hate speech directed against the West. Proponents of the latter theory argue that the West is leading a crusade to eradicate Islam and Muslims, a crusade that is similarly facilitated by their governments. This article presents analyses of texts written by right-wing extremists and Muslim extremists in an effort to shed light on how hate speech promulgates conspiracy theories in order to spread hatred and intolerance.The aim of the article is to contribute to a more thorough understanding of hate speech’s nature by applying rhetorical analysis. Rhetorical analysis is chosen because it offers a means of understanding the persuasive power of speech. It is thus a suitable tool to describe how hate speech works to convince and persuade. The concepts from rhetorical theory used in this article are ethos, logos and pathos. The concept of ethos is used to pinpoint factors that contributed to Osama bin Laden's impact, namely factors that lent credibility to his promotion of the conspiracy theory of the Crusade. In particular, Bin Laden projected common sense, good morals and good will towards his audience. He seemed to have coherent and relevant arguments; he appeared to possess moral credibility; and his use of language demonstrated that he wanted the best for his audience.The concept of pathos is used to define hate speech, since hate speech targets its audience's emotions. In hate speech it is the

  13. A Novel Morphometry-Based Protocol of Automated Video-Image Analysis for Species Recognition and Activity Rhythms Monitoring in Deep-Sea Fauna

    Directory of Open Access Journals (Sweden)

    Paolo Menesatti

    2009-10-01

    Full Text Available The understanding of ecosystem dynamics in deep-sea areas is to date limited by technical constraints on sampling repetition. We have elaborated a morphometry-based protocol for automated video-image analysis where animal movement tracking (by frame subtraction is accompanied by species identification from animals’ outlines by Fourier Descriptors and Standard K-Nearest Neighbours methods. One-week footage from a permanent video-station located at 1,100 m depth in Sagami Bay (Central Japan was analysed. Out of 150,000 frames (1 per 4 s, a subset of 10.000 was analyzed by a trained operator to increase the efficiency of the automated procedure. Error estimation of the automated and trained operator procedure was computed as a measure of protocol performance. Three displacing species were identified as the most recurrent: Zoarcid fishes (eelpouts, red crabs (Paralomis multispina, and snails (Buccinum soyomaruae. Species identification with KNN thresholding produced better results in automated motion detection. Results were discussed assuming that the technological bottleneck is to date deeply conditioning the exploration of the deep-sea.

  14. The Cortical Organization of Speech Processing: Feedback Control and Predictive Coding the Context of a Dual-Stream Model

    Science.gov (United States)

    Hickok, Gregory

    2012-01-01

    Speech recognition is an active process that involves some form of predictive coding. This statement is relatively uncontroversial. What is less clear is the source of the prediction. The dual-stream model of speech processing suggests that there are two possible sources of predictive coding in speech perception: the motor speech system and the…

  15. Speech enhancement

    CERN Document Server

    Benesty, Jacob; Chen, Jingdong

    2006-01-01

    We live in a noisy world! In all applications (telecommunications, hands-free communications, recording, human-machine interfaces, etc.) that require at least one microphone, the signal of interest is usually contaminated by noise and reverberation. As a result, the microphone signal has to be ""cleaned"" with digital signal processing tools before it is played out, transmitted, or stored.This book is about speech enhancement. Different well-known and state-of-the-art methods for noise reduction, with one or multiple microphones, are discussed. By speech enhancement, we mean not only noise red

  16. Towards Quranic reader controlled by speech

    OpenAIRE

    Yacine Yekache; Yekhlef Mekelleche; Belkacem Kouninef

    2012-01-01

    In this paper we describe the process of designing a task-oriented continuous speech recognition system for Arabic, based on CMU Sphinx4, to be used in the voice interface of Quranic reader. The concept of the Quranic reader controlled by speech is presented, the collection of the corpus and creation of acoustic model are described in detail taking into account a specificities of Arabic language and the desired application.

  17. Towards Quranic reader controlled by speech

    Directory of Open Access Journals (Sweden)

    Yacine Yekache

    2011-11-01

    Full Text Available In this paper we describe the process of designing a task-oriented continuous speech recognition system for Arabic, based on CMU Sphinx4, to be used in the voice interface of Quranic reader. The concept of the Quranic reader controlled by speech is presented, the collection of the corpus and creation of acoustic model are described in detail taking into account a specificities of Arabic language and the desired application.

  18. Automatic Licenses Plate Recognition

    OpenAIRE

    Ronak P Patel; Narendra M Patel; Keyur Brahmbhatt

    2013-01-01

    This paper describes the Smart Vehicle Screening System, which can be installed into a tollboothfor automated recognition of vehicle license plate information using a photograph of a vehicle. An automatedsystem could then be implemented to control the payment of fees, parking areas, highways, bridges ortunnels, etc. This paper contains new algorithm for recognition number plate using Morphological operation,Thresholding operation, Edge detection, Bounding box analysis for number plate extract...

  19. Normal and Time-Compressed Speech

    Science.gov (United States)

    Lemke, Ulrike; Kollmeier, Birger; Holube, Inga

    2016-01-01

    Short-term and long-term learning effects were investigated for the German Oldenburg sentence test (OLSA) using original and time-compressed fast speech in noise. Normal-hearing and hearing-impaired participants completed six lists of the OLSA in five sessions. Two groups of normal-hearing listeners (24 and 12 listeners) and two groups of hearing-impaired listeners (9 listeners each) performed the test with original or time-compressed speech. In general, original speech resulted in better speech recognition thresholds than time-compressed speech. Thresholds decreased with repetition for both speech materials. Confirming earlier results, the largest improvements were observed within the first measurements of the first session, indicating a rapid initial adaptation phase. The improvements were larger for time-compressed than for original speech. The novel results on long-term learning effects when using the OLSA indicate a longer phase of ongoing learning, especially for time-compressed speech, which seems to be limited by a floor effect. In addition, for normal-hearing participants, no complete transfer of learning benefits from time-compressed to original speech was observed. These effects should be borne in mind when inviting listeners repeatedly, for example, in research settings.

  20. A novel speech emotion recognition algorithm based on combination of emotion data field and ant colony search strategy%一种新的结合情感数据场和蚁群策略的语音情感识别算法

    Institute of Scientific and Technical Information of China (English)

    查诚; 陶华伟; 张昕然; 周琳; 赵力; 杨平

    2016-01-01

    In order to effectively conduct emotion recognition from spontaneous, non-prototypical and unsegmented speech so as to create a more natural human-machine interaction; a novel speech emotion recognition algorithm based on the combination of the emotional data field ( EDF ) and the ant colony search ( ACS ) strategy, called the EDF-ACS algorithm, is proposed. More specifically, the inter-relationship among the turn-based acoustic feature vectors of different labels are established by using the potential function in the EDF. To perform the spontaneous speech emotion recognition, the artificial colony is used to mimic the turn-based acoustic feature vectors. Then, the canonical ACS strategy is used to investigate the movement direction of each artificial ant in the EDF, which is regarded as the emotional label of the corresponding turn-based acoustic feature vector. The proposed EDF-ACS algorithm is evaluated on the continueous audio/visual emotion challenge ( AVEC ) 2012 dataset, which contains the spontaneous, non-prototypical and unsegmented speech emotion data. The experimental results show that the proposed EDF-ACS algorithm outperforms the existing state-of-the-art algorithm in turn-based speech emotion recognition.%为了有效识别自发、非典型及未分割语音的情感以建立更自然的人机交互界面,提出了一种新的结合情感数据场和蚁群策略的语音情感识别算法。用情感数据场中势函数建立基于块的声学特征向量之间的内在联系。为识别自发语音情感,用人工蚁群模拟基于块的声学特征向量,然后用典型的蚁群策略研究每个人工蚂蚁在情感数据场的运动轨迹,并把该蚂蚁的运动轨迹作为对应的声学特征向量的情感标签。利用2012年连续音视频情感挑战赛中的语音数据对所提算法进行测试。实验结果表明:该算法较已有算法能更好地对基于块的语音情感进行识别。

  1. Spectral subtraction-based speech enhancement for cochlear implant patients in background noise

    Science.gov (United States)

    Yang, Li-Ping; Fu, Qian-Jie

    2005-03-01

    A single-channel speech enhancement algorithm utilizing speech pause detection and nonlinear spectral subtraction is proposed for cochlear implant patients in the present study. The spectral subtraction algorithm estimates the short-time spectral magnitude of speech by subtracting the estimated noise spectral magnitude from the noisy speech spectral magnitude. The artifacts produced by spectral subtraction (such as ``musical noise'') were significantly reduced by combining variance-reduced gain function and spectral flooring. Sentence recognition by seven cochlear implant subjects was tested under different noisy listening conditions (speech-shaped noise and 6-talker speech babble at +9, +6, +3, and 0 dB SNR) with and without the speech enhancement algorithm. For speech-shaped noise, performance for all subjects at all SNRs was significantly improved by the speech enhancement algorithm; for speech babble, performance was only modestly improved. The results suggest that the proposed speech enhancement algorithm may be beneficial for implant users in noisy listening. .

  2. Speech Intelligibility

    Science.gov (United States)

    Brand, Thomas

    Speech intelligibility (SI) is important for different fields of research, engineering and diagnostics in order to quantify very different phenomena like the quality of recordings, communication and playback devices, the reverberation of auditoria, characteristics of hearing impairment, benefit using hearing aids or combinations of these things.

  3. The Value of Commercial Speech

    OpenAIRE

    Munro, Colin

    2003-01-01

    Recent decisions in the courts have encouraged discussion of the extent to which the common law does or should place a high or higher value on political expression. Some scholars argue for a more explicit recognition of the high value of political speech, and would seek, for example, to 'constitutionalise' defamation laws. Others have adopted a more sceptical attitude to the desirability of importing American approaches to freedom of expression generally or to the privileging of political spe...

  4. Talker variability in audio-visual speech perception.

    Science.gov (United States)

    Heald, Shannon L M; Nusbaum, Howard C

    2014-01-01

    A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition). So far, this talker variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker's face, speech recognition is improved under adverse listening (e.g., noise or distortion) conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker's face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred. PMID:25076919

  5. Speech-enabled Computer-aided Translation

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    2014-01-01

    The present study has surveyed post-editor trainees’ views and attitudes before and after the introduction of speech technology as a front end to a computer-aided translation workbench. The aim of the survey was (i) to identify attitudes and perceptions among post-editor trainees before performing...... a post-editing task using automatic speech recognition (ASR); and (ii) to assess the degree to which post-editors’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires: the first one administered before the participants...

  6. Role of neural network models for developing speech systems

    Indian Academy of Sciences (India)

    K Sreenivasa Rao

    2011-10-01

    This paper discusses the application of neural networks for developing different speech systems. Prosodic parameters of speech at syllable level depend on positional, contextual and phonological features of the syllables. In this paper, neural networks are explored to model the prosodic parameters of the syllables from their positional, contextual and phonological features. The prosodic parameters considered in this work are duration and sequence of pitch $(F_0)$ values of the syllables. These prosody models are further examined for applications such as text to speech synthesis, speech recognition, speaker recognition and language identification. Neural network models in voice conversion system are explored for capturing the mapping functions between source and target speakers at source, system and prosodic levels. We have also used neural network models for characterizing the emotions present in speech. For identification of dialects in Hindi, neural network models are used to capture the dialect specific information from spectral and prosodic features of speech.

  7. Silent Speech Interfaces

    OpenAIRE

    Denby, B; Schultz, T.; Honda, K.; Hueber, T.; Gilbert, J.M.; Brumberg, J.S.

    2010-01-01

    Abstract The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a `silent speech? interface, to be used as an aid for the speech handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telec...

  8. Acoustic modeling for emotion recognition

    CERN Document Server

    Anne, Koteswara Rao; Vankayalapati, Hima Deepthi

    2015-01-01

     This book presents state of art research in speech emotion recognition. Readers are first presented with basic research and applications – gradually more advance information is provided, giving readers comprehensive guidance for classify emotions through speech. Simulated databases are used and results extensively compared, with the features and the algorithms implemented using MATLAB. Various emotion recognition models like Linear Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), Support Vector Machines (SVM) and K-Nearest neighbor (KNN) and are explored in detail using prosody and spectral features, and feature fusion techniques.

  9. Handbook of Face Recognition

    CERN Document Server

    Li, Stan Z

    2011-01-01

    This highly anticipated new edition provides a comprehensive account of face recognition research and technology, spanning the full range of topics needed for designing operational face recognition systems. After a thorough introductory chapter, each of the following chapters focus on a specific topic, reviewing background information, up-to-date techniques, and recent results, as well as offering challenges and future directions. Features: fully updated, revised and expanded, covering the entire spectrum of concepts, methods, and algorithms for automated face detection and recognition systems

  10. Sistema audiovisual para reconocimiento de comandos Audiovisual system for recognition of commands

    Directory of Open Access Journals (Sweden)

    Alexander Ceballos

    2011-08-01

    Full Text Available Se presenta el desarrollo de un sistema automático de reconocimiento audiovisual del habla enfocado en el reconocimiento de comandos. La representación del audio se realizó mediante los coeficientes cepstrales de Mel y las primeras dos derivadas temporales. Para la caracterización del vídeo se hizo seguimiento automático de características visuales de alto nivel a través de toda la secuencia. Para la inicialización automática del algoritmo se emplearon transformaciones de color y contornos activos con información de flujo del vector gradiente ("GVF snakes" sobre la región labial, mientras que para el seguimiento se usaron medidas de similitud entre vecindarios y restricciones morfológicas definidas en el estándar MPEG-4. Inicialmente, se presenta el diseño del sistema de reconocimiento automático del habla, empleando únicamente información de audio (ASR, mediante Modelos Ocultos de Markov (HMMs y un enfoque de palabra aislada; posteriormente, se muestra el diseño de los sistemas empleando únicamente características de vídeo (VSR, y empleando características de audio y vídeo combinadas (AVSR. Al final se comparan los resultados de los tres sistemas para una base de datos propia en español y francés, y se muestra la influencia del ruido acústico, mostrando que el sistema de AVSR es más robusto que ASR y VSR.We present the development of an automatic audiovisual speech recognition system focused on the recognition of commands. Signal audio representation was done using Mel cepstral coefficients and their first and second order time derivatives. In order to characterize the video signal, a set of high-level visual features was tracked throughout the sequences. Automatic initialization of the algorithm was performed using color transformations and active contour models based on Gradient Vector Flow (GVF Snakes on the lip region, whereas visual tracking used similarity measures across neighborhoods and morphological

  11. A Java speech implementation of the Mini Mental Status Exam.

    OpenAIRE

    Wang, S S; Starren, J.

    1999-01-01

    The Folstein Mini Mental Status Exam (MMSE) is a simple, widely used, verbally administered test to assess cognitive function. The Java Speech Application Programming Interface (JSAPI) is a new, cross-platform interface for both speech recognition and speech synthesis in the Java environment. To evaluate the suitability of the JSAPI for interactive, patient interview applications, a JSAPI implementation of the MMSE was developed. The MMSE contains questions that vary in structure in order to ...

  12. 考虑情感程度相对顺序的维度语音情感识别%Considering relative order of emotional degree in dimensional speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    韩文静; 李海峰; 马琳

    2011-01-01

    Dimensional speech emotion recognition (Dim-SER) is a rising branch of emotion computing field. It views emotion from dimensional and continuous perspective, and formalizesthe SER problem as a regression task. Current Dim-SER researches never consider the relative order of emotional degree between utterances, which would makethe human-machine interface get wrong information about speaker's emotion variation trend. Starting from this demand,this paper constructs an order sensitive Dim-SER system with the human emotion cognitive characteristics as reference, and employsGamma statisticto evaluate emotion recognition performance. Specifically, the Top-rank probability distribution is developed to describethe emotional ordering of utterances, and the Kullback-Leibler divergence is usedto measure the loss of order consistency caused by emotion recognition. Finally, the Order-Senstive Network (OSNet) algorithm is proposed to minimized prediction loss. Experimental results show that, compared with the commonly usedA-Nearest Neighbor (A-NN) and Support Vector Regression (SVR) approaches, the proposed system effectively improve thecorrectness of emotional relative order between utterances.%维度语音情感识别(Dim-SER)是情感计算领域的一个新兴分支,它从多维、连续的角度看待情感,将SER问题建模为连续值的预测回归任务.当前的Dim-SER系统在进行情感预测时缺少对语料间情感程度相对顺序的考虑,严重影响了人机交互系统对说话人情感变化趋势的把握.从该需求出发,本文以人类情感认知特性为参照,构建了一个对情感程度相对顺序敏感的Dim-SER系统,并引入Gamma统计对SER系统性能评价标准加以完善.系统构建过程中,本文构造了Top-rank概率分布对语料间的情感顺序进行描述,并使用Kullback-Leibler距离对预测造成的顺序一致性损失进行度量,最后提出顺序敏感的神经网络算法实现系统预测损失的最小化.情

  13. An audiovisual emotion recognition system

    Science.gov (United States)

    Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun

    2007-12-01

    Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.

  14. Using a Fuzzy Emotion Model in Computer Assisted Speech Therapy

    OpenAIRE

    Schipor, Ovidiu Andrei; Pentiuc, Stefan Gheorghe; Schipor, Maria Doina

    2011-01-01

    Affective computing – machine’s ability to recognize and simulate human affects – has become a main research field for Human Computer Interaction. This paper deal with emotion recognition within a CBST (Computer Based Speech Therapy System) for preschoolers and young schoolchildren. Identifying the emotions of children with speech disorders during the assisted therapy sessions requires an adaptation of classical recognition techniques. That is why, in our article we focus on finding and testi...

  15. An Agent-based Framework for Speech Investigation

    OpenAIRE

    Walsh, Michael; O'Hare, G.M.P.; Carson-Berndsen, Julie

    2005-01-01

    This paper presents a novel agent-based framework for investigating speech recognition which combines statistical data and explicit phonological knowledge in order to explore strategies aimed at augmenting the performance of automatic speech recognition (ASR) systems. This line of research is motivated by a desire to provide solutions to some of the more notable problems encountered, including in particular the problematic phenomena of coarticulation, underspecified input...

  16. Strategies for focal accent detection in spontaneous speech

    OpenAIRE

    Petzold, Anja

    1995-01-01

    In this paper a new method for detection of focus is developed. Speech data consists of German spontaneous speech from several speakers. At present the algorithm uses only the fundamental frequency values. By computing a nonlinear reference line through significant anchor points in the F_{0} course, points of highest prominence are determined. The global recognition rate is 78,5% and the mean recognition rate is 66,6%.

  17. Comparison of Manual and Automatic Evaluation of Speech Recognition Threshold Using Mandarin Disyllabic Test%手动与自动取值对普通话双音节测试中言语识别阈的影响

    Institute of Scientific and Technical Information of China (English)

    郑中伟; 张华; 王越

    2014-01-01

    目的:比较手动测试取值与软件自动描记取值所得普通话双音节词汇表言语识别阈(speech recog-nition threshold ,SRT ),并探讨其临床应用的意义。方法选取128例正常人(听力正常组)以及57例从事噪声作业的工人(噪声组)为受试对象,均以普通话作为日常交流方式。应用丹麦Madsen Conera临床诊断听力计,采用难度等价性一致的一组双音节词汇表作为测试材料,测试初始给声强度为PT A上20 dB ,将手动取值获得的言语识别阈与Conera听力计工作软件自动生成的言语识别阈进行对比分析。结果听力正常组语频听阈均值7.63±5.78 dB HL ,自动取值所得SRT为7.84±3.98 dB HL ,手动取值所得SRT 为9.19±4.47 dB HL ;噪声组语频听阈均值27.18±19.13 dB HL ,自动取值所得SRT 为16.10±8.40 dB HL ,手动取值所得的SRT 为18.81±9.52 dB HL。两组手动取值所得的SRT值高于自动取值所得的SRT值(P<0.01)。结论自动取值SRT与手动取值SRT有差异,听力正常人的言语识别阈可用自动取值方法测试,便于听力正常人群的筛查;对听力障碍人群的SRT检查,更适合应用手动测试取值方法。%Objective To compare the results of manually -tested speech recognition threshold (SRT ) with automatically software -recorded SRT in the trial of Mandarin disyllabic test ,exploring the significance to the clini-cal applying .Methods 128 normal people of different ages without hearing loss and 57 workers exposed to noise in an automobile manufacturing was selected .These two group of volunteers speak mainly Mandarin in their daily life . MADSEN Conera (Danmark) clinical audiometr was applied .A group of double syllable word list with the same dif-ficulty of equivalence was used as test material .The initial presentation level was 20 dB HL higher than PTA .Then compared the results of manually -tested SRT with

  18. Assessment of PD Speech Anomalies @ Home

    OpenAIRE

    Khan, Taha; Westin, Jerker

    2011-01-01

    Background: Voice processing in real-time is challenging. A drawback of previous work for Hypokinetic Dysarthria (HKD) recognition is the requirement of controlled settings in a laboratory environment. A personal digital assistant (PDA) has been developed for home assessment of PD patients. The PDA offers sound processing capabilities, which allow for developing a module for recognition and quantification HKD. Objective: To compose an algorithm for assessment of PD speech severity in the home...

  19. Prediction and constraint in audiovisual speech perception.

    Science.gov (United States)

    Peelle, Jonathan E; Sommers, Mitchell S

    2015-07-01

    During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration

  20. Speech perception as an active cognitive process

    Directory of Open Access Journals (Sweden)

    Shannon eHeald

    2014-03-01

    Full Text Available One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processingd with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or