WorldWideScience

Sample records for robust speech recognition

  1. Recent Advances in Robust Speech Recognition Technology

    CERN Document Server

    Ramírez, Javier

    2011-01-01

    This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminishe

  2. Robust Speech Recognition Using a Harmonic Model

    Institute of Scientific and Technical Information of China (English)

    许超; 曹志刚

    2004-01-01

    Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been employed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algorithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrelation function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.

  3. The Phase Spectra Based Feature for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Abbasian ALI

    2009-07-01

    Full Text Available Speech recognition in adverse environment is one of the major issue in automatic speech recognition nowadays. While most current speech recognition system show to be highly efficient for ideal environment but their performance go down extremely when they are applied in real environment because of noise effected speech. In this paper a new feature representation based on phase spectra and Perceptual Linear Prediction (PLP has been suggested which can be used for robust speech recognition. It is shown that this new features can improve the performance of speech recognition not only in clean condition but also in various levels of noise condition when it is compared to PLP features.

  4. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hoirin Kim

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  5. Noise Robust Speech Recognition Applied to Voice-Driven Wheelchair

    Science.gov (United States)

    Sasou, Akira; Kojima, Hiroaki

    2009-12-01

    Conventional voice-driven wheelchairs usually employ headset microphones that are capable of achieving sufficient recognition accuracy, even in the presence of surrounding noise. However, such interfaces require users to wear sensors such as a headset microphone, which can be an impediment, especially for the hand disabled. Conversely, it is also well known that the speech recognition accuracy drastically degrades when the microphone is placed far from the user. In this paper, we develop a noise robust speech recognition system for a voice-driven wheelchair. This system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors. We verified the effectiveness of our system in experiments in different environments, and confirmed that our system can achieve almost the same recognition accuracy as the headset microphone without wearing sensors.

  6. Robust Automatic Speech Recognition in Impulsive Noise Environment

    Institute of Scientific and Technical Information of China (English)

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  7. Robust Speech Recognition Method Based on Discriminative Environment Feature Extraction

    Institute of Scientific and Technical Information of China (English)

    HAN Jiqing; GAO Wen

    2001-01-01

    It is an effective approach to learn the influence of environmental parameters,such as additive noise and channel distortions, from training data for robust speech recognition.Most of the previous methods are based on maximum likelihood estimation criterion. However,these methods do not lead to a minimum error rate result. In this paper, a novel discrimina-tive learning method of environmental parameters, which is based on Minimum ClassificationError (MCE) criterion, is proposed. In the method, a simple classifier and the Generalized Probabilistic Descent (GPD) algorithm are adopted to iteratively learn the environmental parameters. Consequently, the clean speech features are estimated from the noisy speech features with the estimated environmental parameters, and then the estimations of clean speech features are utilized in the back-end HMM classifier. Experiments show that the best error rate reduction of 32.1% is obtained, tested on a task of 18 isolated confusion Korean words, relative to a conventional HMM system.

  8. Auditory-model based robust feature selection for speech recognition.

    Science.gov (United States)

    Koniaris, Christos; Kuropatwinski, Marcin; Kleijn, W Bastiaan

    2010-02-01

    It is shown that robust dimension-reduction of a feature set for speech recognition can be based on a model of the human auditory system. Whereas conventional methods optimize classification performance, the proposed method exploits knowledge implicit in the auditory periphery, inheriting its robustness. Features are selected to maximize the similarity of the Euclidean geometry of the feature domain and the perceptual domain. Recognition experiments using mel-frequency cepstral coefficients (MFCCs) confirm the effectiveness of the approach, which does not require labeled training data. For noisy data the method outperforms commonly used discriminant-analysis based dimension-reduction methods that rely on labeling. The results indicate that selecting MFCCs in their natural order results in subsets with good performance.

  9. An effective cluster-based model for robust speech detection and speech recognition in noisy environments.

    Science.gov (United States)

    Górriz, J M; Ramírez, J; Segura, J C; Puntonet, C G

    2006-07-01

    This paper shows an accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognition systems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection (VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.

  10. Robust relationship between reading span and speech recognition in noise.

    Science.gov (United States)

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.

  11. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...

  12. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting at....... Lastly, to understand the effects of applying various EC techniques, this paper introduces three approaches consisting of speech feature, dynamic programming distance and hidden Markov model state duration comparison.......In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...

  13. Space discriminative function for microphone array robust speech recognition

    Institute of Scientific and Technical Information of China (English)

    Zhao Xianyu; Ou Zhijian; Wang Zuoying

    2005-01-01

    Based on W-disjoint orthogonality of speech mixtures, a space discriminative function was proposed to enumerate and localize competing speakers in the surrounding environments. Then, a Wiener-like post-filterer was developed to adaptively suppress interferences. Experimental results with a hands-free speech recognizer under various SNR and competing speakers settings show that nearly 69% error reduction can be obtained with a two-channel small aperture microphone array against the conventional single microphone baseline system. Comparisons were made against traditional delay-and-sum and Griffiths-Jim adaptive beamforming techniques to further assess the effectiveness of this method.

  14. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior...... to the appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  15. Data-Driven Temporal Filtering on Teager Energy Time Trajectory for Robust Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    ZHAO Jun-hui; XIE Xiang; KUANG Jing-ming

    2006-01-01

    Data-driven temporal filtering technique is integrated into the time trajectory of Teager energy operation (TEO) based feature parameter for improving the robustness of speech recognition system against noise.Three kinds of data-driven temporal filters are investigated for the motivation of alleviating the harmful effects that the environmental factors have on the speech. The filters include: principle component analysis (PCA) based filters, linear discriminant analysis (LDA) based filters and minimum classification error (MCE) based filters. Detailed comparative analysis among these temporal filtering approaches applied in Teager energy domain is presented. It is shown that while all of them can improve the recognition performance of the original TEO based feature parameter in adverse environment, MCE based temporal filtering can provide the lowest error rate as SNR decreases than any other algorithms.

  16. Robust Speech Recognition Using Temporal Pattern Feature Extracted From MTMLP Structure

    Directory of Open Access Journals (Sweden)

    Yasser Shekofteh

    2014-10-01

    Full Text Available Temporal Pattern feature of a speech signal could be either extracted from the time domain or via their front-end vectors. This feature includes long-term information of variations in the connected speech units. In this paper, the second approach is followed, i.e. the features which are the cases of temporal computations, consisting of Spectral-based (LFBE and Cepstrum-based (MFCC feature vectors, are considered. To extract these features, we use posterior probability-based output of the proposed MTMLP neural networks. The combination of the temporal patterns, which represents the long-term dynamics of the speech signal, together with some traditional features, composed of the MFCC and its first and second derivatives are evaluated in an ASR task. It is shown that the use of such a combined feature vector results in the increase of the phoneme recognition accuracy by more than 1 percent regarding the results of the baseline system, which does not benefit from the long-term temporal patterns. In addition, it is shown that the use of extracted features by the proposed method gives robust recognition under different noise conditions (by 13 percent and, therefore, the proposed method is a robust feature extraction method.

  17. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    Tjong Wan Sen

    2013-09-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  18. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  19. A Computational Auditory Scene Analysis System for Speech Segregation and Robust Speech Recognition

    Science.gov (United States)

    2007-01-01

    Droppo, J., Acero , A., 2005. Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech...analysis. IEEE Trans. on Audio, Speech, and Language Processing 15, 396–405. Huang, X., Acero , A., Hon, H., 2001. Spoken Language Processing. Prentice Hall

  20. Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition

    OpenAIRE

    Arata Itoh; Sunao Hara; Norihide Kitaoka; Kazuya Takeda

    2012-01-01

    A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature gene...

  1. Towards Robustness to Speech Rate in Mandarin All-Syllable Recognition

    Institute of Scientific and Technical Information of China (English)

    CHEN YiNing (陈一宁); ZHU Xuan (朱璇); LIU Jia (刘加); LIU RunSheng (刘润生)

    2003-01-01

    In mandarin all-syllable recognition, many insert errors occur due to the influence of non-consonant syllables. Introducing the duration model into the recognition process is a direct way to lessen these errors. But that usually could not work well as expected, for the duration is sensitive to speech rate. Hence, aiming at this problem, a novel context dependent duration distribution normalized by speech rate is proposed in this paper and applied to a speech recognition system based on the frame of improved Hidden Markov Model (HMM). To realize this algorithm,the authors employ a new method to estimate the speech rate of a sentence; then compute the duration probability combined with speech rate; and finally implement this duration information in the post-processing stage. With little change in the recognition process and resource demand,the duration model is adopted efficiently in the system. The experimental results indicate that the syllable error rates decrease significantly in two different speech corpora. Especially for the insertions, the error rates reduce about sixty to eighty percent.

  2. Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

    NARCIS (Netherlands)

    Ahmadi, S.; Ahadi, S.M.; Cranen, B.; Boves, L.W.J.

    2014-01-01

    The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous research in automatic speech recognition converted this very rich representation into the equivalent of a sequence of short-time power spectra, mainly to simplify the computation of the

  3. A Robust Method for Speech Emotion Recognition Based on Infinite Student’s t-Mixture Model

    Directory of Open Access Journals (Sweden)

    Xinran Zhang

    2015-01-01

    Full Text Available Speech emotion classification method, proposed in this paper, is based on Student’s t-mixture model with infinite component number (iSMM and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model, speech emotion model based on Student’s t-mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover, t-mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus and two acting (DES and EMO-DB universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.

  4. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Turicchia Lorenzo

    2007-01-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at -5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (-5 dB SNR to +5 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  5. An FFT-Based Companding Front End for Noise-Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bhiksha Raj

    2007-06-01

    Full Text Available We describe an FFT-based companding algorithm for preprocessing speech before recognition. The algorithm mimics tone-to-tone suppression and masking in the auditory system to improve automatic speech recognition performance in noise. Moreover, it is also very computationally efficient and suited to digital implementations due to its use of the FFT. In an automotive digits recognition task with the CU-Move database recorded in real environmental noise, the algorithm improves the relative word error by 12.5% at −5 dB signal-to-noise ratio (SNR and by 6.2% across all SNRs (−5 dB SNR to +15 dB SNR. In the Aurora-2 database recorded with artificially added noise in several environments, the algorithm improves the relative word error rate in almost all situations.

  6. Robust multi-stream speech recognition based on weighting the output probabilities of feature components

    Institute of Scientific and Technical Information of China (English)

    ZHANG Jun; WEI Gang; YU Hua; NING Genxin

    2009-01-01

    In the traditional multi-stream fusion methods of speech recognition, all the feature components in a data stream share the same stream weight, while their distortion levels are usually different when the speech recognizer works in noisy environments. To overcome this limitation of the traditional multi-stream frameworks, the current study proposes a new stream fusion method that weights not only the stream outputs, but also the output probabilities of feature components. How the stream and feature component weights in the new fusion method affect the decision is analyzed and two stream fusion schemes based on the 03iginalisation and soft decision models in the missing data techniques are proposed. Experimental results on the hybrid sub-band multi-stream speech recognizer show that the proposed schemes can adjust the stream influences on the decision adaptively and outperform the traditional multi-stream methods in various noisy environments.

  7. Robust Speaker Recognition with Combined Use of Acoustic and Throat Microphone Speech

    DEFF Research Database (Denmark)

    Sahidullah, Md; Gonzalez Hautamäki, Rosa; Thomsen, Dennis Alexander Lehmann

    2016-01-01

    Accuracy of automatic speaker recognition (ASV) systems degrades severely in the presence of background noise. In this paper, we study the use of additional side information provided by a body-conducted sensor, throat microphone. Throat microphone signal is much less affected by background noise...... in comparison to acoustic microphone signal. This makes throat microphones potentially useful for feature extraction or speech activity detection. This paper, firstly, proposes a new prototype system for simultaneous data-acquisition of acoustic and throat microphone signals. Secondly, we study the use...... of this additional information for both speech activity detection, feature extraction and fusion of the acoustic and throat microphone signals. We collect a pilot database consisting of 38 subjects including both clean and noisy sessions. We carry out speaker verification experiments using Gaussian mixture model...

  8. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  9. Speech recognition with amplitude and frequency modulations

    Science.gov (United States)

    Zeng, Fan-Gang; Nie, Kaibao; Stickney, Ginger S.; Kong, Ying-Yee; Vongphoe, Michael; Bhargave, Ashish; Wei, Chaogang; Cao, Keli

    2005-02-01

    Amplitude modulation (AM) and frequency modulation (FM) are commonly used in communication, but their relative contributions to speech recognition have not been fully explored. To bridge this gap, we derived slowly varying AM and FM from speech sounds and conducted listening tests using stimuli with different modulations in normal-hearing and cochlear-implant subjects. We found that although AM from a limited number of spectral bands may be sufficient for speech recognition in quiet, FM significantly enhances speech recognition in noise, as well as speaker and tone recognition. Additional speech reception threshold measures revealed that FM is particularly critical for speech recognition with a competing voice and is independent of spectral resolution and similarity. These results suggest that AM and FM provide independent yet complementary contributions to support robust speech recognition under realistic listening situations. Encoding FM may improve auditory scene analysis, cochlear-implant, and audiocoding performance. auditory analysis | cochlear implant | neural code | phase | scene analysis

  10. Advances in Speech Recognition

    CERN Document Server

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  11. Amharic Speech Recognition for Speech Translation

    OpenAIRE

    Melese, Michael; Besacier, Laurent; Meshesha, Million

    2016-01-01

    International audience; The state-of-the-art speech translation can be seen as a cascade of Automatic Speech Recognition, Statistical Machine Translation and Text-To-Speech synthesis. In this study an attempt is made to experiment on Amharic speech recognition for Amharic-English speech translation in tourism domain. Since there is no Amharic speech corpus, we developed a read-speech corpus of 7.43hr in tourism domain. The Amharic speech corpus has been recorded after translating standard Bas...

  12. A Robust Front-End Processor combining Mel Frequency Cepstral Coefficient and Sub-band Spectral Centroid Histogram methods for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    R. Thangarajan

    2009-06-01

    Full Text Available Environmental robustness is an important area of research in speech recognition. Mismatch between trained speech models and actual speech to be recognized is due to factors like background noise. It can cause severe degradation in the accuracy of recognizers whichare based on commonly used features like mel-frequency cepstral co-efficient (MFCC and linear predictive coding (LPC. It is well understood that all previous auditory based feature extraction methods perform extremely well in terms of robustness due to the dominantfrequency information present in them. But these methods suffer from high computational cost. Another method called sub-band spectral centroid histograms (SSCH integrates dominant-frequency information with sub-band power information. This method is based onsub-band spectral centroids (SSC which are closely related to spectral peaks for both clean and noisy speech. Since SSC can be computed efficiently from short-term speech power spectrum estimate, SSCH method is quite robust to background additive noise at a lowercomputational cost. It has been noted that MFCC method outperforms SSCH method in the case of clean speech. However in the case of speech with additive noise, MFCC method degrades substantially. In this paper, both MFCC and SSCH feature extraction have beenimplemented in Carnegie Melon University (CMU Sphinx 4.0 and trained and tested on AN4 database for clean and noisy speech. Finally, a robust speech recognizer which automatically employs either MFCC or SSCH feature extraction methods based on the variance of shortterm power of the input utterance is suggested.

  13. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  14. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Science.gov (United States)

    Selouani, Sid-Ahmed; O'Shaughnessy, Douglas

    2003-12-01

    Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR) systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT) in the mel-frequency domain with a genetic algorithm (GA) to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs) varying from 16 dB to[InlineEquation not available: see fulltext.] dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  15. Robustness-related issues in speaker recognition

    CERN Document Server

    Zheng, Thomas Fang

    2017-01-01

    This book presents an overview of speaker recognition technologies with an emphasis on dealing with robustness issues. Firstly, the book gives an overview of speaker recognition, such as the basic system framework, categories under different criteria, performance evaluation and its development history. Secondly, with regard to robustness issues, the book presents three categories, including environment-related issues, speaker-related issues and application-oriented issues. For each category, the book describes the current hot topics, existing technologies, and potential research focuses in the future. The book is a useful reference book and self-learning guide for early researchers working in the field of robust speech recognition.

  16. Auditory—Spectrum Quantization Based Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    WuYuanqing; HaoJie; 等

    1997-01-01

    Based on the analysis of the physiological and psychological characteristics of human auditory system[1],we can classify human auditory process into two hearing modes:active one and passive one.A novel approach of robust speech recognition,Auditory-spectrum Quantization Based Speech Recognition(AQBSR),is proposed.In this method,we intend to simulate human active hearing mode and locate the effective areas of speech signals in temporal domain and in frequency domain.Adaptive filter banks are used in place of fixed-band filters to extract feature parameters.The effective speech components and their corresponding frequency areas of each word in the vocabulary can be found out during training.In recognition stage,comparison between the unknown sound and the current template is maintained only in the effective areas of the template word.The control experiments show that the AQ BSR method is more robust than traditional systems.

  17. Speech recognition in university classrooms

    OpenAIRE

    Wald, Mike; Bain, Keith; Basson, Sara H

    2002-01-01

    The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions: 1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms? 2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities? This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition te...

  18. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...... in the mobile context covering motivations, challenges, fundamental techniques and applications. Three ASR architectures are introduced: embedded speech recognition, distributed speech recognition and network speech recognition. Their pros and cons and implementation issues are discussed. Applications within...... command and control, text entry and search are presented with an emphasis on mobile text entry....

  19. Robust Speech Recognition Based on Vector Taylor Series%基于矢量泰勒级数的鲁棒语音识别

    Institute of Scientific and Technical Information of China (English)

    吕勇; 吴镇扬

    2011-01-01

    The vector Taylor series (VTS) expansion is an effective approach to noise robust speech recognition.However, in the log-spectral domain, there exist the strong correlations among the different channels of Mel filter bank and thus it is difficult to estimate the noise variance from noisy speech proposes.A feature compensation algorithm in the cepstral domain based on vector Taylor series was proposed.In this algorithm, the distribution of speech cepstral features was represented by a Gaussian mixture model (GMM), and the mean and variance of noise were estimated from noisy speech by the VTS approximation.The experimental results show that the proposed algorithm can significantly improve the performance of speech recognition system, and outperforms the VTS-based feature compensation method in the log-spectral domain.%矢量泰勒级数是一种有效的抗噪声鲁棒语音识别算法.然而在对数谱域,美尔滤波器组的不同通道之间有较强的相关性,因而难以从含噪语音中准确估计噪声的方差.提出了一种基于矢量泰勒级数的倒谱域特征补偿算法.该算法在倒谱域,用一个高斯混合模型描述语音倒谱特征的分布,通过矢量泰勒级数从含噪语音中估计噪声的均值和方差.实验结果表明,此算法能明显提高语音识别系统的性能,优于基于矢量泰勒级数的对数谱域特征补偿算法.

  20. PCA-Based Speech Enhancement for Distorted Speech Recognition

    Directory of Open Access Journals (Sweden)

    Tetsuya Takiguchi

    2007-09-01

    Full Text Available We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but it remains difficult to completely remove additive or convolution noise (distortion. The most commonly used noise-removal techniques are based on the spectraldomain operation, and then for speech recognition, the MFCC (Mel Frequency Cepstral Coefficient is computed, where DCT (Discrete Cosine Transform is applied to the mel-scale filter bank output. This paper describes a new PCA-based speech enhancement algorithm using kernel PCA instead of DCT, where the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features. Its effectiveness is confirmed by word recognition experiments on distorted speech.

  1. Speech Recognition: How Do We Teach It?

    Science.gov (United States)

    Barksdale, Karl

    2002-01-01

    States that growing use of speech recognition software has made voice writing an essential computer skill. Describes how to present the topic, develop basic speech recognition skills, and teach speech recognition outlining, writing, proofreading, and editing. (Contains 14 references.) (SK)

  2. Methods of Teaching Speech Recognition

    Science.gov (United States)

    Rader, Martha H.; Bailey, Glenn A.

    2010-01-01

    Objective: This article introduces the history and development of speech recognition, addresses its role in the business curriculum, outlines related national and state standards, describes instructional strategies, and discusses the assessment of student achievement in speech recognition classes. Methods: Research methods included a synthesis of…

  3. Service Robot SCORPIO with Robust Speech Interface

    Directory of Open Access Journals (Sweden)

    Stanislav Ondas

    2013-01-01

    Full Text Available The SCORPIO is a small‐size mini‐teleoperator mobile service robot for booby‐trap disposal. It can be manually controlled by an operator through a portable briefcase remote control device using joystick, keyboard and buttons. In this paper, the speech interface is described. As an auxiliary function, the remote interface allows a human operator to concentrate sight and/or hands on other operation activities that are more important. The developed speech interface is based on HMM‐based acoustic models trained using the SpeechDatE‐SK database, a small‐vocabulary language model based on fixed connected words, grammar, and the speech recognition setup adapted for low‐resource devices. To improve the robustness of the speech interface in an outdoor environment, which is the working area of the SCORPIO service robot, a speech enhancement based on the spectral subtraction method, as well as a unique combination of an iterative approach and a modified LIMA framework, were researched, developed and tested on simulated and real outdoor recordings.

  4. Robust speech features representation based on computational auditory model

    Institute of Scientific and Technical Information of China (English)

    LU Xugang; JIA Chuan; DANG Jianwu

    2004-01-01

    A speech signal processing and features extracting method based on computational auditory model is proposed. The computational model is based on psychological, physiological knowledge and digital signal processing methods. In each stage of a hearing perception system, there is a corresponding computational model to simulate its function. Based on this model, speech features are extracted. In each stage, the features in different kinds of level are extracted. A further processing for primary auditory spectrum based on lateral inhibition is proposed to extract much more robust speech features. All these features can be regarded as the internal representations of speech stimulation in hearing system. The robust speech recognition experiments are conducted to test the robustness of the features. Results show that the representations based on the proposed computational auditory model are robust representations for speech signals.

  5. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

    Science.gov (United States)

    Lee, Byeongwook; Cho, Kwang-Hyun

    2016-11-01

    Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintain high recognition performance under any circumstance? Recent neurophysiological studies have suggested that the phase of neuronal oscillations in the auditory cortex contributes to accurate speech recognition by guiding speech segmentation into smaller units at different timescales. A phase-locked relationship between neuronal oscillation and the speech envelope has recently been obtained, which suggests that the speech envelope provides a foundation for multi-timescale speech segmental information. In this study, we quantitatively investigated the role of the speech envelope as a potential temporal reference to segment speech using its instantaneous phase information. We evaluated the proposed approach by the achieved information gain and recognition performance in various noisy environments. The results indicate that the proposed segmentation scheme not only extracts more information from speech but also provides greater robustness in a recognition test.

  6. Emotion Recognition using Speech Features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    “Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions.  The content of this book is important for designing and developing  natural and sophisticated speech systems. Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance;  and pro...

  7. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  8. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  9. Speech Recognition: Its Place in Business Education.

    Science.gov (United States)

    Szul, Linda F.; Bouder, Michele

    2003-01-01

    Suggests uses of speech recognition devices in the classroom for students with disabilities. Compares speech recognition software packages and provides guidelines for selection and teaching. (Contains 14 references.) (SK)

  10. Speech Recognition Technology for Hearing Disabled Community

    Directory of Open Access Journals (Sweden)

    Tanvi Dua

    2014-09-01

    Full Text Available As the number of people with hearing disabilities are increasing significantly in the world, it is always required to use technology for filling the gap of communication between Deaf and Hearing communities. To fill this gap and to allow people with hearing disabilities to communicate this paper suggests a framework that contributes to the efficient integration of people with hearing disabilities. This paper presents a robust speech recognition system, which converts the continuous speech into text and image. The results are obtained with an accuracy of 95% with the small size vocabulary of 20 greeting sentences of continuous speech form tested in a speaker independent mode. In this testing phase all these continuous sentences were given as live input to the proposed system.

  11. Robust Recogmtion Method of Speech Under Stress%心理紧张情况下的Robust语音识别方法

    Institute of Scientific and Technical Information of China (English)

    韩纪庆; 张磊; 王承发

    2000-01-01

    Abstract There are many stressful envronments which deteriorate the performance of speech recognition systems. Techniques for compensating the influence of stress can help neutralize stressed speech and improve robustness of speech recognition systems. In this paper,we smmarize the aproaches for robust recognition of speech under stress and also give the advances in the area.

  12. Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

    NARCIS (Netherlands)

    Huijbregts, M.A.H.; Ordelman, Roeland J.F.; de Jong, Franciska M.G.

    2007-01-01

    This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life

  13. Under-resourced speech recognition based on the speech manifold

    CSIR Research Space (South Africa)

    Sahraeian, R

    2015-09-01

    Full Text Available Conventional acoustic modeling involves estimating many parameters to effectively model feature distributions. The sparseness of speech and text data, however, degrades the reliability of the estimation process and makes speech recognition a...

  14. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  15. Automatic speech recognition An evaluation of Google Speech

    OpenAIRE

    Stenman, Magnus

    2015-01-01

    The use of speech recognition is increasing rapidly and is now available in smart TVs, desktop computers, every new smart phone, etc. allowing us to talk to computers naturally. With the use in home appliances, education and even in surgical procedures accuracy and speed becomes very important. This thesis aims to give an introduction to speech recognition and discuss its use in robotics. An evaluation of Google Speech, using Google’s speech API, in regards to word error rate and translation ...

  16. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  17. Human Emotion Recognition From Speech

    Directory of Open Access Journals (Sweden)

    Miss. Aparna P. Wanare

    2014-07-01

    Full Text Available Speech Emotion Recognition is a recent research topic in the Human Computer Interaction (HCI field. The need has risen for a more natural communication interface between humans and computer, as computers have become an integral part of our lives. A lot of work currently going on to improve the interaction between humans and computers. To achieve this goal, a computer would have to be able to distinguish its present situation and respond differently depending on that observation. Part of this process involves understanding a user‟s emotional state. To make the human computer interaction more natural, the objective is that computer should be able to recognize emotional states in the same as human does. The efficiency of emotion recognition system depends on type of features extracted and classifier used for detection of emotions. The proposed system aims at identification of basic emotional states such as anger, joy, neutral and sadness from human speech. While classifying different emotions, features like MFCC (Mel Frequency Cepstral Coefficient and Energy is used. In this paper, Standard Emotional Database i.e. English Database is used which gives the satisfactory detection of emotions than recorded samples of emotions. This methodology describes and compares the performances of Learning Vector Quantization Neural Network (LVQ NN, Multiclass Support Vector Machine (SVM and their combination for emotion recognition.

  18. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... record, but also with newer electronic versions typically based on touch-screen and keyboard, in particular ergonomic issues and the fact that anaesthesiologists tend to postpone the registration of the medications and other events during busy periods of anaesthesia, which in turn may lead to gaps...

  19. Isolated Speech Recognition Using Artificial Neural Networks

    Science.gov (United States)

    2007-11-02

    In this project Artificial Neural Networks are used as research tool to accomplish Automated Speech Recognition of normal speech. A small size...the first stage of this work are satisfactory and thus the application of artificial neural networks in conjunction with cepstral analysis in isolated word recognition holds promise.

  20. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  1. Speech Emotion Recognition Using Fuzzy Logic Classifier

    Directory of Open Access Journals (Sweden)

    Daniar aghsanavard

    2016-01-01

    Full Text Available Over the last two decades, emotions, speech recognition and signal processing have been one of the most significant issues in the adoption of techniques to detect them. Each method has advantages and disadvantages. This paper tries to suggest fuzzy speech emotion recognition based on the classification of speech's signals in order to better recognition along with a higher speed. In this system, the use of fuzzy logic system with 5 layers, which is the combination of neural progressive network and algorithm optimization of firefly, first, speech samples have been given to input of fuzzy orbit and then, signals will be investigated and primary classified in a fuzzy framework. In this model, a pattern of signals will be created for each class of signals, which results in reduction of signal data dimension as well as easier speech recognition. The obtained experimental results show that our proposed method (categorized by firefly, improves recognition of utterances.

  2. Modelling Human Speech Recognition using Automatic Speech Recognition Paradigms in SpeM

    NARCIS (Netherlands)

    Scharenborg, O.E.; McQueen, J.M.; Bosch, L.F.M. ten; Norris, D.

    2003-01-01

    We have recently developed a new model of human speech recognition, based on automatic speech recognition techniques [1]. The present paper has two goals. First, we show that the new model performs well in the recognition of lexically ambiguous input. These demonstrations suggest that the model is

  3. Automatic speech recognition a deep learning approach

    CERN Document Server

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  4. Industrial Applications of Automatic Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Dr. Jayashri Vajpai

    2016-03-01

    Full Text Available Current trends in developing technologies form important bridges to the future, fortified by the early and productive use of technology for enriching the human life. Speech signal processing, which includes automatic speech recognition, synthetic speech, and natural language processing, is beginning to have a significant impact on business, industry and ease of operation of personal computers. Apart from this, it facilitates the deeper understanding of complex mechanism of functioning of human brain. Advances in speech recognition technology, over the past five decades, have enabled a wide range of industrial applications. Yet today's applications provide a small preview of a rich future for speech and voice interface technology that will eventually replace keyboards with microphones for designing human machine interface for providing easy access to increasingly intelligent machines. It also shows how the capabilities of speech recognition systems in industrial applications are evolving over time to usher in the next generation of voice-enabled services. This paper aims to present an effective survey of the speech recognition technology described in the available literature and integrate the insights gained during the process of study of individual research and developments. The current applications of speech recognition for real world and industry have also been outlined with special reference to applications in the areas of medical, industrial robotics, forensic, defence and aviation

  5. Robust digital processing of speech signals

    CERN Document Server

    Kovacevic, Branko; Veinović, Mladen; Marković, Milan

    2017-01-01

    This book focuses on speech signal phenomena, presenting a robustification of the usual speech generation models with regard to the presumed types of excitation signals, which is equivalent to the introduction of a class of nonlinear models and the corresponding criterion functions for parameter estimation. Compared to the general class of nonlinear models, such as various neural networks, these models possess good properties of controlled complexity, the option of working in “online” mode, as well as a low information volume for efficient speech encoding and transmission. Providing comprehensive insights, the book is based on the authors’ research, which has already been published, supplemented by additional texts discussing general considerations of speech modeling, linear predictive analysis and robust parameter estimation.

  6. Current trends in small vocabulary speech recognition for equipment control

    Science.gov (United States)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  7. Speech Recognition System Architecture for Gujarati Language

    National Research Council Canada - National Science Library

    Jinal H Tailor; Dipti B Shah

    2016-01-01

    .... To achieve good accuracy and efficiency of Automatic Speech Recognition (ASR) system for Indian Gujarati language is challenging task due to its morphology, language barriers, different dialects, and unavailability of resources...

  8. Speech-in-Speech Recognition: A Training Study

    Science.gov (United States)

    Van Engen, Kristin J.

    2012-01-01

    This study aims to identify aspects of speech-in-noise recognition that are susceptible to training, focusing on whether listeners can learn to adapt to target talkers ("tune in") and learn to better cope with various maskers ("tune out") after short-term training. Listeners received training on English sentence recognition in…

  9. Phoneme vs Grapheme Based Automatic Speech Recognition

    OpenAIRE

    Magimai.-Doss, Mathew; Dines, John; Bourlard, Hervé; Hermansky, Hynek

    2004-01-01

    In recent literature, different approaches have been proposed to use graphemes as subword units with implicit source of phoneme information for automatic speech recognition. The major advantage of using graphemes as subword units is that the definition of lexicon is easy. In previous studies, results comparable to phoneme-based automatic speech recognition systems have been reported using context-independent graphemes or context-dependent graphemes with decision trees. In this paper, we study...

  10. Pronunciation Modeling for Large Vocabulary Speech Recognition

    Science.gov (United States)

    Kantor, Arthur

    2010-01-01

    The large pronunciation variability of words in conversational speech is one of the major causes of low accuracy in automatic speech recognition (ASR). Many pronunciation modeling approaches have been developed to address this problem. Some explicitly manipulate the pronunciation dictionary as well as the set of the units used to define the…

  11. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...... (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural...

  12. Human phoneme recognition depending on speech-intrinsic variability.

    Science.gov (United States)

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  13. Contextual variability during speech-in-speech recognition.

    Science.gov (United States)

    Brouwer, Susanne; Bradlow, Ann R

    2014-07-01

    This study examined the influence of background language variation on speech recognition. English listeners performed an English sentence recognition task in either "pure" background conditions in which all trials had either English or Dutch background babble or in mixed background conditions in which the background language varied across trials (i.e., a mix of English and Dutch or one of these background languages mixed with quiet trials). This design allowed the authors to compare performance on identical trials across pure and mixed conditions. The data reveal that speech-in-speech recognition is sensitive to contextual variation in terms of the target-background language (mis)match depending on the relative ease/difficulty of the test trials in relation to the surrounding trials.

  14. Robust recognition via information theoretic learning

    CERN Document Server

    He, Ran; Yuan, Xiaotong; Wang, Liang

    2014-01-01

    This Springer Brief represents a comprehensive review of information theoretic methods for robust recognition. A variety of information theoretic methods have been proffered in the past decade, in a large variety of computer vision applications; this work brings them together, attempts to impart the theory, optimization and usage of information entropy.The?authors?resort to a new information theoretic concept, correntropy, as a robust measure and apply it to solve robust face recognition and object recognition problems. For computational efficiency,?the brief?introduces the additive and multip

  15. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  16. Robust emotion recognition using spectral and prosodic features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    In this brief, the authors discuss recently explored spectral (sub-segmental and pitch synchronous) and prosodic (global and local features at word and syllable levels in different parts of the utterance) features for discerning emotions in a robust manner. The authors also delve into the complementary evidences obtained from excitation source, vocal tract system and prosodic features for the purpose of enhancing emotion recognition performance. Features based on speaking rate characteristics are explored with the help of multi-stage and hybrid models for further improving emotion recognition performance. Proposed spectral and prosodic features are evaluated on real life emotional speech corpus.

  17. Objective Gender and Age Recognition from Speech Sentences

    Directory of Open Access Journals (Sweden)

    Fatima K. Faek

    2015-10-01

    Full Text Available In this work, an automatic gender and age recognizer from speech is investigated. The relevant features to gender recognition are selected from the first four formant frequencies and twelve MFCCs and feed the SVM classifier. While the relevant features to age has been used with k-NN classifier for the age recognizer model, using MATLAB as a simulation tool. A special selection of robust features is used in this work to improve the results of the gender and age classifiers based on the frequency range that the feature represents. The gender and age classification algorithms are evaluated using 114 (clean and noisy speech samples uttered in Kurdish language. The model of two classes (adult males and adult females gender recognition, reached 96% recognition accuracy. While for three categories classification (adult males, adult females, and children, the model achieved 94% recognition accuracy. For the age recognition model, seven groups according to their ages are categorized. The model performance after selecting the relevant features to age achieved 75.3%. For further improvement a de-noising technique is used with the noisy speech signals, followed by selecting the proper features that are affected by the de-noising process and result in 81.44% recognition accuracy.

  18. Vocabulary and Environment Adaptation in Vocabulary-Independent Speech Recognition

    Science.gov (United States)

    1992-01-01

    normalization (ISDCN) proposed by Acero et al. [2] for microphone adaptation are incorporated into the our VI system to achieve environmental...reverberation from surface reflec- tions, etc. Acero at al. [1,2] proposed a series of environment normal- ization algorithms based on joint...support. References [1] Acero , A. Acoustical and Environmental Robustness in Auto- matic Speech Recognition. Department of Electrical Engineer- ing

  19. Issues in acoustic modeling of speech for automatic speech recognition

    OpenAIRE

    Gong, Yifan; Haton, Jean-Paul; Mari, Jean-François

    1994-01-01

    Projet RFIA; Stochastic modeling is a flexible method for handling the large variability in speech for recognition applications. In contrast to dynamic time warping where heuristic training methods for estimating word templates are used, stochastic modeling allows a probabilistic and automatic training for estimating models. This paper deals with the improvement of stochastic techniques, especially for a better representation of time varying phenomena.

  20. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  1. Hidden Markov models in automatic speech recognition

    Science.gov (United States)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  2. Robust speaker recognition in noisy environments

    CERN Document Server

    Rao, K Sreenivasa

    2014-01-01

    This book discusses speaker recognition methods to deal with realistic variable noisy environments. The text covers authentication systems for; robust noisy background environments, functions in real time and incorporated in mobile devices. The book focuses on different approaches to enhance the accuracy of speaker recognition in presence of varying background environments. The authors examine: (a) Feature compensation using multiple background models, (b) Feature mapping using data-driven stochastic models, (c) Design of super vector- based GMM-SVM framework for robust speaker recognition, (d) Total variability modeling (i-vectors) in a discriminative framework and (e) Boosting method to fuse evidences from multiple SVM models.

  3. Robust facial expression recognition via compressive sensing.

    Science.gov (United States)

    Zhang, Shiqing; Zhao, Xiaoming; Lei, Bicheng

    2012-01-01

    Recently, compressive sensing (CS) has attracted increasing attention in the areas of signal processing, computer vision and pattern recognition. In this paper, a new method based on the CS theory is presented for robust facial expression recognition. The CS theory is used to construct a sparse representation classifier (SRC). The effectiveness and robustness of the SRC method is investigated on clean and occluded facial expression images. Three typical facial features, i.e., the raw pixels, Gabor wavelets representation and local binary patterns (LBP), are extracted to evaluate the performance of the SRC method. Compared with the nearest neighbor (NN), linear support vector machines (SVM) and the nearest subspace (NS), experimental results on the popular Cohn-Kanade facial expression database demonstrate that the SRC method obtains better performance and stronger robustness to corruption and occlusion on robust facial expression recognition tasks.

  4. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

    OpenAIRE

    Byeongwook Lee; Kwang-Hyun Cho

    2016-01-01

    Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintai...

  5. Toddlers' recognition of noise-vocoded speech.

    Science.gov (United States)

    Newman, Rochelle; Chatterjee, Monita

    2013-01-01

    Despite their remarkable clinical success, cochlear-implant listeners today still receive spectrally degraded information. Much research has examined normally hearing adult listeners' ability to interpret spectrally degraded signals, primarily using noise-vocoded speech to simulate cochlear implant processing. Far less research has explored infants' and toddlers' ability to interpret spectrally degraded signals, despite the fact that children in this age range are frequently implanted. This study examines 27-month-old typically developing toddlers' recognition of noise-vocoded speech in a language-guided looking study. Children saw two images on each trial and heard a voice instructing them to look at one item ("Find the cat!"). Full-spectrum sentences or their noise-vocoded versions were presented with varying numbers of spectral channels. Toddlers showed equivalent proportions of looking to the target object with full-speech and 24- or 8-channel noise-vocoded speech; they failed to look appropriately with 2-channel noise-vocoded speech and showed variable performance with 4-channel noise-vocoded speech. Despite accurate looking performance for speech with at least eight channels, children were slower to respond appropriately as the number of channels decreased. These results indicate that 2-yr-olds have developed the ability to interpret vocoded speech, even without practice, but that doing so requires additional processing. These findings have important implications for pediatric cochlear implantation.

  6. Phonological modeling for continuous speech recognition in Korean

    CERN Document Server

    Lee, W I; Lee, J H; Lee, WonIl; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.

  7. Effects of Cognitive Load on Speech Recognition

    Science.gov (United States)

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  8. Speech recognition employing biologically plausible receptive fields

    DEFF Research Database (Denmark)

    Fereczkowski, Michal; Bothe, Hans-Heinrich

    2011-01-01

    The main idea of the project is to build a widely speaker-independent, biologically motivated automatic speech recognition (ASR) system. The two main differences between our approach and current state-of-the-art ASRs are that i) the features used here are based on the responses of neuronlike spec...

  9. Speech Recognition, Disability, and College Composition

    Science.gov (United States)

    Nelson, Lorna M.; Reynolds, Thomas W., Jr.

    2015-01-01

    This study examined the composing processes of five postsecondary students who used or were learning to use speech recognition software (SR) for college-level writing. The study analyzed their composing processes through observation, interviews, and analysis of written products over a series of composing sessions. This investigation was prompted…

  10. Effects of Cognitive Load on Speech Recognition

    Science.gov (United States)

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  11. Speech Recognition: A World of Opportunities

    Science.gov (United States)

    PACER Center, 2004

    2004-01-01

    Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. The technology is also useful to people with learning disabilities who experience difficulty with spelling…

  12. Bimodal Emotion Recognition from Speech and Text

    Directory of Open Access Journals (Sweden)

    Weilin Ye

    2014-01-01

    Full Text Available This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.

  13. SPECTRAL METHODS IN POLISH EMOTIONAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2016-12-01

    Full Text Available In this article the issue of emotion recognition based on Polish emotional speech signal analysis was presented. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. Speech signal has been processed by Artificial Neural Networks (ANN. The inputs for ANN were information obtained from signal spectrogram. Researches were conducted for three different spectrogram divisions. The ANN consists of four layers but the number of neurons in each layer depends of spectrogram division. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom. The averange effectiveness of emotions recognition was about 80%.

  14. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  15. Phonetic Alphabet for Speech Recognition of Czech

    Directory of Open Access Journals (Sweden)

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  16. Relationship between speech recognition in noise and sparseness.

    Science.gov (United States)

    Li, Guoping; Lutman, Mark E; Wang, Shouyan; Bleeck, Stefan

    2012-02-01

    Established methods for predicting speech recognition in noise require knowledge of clean speech signals, placing limitations on their application. The study evaluates an alternative approach based on characteristics of noisy speech, specifically its sparseness as represented by the statistic kurtosis. Experiments 1 and 2 involved acoustic analysis of vowel-consonant-vowel (VCV) syllables in babble noise, comparing kurtosis, glimpsing areas, and extended speech intelligibility index (ESII) of noisy speech signals with one another and with pre-existing speech recognition scores. Experiment 3 manipulated kurtosis of VCV syllables and investigated effects on speech recognition scores in normal-hearing listeners. Pre-existing speech recognition data for Experiments 1 and 2; seven normal-hearing participants for Experiment 3. Experiments 1 and 2 demonstrated that kurtosis calculated in the time-domain from noisy speech is highly correlated (r > 0.98) with established prediction models: glimpsing and ESII. All three measures predicted speech recognition scores well. The final experiment showed a clear monotonic relationship between speech recognition scores and kurtosis. Speech recognition performance in noise is closely related to the sparseness (kurtosis) of the noisy speech signal, at least for the types of speech and noise used here and for listeners with normal hearing.

  17. Speech Recognition in 7 Languages

    Science.gov (United States)

    2000-08-01

    best monolingual cross-lingual [10] F. Weng, H. Bratt, L. Neumeyer, and A. Stol- recognizers could not always be tested. cke. A Study of Multilingual ...are r the same tm the proaches, namely portation, cross-lingual and simul- two languages are recognized at the same time, the taneous multilingual ...in cross-lingual will present experiments and results for different ap- recognition for different baseline systems and found proaches of multilingual

  18. A Dialectal Chinese Speech Recognition Framework

    Institute of Scientific and Technical Information of China (English)

    Jing Li; Thomas Fang Zheng; William Byrne; Dan Jurafsky

    2006-01-01

    A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level,and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use contextindependent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multipronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10-18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and

  19. Arabic Speech Recognition System using CMU-Sphinx4

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper we present the creation of an Arabic version of Automated Speech Recognition System (ASR). This system is based on the open source Sphinx-4, from the Carnegie Mellon University. Which is a speech recognition system based on discrete hidden Markov models (HMMs). We investigate the changes that must be made to the model to adapt Arabic voice recognition. Keywords: Speech recognition, Acoustic model, Arabic language, HMMs, CMUSphinx-4, Artificial intelligence.

  20. Indonesian Automatic Speech Recognition For Command Speech Controller Multimedia Player

    Directory of Open Access Journals (Sweden)

    Vivien Arief Wardhany

    2014-12-01

    Full Text Available The purpose of multimedia devices development is controlling through voice. Nowdays voice that can be recognized only in English. To overcome the issue, then recognition using Indonesian language model and accousticc model and dictionary. Automatic Speech Recognizier is build using engine CMU Sphinx with modified english language to Indonesian Language database and XBMC used as the multimedia player. The experiment is using 10 volunteers testing items based on 7 commands. The volunteers is classifiedd by the genders, 5 Male & 5 female. 10 samples is taken in each command, continue with each volunteer perform 10 testing command. Each volunteer also have to try all 7 command that already provided. Based on percentage clarification table, the word “Kanan” had the most recognize with percentage 83% while “pilih” is the lowest one. The word which had the most wrong clarification is “kembali” with percentagee 67%, while the word “kanan” is the lowest one. From the result of Recognition Rate by male there are several command such as “Kembali”, “Utama”, “Atas “ and “Bawah” has the low Recognition Rate. Especially for “kembali” cannot be recognized as the command in the female voices but in male voice that command has 4% of RR this is because the command doesn’t have similar word in english near to “kembali” so the system unrecognize the command. Also for the command “Pilih” using the female voice has 80% of RR but for the male voice has only 4% of RR. This problem is mostly because of the different voice characteristic between adult male and female which male has lower voice frequencies (from 85 to 180 Hz than woman (165 to 255 Hz.The result of the experiment showed that each man had different number of recognition rate caused by the difference tone, pronunciation, and speed of speech. For further work needs to be done in order to improving the accouracy of the Indonesian Automatic Speech Recognition system

  1. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text......-input method for smartphones. On another front, given the significant improvements in Machine Translation (MT) quality and the increasing demand for translations, post-editing of MT is becoming a popular practice in the translation industry, since it has been shown to allow for larger volumes of translations...... to be produced saving time and costs. The translation industry is at a deeply transformative point in its evolution and the coming years herald an era of converge where speech technology could make a difference. As post-editing services are becoming a common practice among language service providers and speech...

  2. Speech emotion recognition with unsupervised feature learning

    Institute of Scientific and Technical Information of China (English)

    Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

    2015-01-01

    Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

  3. Multilingual Vocabularies in Automatic Speech Recognition

    Science.gov (United States)

    2000-08-01

    monolingual (a few thousands) is an obstacle to a full generalization of the inventories, then moved to the multilingual case. In the approach towards the...language. of multilingual models than the monolingual models, and it was specifically observed in the test with Spanish utterances. In fact...UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP010389 TITLE: Multilingual Vocabularies in Automatic Speech Recognition

  4. Robust video foreground segmentation and face recognition

    Institute of Scientific and Technical Information of China (English)

    GUAN Ye-peng

    2009-01-01

    Face recognition provides a natural visual interface for human computer interaction (HCI) applications.The process of face recognition,however,is inhibited by variations in the appearance of face images caused by changes in lighting,expression,viewpoint,aging and introduction of occlusion.Although various algorithms have been presented for face recognition,face recognition is still a very challenging topic.A novel approach of real time face recognition for HCI is proposed in the paper.In view of the limits of the popular approaches to foreground segmentation,wavelet multi-scale transform based background subtraction is developed to extract foreground objects.The optimal selection of the threshold is automatically determined,which does not require any complex supervised training or manual experimental calibration.A robust real time face recognition algorithm is presented,which combines the projection matrixes without iteration and kernel Fisher discriminant analysis (KFDA) to overcome some difficulties existing in the real face recognition.Superior performance of the proposed algorithm is demonstrated by comparing with other algorithms through experiments.The proposed algorithm can also be applied to the video image sequences of natural HCI.

  5. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Lévy Christophe

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  6. Tone realisation in a Yoruba speech recognition corpus

    CSIR Research Space (South Africa)

    Van Niekerk, D

    2012-05-01

    Full Text Available The authors investigate the acoustic realisation of tone in short continuous utterances in Yoruba. Fundamental frequency contours are extracted for automatically aligned syllables from a speech corpus of 33 speakers collected for speech recognition...

  7. Improved Open-Microphone Speech Recognition

    Science.gov (United States)

    Abrash, Victor

    2002-01-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken

  8. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  9. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  10. Cross-Word Modeling for Arabic Speech Recognition

    CERN Document Server

    AbuZeina, Dia

    2012-01-01

    "Cross-Word Modeling for Arabic Speech Recognition" utilizes phonological rules in order to model the cross-word problem, a merging of adjacent words in speech caused by continuous speech, to enhance the performance of continuous speech recognition systems. The author aims to provide an understanding of the cross-word problem and how it can be avoided, specifically focusing on Arabic phonology using an HHM-based classifier.

  11. Microphone Array Design Measures for Hands-Free Speech Recognition

    OpenAIRE

    井上 雅晶; 山田 武志; 中村, 哲; 鹿野 清宏

    1998-01-01

    One of the key technologies for natural man-machine interface is hands-free speech recognition. The performance of hands-free distant- talking speech recognition will be seriously degraded by noise and reverberation in real environments. A microphone array is applied to solve the problem. When applying a microphone array to speech recognition, parameters such as number of microphone elements and their spacing interval affect the performance. In order to optimize these parameters, a measure wh...

  12. Factors influencing recognition of interrupted speech.

    Science.gov (United States)

    Wang, Xin; Humes, Larry E

    2010-10-01

    This study examined the effect of interruption parameters (e.g., interruption rate, on-duration and proportion), linguistic factors, and other general factors, on the recognition of interrupted consonant-vowel-consonant (CVC) words in quiet. Sixty-two young adults with normal-hearing were randomly assigned to one of three test groups, "male65," "female65" and "male85," that differed in talker (male/female) and presentation level (65/85 dB SPL), with about 20 subjects per group. A total of 13 stimulus conditions, representing different interruption patterns within the words (i.e., various combinations of three interruption parameters), in combination with two values (easy and hard) of lexical difficulty were examined (i.e., 13×2=26 test conditions) within each group. Results showed that, overall, the proportion of speech and lexical difficulty had major effects on the integration and recognition of interrupted CVC words, while the other variables had small effects. Interactions between interruption parameters and linguistic factors were observed: to reach the same degree of word-recognition performance, less acoustic information was required for lexically easy words than hard words. Implications of the findings of the current study for models of the temporal integration of speech are discussed.

  13. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult......, like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...

  14. From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems

    Science.gov (United States)

    Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902

  15. Speech Recognition Technology Applied to Intelligent Mobile Navigation System

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    The capability of human-computer interaction reflects the intelligent degree of mobile navigation system.The navigation data and functions of mobile navigation system are divided into system commands and non-system commands in this paper.And then a group of speech commands are Abstracted.This paper applies speech recognition technology to intelligent mobile navigation system to process speech commands and does some deep research on the integration of speech recognition technology with mobile navigation system.The navigation operation can be performed by speech commands,which makes human-computer interaction easy during navigation.Speech command interface of navigation system is implemented by Dutty ++ Software,which is based on speech recognition system -Via Voice of IBM.Through navigation experiments,navigation can be done almost without keyboard,which proved that human-computer interaction is very convenient by speech commands and the reliability is also higher.

  16. Combined Hand Gesture — Speech Model for Human Action Recognition

    Directory of Open Access Journals (Sweden)

    Sheng-Tzong Cheng

    2013-12-01

    Full Text Available This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  17. Combined hand gesture--speech model for human action recognition.

    Science.gov (United States)

    Cheng, Sheng-Tzong; Hsu, Chih-Wei; Li, Jian-Pan

    2013-12-12

    This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  18. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  19. Recognition of time-compressed speech does not predict recognition of natural fast-rate speech by older listeners.

    Science.gov (United States)

    Gordon-Salant, Sandra; Zion, Danielle J; Espy-Wilson, Carol

    2014-10-01

    This study investigated whether recognition of time-compressed speech predicts recognition of natural fast-rate speech, and whether this relationship is influenced by listener age. High and low context sentences were presented to younger and older normal-hearing adults at a normal speech rate, naturally fast speech rate, and fast rate implemented by time compressing the normal-rate sentences. Recognition of time-compressed sentences over-estimated recognition of natural fast sentences for both groups, especially for older listeners. The findings suggest that older listeners are at a much greater disadvantage when listening to natural fast speech than would be predicted by recognition performance for time-compressed speech.

  20. Transforming Binary Uncertainties for Robust Speech Recognition

    Science.gov (United States)

    2006-08-01

    1-0117), an AFRL grant via Veridian and an NSF grant (IIS-0534707). We thank A. Acero and M. L. Seltzer for helpful suggestions. A preliminary...Deng, J. Droppo, and A. Acero , “Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of...amplitude modulation,” IEEE Trans. on Neural Networks, vol. 15, pp. 1135–1150, 2004. [16] X. Huang, A. Acero , and H. Hon, Spoken Language Processing

  1. Robust Bioinformatics Recognition with VLSI Biochip Microsystem

    Science.gov (United States)

    Lue, Jaw-Chyng L.; Fang, Wai-Chi

    2006-01-01

    A microsystem architecture for real-time, on-site, robust bioinformatic patterns recognition and analysis has been proposed. This system is compatible with on-chip DNA analysis means such as polymerase chain reaction (PCR)amplification. A corresponding novel artificial neural network (ANN) learning algorithm using new sigmoid-logarithmic transfer function based on error backpropagation (EBP) algorithm is invented. Our results show the trained new ANN can recognize low fluorescence patterns better than the conventional sigmoidal ANN does. A differential logarithmic imaging chip is designed for calculating logarithm of relative intensities of fluorescence signals. The single-rail logarithmic circuit and a prototype ANN chip are designed, fabricated and characterized.

  2. Visual Speech Primes Open-Set Recognition of Spoken Words

    Science.gov (United States)

    Buchwald, Adam B.; Winters, Stephen J.; Pisoni, David B.

    2009-01-01

    Visual speech perception has become a topic of considerable interest to speech researchers. Previous research has demonstrated that perceivers neurally encode and use speech information from the visual modality, and this information has been found to facilitate spoken word recognition in tasks such as lexical decision (Kim, Davis, & Krins,…

  3. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  4. Mispronunciation Detection for Language Learning and Speech Recognition Adaptation

    Science.gov (United States)

    Ge, Zhenhao

    2013-01-01

    The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…

  5. Training Implications of Airborne Applications of Automated Speech Recognition Technology.

    Science.gov (United States)

    1980-10-01

    Coler , C. R. Automated speech recognition and man- computer interaction research at NASA Ames Research Center. In S. Harris (Ed.), Proceedings: Voice...Sons, Inc., 1964. 56 NAVTRAEQUIPCEN 80-D-0009-0155-1 Coler , C. R. Automated speech recognition and man- computer interaction research at NASA Ames

  6. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    2016-01-01

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep auto

  7. Mispronunciation Detection for Language Learning and Speech Recognition Adaptation

    Science.gov (United States)

    Ge, Zhenhao

    2013-01-01

    The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…

  8. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep

  9. Ensemble Feature Extraction Modules for Improved Hindi Speech Recognition System

    Directory of Open Access Journals (Sweden)

    Malay Kumar

    2012-05-01

    Full Text Available Speech is the most natural way of communication between human beings. The field of speech recognition generates intrigues of man - machine conversation and due to its versatile applications; automatic speech recognition systems have been designed. In this paper we are presenting a novel approach for Hindi speech recognition by ensemble feature extraction modules of ASR systems and their outputs have been combined using voting technique ROVER. Experimental results have been shown that proposed system will produce better result than traditional ASR systems.

  10. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    Science.gov (United States)

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  11. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar

    Directory of Open Access Journals (Sweden)

    Young Hoon Shin

    2016-10-01

    Full Text Available People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker’s vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  12. Joint variable frame rate and length analysis for speech recognition under adverse conditions

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Kraljevski, Ivan

    2014-01-01

    This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal......-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased...... frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness....

  13. Method and apparatus for obtaining complete speech signals for speech recognition applications

    Science.gov (United States)

    Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)

    2009-01-01

    The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.

  14. Introduction to Arabic Speech Recognition Using CMUSphinx System

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper Arabic was investigated from the speech recognition problem point of view. We propose a novel approach to build an Arabic Automated Speech Recognition System (ASR). This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs). We build a model using utilities from the OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this system to Arabic voice recognition.

  15. Recognition memory in noise for speech of varying intelligibility.

    Science.gov (United States)

    Gilbert, Rachael C; Chandrasekaran, Bharath; Smiljanic, Rajka

    2014-01-01

    This study investigated the extent to which noise impacts normal-hearing young adults' speech processing of sentences that vary in intelligibility. Intelligibility and recognition memory in noise were examined for conversational and clear speech sentences recorded in quiet (quiet speech, QS) and in response to the environmental noise (noise-adapted speech, NAS). Results showed that (1) increased intelligibility through conversational-to-clear speech modifications led to improved recognition memory and (2) NAS presented a more naturalistic speech adaptation to noise compared to QS, leading to more accurate word recognition and enhanced sentence recognition memory. These results demonstrate that acoustic-phonetic modifications implemented in listener-oriented speech enhance speech-in-noise processing beyond word recognition. Effortful speech processing in challenging listening environments can thus be improved by speaking style adaptations on the part of the talker. In addition to enhanced intelligibility, a substantial improvement in recognition memory can be achieved through speaker adaptations to the environment and to the listener when in adverse conditions.

  16. Speech recognition using articulatory and excitation source features

    CERN Document Server

    Rao, K Sreenivasa

    2017-01-01

    This book discusses the contribution of articulatory and excitation source information in discriminating sound units. The authors focus on excitation source component of speech -- and the dynamics of various articulators during speech production -- for enhancement of speech recognition (SR) performance. Speech recognition is analyzed for read, extempore, and conversation modes of speech. Five groups of articulatory features (AFs) are explored for speech recognition, in addition to conventional spectral features. Each chapter provides the motivation for exploring the specific feature for SR task, discusses the methods to extract those features, and finally suggests appropriate models to capture the sound unit specific knowledge from the proposed features. The authors close by discussing various combinations of spectral, articulatory and source features, and the desired models to enhance the performance of SR systems.

  17. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  18. Speech recognition algorithms based on weighted finite-state transducers

    CERN Document Server

    Hori, Takaaki

    2013-01-01

    This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recogni

  19. Confidence and rejection in automatic speech recognition

    Science.gov (United States)

    Colton, Larry Don

    Automatic speech recognition (ASR) is performed imperfectly by computers. For some designated part (e.g., word or phrase) of the ASR output, rejection is deciding (yes or no) whether it is correct, and confidence is the probability (0.0 to 1.0) of it being correct. This thesis presents new methods of rejecting errors and estimating confidence for telephone speech. These are also called word or utterance verification and can be used in wordspotting or voice-response systems. Open-set or out-of-vocabulary situations are a primary focus. Language models are not considered. In vocabulary-dependent rejection all words in the target vocabulary are known in advance and a strategy can be developed for confirming each word. A word-specific artificial neural network (ANN) is shown to discriminate well, and scores from such ANNs are shown on a closed-set recognition task to reorder the N-best hypothesis list (N=3) for improved recognition performance. Segment-based duration and perceptual linear prediction (PLP) features are shown to perform well for such ANNs. The majority of the thesis concerns vocabulary- and task-independent confidence and rejection based on phonetic word models. These can be computed for words even when no training examples of those words have been seen. New techniques are developed using phoneme ranks instead of probabilities in each frame. These are shown to perform as well as the best other methods examined despite the data reduction involved. Certain new weighted averaging schemes are studied but found to give no performance benefit. Hierarchical averaging is shown to improve performance significantly: frame scores combine to make segment (phoneme state) scores, which combine to make phoneme scores, which combine to make word scores. Use of intermediate syllable scores is shown to not affect performance. Normalizing frame scores by an average of the top probabilities in each frame is shown to improve performance significantly. Perplexity of the wrong

  20. Continuous speech recognition based on convolutional neural network

    Science.gov (United States)

    Zhang, Qing-qing; Liu, Yong; Pan, Jie-lin; Yan, Yong-hong

    2015-07-01

    Convolutional Neural Networks (CNNs), which showed success in achieving translation invariance for many image processing tasks, are investigated for continuous speech recognitions in the paper. Compared to Deep Neural Networks (DNNs), which have been proven to be successful in many speech recognition tasks nowadays, CNNs can reduce the NN model sizes significantly, and at the same time achieve even better recognition accuracies. Experiments on standard speech corpus TIMIT showed that CNNs outperformed DNNs in the term of the accuracy when CNNs had even smaller model size.

  1. Novel Extended Phonemic Set for Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    谢湘; 匡镜明

    2003-01-01

    An extended phonemic set of mandarin from the view of speech recognition is proposed. This set absorbs most principles of some other existing phonemic sets for mandarin, like Worldbet and SAMPA-C, and also takes advantage of some practical experiences from speech recognition research for increasing the discriminability between word models. And the experiments in speaker independent continuous speech recognition show that hidden Markov models defined by this phonemic set have a better performance than those based on initial/final units of mandarin and have a very compact size.

  2. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  3. Tackling speaking mode varieties in EMG-based speech recognition.

    Science.gov (United States)

    Wand, Michael; Janke, Matthias; Schultz, Tanja

    2014-10-01

    An electromyographic (EMG) silent speech recognizer is a system that recognizes speech by capturing the electric potentials of the human articulatory muscles, thus enabling the user to communicate silently. After having established a baseline EMG-based continuous speech recognizer, in this paper, we investigate speaking mode variations, i.e., discrepancies between audible and silent speech that deteriorate recognition accuracy. We introduce multimode systems that allow seamless switching between audible and silent speech, investigate different measures which quantify speaking mode differences, and present the spectral mapping algorithm, which improves the word error rate (WER) on silent speech by up to 14.3% relative. Our best average silent speech WER is 34.7%, and our best WER on audibly spoken speech is 16.8%.

  4. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.

  5. Hybrid Approach for Language Identification Oriented to Multilingual Speech Recognition in the Basque Context

    Science.gov (United States)

    Barroso, N.; de Ipiña, K. López; Ezeiza, A.; Barroso, O.; Susperregi, U.

    The development of Multilingual Large Vocabulary Continuous Speech Recognition systems involves issues as: Language Identification, Acoustic-Phonetic Decoding, Language Modelling or the development of appropriated Language Resources. The interest on Multilingual Systems arouses because there are three official languages in the Basque Country (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition for the Basque context. The work presents hybrid strategies for LID, based on the selection of system elements by Support Vector Machines and Multilayer Perceptron classifiers and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams).

  6. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  7. Spectrum warping based on sub-glottal resonances in speaker-independent speech recognition

    Institute of Scientific and Technical Information of China (English)

    HOU Limin; HUANG Zhenhua; XIE Juanmin

    2011-01-01

    To reduce degradation in speech recognition due to varied characteristics of different speakers, a method of perceptual frequency warping based on subglottal resonances for speaker normalization is proposed. The warping factor is extracted from the second subglottal resonance using acoustic coupling between subglottis and vocal tract. The second subglottal resonance is independent of the speech content, which reflects the speaker characteristics more than the third formant. The perceptual minimum variation distortionless response (PMVDR) coefficient is normalized, which is more robust and has better anti-noise capability than MFCC. The normalized coefficients are used in the speech-mode training and speech recognition. Experiments show that the word error rate, as compared with MFCC and the spectrum warping by the third formant, decreases by 4% and 3% respectively in clean speech recognition, and by 9% and 5% respectively in a noisy environment. The results indicate that the proposed method can improve the word recognition accuracy in a speaker-independent recognition system.

  8. Vocal Tract Representation in the Recognition of Cerebral Palsied Speech

    Science.gov (United States)

    Rudzicz, Frank; Hirst, Graeme; van Lieshout, Pascal

    2012-01-01

    Purpose: In this study, the authors explored articulatory information as a means of improving the recognition of dysarthric speech by machine. Method: Data were derived chiefly from the TORGO database of dysarthric articulation (Rudzicz, Namasivayam, & Wolff, 2011) in which motions of various points in the vocal tract are measured during speech.…

  9. Speech-recognition interfaces for music information retrieval

    Science.gov (United States)

    Goto, Masataka

    2005-09-01

    This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)

  10. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    Science.gov (United States)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  11. Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure

    Directory of Open Access Journals (Sweden)

    Wu Qiang

    2008-01-01

    Full Text Available This paper investigates the problem of speaker recognition in noisy conditions. A new approach called nonnegative tensor principal component analysis (NTPCA with sparse constraint is proposed for speech feature extraction. We encode speech as a general higher-order tensor in order to extract discriminative features in spectrotemporal domain. Firstly, speech signals are represented by cochlear feature based on frequency selectivity characteristics at basilar membrane and inner hair cells; then, low-dimension sparse features are extracted by NTPCA for robust speaker modeling. The useful information of each subspace in the higher-order tensor can be preserved. Alternating projection algorithm is used to obtain a stable solution. Experimental results demonstrate that our method can increase the recognition accuracy specifically in noisy environments.

  12. Auditory Sparse Representation for Robust Speaker Recognition Based on Tensor Structure

    Directory of Open Access Journals (Sweden)

    Liqing Zhang

    2008-11-01

    Full Text Available This paper investigates the problem of speaker recognition in noisy conditions. A new approach called nonnegative tensor principal component analysis (NTPCA with sparse constraint is proposed for speech feature extraction. We encode speech as a general higher-order tensor in order to extract discriminative features in spectrotemporal domain. Firstly, speech signals are represented by cochlear feature based on frequency selectivity characteristics at basilar membrane and inner hair cells; then, low-dimension sparse features are extracted by NTPCA for robust speaker modeling. The useful information of each subspace in the higher-order tensor can be preserved. Alternating projection algorithm is used to obtain a stable solution. Experimental results demonstrate that our method can increase the recognition accuracy specifically in noisy environments.

  13. A Review on Speech Corpus Development for Automatic Speech Recognition in Indian Languages

    Directory of Open Access Journals (Sweden)

    Cini kurian

    2015-05-01

    Full Text Available Corpus development gained much attention due to recent statistics based natural language processing. It has new applications in Language Technology, linguistic research, language education and information exchange. Corpus based Language research has an innovative outlook which will discard the aged linguistic theories. Speech corpus is the essential resources for building a speech recognizer. One of the main challenges faced by speech scientist is the unavailability of these resources. Very fewer efforts have been made in Indian languages to make these resources available to public compared to English. In this paper we review the efforts made in Indian languages for developing speech corpus for automatic speech recognition.

  14. Effects of speech clarity on recognition memory for spoken sentences.

    Science.gov (United States)

    Van Engen, Kristin J; Chandrasekaran, Bharath; Smiljanic, Rajka

    2012-01-01

    Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language). Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise). Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

  15. Relationship between multipulse integration and speech recognition with cochlear implants.

    Science.gov (United States)

    Zhou, Ning; Pfingst, Bryan E

    2014-09-01

    Comparisons of performance with cochlear implants and postmortem conditions in the cochlea in humans have shown mixed results. The limitations in those studies favor the use of within-subject designs and non-invasive measures to estimate cochlear conditions. One non-invasive correlate of cochlear health is multipulse integration, established in an animal model. The present study used this measure to relate neural health in human cochlear implant users to their speech recognition performance. The multipulse-integration slopes were derived based on psychophysical detection thresholds measured for two pulse rates (80 and 640 pulses per second). A within-subject design was used in eight subjects with bilateral implants where the direction and magnitude of ear differences in the multipulse-integration slopes were compared with those of the speech-recognition results. The speech measures included speech reception threshold for sentences and phoneme recognition in noise. The magnitude of ear difference in the integration slopes was significantly correlated with the magnitude of ear difference in speech reception thresholds, consonant recognition in noise, and transmission of place of articulation of consonants. These results suggest that multipulse integration predicts speech recognition in noise and perception of features that use dynamic spectral cues.

  16. Mandarin Digits Speech Recognition Using Support Vector Machines

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; KUANG Jing-ming

    2005-01-01

    A method of applying support vector machine (SVM) in speech recognition was proposed, and a speech recognition system for mandarin digits was built up by SVMs. In the system, vectors were linearly extracted from speech feature sequence to make up time-aligned input patterns for SVM, and the decisions of several 2-class SVM classifiers were employed for constructing an N-class classifier. Four kinds of SVM kernel functions were compared in the experiments of speaker-independent speech recognition of mandarin digits. And the kernel of radial basis function has the highest accurate rate of 99.33%, which is better than that of the baseline system based on hidden Markov models (HMM) (97.08%). And the experiments also show that SVM can outperform HMM especially when the samples for learning were very limited.

  17. Lexicon Optimization for Dutch Speech Recognition in Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  18. Lexicon optimization for Dutch speech recognition in spoken document retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland; Hessen, van Arjan; Jong, de Franciska

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  19. Objects Control through Speech Recognition Using LabVIEW

    Directory of Open Access Journals (Sweden)

    Ankush Sharma

    2013-01-01

    Full Text Available Speech is the natural form of human communication and the speech processing is the one of the most stimulating area of the signal processing. Speech recognition technology has made it possible for computer to follow the human voice command and understand the human languages. The objects (LED, Toggle switch etc. control through human speech is designed in this paper. By combine the virtual instrumentation technology and speech recognition techniques. And also provided password authentication. This can be done with the help of LabVIEW programming concepts. The microphone is using to take voice commands from Human. This microphone signals interface with LabVIEW code. The LabVIEW code will generate appropriate control signal to control the objects. The entire work done on the LabVIEW platform.

  20. Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database

    Science.gov (United States)

    Mishra, A. N.; Shrotriya, M. C.; Sharan, S. N.

    2010-02-01

    In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. In this paper wavelet based features using different wavelets are used for Hindi digits recognition. The recognition performance of these features has been compared with Linear Prediction Coefficients (LPC) and Perceptual Linear Prediction (PLP) features. All features have been tested using Hidden Markov Model (HMM) based classifier for speaker independent Hindi digits recognition. The recognition performance of PLP features is11.3% better than LPC features. The recognition performance with db10 features has shown a further improvement of 12.55% over PLP features. The recognition performance with db10 is best among all wavelet based features.

  1. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Directory of Open Access Journals (Sweden)

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  2. Modelling context in automatic speech recognition

    NARCIS (Netherlands)

    Wiggers, P.

    2008-01-01

    Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the o

  3. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  4. Robust F0 Estimation Using ELS-Based Robust Complex Speech Analysis

    Science.gov (United States)

    Funaki, Keiichi; Kinjo, Tatsuhiko

    Complex speech analysis for an analytic speech signal can accurately estimate the spectrum in low frequencies since the analytic signal provides spectrum only over positive frequencies. The remarkable feature makes it possible to realize more accurate F0 estimation using complex residual signal extracted by complex-valued speech analysis. We have already proposed F0 estimation using complex LPC residual, in which the autocorrelation function weighted by AMDF was adopted as the criterion. The method adopted MMSE-based complex LPC analysis and it has been reported that it can estimate more accurate F0 for IRS filtered speech corrupted by white Gauss noise although it can not work better for the IRS filtered speech corrupted by pink noise. In this paper, robust complex speech analysis based on ELS (Extended Least Square) method is introduced in order to overcome the drawback. The experimental results for additive white Gauss or pink noise demonstrate that the proposed algorithm based on robust ELS-based complex AR analysis can perform better than other methods.

  5. Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

    Directory of Open Access Journals (Sweden)

    Reinhold Orglmeister

    2010-01-01

    Full Text Available When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.

  6. A Multi-Modal Recognition System Using Face and Speech

    Directory of Open Access Journals (Sweden)

    Samir Akrouf

    2011-05-01

    Full Text Available Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of recognition systems. In this paper, we study the fusion of speech and face in a recognition system for taking a final decision (i.e., accept or reject identity claim. We evaluate the performance of each system differently then we fuse the results and compare the performances.

  7. Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

    CERN Document Server

    Terbeh, Naim

    2012-01-01

    This work consists of creating a system of the Computer Assisted Language Learning (CALL) based on a system of Automatic Speech Recognition (ASR) for the Arabic language using the tool CMU Sphinx3 [1], based on the approach of HMM. To this work, we have constructed a corpus of six hours of speech recordings with a number of nine speakers. we find in the robustness to noise a grounds for the choice of the HMM approach [2]. the results achieved are encouraging since our corpus is made by only nine speakers, but they are always reasons that open the door for other improvement works.

  8. Duration-Distribution-Based HMM for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    WANG Zuo-ying; XIAO Xi

    2006-01-01

    To overcome the defects of the duration modeling in the homogeneous Hidden Markov Model (HMM)for speech recognition,a duration-distribution-based HMM (DDBHMM) is proposed in this paper based on a formalized definition of a left-to-right inhomogeneous Markov model.It has been demonstrated that it can be identically defined by either the state duration or the state transition probability.The speaker-independent continuous speech recognition experiments show that by only modeling the state duration in DDBHMM,a significant improvement (17.8% error rate reduction) can be achieved compared with the classical HMM.The ideal properties of DDBHMM give promise to many aspects of speech modeling,such as the modeling of the state duration,speed variation,speech discontinuity,and interframe correlation.

  9. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  10. Speech Rate Control for Improving Elderly Speech Recognition of Smart Devices

    Directory of Open Access Journals (Sweden)

    SON, G.

    2017-05-01

    Full Text Available Although smart devices have become a widely-adopted tool for communication in modern society, it still requires a steep learning curve among the elderly. By introducing a voice-based interface for smart devices using voice recognition technology, smart devices can become more user-friendly and useful to the elderly. However, the voice recognition technology used in current devices is attuned to the voice patterns of the young. Therefore, speech recognition falters when an elderly user speaks into the device. This paper has identified that the elderly's improper speech rate by each syllable contributes to the failure in the voice recognition system. Thus, upon modifying the speech rate by each syllable, the voice recognition rate saw an increase of 12.3%. This paper demonstrates that by simply modifying the speech rate by each syllable, which is one of the factors that causes errors in voice recognition, the recognition rate can be substantially increased. Such improvements in voice recognition technology can make it easier for the elderly to operate smart devices that will allow them to be more socially connected in a mobile world and access information at their fingertips. It may also be helpful in bridging the communication divide between generations.

  11. Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface

    Directory of Open Access Journals (Sweden)

    Futoshi Asano

    2004-09-01

    Full Text Available A method of detecting speech events in a multiple-sound-source condition using audio and video information is proposed. For detecting speech events, sound localization using a microphone array and human tracking by stereo vision is combined by a Bayesian network. From the inference results of the Bayesian network, information on the time and location of speech events can be known. The information on the detected speech events is then utilized in the robust speech interface. A maximum likelihood adaptive beamformer is employed as a preprocessor of the speech recognizer to separate the speech signal from environmental noise. The coefficients of the beamformer are kept updated based on the information of the speech events. The information on the speech events is also used by the speech recognizer for extracting the speech segment.

  12. Speaker Independent Speech Recognition of Isolated Words in Room Environment

    OpenAIRE

    M. Tabassum; M. A. Aziz Jahan; Rahman, M M; Mohamed, S. B.; M. A. Rashid

    2017-01-01

    In this paper, the process of recognizing some important words from a large set of vocabularies is demonstrated based on the combination of dynamic and instantaneous features of the speech spectrum. There are many procedures to recognize a word by its vowel but this paper presents the highly effective speaker independent speech recognition in a typical room environment noise cases. To distinguish several isolated words of the sound of different vowels, two important features known as Pitch an...

  13. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Directory of Open Access Journals (Sweden)

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  14. Matrix sentence intelligibility prediction using an automatic speech recognition system.

    Science.gov (United States)

    Schädler, Marc René; Warzybok, Anna; Hochmuth, Sabine; Kollmeier, Birger

    2015-01-01

    The feasibility of predicting the outcome of the German matrix sentence test for different types of stationary background noise using an automatic speech recognition (ASR) system was studied. Speech reception thresholds (SRT) of 50% intelligibility were predicted in seven noise conditions. The ASR system used Mel-frequency cepstral coefficients as a front-end and employed whole-word Hidden Markov models on the back-end side. The ASR system was trained and tested with noisy matrix sentences on a broad range of signal-to-noise ratios. The ASR-based predictions were compared to data from the literature ( Hochmuth et al, 2015 ) obtained with 10 native German listeners with normal hearing and predictions of the speech intelligibility index (SII). The ASR-based predictions showed a high and significant correlation (R² = 0.95, p speech and noise signals. Minimum assumptions were made about human speech processing already incorporated in a reference-free ordinary ASR system.

  15. Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

    Science.gov (United States)

    Narayanan, Arun; Wang, DeLiang

    2015-01-01

    Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus. PMID:26973851

  16. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  17. Bayesian estimation of keyword confidence in Chinese continuous speech recognition

    Institute of Scientific and Technical Information of China (English)

    HAO Jie; LI Xing

    2003-01-01

    In a syllable-based speaker-independent Chinese continuous speech recognition system based on classical Hidden Markov Model (HMM), a Bayesian approach of keyword confidence estimation is studied, which utilizes both acoustic layer scores and syllable-based statistical language model (LM) score. The Maximum a posteriori (MAP) confidence measure is proposed, and the forward-backward algorithm calculating the MAP confidence scores is deduced. The performance of the MAP confidence measure is evaluated in keyword spotting application and the experiment results show that the MAP confidence scores provide high discriminability for keyword candidates. Furthermore, the MAP confidence measure can be applied to various speech recognition applications.

  18. Speech Recognition Method Based on Multilayer Chaotic Neural Network

    Institute of Scientific and Technical Information of China (English)

    REN Xiaolin; HU Guangrui

    2001-01-01

    In this paper,speech recognitionusing neural networks is investigated.Especially,chaotic dynamics is introduced to neurons,and a mul-tilayer chaotic neural network (MLCNN) architectureis built.A learning algorithm is also derived to trainthe weights of the network.We apply the MLCNNto speech recognition and compare the performanceof the network with those of recurrent neural net-work (RNN) and time-delay neural network (TDNN).Experimental results show that the MLCNN methodoutperforms the other neural networks methods withrespect to average recognition rate.

  19. Subspace Distribution Clustering HMM for Chinese Digit Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    As a kind of statistical method, the technique of Hidden Markov Model (HMM) is widely used for speech recognition. In order to train the HMM to be more effective with much less amount of data, the Subspace Distribution Clustering Hidden Markov Model (SDCHMM), derived from the Continuous Density Hidden Markov Model (CDHMM), is introduced. With parameter tying, a new method to train SDCHMMs is described. Compared with the conventional training method, an SDCHMM recognizer trained by means of the new method achieves higher accuracy and speed. Experiment results show that the SDCHMM recognizer outperforms the CDHMM recognizer on speech recognition of Chinese digits.

  20. Collecting and evaluating speech recognition corpora for 11 South African languages

    CSIR Research Space (South Africa)

    Badenhorst, J

    2011-08-01

    Full Text Available The authors describe the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which contains data from the eleven official languages of South Africa. Because of practical constraints, the amount of speech per language...

  1. Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2009-03-01

    Full Text Available The authors describes the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively...

  2. Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

    CERN Document Server

    Lee, G; Kim, S; Lee, Geunbae; Lee, Jong-Hyeok; Kim, Sangeok

    1996-01-01

    This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

  3. Finding Acoustic Regularities in Speech: Applications to Phonetic Recognition

    Science.gov (United States)

    1988-12-01

    Phonetic recognition can be viewed as a process through which the acoustic signal is mapped to a set of phonological units used to represent a lexicon...the phonological interpretation of the acoustic organization, only 5 regions which aligned with the phonetic transcription were used as training data...9Jnt itLL O. C I I I | Finding Acoustic Regularities in Speech: | Applications to Phonetic Recognition I lo RLE Technical Report No. 536 I December

  4. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    OpenAIRE

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrument that may cause a shift in writing process research because the underlying processes are changing. In addition to this, we take advantage of on of the weak points of speech recognition, namely the...

  5. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Science.gov (United States)

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency. PMID:26346654

  6. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  7. Basic speech recognition for spoken dialogues

    CSIR Research Space (South Africa)

    Van Heerden, C

    2009-09-01

    Full Text Available speech recognisers for a diverse multitude of languages. The paper investigates the feasibility of developing small-vocabulary speaker-independent ASR systems designed for use in a telephone-based information system, using ten resource-scarce languages...

  8. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrumen

  9. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing

  10. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue;

    2006-01-01

    that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...... of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  11. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  12. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  13. Appropriate baseline values for HMM-based speech recognition

    CSIR Research Space (South Africa)

    Barnard, E

    2004-11-01

    Full Text Available A number of issues realted to the development of speech-recognition systems with Hidden Markov Models (HMM) are discussed. A set of systematic experiments using the HTK toolkit and the TMIT database are used to elucidate matters such as the number...

  14. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  15. Speech Recognition Using Neural Nets and Dynamic Time Warping

    Science.gov (United States)

    1988-12-01

    8217 Phonetic Typewriter", Computer, 21: 11-22 (March 1988). - - 7. Lippmann, Richard P. "An Introduction to Computing with Neural Nets," IEEE ASSP Mag...azine, 4: 4-22 (April 1987). 8. Kohonen, Teuvo and others. "Phonotopic Maps-Insightful Representation of Phonological Features for Speech Recognition

  16. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    Science.gov (United States)

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  17. Shortlist B: A Bayesian model of continuous speech recognition

    NARCIS (Netherlands)

    Norris, D.; McQueen, J.M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prel

  18. Development of a speech recognition system for Spanish broadcast news

    NARCIS (Netherlands)

    Niculescu, Andreea; Jong, de Franciska

    2008-01-01

    This paper reports on the development process of a speech recognition system for Spanish broadcast news within the MESH FP6 project. The system uses the SONIC recognizer developed at the Center for Spoken Language Research (CSLR), University of Colorado. Acoustic and language models were trained usi

  19. Speech Recognition by Goats, Wolves, Sheep and Non-Natives

    Science.gov (United States)

    2000-08-01

    existent, will lead to vowel insertions and 1126. diphtongs are likely to be replaced by a single vowel. A [Bona97] P. Bonaventura , F. Gallocchio, G. Micca...Bona98] P. Bonaventura , F. Gallocchio, J. Mari, G. efficient for the small group of frequent language-pair Micca, "Speech recognition methods for non

  20. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  1. Speech Recognition Issues for Dutch Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.; Matousek, Vaclav; Mautner, Pavel; Moucek, Roman; Tauser, Karel

    2001-01-01

    In this paper, ongoing work on the development of the speech recognition modules of a multimedia retrieval environment for Dutch is described. The work on the generation of acoustic models and language models along with their current performance is presented. Some characteristics of the Dutch

  2. Shortlist B: A Bayesian Model of Continuous Speech Recognition

    Science.gov (United States)

    Norris, Dennis; McQueen, James M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prelexical and lexical representations, a feedforward…

  3. Speech Recognition for Students with Disabilities in Writing

    Science.gov (United States)

    Gardner, Teresa J.

    2008-01-01

    The role of technology in education is ever increasing. This article looks at students with disabilities and the problem of writing independently. Speech recognition technology offers an option, or solution, for students who have physical and/or learning disabilities and for students who cannot access and use computer keyboards or switches.…

  4. Using Automatic Speech Recognition Technology with Elicited Oral Response Testing

    Science.gov (United States)

    Cox, Troy L.; Davies, Randall S.

    2012-01-01

    This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…

  5. How Aging Affects the Recognition of Emotional Speech

    Science.gov (United States)

    Paulmann, Silke; Pell, Marc D.; Kotz, Sonja A.

    2008-01-01

    To successfully infer a speaker's emotional state, diverse sources of emotional information need to be decoded. The present study explored to what extent emotional speech recognition of "basic" emotions (anger, disgust, fear, happiness, pleasant surprise, sadness) differs between different sex (male/female) and age (young/middle-aged) groups in a…

  6. Intonation and Dialog Context as Constraints for Speech Recognition.

    Science.gov (United States)

    Taylor, Paul; King, Simon; Isard, Stephen; Wright, Helen

    1998-01-01

    Describes how to use intonation and dialog context to improve the performance of an automatic speech-recognition system. Experiments utilized the DCIEM Maptask corpus, using a separate bigram language model for each type of move and showing that, with the correct move-specific language model for each utterance in the test set, the recognizer's…

  7. Channel normalization technique for speech recognition in mismatched conditions

    CSIR Research Space (South Africa)

    Kleynhans, N

    2008-11-01

    Full Text Available , where one wishes to use any available training data for a variety of purposes. Research into a new channel normalization (CN) technique for channel mismatched speech recognition is presented. A process of inverse linear filtering is used in order...

  8. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error

  9. Robust Face Recognition through Local Graph Matching

    Directory of Open Access Journals (Sweden)

    Ehsan Fazl-Ersi

    2007-09-01

    Full Text Available A novel face recognition method is proposed, in which face images are represented by a set of local labeled graphs, each containing information about the appearance and geometry of a 3-tuple of face feature points, extracted using Local Feature Analysis (LFA technique. Our method automatically learns a model set and builds a graph space for each individual. A two-stage method for optimal matching between the graphs extracted from a probe image and the trained model graphs is proposed. The recognition of each probe face image is performed by assigning it to the trained individual with the maximum number of references. Our approach achieves perfect result on the ORL face set and an accuracy rate of 98.4% on the FERET face set, which shows the superiority of our method over all considered state-of-the-art methods. I

  10. EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR

    Institute of Scientific and Technical Information of China (English)

    Chen Yanxiang; Xie Jian

    2012-01-01

    Emotion recognition from speech is an important field of research in human computer interaction.In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition.Because of the importance of variance in reflecting the distribution of speech,the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector.Comparative experiments from five aspects are conducted to study their corresponding effect to system performance.The experiment results,which indicate that the influence of number of mixtures is strong as well as influence of duration is weak,provide basis for the train set selection of Universal Background Model (UBM).

  11. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  12. Mechanisms of enhancing visual-speech recognition by prior auditory information.

    Science.gov (United States)

    Blank, Helen; von Kriegstein, Katharina

    2013-01-15

    Speech recognition from visual-only faces is difficult, but can be improved by prior information about what is said. Here, we investigated how the human brain uses prior information from auditory speech to improve visual-speech recognition. In a functional magnetic resonance imaging study, participants performed a visual-speech recognition task, indicating whether the word spoken in visual-only videos matched the preceding auditory-only speech, and a control task (face-identity recognition) containing exactly the same stimuli. We localized a visual-speech processing network by contrasting activity during visual-speech recognition with the control task. Within this network, the left posterior superior temporal sulcus (STS) showed increased activity and interacted with auditory-speech areas if prior information from auditory speech did not match the visual speech. This mismatch-related activity and the functional connectivity to auditory-speech areas were specific for speech, i.e., they were not present in the control task. The mismatch-related activity correlated positively with performance, indicating that posterior STS was behaviorally relevant for visual-speech recognition. In line with predictive coding frameworks, these findings suggest that prediction error signals are produced if visually presented speech does not match the prediction from preceding auditory speech, and that this mechanism plays a role in optimizing visual-speech recognition by prior information. Copyright © 2012 Elsevier Inc. All rights reserved.

  13. How noise and language proficiency influence speech recognition by individual non-native listeners.

    Science.gov (United States)

    Zhang, Jin; Xie, Lingli; Li, Yongjun; Chatterjee, Monita; Ding, Nai

    2014-01-01

    This study investigated how speech recognition in noise is affected by language proficiency for individual non-native speakers. The recognition of English and Chinese sentences was measured as a function of the signal-to-noise ratio (SNR) in sixty native Chinese speakers who never lived in an English-speaking environment. The recognition score for speech in quiet (which varied from 15%-92%) was found to be uncorrelated with speech recognition threshold (SRTQ/2), i.e. the SNR at which the recognition score drops to 50% of the recognition score in quiet. This result demonstrates separable contributions of language proficiency and auditory processing to speech recognition in noise.

  14. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition‐ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model‐based real‐time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360‐degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  15. Speech Recognition Supported by Lip Analysis

    OpenAIRE

    Butt, Waqqas ur Rehman

    2016-01-01

    Computers have become more pervasive than ever with a wide range of devices and multiple ways of interaction. Traditional ways of human computer interaction using keyboards, mice and display monitors are being replaced by more natural modes such as speech, touch, and gesture. The continuous progress of technology brings to an irreversible change of paradigms of interaction between human and machine. They are now used in daily life in many devices that have revolutionized the way users interac...

  16. Incorporating Speech Recognition into a Natural User Interface

    Science.gov (United States)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to

  17. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  18. New Ideas for Speech Recognition and Related Technologies

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J F

    2002-06-17

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstrated by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun

  19. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue

    2006-01-01

    BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize that it is the temp......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize...... that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...... of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  20. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  1. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Iwano Koji

    2007-01-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  2. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  3. Robust Behavior Recognition in Intelligent Surveillance Environments

    Science.gov (United States)

    Batchuluun, Ganbayar; Kim, Yeong Gon; Kim, Jong Hyun; Hong, Hyung Gil; Park, Kang Ryoung

    2016-01-01

    Intelligent surveillance systems have been studied by many researchers. These systems should be operated in both daytime and nighttime, but objects are invisible in images captured by visible light camera during the night. Therefore, near infrared (NIR) cameras, thermal cameras (based on medium-wavelength infrared (MWIR), and long-wavelength infrared (LWIR) light) have been considered for usage during the nighttime as an alternative. Due to the usage during both daytime and nighttime, and the limitation of requiring an additional NIR illuminator (which should illuminate a wide area over a great distance) for NIR cameras during the nighttime, a dual system of visible light and thermal cameras is used in our research, and we propose a new behavior recognition in intelligent surveillance environments. Twelve datasets were compiled by collecting data in various environments, and they were used to obtain experimental results. The recognition accuracy of our method was found to be 97.6%, thereby confirming the ability of our method to outperform previous methods. PMID:27376288

  4. Robust Behavior Recognition in Intelligent Surveillance Environments

    Directory of Open Access Journals (Sweden)

    Ganbayar Batchuluun

    2016-06-01

    Full Text Available Intelligent surveillance systems have been studied by many researchers. These systems should be operated in both daytime and nighttime, but objects are invisible in images captured by visible light camera during the night. Therefore, near infrared (NIR cameras, thermal cameras (based on medium-wavelength infrared (MWIR, and long-wavelength infrared (LWIR light have been considered for usage during the nighttime as an alternative. Due to the usage during both daytime and nighttime, and the limitation of requiring an additional NIR illuminator (which should illuminate a wide area over a great distance for NIR cameras during the nighttime, a dual system of visible light and thermal cameras is used in our research, and we propose a new behavior recognition in intelligent surveillance environments. Twelve datasets were compiled by collecting data in various environments, and they were used to obtain experimental results. The recognition accuracy of our method was found to be 97.6%, thereby confirming the ability of our method to outperform previous methods.

  5. Text Independent Speaker Recognition and Speaker Independent Speech Recognition Using Iterative Clustering Approach

    Directory of Open Access Journals (Sweden)

    A.Revathi

    2009-11-01

    Full Text Available This paper presents the effectiveness of perceptual features and iterative clustering approach forperforming both speech and speaker recognition. Procedure used for formation of training speech is differentfor developing training models for speaker independent speech and text independent speaker recognition. So,this work mainly emphasizes the utilization of clustering models developed for the training data to obtainbetter accuracy as 91%, 91% and 99.5% for mel frequency perceptual linear predictive cepstrum with respectto three categories such as speaker identification, isolated digit recognition and continuous speechrecognition. This feature also produces 9% as low equal error rate which is used as a performance measurefor speaker verification. The work is experimentally evaluated on the set of isolated digits and continuousspeeches from TI digits_1 and TI digits_2 database for speech recognition and on speeches of 50 speakersrandomly chosen from TIMIT database for speaker recognition. The noteworthy feature of speakerrecognition algorithm is to evaluate the testing procedure on identical messages of all the 50 speakers,theoretical validation of results using F-ratio and validation of results by statistical analysis using2 cdistribution.

  6. Sine-wave speech recognition in a tonal language.

    Science.gov (United States)

    Feng, Yan-Mei; Xu, Li; Zhou, Ning; Yang, Guang; Yin, Shan-Kai

    2012-02-01

    It is hypothesized that in sine-wave replicas of natural speech, lexical tone recognition would be severely impaired due to the loss of F0 information, but the linguistic information at the sentence level could be retrieved even with limited tone information. Forty-one native Mandarin-Chinese-speaking listeners participated in the experiments. Results showed that sine-wave tone-recognition performance was on average only 32.7% correct. However, sine-wave sentence-recognition performance was very accurate, approximately 92% correct on average. Therefore the functional load of lexical tones on sentence recognition is limited, and the high-level recognition of sine-wave sentences is likely attributed to the perceptual organization that is influenced by top-down processes. © 2012 Acoustical Society of America

  7. Biologically inspired emotion recognition from speech

    Science.gov (United States)

    Caponetti, Laura; Buscicchio, Cosimo Alessandro; Castellano, Giovanna

    2011-12-01

    Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC) and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  8. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  9. Undersampled face recognition via robust auxiliary dictionary learning.

    Science.gov (United States)

    Wei, Chia-Po; Wang, Yu-Chiang Frank

    2015-06-01

    In this paper, we address the problem of robust face recognition with undersampled training data. Given only one or few training images available per subject, we present a novel recognition approach, which not only handles test images with large intraclass variations such as illumination and expression. The proposed method is also to handle the corrupted ones due to occlusion or disguise, which is not present during training. This is achieved by the learning of a robust auxiliary dictionary from the subjects not of interest. Together with the undersampled training data, both intra and interclass variations can thus be successfully handled, while the unseen occlusions can be automatically disregarded for improved recognition. Our experiments on four face image datasets confirm the effectiveness and robustness of our approach, which is shown to outperform state-of-the-art sparse representation-based methods.

  10. Hemispheric lateralization of linguistic prosody recognition in comparison to speech and speaker recognition.

    Science.gov (United States)

    Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina

    2014-11-15

    Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on

  11. Robust face recognition algorithm for identifition of disaster victims

    Science.gov (United States)

    Gevaert, Wouter J. R.; de With, Peter H. N.

    2013-02-01

    We present a robust face recognition algorithm for the identification of occluded, injured and mutilated faces with a limited training set per person. In such cases, the conventional face recognition methods fall short due to specific aspects in the classification. The proposed algorithm involves recursive Principle Component Analysis for reconstruction of afiected facial parts, followed by a feature extractor based on Gabor wavelets and uniform multi-scale Local Binary Patterns. As a classifier, a Radial Basis Neural Network is employed. In terms of robustness to facial abnormalities, tests show that the proposed algorithm outperforms conventional face recognition algorithms like, the Eigenfaces approach, Local Binary Patterns and the Gabor magnitude method. To mimic real-life conditions in which the algorithm would have to operate, specific databases have been constructed and merged with partial existing databases and jointly compiled. Experiments on these particular databases show that the proposed algorithm achieves recognition rates beyond 95%.

  12. Robust Convolutional Neural Networks for Image Recognition

    Directory of Open Access Journals (Sweden)

    Hayder M. Albeahdili

    2015-11-01

    Full Text Available Recently image recognition becomes vital task using several methods. One of the most interesting used methods is using Convolutional Neural Network (CNN. It is widely used for this purpose. However, since there are some tasks that have small features that are considered an essential part of a task, then classification using CNN is not efficient because most of those features diminish before reaching the final stage of classification. In this work, analyzing and exploring essential parameters that can influence model performance. Furthermore different elegant prior contemporary models are recruited to introduce new leveraging model. Finally, a new CNN architecture is proposed which achieves state-of-the-art classification results on the different challenge benchmarks. The experimented are conducted on MNIST, CIFAR-10, and CIFAR-100 datasets. Experimental results showed that the results outperform and achieve superior results comparing to the most contemporary approaches.

  13. Robust multi-camera view face recognition

    CERN Document Server

    Kisku, Dakshina Ranjan; Gupta, Phalguni; Sing, Jamuna Kanta

    2010-01-01

    This paper presents multi-appearance fusion of Principal Component Analysis (PCA) and generalization of Linear Discriminant Analysis (LDA) for multi-camera view offline face recognition (verification) system. The generalization of LDA has been extended to establish correlations between the face classes in the transformed representation and this is called canonical covariate. The proposed system uses Gabor filter banks for characterization of facial features by spatial frequency, spatial locality and orientation to make compensate to the variations of face instances occurred due to illumination, pose and facial expression changes. Convolution of Gabor filter bank to face images produces Gabor face representations with high dimensional feature vectors. PCA and canonical covariate are then applied on the Gabor face representations to reduce the high dimensional feature spaces into low dimensional Gabor eigenfaces and Gabor canonical faces. Reduced eigenface vector and canonical face vector are fused together usi...

  14. Working Papers in Speech Recognition, 3

    Science.gov (United States)

    1974-04-01

    du roman ’Eugene Onegm’ iMjolrant la liaison dec epreuve en chair;," Bulletin de ’Academie Imneri^e des Sciences de St. Peieiibourg, VII, 1913. [4...as having a ternary value (+, -, or 0). Other than being ternary, as opposed to binary, these features be»r some resemblance to the the JaKobson ...Waverly Press, Baltimore. Jakobson , R., G Fant, and M. Halle (1951), Preliminariet to Speech Anal/sit, MIT. Postal, P. (1968a), Aspects of Phonological

  15. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  16. Robust place recognition with an application to semantic topological mapping

    Science.gov (United States)

    Siddiqui, J. R.; Khatibi, S.

    2013-12-01

    The problem of robust and invariant representation of places is being addressed. A place recognition technique is proposed followed by an application to a semantic topological mapping. The proposed technique is evaluated on a robot localization database which consists of a large set of images taken under various weather conditions. The results show that the proposed method can robustly recognize the places and is invariant to geometric transformations, brightness changes and noise. The comparative analysis with the state-of-the-art semantic place description methods show that the method outperforms the competing methods and exhibits better average recognition rates.

  17. Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral Decomposion Technique

    Directory of Open Access Journals (Sweden)

    Ibrahim Patel

    2010-12-01

    Full Text Available This paper presents an approach to the recognition of speech signal using frequency spectral information with Mel frequency for the improvement of speech feature representation in a HMM based recognition approach. A frequency spectral information is incorporated to the conventional Mel spectrum base speech recognition approach. The Mel frequency approach exploits the frequency observation for speech signal in a given resolution which results in resolution feature overlapping resulting in recognition limit. Resolution decomposition with separating frequency is mapping approach for a HMM based speech recognition system. The Simulation results show an improvement in the quality metrics of speech recognition with respect to computational time, learning accuracy for a speech recognition system.

  18. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    State of the art speech recognition systems (ASR), mostly use Mel-Frequency cepstral coefficients (MFCC), as acoustic features. In this paper, we propose a new discriminative analysis of acoustic features, based on spectrogram analysis. Both spectral and temporal variations of speech signal...... and enhanced by means of bi-cubic interpolation. An adaptive strategy is proposed for the size of patches over the time to construct unique length vectors for different phonemes. These vectors are classified based on K-nearest neighbor (KNN) and linear discriminative analysis (LDA) and reduced rank LDA (RLDA...

  19. Design of an expert system for phonetic speech recognition

    Energy Technology Data Exchange (ETDEWEB)

    Carbonell, N.; Haton, J.P.; Pierrel, J.M.; Lonchamp, F.

    1983-07-01

    Expert systems have been extensively used as a means for integrating the expertise of a human being into an artificial intelligence system. The authors are presently designing an expert system which will integrate the strategy and the knowledge of a phonetician reading a speech spectrogram. Their goal is twofold, firstly to obtain a better insight into the acoustic-decoding of speech, and, secondly, to improve the efficiency of present automatic phonetic recognition systems. This paper presents a preliminary description of the project, especially the overall strategy of the expert and the role of duration parameters in the segmentation and identification processes.

  20. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  1. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    OpenAIRE

    Kang,S.; Kim, J. -H.; Seo, J.

    2014-01-01

    Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech reco...

  2. Joint Sparsity-Based Robust Multimodal Biometrics Recognition

    Science.gov (United States)

    2012-10-07

    joint sparsity-based algorithm for multimodal biometrics recognition. Our method is based on the well known regu- larized regression method, multi-task...multivariate Lasso [7, 8]. Figure. 1 presents an overview of our method. This paper makes the following contributions: – We present a robust feature...reduces to the conventional Lasso [10] when D = 1 and d = 1. For D = 1 (1), it is equivalent to multivariate Lasso [7]. 2.2 Robust Multimodal

  3. Speech recognition for the anaesthesia record during crisis scenarios.

    Science.gov (United States)

    Alapetite, Alexandre

    2008-07-01

    improvements in speech recognition rates are needed. A vocal interface leads to shorter time between the events to be registered and the actual registration in the electronic anaesthesia record; therefore, this type of interface would likely lead to greater accuracy of items recorded and a reduction of mental workload associated with memorization of events to be registered, especially during time constrained situations. At the same time, current speech recognition technology and speech interfaces require user training and user dedication if a speech interface is to be used successfully.

  4. Increase in Speech Recognition Due to Linguistic Mismatch between Target and Masker Speech: Monolingual and Simultaneous Bilingual Performance

    Science.gov (United States)

    Calandruccio, Lauren; Zhou, Haibo

    2014-01-01

    Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…

  5. An overview of the SPHINX speech recognition system

    Science.gov (United States)

    Lee, Kai-Fu; Hon, Hsiao-Wuen; Reddy, Raj

    1990-01-01

    A description is given of SPHINX, a system that demonstrates the feasibility of accurate, large-vocabulary, speaker-independent, continuous speech recognition. SPHINX is based on discrete hidden Markov models (HMMs) with linear-predictive-coding derived parameters. To provide speaker independence, knowledge was added to these HMMs in several ways: multiple codebooks of fixed-width parameters, and an enhanced recognizer with carefully designed models and word-duration modeling. To deal with coarticulation in continuous speech, yet still adequately represent a large vocabulary, two new subword speech units are introduced: function-word-dependent phone models and generalized triphone models. With grammars of perplexity 997, 60, and 20, SPHINX attained word accuracies of 71, 94, and 96 percent, respectively, on a 997-word task.

  6. Approximated mutual information training for speech recognition using myoelectric signals.

    Science.gov (United States)

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  7. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Cini Kurian; Kannan Balakrishnan

    2013-12-01

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual Linear Predictive (PLP) cepstral coefficient for speech parameterization and continuous density Hidden Markov Model (HMM) in the recognition process. Viterbi algorithm is used for decoding. The training data base has the utterance of 21 speakers from the age group of 20 to 40 years and the sound is recorded in the normal office environment where each speaker is asked to read 20 set of continuous digits. The system obtained an accuracy of 99.5 % with the unseen data.

  8. Dynamic Bayesian Networks for Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Liang Luhong

    2002-01-01

    Full Text Available The use of visual features in audio-visual speech recognition (AVSR is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM and the factorial HMM (FHMM, and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.

  9. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  10. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  11. Combining Multiple Knowledge Sources for Continuous Speech Recognition

    Science.gov (United States)

    1989-08-01

    phonetic , phonological , and grammatical knowledge. The complete system, called BYBLOS, has been shown to achieve the highest recognition accuracy to...speech which is capable of incorporating knowl- edge from several sources, including lexical, phonetic , phonological , and grammatical knowledge. The...accuracy. Some of these include:I a phonetic lexicon specifying the most likely pronunciations for each word, extended by a set of phonological rules

  12. Study on Unequal Error Protection for Distributed Speech Recognition System

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; WANG Si-yao; LIU Jia-kang

    2006-01-01

    The unequal error protection (UEP) is applied in distributed speech recognition (DSR) system and three schemes are proposed. All of these three schemes are evaluated on the GSM simulating platform for recognizing mandarin digit strings and compared with the equal error protection (EEP) scheme. Experiments show that UEP can protect the data transmitted in DSR system more effectively, which results in a higher word accurate rate of DSR system.

  13. Error Correction Using Long Context Match for Smartphone Speech Recognition

    OpenAIRE

    梁, 原; Liang, Yuan; 岩野, 公司; Iwano, Koji; 篠田, 浩一; Shinoda, Koichi

    2015-01-01

    Most error correction interfaces for speech recognition applications on smartphones require the user to first mark an error region and choose the correct word from a candidate list. We propose a simple multimodal interface to make the process more efficient. We develop Long Context Match (LCM) to get candidates that complement the conventional word confusion network (WCN). Assuming that not only the preceding words but also the succeeding words of the error region are validated by users, we u...

  14. An audio-visual corpus for multimodal speech recognition in Dutch language

    NARCIS (Netherlands)

    Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.

    2002-01-01

    This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also i

  15. An audio-visual corpus for multimodal speech recognition in Dutch language

    NARCIS (Netherlands)

    Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.

    2002-01-01

    This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also

  16. An audio-visual corpus for multimodal speech recognition in Dutch language

    NARCIS (Netherlands)

    Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.

    2002-01-01

    This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also i

  17. Speech processing in mobile environments

    CERN Document Server

    Rao, K Sreenivasa

    2014-01-01

    This book focuses on speech processing in the presence of low-bit rate coding and varying background environments. The methods presented in the book exploit the speech events which are robust in noisy environments. Accurate estimation of these crucial events will be useful for carrying out various speech tasks such as speech recognition, speaker recognition and speech rate modification in mobile environments. The authors provide insights into designing and developing robust methods to process the speech in mobile environments. Covering temporal and spectral enhancement methods to minimize the effect of noise and examining methods and models on speech and speaker recognition applications in mobile environments.

  18. Merge-Weighted Dynamic Time Warping for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张湘莉兰; 骆志刚; 李明

    2014-01-01

    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

  19. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  20. Does cochlear implantation improve speech recognition in children with auditory neuropathy spectrum disorder? A systematic review.

    Science.gov (United States)

    Humphriss, Rachel; Hall, Amanda; Maddocks, Jennefer; Macleod, John; Sawaya, Kathleen; Midgley, Elizabeth

    2013-07-01

    Cochlear implantation (CI) is a standard treatment for severe-profound sensorineural hearing loss (SNHL). However, consensus has yet to be reached on its effectiveness for hearing loss caused by auditory neuropathy spectrum disorder (ANSD). This review aims to summarize and synthesize current evidence of the effectiveness of CI in improving speech recognition in children with ANSD. Systematic review. A total of 27 studies from an initial selection of 237. All selected studies were observational in design, including case studies, cohort studies, and comparisons between children with ANSD and SNHL. Most children with ANSD achieved open-set speech recognition with their CI. Speech recognition ability was found to be equivalent in CI users (who previously performed poorly with hearing aids) and hearing-aid users. Outcomes following CI generally appeared similar in children with ANSD and SNHL. Assessment of study quality, however, suggested substantial methodological concerns, particularly in relation to issues of bias and confounding, limiting the robustness of any conclusions around effectiveness. Currently available evidence is compatible with favourable outcomes from CI in children with ANSD. However, this evidence is weak. Stronger evidence is needed to support cost-effective clinical policy and practice in this area.

  1. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  2. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    Science.gov (United States)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  3. Speaker Independent Speech Recognition of Isolated Words in Room Environment

    Directory of Open Access Journals (Sweden)

    M. Tabassum

    2017-04-01

    Full Text Available In this paper, the process of recognizing some important words from a large set of vocabularies is demonstrated based on the combination of dynamic and instantaneous features of the speech spectrum. There are many procedures to recognize a word by its vowel but this paper presents the highly effective speaker independent speech recognition in a typical room environment noise cases. To distinguish several isolated words of the sound of different vowels, two important features known as Pitch and Formant are extracted from the speech signals collected from a number of random male and female speakers. The extracted features are then analysed for the particular utterances to train the system. The specific objectives of this work are to implement an isolated and automatic word speech recognizer, which is capable of recognizing as well as responding to speech and an audio interfacing system between human and machine for an effective human-machine interaction. The whole system has been tested using computer codes and the result was satisfactory in almost 90% of cases. However, system might get confused by similar vowel sounds sometimes.

  4. EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Directory of Open Access Journals (Sweden)

    Theodoros Anagnostopoulos

    2014-11-01

    Full Text Available Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space.

  5. The self-advantage in visual speech processing enhances audiovisual speech recognition in noise.

    Science.gov (United States)

    Tye-Murray, Nancy; Spehar, Brent P; Myerson, Joel; Hale, Sandra; Sommers, Mitchell S

    2015-08-01

    Individuals lip read themselves more accurately than they lip read others when only the visual speech signal is available (Tye-Murray et al., Psychonomic Bulletin & Review, 20, 115-119, 2013). This self-advantage for vision-only speech recognition is consistent with the common-coding hypothesis (Prinz, European Journal of Cognitive Psychology, 9, 129-154, 1997), which posits (1) that observing an action activates the same motor plan representation as actually performing that action and (2) that observing one's own actions activates motor plan representations more than the others' actions because of greater congruity between percepts and corresponding motor plans. The present study extends this line of research to audiovisual speech recognition by examining whether there is a self-advantage when the visual signal is added to the auditory signal under poor listening conditions. Participants were assigned to sub-groups for round-robin testing in which each participant was paired with every member of their subgroup, including themselves, serving as both talker and listener/observer. On average, the benefit participants obtained from the visual signal when they were the talker was greater than when the talker was someone else and also was greater than the benefit others obtained from observing as well as listening to them. Moreover, the self-advantage in audiovisual speech recognition was significant after statistically controlling for individual differences in both participants' ability to benefit from a visual speech signal and the extent to which their own visual speech signal benefited others. These findings are consistent with our previous finding of a self-advantage in lip reading and with the hypothesis of a common code for action perception and motor plan representation.

  6. Development of a Mandarin-English Bilingual Speech Recognition System for Real World Music Retrieval

    Science.gov (United States)

    Zhang, Qingqing; Pan, Jielin; Lin, Yang; Shao, Jian; Yan, Yonghong

    In recent decades, there has been a great deal of research into the problem of bilingual speech recognition-to develop a recognizer that can handle inter- and intra-sentential language switching between two languages. This paper presents our recent work on the development of a grammar-constrained, Mandarin-English bilingual Speech Recognition System (MESRS) for real world music retrieval. Two of the main difficult issues in handling the bilingual speech recognition systems for real world applications are tackled in this paper. One is to balance the performance and the complexity of the bilingual speech recognition system; the other is to effectively deal with the matrix language accents in embedded language**. In order to process the intra-sentential language switching and reduce the amount of data required to robustly estimate statistical models, a compact single set of bilingual acoustic models derived by phone set merging and clustering is developed instead of using two separate monolingual models for each language. In our study, a novel Two-pass phone clustering method based on Confusion Matrix (TCM) is presented and compared with the log-likelihood measure method. Experiments testify that TCM can achieve better performance. Since potential system users' native language is Mandarin which is regarded as a matrix language in our application, their pronunciations of English as the embedded language usually contain Mandarin accents. In order to deal with the matrix language accents in embedded language, different non-native adaptation approaches are investigated. Experiments show that model retraining method outperforms the other common adaptation methods such as Maximum A Posteriori (MAP). With the effective incorporation of approaches on phone clustering and non-native adaptation, the Phrase Error Rate (PER) of MESRS for English utterances was reduced by 24.47% relatively compared to the baseline monolingual English system while the PER on Mandarin utterances was

  7. VFM:Visual Feedback Model for Robust Object Recognition

    Institute of Scientific and Technical Information of China (English)

    王冲; 黄凯奇

    2015-01-01

    Object recognition, which consists of classification and detection, has two important attributes for robustness:1) closeness: detection windows should be as close to object locations as possible, and 2) adaptiveness: object matching should be adaptive to object variations within an object class. It is difficult to satisfy both attributes using traditional methods which consider classification and detection separately; thus recent studies propose to combine them based on confidence contextualization and foreground modeling. However, these combinations neglect feature saliency and object structure, and biological evidence suggests that the feature saliency and object structure can be important in guiding the recognition from low level to high level. In fact, ob ject recognition originates in the mechanism of “what” and “where”pathways in human visual systems. More importantly, these pathways have feedback to each other and exchange useful information, which may improve closeness and adaptiveness. Inspired by the visual feedback, we propose a robust object recognition framework by designing a computational visual feedback model (VFM) between classification and detection. In the “what” feedback, the feature saliency from classification is exploited to rectify detection windows for better closeness;while in the “where” feedback, object parts from detection are used to match object structure for better adaptiveness. Experimental results show that the “what” and “where” feedback is effective to improve closeness and adaptiveness for ob ject recognition, and encouraging improvements are obtained on the challenging PASCAL VOC 2007 dataset.

  8. Error analysis to improve the speech recognition accuracy on Telugu language

    Indian Academy of Sciences (India)

    N Usha Rani; P N Girija

    2012-12-01

    Speech is one of the most important communication channels among the people. Speech Recognition occupies a prominent place in communication between the humans and machine. Several factors affect the accuracy of the speech recognition system. Much effort was involved to increase the accuracy of the speech recognition system, still erroneous output is generating in current speech recognition systems. Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy. Modification should be performed in the dictionary, which is used in the decoder of the speech recognition system. This modification reduces the number of the confusion pairs which improves the performance of the speech recognition system. Language model scores are also varied with this modification. Hit rate is considerably increased during this modification and false alarms have been changing during the modification of the pronunciation dictionary. Variations are observed in different error measures such as F-measures, error-rate and Word Error Rate (WER) by application of the proposed method.

  9. A study of speech emotion recognition based on hybrid algorithm

    Science.gov (United States)

    Zhu, Ju-xia; Zhang, Chao; Lv, Zhao; Rao, Yao-quan; Wu, Xiao-pei

    2011-10-01

    To effectively improve the recognition accuracy of the speech emotion recognition system, a hybrid algorithm which combines Continuous Hidden Markov Model (CHMM), All-Class-in-One Neural Network (ACON) and Support Vector Machine (SVM) is proposed. In SVM and ACON methods, some global statistics are used as emotional features, while in CHMM method, instantaneous features are employed. The recognition rate by the proposed method is 92.25%, with the rejection rate to be 0.78%. Furthermore, it obtains the relative increasing of 8.53%, 4.69% and 0.78% compared with ACON, CHMM and SVM methods respectively. The experiment result confirms the efficiency of distinguishing anger, happiness, neutral and sadness emotional states.

  10. A ROBUST EYE LOCALIZATION ALGORITHM FOR FACE RECOGNITION

    Institute of Scientific and Technical Information of China (English)

    Zhang Wencong; Li Xin; Yao Peng; Li Bin; Zhuang Zhenquan

    2008-01-01

    The accuracy of face alignment affects greatly the performance of a face recognition system.Since the face alignment is usually conducted using eye positions, the algorithm for accurate eye localization is essential for the accurate face recognition. In this paper, an algorithm is proposed for eye localization. First, the proper AdaBoost detection is adaptively trained to segment the region based on the special gray distribution in the region. After that, a fast radial symmetry operator is used to precisely locate the center of eyes. Experimental results show that the method can accurately locate the eyes, and it is robust to the variations of face poses, illuminations, expressions, and accessories.

  11. Methods and apparatus for non-acoustic speech characterization and recognition

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J.F.

    1999-12-21

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  12. Methods and apparatus for non-acoustic speech characterization and recognition

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, John F. (Berkeley, CA)

    1999-01-01

    By simultaneously recording EM wave reflections and acoustic speech information, the positions and velocities of the speech organs as speech is articulated can be defined for each acoustic speech unit. Well defined time frames and feature vectors describing the speech, to the degree required, can be formed. Such feature vectors can uniquely characterize the speech unit being articulated each time frame. The onset of speech, rejection of external noise, vocalized pitch periods, articulator conditions, accurate timing, the identification of the speaker, acoustic speech unit recognition, and organ mechanical parameters can be determined.

  13. Is the Speech Transmission Index (STI) a robust measure of sound system speech intelligibility performance?

    Science.gov (United States)

    Mapp, Peter

    2002-11-01

    Although RaSTI is a good indicator of the speech intelligibility capability of auditoria and similar spaces, during the past 2-3 years it has been shown that RaSTI is not a robust predictor of sound system intelligibility performance. Instead, it is now recommended, within both national and international codes and standards, that full STI measurement and analysis be employed. However, new research is reported, that indicates that STI is not as flawless, nor robust as many believe. The paper highlights a number of potential error mechanisms. It is shown that the measurement technique and signal excitation stimulus can have a significant effect on the overall result and accuracy, particularly where DSP-based equipment is employed. It is also shown that in its current state of development, STI is not capable of appropriately accounting for a number of fundamental speech and system attributes, including typical sound system frequency response variations and anomalies. This is particularly shown to be the case when a system is operating under reverberant conditions. Comparisons between actual system measurements and corresponding word score data are reported where errors of up to 50 implications for VA and PA system performance verification will be discussed.

  14. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    Indian Academy of Sciences (India)

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  15. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  16. Comparing Speech Recognition Systems (Microsoft API, Google API And CMU Sphinx

    Directory of Open Access Journals (Sweden)

    Veton Këpuska

    2017-03-01

    Full Text Available The idea of this paper is to design a tool that will be used to test and compare commercial speech recognition systems, such as Microsoft Speech API and Google Speech API, with open-source speech recognition systems such as Sphinx-4. The best way to compare automatic speech recognition systems in different environments is by using some audio recordings that were selected from different sources and calculating the word error rate (WER. Although the WER of the three aforementioned systems were acceptable, it was observed that the Google API is superior.

  17. Robust Gait Recognition by Integrating Inertial and RGBD Sensors.

    Science.gov (United States)

    Zou, Qin; Ni, Lihao; Wang, Qian; Li, Qingquan; Wang, Song

    2017-03-29

    Gait has been considered as a promising and unique biometric for person identification. Traditionally, gait data are collected using either color sensors, such as a CCD camera, depth sensors, such as a Microsoft Kinect, or inertial sensors, such as an accelerometer. However, a single type of sensors may only capture part of the dynamic gait features and make the gait recognition sensitive to complex covariate conditions, leading to fragile gait-based person identification systems. In this paper, we propose to combine all three types of sensors for gait data collection and gait recognition, which can be used for important identification applications, such as identity recognition to access a restricted building or area. We propose two new algorithms, namely EigenGait and TrajGait, to extract gait features from the inertial data and the RGBD (color and depth) data, respectively. Specifically, EigenGait extracts general gait dynamics from the accelerometer readings in the eigenspace and TrajGait extracts more detailed subdynamics by analyzing 3-D dense trajectories. Finally, both extracted features are fed into a supervised classifier for gait recognition and person identification. Experiments on 50 subjects, with comparisons to several other state-of-the-art gait-recognition approaches, show that the proposed approach can achieve higher recognition accuracy and robustness.

  18. Robust Face Recognition via Block Sparse Bayesian Learning

    Directory of Open Access Journals (Sweden)

    Taiyong Li

    2013-01-01

    Full Text Available Face recognition (FR is an important task in pattern recognition and computer vision. Sparse representation (SR has been demonstrated to be a powerful framework for FR. In general, an SR algorithm treats each face in a training dataset as a basis function and tries to find a sparse representation of a test face under these basis functions. The sparse representation coefficients then provide a recognition hint. Early SR algorithms are based on a basic sparse model. Recently, it has been found that algorithms based on a block sparse model can achieve better recognition rates. Based on this model, in this study, we use block sparse Bayesian learning (BSBL to find a sparse representation of a test face for recognition. BSBL is a recently proposed framework, which has many advantages over existing block-sparse-model-based algorithms. Experimental results on the Extended Yale B, the AR, and the CMU PIE face databases show that using BSBL can achieve better recognition rates and higher robustness than state-of-the-art algorithms in most cases.

  19. Speech recognition based on a combination of acoustic features with articulatory information

    Institute of Scientific and Technical Information of China (English)

    LU Xugang; DANG Jianwu

    2005-01-01

    The contributions of the static and dynamic articulatory information to speech recognition were evaluated, and the recognition approaches by combining the articulatory information with acoustic features were discussed. Articulatory movements were observed by the Electromagnetic Articulographic System for reading speech, and the speech signals were recorded simultaneously. First, we conducted several speech recognition experiments by using articulatory features alone, consisting of a number of specific articulatory channels, to evaluate the contribution of each observation point on articulators. Then, the displacement information of articulatory data were combined with acoustic features directly and adopted in speech recognition. The results show that articulatory information provides with additional information for speech recognition which is not encoded in acoustic features. Furthermore, the contribution of the dynamic information of the articulatory data was evaluated by combining them in speech recognition. It is found that the second derivative of articulatory information provided quite larger contribution to speech recognition comparing with the second derivative of acoustical information. At last, the combination methods of articulatory features and acoustic ones were investigated for speech recognition. The basic approach isthat the Bayesian Network (BN) is added to each state of HMM, where the articulatory information is represented by the BN as a factor of observed signals during training the model and is marginalized as a hidden variable in recognition stage. Results based on this HMM/BN framework show a better performance than the traditional method.

  20. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...... for noisy speech while achieving higher compression rates....

  1. Speaker-Adaptive Speech Recognition Based on Surface Electromyography

    Science.gov (United States)

    Wand, Michael; Schultz, Tanja

    We present our recent advances in silent speech interfaces using electromyographic signals that capture the movements of the human articulatory muscles at the skin surface for recognizing continuously spoken speech. Previous systems were limited to speaker- and session-dependent recognition tasks on small amounts of training and test data. In this article we present speaker-independent and speaker-adaptive training methods which allow us to use a large corpus of data from many speakers to train acoustic models more reliably. We use the speaker-dependent system as baseline, carefully tuning the data preprocessing and acoustic modeling. Then on our corpus we compare the performance of speaker-dependent and speaker-independent acoustic models and carry out model adaptation experiments.

  2. Robust Tomato Recognition for Robotic Harvesting Using Feature Images Fusion

    Directory of Open Access Journals (Sweden)

    Yuanshen Zhao

    2016-01-01

    Full Text Available Automatic recognition of mature fruits in a complex agricultural environment is still a challenge for an autonomous harvesting robot due to various disturbances existing in the background of the image. The bottleneck to robust fruit recognition is reducing influence from two main disturbances: illumination and overlapping. In order to recognize the tomato in the tree canopy using a low-cost camera, a robust tomato recognition algorithm based on multiple feature images and image fusion was studied in this paper. Firstly, two novel feature images, the  a*-component image and the I-component image, were extracted from the L*a*b* color space and luminance, in-phase, quadrature-phase (YIQ color space, respectively. Secondly, wavelet transformation was adopted to fuse the two feature images at the pixel level, which combined the feature information of the two source images. Thirdly, in order to segment the target tomato from the background, an adaptive threshold algorithm was used to get the optimal threshold. The final segmentation result was processed by morphology operation to reduce a small amount of noise. In the detection tests, 93% target tomatoes were recognized out of 200 overall samples. It indicates that the proposed tomato recognition method is available for robotic tomato harvesting in the uncontrolled environment with low cost.

  3. Robust Face Recognition Via Gabor Feature and Sparse Representation

    Directory of Open Access Journals (Sweden)

    Hao Yu-Juan

    2016-01-01

    Full Text Available Sparse representation based on compressed sensing theory has been widely used in the field of face recognition, and has achieved good recognition results. but the face feature extraction based on sparse representation is too simple, and the sparse coefficient is not sparse. In this paper, we improve the classification algorithm based on the fusion of sparse representation and Gabor feature, and then improved algorithm for Gabor feature which overcomes the problem of large dimension of the vector dimension, reduces the computation and storage cost, and enhances the robustness of the algorithm to the changes of the environment.The classification efficiency of sparse representation is determined by the collaborative representation,we simplify the sparse constraint based on L1 norm to the least square constraint, which makes the sparse coefficients both positive and reduce the complexity of the algorithm. Experimental results show that the proposed method is robust to illumination, facial expression and pose variations of face recognition, and the recognition rate of the algorithm is improved.

  4. The Complete Gabor-Fisher Classifier for Robust Face Recognition

    Directory of Open Access Journals (Sweden)

    Štruc Vitomir

    2010-01-01

    Full Text Available Abstract This paper develops a novel face recognition technique called Complete Gabor Fisher Classifier (CGFC. Different from existing techniques that use Gabor filters for deriving the Gabor face representation, the proposed approach does not rely solely on Gabor magnitude information but effectively uses features computed based on Gabor phase information as well. It represents one of the few successful attempts found in the literature of combining Gabor magnitude and phase information for robust face recognition. The novelty of the proposed CGFC technique comes from (1 the introduction of a Gabor phase-based face representation and (2 the combination of the recognition technique using the proposed representation with classical Gabor magnitude-based methods into a unified framework. The proposed face recognition framework is assessed in a series of face verification and identification experiments performed on the XM2VTS, Extended YaleB, FERET, and AR databases. The results of the assessment suggest that the proposed technique clearly outperforms state-of-the-art face recognition techniques from the literature and that its performance is almost unaffected by the presence of partial occlusions of the facial area, changes in facial expression, or severe illumination changes.

  5. Fuzzy C-Means Clustering Based Phonetic Tied-Mixture HMM in Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    XU Xiang-hua; ZHU Jie; GUO Qiang

    2005-01-01

    A fuzzy clustering analysis based phonetic tied-mixture HMM(FPTM) was presented to decrease parameter size and improve robustness of parameter training. FPTM was synthesized from state-tied HMMs by a modified fuzzy C-means clustering algorithm. Each Gaussian codebook of FPTM was built from Gaussian components within the same root node in phonetic decision tree. The experimental results on large vocabulary Mandarin speech recognition show that compared with conventional phonetic tied-mixture HMM and state-tied HMM with approximately the same number of Gaussian mixtures, FPTM achieves word error rate reductions by 4.84% and 13.02 % respectively. Combining the two schemes of mixing weights pruning and Gaussian centers fuzzy merging, a significantly parameter size reduction was achieved with little impact on recognition accuracy.

  6. Speech recognition in individuals having a clinical complaint about understanding speech during noise or not

    Directory of Open Access Journals (Sweden)

    Becker, Karine Thaís

    2011-07-01

    Full Text Available Introduction: Clinical and experimental study. Individuals with a normal hearing can be jeopardized in adverse communication situations, what negatively interferes with speech clearness. Objective: check and compare the performance of normal hearing young adults who have a difficulty in understanding speech during noise or not, by making use of sentences as stimuli. Method: 50 normal hearing individuals, 21 of whom were male and 29 were female, aged between 19 and 32, were evaluated and divided into two groups: with and without a clinical complaint about understanding speech during noise. By using Portuguese Sentence Lists test, the Recognition Threshold of Sentences during Noise research was performed, through which the signal-to-noise (SN ratios were obtained. The contrasting noise was introduced at 65 dB NA. Results: the average values achieved for SN ratios in the left ear, for the group without a complaint and the group with a complaint, were respectively 6.26 dB and 3.62 dB. For the left ear, the values were -7.12 dB and -4.12 dB. A statistically significant difference was noticed in both right and left ears of the two groups. Conclusion: normal hearing individuals showing a clinical complaint about understanding speech at noisy places have more difficulty in the task to recognize sentences during noise, in comparison with the people who do not face such a difficulty. Accordingly, the customary audiologic evaluation must include tests using sentences during a contrasting noise, with a view to evaluating the speech recognition performance more reliably and efficiently. ACTRN12610000822088

  7. Adaptive Compensation Algorithm in Open Vocabulary Mandarin Speaker-Independent Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    In speech recognition systems, the physiological characteristics of the speech production model cause the voiced sections of the speech signal to have an attenuation of approximately 20 dB per decade. Many speech recognition algorithms have been developed to solve this problem by filtering the input signal with a single-zero high pass filter. Unfortunately, this technique increases the noise energy at high frequencies above 4 kHz, which in some cases degrades the recognition accuracy. This paper solves the problem using a pre-emphasis filter in the front end of the recognizer. The aim is to develop a modified parameterization approach taking into account the whole energy zone in the spectrum to improve the performance of the existing baseline recognition system in the acoustic phase. The results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.

  8. Comparison of fluctuating maskers for speech recognition tests.

    Science.gov (United States)

    Francart, Tom; van Wieringen, Astrid; Wouters, Jan

    2011-01-01

    To investigate the extent to which temporal gaps, temporal fine structure, and comprehensibility of the masker affect masking strength in speech recognition experiments. Seven different masker types with Dutch speech materials were evaluated. Amongst these maskers were the ICRA-5 fluctuating noise, the international speech test signal (ISTS), and competing talkers in Dutch and Swedish. Normal-hearing and hearing-impaired subjects. The normal-hearing subjects benefited from both temporal gaps and temporal fine structure in the fluctuating maskers. When the competing talker was comprehensible, performance decreased. The ISTS masker appeared to cause a large informational masking component. The stationary maskers yielded the steepest slopes of the psychometric function, followed by the modulated noises, followed by the competing talkers. Although the hearing-impaired group was heterogeneous, their data showed similar tendencies, but sometimes to a lesser extent, depending on individuals' hearing impairment. If measurement time is of primary concern non-modulated maskers are advised. If it is useful to assess release of masking by the use of temporal gaps, a fluctuating noise is advised. If perception of temporal fine structure is being investigated, a foreign-language competing talker is advised.

  9. Supervised linear dimensionality reduction with robust margins for object recognition

    Science.gov (United States)

    Dornaika, F.; Assoum, A.

    2013-01-01

    Linear Dimensionality Reduction (LDR) techniques have been increasingly important in computer vision and pattern recognition since they permit a relatively simple mapping of data onto a lower dimensional subspace, leading to simple and computationally efficient classification strategies. Recently, many linear discriminant methods have been developed in order to reduce the dimensionality of visual data and to enhance the discrimination between different groups or classes. Many existing linear embedding techniques relied on the use of local margins in order to get a good discrimination performance. However, dealing with outliers and within-class diversity has not been addressed by margin-based embedding method. In this paper, we explored the use of different margin-based linear embedding methods. More precisely, we propose to use the concepts of Median miss and Median hit for building robust margin-based criteria. Based on such margins, we seek the projection directions (linear embedding) such that the sum of local margins is maximized. Our proposed approach has been applied to the problem of appearance-based face recognition. Experiments performed on four public face databases show that the proposed approach can give better generalization performance than the classic Average Neighborhood Margin Maximization (ANMM). Moreover, thanks to the use of robust margins, the proposed method down-grades gracefully when label outliers contaminate the training data set. In particular, we show that the concept of Median hit was crucial in order to get robust performance in the presence of outliers.

  10. Statistic Model Based Dynamic Channel Compensation for Telephony Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    ZHANGHuayun; HANZhaobing; XUBo

    2004-01-01

    The degradation of speech recognition performance in real-life environments and through transmission channels is a main embarrassment for many speechbased applications around the world, especially when nonstationary noise and changing channel exist. Previous works have shown that the main reason for this performance degradation is the variational mismatch caused by different telephone channels between the testing and training sets. In this paper, we propose a statistic model based implementation to dynamically compensate this mismatch. Firstly, we focus on a Maximum-likelihood (ML) estimation algorithm for telephone channels. In experiments on Mandarin Large vocabulary continuous speech recognition (LVCSR) over telephone lines, the Character error rate (CER) decreases more than 20%. The average delay is about 300-400ms. Secondly, we will extend it by introducing a phone-conditioned prior statistic model for the channels and applying Maximum a posteriori (MAP) estimation technique. Compared to the ML based method, the MAP based algorithm follows with the variations within channels more effectively. Average delay of the algorithm is decreased to 200ms. An additional 7-8% CER relative reduction is observed in LVCSR.

  11. Emotion Recognition of Speech Signals Based on Filter Methods

    Directory of Open Access Journals (Sweden)

    Narjes Yazdanian

    2016-10-01

    Full Text Available Speech is the basic mean of communication among human beings.With the increase of transaction between human and machine, necessity of automatic dialogue and removing human factor has been considered. The aim of this study was to determine a set of affective features the speech signal is based on emotions. In this study system was designs that include three mains sections, features extraction, features selection and classification. After extraction of useful features such as, mel frequency cepstral coefficient (MFCC, linear prediction cepstral coefficients (LPC, perceptive linear prediction coefficients (PLP, ferment frequency, zero crossing rate, cepstral coefficients and pitch frequency, Mean, Jitter, Shimmer, Energy, Minimum, Maximum, Amplitude, Standard Deviation, at a later stage with filter methods such as Pearson Correlation Coefficient, t-test, relief and information gain, we came up with a method to rank and select effective features in emotion recognition. Then Result, are given to the classification system as a subset of input. In this classification stage, multi support vector machine are used to classify seven type of emotion. According to the results, that method of relief, together with multi support vector machine, has the most classification accuracy with emotion recognition rate of 93.94%.

  12. Speech recognition in the radiology department: a systematic review.

    Science.gov (United States)

    Hammana, Imane; Lepanto, Luigi; Poder, Thomas; Bellemare, Christian; Ly, My-Sandra

    2015-01-01

    To conduct a systematic review of the literature describing the impact of speech recognition systems on report error rates and productivity in radiology departments. The search was conducted for relevant papers published from January 1992 to October 2013. Comparative studies reporting any of the following outcomes were selected: error rates, departmental productivity, and radiologist productivity. The retrieved studies were assessed for quality and risk of bias. The literature search identified 85 potentially relevant publications, but, based on the inclusion and exclusion criteria, only 20 were included. Most studies were before and after assessments with no control group. There was a large amount of heterogeneity due to differences in the imaging modalities assessed and the outcomes measured. The percentage of reports containing at least one error varied from 4.8% to 89% for speech recognition, and from 2.1% to 22% for transcription. Departmental productivity was improved with decreases in report turnaround times varying from 35% to 99%. Most studies found a lengthening of radiologist dictation time. Overall gains in departmental productivity were high, but radiologist productivity, as measured by the time to produce a report, was diminished.

  13. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2009-01-01

    Objective: The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that spe

  14. Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition

    Science.gov (United States)

    Skowronski, Mark D.; Harris, John G.

    2004-09-01

    Mel frequency cepstral coefficients (MFCC) are the most widely used speech features in automatic speech recognition systems, primarily because the coefficients fit well with the assumptions used in hidden Markov models and because of the superior noise robustness of MFCC over alternative feature sets such as linear prediction-based coefficients. The authors have recently introduced human factor cepstral coefficients (HFCC), a modification of MFCC that uses the known relationship between center frequency and critical bandwidth from human psychoacoustics to decouple filter bandwidth from filter spacing. In this work, the authors introduce a variation of HFCC called HFCC-E in which filter bandwidth is linearly scaled in order to investigate the effects of wider filter bandwidth on noise robustness. Experimental results show an increase in signal-to-noise ratio of 7 dB over traditional MFCC algorithms when filter bandwidth increases in HFCC-E. An important attribute of both HFCC and HFCC-E is that the algorithms only differ from MFCC in the filter bank coefficients: increased noise robustness using wider filters is achieved with no additional computational cost.

  15. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  16. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    NARCIS (Netherlands)

    Jesse, A.; McQueen, J.M.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes i

  17. Supporting Dictation Speech Recognition Error Correction: The Impact of External Information

    Science.gov (United States)

    Shi, Yongmei; Zhou, Lina

    2011-01-01

    Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…

  18. Modeling words with subword units in an articulatorily constrained speech recognition algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1997-11-20

    The goal of speech recognition is to find the most probable word given the acoustic evidence, i.e. a string of VQ codes or acoustic features. Speech recognition algorithms typically take advantage of the fact that the probability of a word, given a sequence of VQ codes, can be calculated.

  19. Distributed Speech Recognition Systems and Some Key Factors Affecting It's Performance

    Institute of Scientific and Technical Information of China (English)

    YE Lei; YANG Zhen

    2003-01-01

    In this paper we first analyze the Distributed Speech Recognition (DSR) system and the key factors that affect it's performance and then focus on the research on the relationship between the length of testing speech and the recognition accuracy of the system. Some experimental results are given at last.

  20. Supporting Dictation Speech Recognition Error Correction: The Impact of External Information

    Science.gov (United States)

    Shi, Yongmei; Zhou, Lina

    2011-01-01

    Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…

  1. Measuring Speech Recognition Proficiency: A Psychometric Analysis of Speed and Accuracy

    Science.gov (United States)

    Rader, Martha H.; Bailey, Glenn A.; Kurth, Linda A.

    2008-01-01

    This study examined the validity of various measures of speed and accuracy for assessing proficiency in speech recognition. The study specifically compared two different word-count indices for speed and accuracy (the 5-stroke word and the 1.4-syllable standard word) on a timing administered to 114 speech recognition students measured at 1-, 2-,…

  2. A robust HOG-based descriptor for pattern recognition

    Science.gov (United States)

    Diaz-Escobar, Julia; Kober, Vitaly

    2016-09-01

    The Histogram of Oriented Gradients (HOG) is a popular feature descriptor used in computer vision and image processing. The technique counts occurrences of gradient orientation in localized portions of an image. The descriptor is sensible to the presence in images of noise, nonuniform illumination, and low contrast. In this work, we propose a robust HOG-based descriptor using the local energy model and phase congruency approach. Computer simulation results are presented for recognition of objects in images affected by additive noise, nonuniform illumination, and geometric distortions using the proposed and conventional HOG descriptors.

  3. A comparison of techniques for robust gender recognition

    OpenAIRE

    2011-01-01

    Reprinted, with permission, from [Rojas Bello, R.N., Lago Fernández, L.F., Martínez Muñoz, G., y Sánchez Montañés, M.A., A comparision of techniques for robust gender recognition, IEEE International Conference on Image Processing, ICIP 2011]. This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the Universidad Autónoma de Madrid's products or services. Internal or personal use of this material is permitted...

  4. Discriminative Local Sparse Representations for Robust Face Recognition

    CERN Document Server

    Chen, Yi; Do, Thong T; Monga, Vishal; Tran, Trac D

    2011-01-01

    A key recent advance in face recognition models a test face image as a sparse linear combination of a set of training face images. The resulting sparse representations have been shown to possess robustness against a variety of distortions like random pixel corruption, occlusion and disguise. This approach however makes the restrictive (in many scenarios) assumption that test faces must be perfectly aligned (or registered) to the training data prior to classification. In this paper, we propose a simple yet robust local block-based sparsity model, using adaptively-constructed dictionaries from local features in the training data, to overcome this misalignment problem. Our approach is inspired by human perception: we analyze a series of local discriminative features and combine them to arrive at the final classification decision. We propose a probabilistic graphical model framework to explicitly mine the conditional dependencies between these distinct sparse local features. In particular, we learn discriminative...

  5. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  6. Audibility-based predictions of speech recognition for children and adults with normal hearing.

    Science.gov (United States)

    McCreery, Ryan W; Stelmachowicz, Patricia G

    2011-12-01

    This study investigated the relationship between audibility and predictions of speech recognition for children and adults with normal hearing. The Speech Intelligibility Index (SII) is used to quantify the audibility of speech signals and can be applied to transfer functions to predict speech recognition scores. Although the SII is used clinically with children, relatively few studies have evaluated SII predictions of children's speech recognition directly. Children have required more audibility than adults to reach maximum levels of speech understanding in previous studies. Furthermore, children may require greater bandwidth than adults for optimal speech understanding, which could influence frequency-importance functions used to calculate the SII. Speech recognition was measured for 116 children and 19 adults with normal hearing. Stimulus bandwidth and background noise level were varied systematically in order to evaluate speech recognition as predicted by the SII and derive frequency-importance functions for children and adults. Results suggested that children required greater audibility to reach the same level of speech understanding as adults. However, differences in performance between adults and children did not vary across frequency bands.

  7. Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Martin Heckmann

    2002-11-01

    Full Text Available It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.

  8. Automatic Speech Recognition and Training for Severely Dysarthric Users of Assistive Technology: The STARDUST Project

    Science.gov (United States)

    Parker, Mark; Cunningham, Stuart; Enderby, Pam; Hawley, Mark; Green, Phil

    2006-01-01

    The STARDUST project developed robust computer speech recognizers for use by eight people with severe dysarthria and concomitant physical disability to access assistive technologies. Independent computer speech recognizers trained with normal speech are of limited functional use by those with severe dysarthria due to limited and inconsistent…

  9. Noise-robust cortical tracking of attended speech in real-world acoustic scenes

    DEFF Research Database (Denmark)

    Fuglsang, Søren; Dau, Torsten; Hjortkjær, Jens

    2017-01-01

    Selectively attending to one speaker in a multi-speaker scenario is thought to synchronize low-frequency cortical activity to the attended speech signal. In recent studies, reconstruction of speech from single-trial electroencephalogram (EEG) data has been used to decode which talker a listener...... is attending to in a two-talker situation. It is currently unclear how this generalizes to more complex sound environments. Behaviorally, speech perception is robust to the acoustic distortions that listeners typically encounter in everyday life, but it is unknown whether this is mirrored by a noise......-robust neural tracking of attended speech. Here we used advanced acoustic simulations to recreate real-world acoustic scenes in the laboratory. In virtual acoustic realities with varying amounts of reverberation and number of interfering talkers, listeners selectively attended to the speech stream...

  10. Robust Point Set Matching for Partial Face Recognition.

    Science.gov (United States)

    Weng, Renliang; Lu, Jiwen; Tan, Yap-Peng

    2016-03-01

    Over the past three decades, a number of face recognition methods have been proposed in computer vision, and most of them use holistic face images for person identification. In many real-world scenarios especially some unconstrained environments, human faces might be occluded by other objects, and it is difficult to obtain fully holistic face images for recognition. To address this, we propose a new partial face recognition approach to recognize persons of interest from their partial faces. Given a pair of gallery image and probe face patch, we first detect keypoints and extract their local textural features. Then, we propose a robust point set matching method to discriminatively match these two extracted local feature sets, where both the textural information and geometrical information of local features are explicitly used for matching simultaneously. Finally, the similarity of two faces is converted as the distance between these two aligned feature sets. Experimental results on four public face data sets show the effectiveness of the proposed approach.

  11. Memristive Computational Architecture of an Echo State Network for Real-Time Speech Emotion Recognition

    Science.gov (United States)

    2015-05-28

    recognition is simpler and requires less computational resources compared to other inputs such as facial expressions. The Berlin database of Emotional...Network for Real-Time Speech-Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q...domains such as image and video analysis, anomaly detection, and speech recognition . In this research, a hardware architecture was explored for

  12. Segmentation cues in conversational speech: robust semantics and fragile phonotactics.

    Science.gov (United States)

    White, Laurence; Mattys, Sven L; Wiget, Lukas

    2012-01-01

    Multiple cues influence listeners' segmentation of connected speech into words, but most previous studies have used stimuli elicited in careful readings rather than natural conversation. Discerning word boundaries in conversational speech may differ from the laboratory setting. In particular, a speaker's articulatory effort - hyperarticulation vs. hypoarticulation (H&H) - may vary according to communicative demands, suggesting a compensatory relationship whereby acoustic-phonetic cues are attenuated when other information sources strongly guide segmentation. We examined how listeners' interpretation of segmentation cues is affected by speech style (spontaneous conversation vs. read), using cross-modal identity priming. To elicit spontaneous stimuli, we used a map task in which speakers discussed routes around stylized landmarks. These landmarks were two-word phrases in which the strength of potential segmentation cues - semantic likelihood and cross-boundary diphone phonotactics - was systematically varied. Landmark-carrying utterances were transcribed and later re-recorded as read speech. Independent of speech style, we found an interaction between cue valence (favorable/unfavorable) and cue type (phonotactics/semantics). Thus, there was an effect of semantic plausibility, but no effect of cross-boundary phonotactics, indicating that the importance of phonotactic segmentation may have been overstated in studies where lexical information was artificially suppressed. These patterns were unaffected by whether the stimuli were elicited in a spontaneous or read context, even though the difference in speech styles was evident in a main effect. Durational analyses suggested speaker-driven cue trade-offs congruent with an H&H account, but these modulations did not impact on listener behavior. We conclude that previous research exploiting read speech is reliable in indicating the primacy of lexically based cues in the segmentation of natural conversational speech.

  13. Segmentation cues in conversational speech: Robust semantics and fragile phonotactics

    Directory of Open Access Journals (Sweden)

    Laurence eWhite

    2012-10-01

    Full Text Available Multiple cues influence listeners’ segmentation of connected speech into words, but most previous studies have used stimuli elicited in careful readings rather than natural conversation. Discerning word boundaries in conversational speech may differ from the laboratory setting. In particular, a speaker’s articulatory effort – hyperarticulation vs hypoarticulation (H&H – may vary according to communicative demands, suggesting a compensatory relationship whereby acoustic-phonetic cues are attenuated when other information sources strongly guide segmentation. We examined how listeners’ interpretation of segmentation cues is affected by speech style (spontaneous conversation vs read, using cross-modal identity priming. To elicit spontaneous stimuli, we used a map task in which speakers discussed routes around stylised landmarks. These landmarks were two-word phrases in which the strength of potential segmentation cues – semantic likelihood and cross-boundary diphone phonotactics – was systematically varied. Landmark-carrying utterances were transcribed and later re-recorded as read speech.Independent of speech style, we found an interaction between cue valence (favourable/unfavourable and cue type (phonotactics/semantics. Thus, there was an effect of semantic plausibility, but no effect of cross-boundary phonotactics, indicating that the importance of phonotactic segmentation may have been overstated in studies where lexical information was artificially suppressed. These patterns were unaffected by whether the stimuli were elicited in a spontaneous or read context, even though the difference in speech styles was evident in a main effect. Durational analyses suggested speaker-driven cue trade-offs congruent with an H&H account, but these modulations did not impact on listener behaviour. We conclude that previous research exploiting read speech is reliable in indicating the primacy of lexically-based cues in the segmentation of natural

  14. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  15. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  16. Speech Recognition for Environmental Control: Effect of Microphone Type, Dysarthria, and Severity on Recognition Results.

    Science.gov (United States)

    Fager, Susan Koch; Burnfield, Judith M

    2015-01-01

    This study examines the use of commercially available automatic speech recognition (ASR) across microphone options as access to environmental control for individuals with and without dysarthria. A study of two groups of speakers (typical speech and dysarthria), was conducted to understand their performance using ASR and various microphones for environmental control. Specifically, dependent variables examined included attempts per command, recognition accuracy, frequency of error type, and perceived workload. A further sub-analysis of the group of participants with dysarthria examined the impact of severity. Results indicated a significantly larger number of attempts were required (P = 0.007), and significantly lower recognition accuracies were achieved by the dysarthric participants (P = 0.010). A sub-analysis examining severity demonstrated no significant differences between the typical speakers and participants with mild dysarthria. However, significant differences were evident (P = 0.007, P = 0.008) between mild and moderate-severe dysarthric participants. No significant differences existed across microphones. A higher frequency of threshold errors occurred for typical participants and no response errors for moderate-severe dysarthrics. There were no significant differences on the NASA Task Load Index.

  17. Fish Recognition Based on Robust Features Extraction from Size and Shape Measurements Using Neural Network

    Directory of Open Access Journals (Sweden)

    Mutasem K. Alsmadi

    2010-01-01

    Full Text Available Problem statement: Image recognition is a challenging problem researchers had been research into this area for so long especially in the recent years, due to distortion, noise, segmentation errors, overlap and occlusion of objects in digital images. In our study, there are many fields concern with pattern recognition, for example, fingerprint verification, face recognition, iris discrimination, chromosome shape discrimination, optical character recognition, texture discrimination and speech recognition, the subject of pattern recognition appears. A system for recognizing isolated pattern of interest may be as an approach for dealing with such application. Scientists and engineers with interests in image processing and pattern recognition have developed various approaches to deal with digital image recognition problems such as, neural network, contour matching and statistics. Approach: In this study, our aim was to recognize an isolated pattern of interest in the image based on the combination between robust features extraction. Where depend on size and shape measurements, that were extracted by measuring the distance and geometrical measurements. Results: We presented a system prototype for dealing with such problem. The system started by acquiring an image containing pattern of fish, then the image features extraction is performed relying on size and shape measurements. Our system has been applied on 20 different fish families, each family has a different number of fish types and our sample consists of distinct 350 of fish images. These images were divided into two datasets: 257 training images and 93 testing images. An overall accuracy was obtained using the neural network associated with the back-propagation algorithm was 86% on the test dataset used. Conclusion: We developed a classifier for fish images recognition. We efficiently have chosen a features extraction method to fit our demands. Our classifier successfully design and implement a

  18. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

    Directory of Open Access Journals (Sweden)

    Mihaela Gordan

    2002-11-01

    Full Text Available Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

  19. Maximum likelihood polynomial regression for robust speech recognition

    Institute of Scientific and Technical Information of China (English)

    LU Yong; WU Zhenyang

    2011-01-01

    The linear hypothesis is the main disadvantage of maximum likelihood linear re- gression (MLLR). This paper applies the polynomial regression method to model adaptation and establishes a nonlinear model adaptation algorithm using maximum likelihood polyno

  20. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Ji Ming

    2008-03-01

    Full Text Available We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  1. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  2. A Novel Feature Extraction for Robust EMG Pattern Recognition

    CERN Document Server

    Phinyomark, Angkoon; Phukpattaranont, Pornchai

    2009-01-01

    Varieties of noises are major problem in recognition of Electromyography (EMG) signal. Hence, methods to remove noise become most significant in EMG signal analysis. White Gaussian noise (WGN) is used to represent interference in this paper. Generally, WGN is difficult to be removed using typical filtering and solutions to remove WGN are limited. In addition, noise removal is an important step before performing feature extraction, which is used in EMG-based recognition. This research is aimed to present a novel feature that tolerate with WGN. As a result, noise removal algorithm is not needed. Two novel mean and median frequencies (MMNF and MMDF) are presented for robust feature extraction. Sixteen existing features and two novelties are evaluated in a noisy environment. WGN with various signal-to-noise ratios (SNRs), i.e. 20-0 dB, was added to the original EMG signal. The results showed that MMNF performed very well especially in weak EMG signal compared with others. The error of MMNF in weak EMG signal with...

  3. An objective auditory measure to assess speech recognition in adult cochlear implant users.

    Science.gov (United States)

    Turgeon, C; Lazzouni, L; Lepore, F; Ellemberg, D

    2014-04-01

    To verify if a mismatch negativity (MMN) paradigm based on speech syllables can differentiate between good and poorer cochlear implant (CI) users on a speech recognition task. Twenty adults with a CI and 11 normal hearing adults participated in the study. Based on a speech recognition test, ten CI users were classified as good performers and ten as poor performers. We measured the MMN with /da/ as the standard stimulus and /ba/ and /ga/ as the deviants. Separate analyses were conducted on the amplitude and latency of the MMN. A MMN was evoked by both deviant stimuli in all normal hearing participants and in well performing CI users, with similar amplitudes for both groups. However, the amplitude of the MMN was significantly reduced for the poorer CI users compared to the normal hearing group and the good CI users. The latency was longer for both groups of cochlear implant users. A bivariate correlation showed a significant positive correlation between the speech recognition score and the amplitude of the MMN. The MMN can distinguish between CI users who have good versus poor speech recognition as assessed with conventional tasks. Our findings suggest that the MMN can be use to assess speech recognition proficiency in CI users who cannot be tested with regular speech recognition tasks, like infants and other non-verbal populations. Copyright © 2013 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

  4. A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

    OpenAIRE

    Pao, Tsang-Long; Liao, Wen-Yuan; Chen, Yu-Te

    2008-01-01

    In this chapter, we present a speech emotion recognition system to compare several classifiers on the clean speech and noisy speech. Our proposed WD-KNN classifier outperforms the other three KNN-based classifiers at every SNR level and achieves highest accuracy from clean speech to 20dB noisy speech when compared with all other classifiers. Similar to (Neiberg et al, 2006), GMM is a feasible technique for emotion classification on the frame level and the results of GMM are better than perfor...

  5. Vowel recognition by fuzzy inference and application to recognition of continuous Korean speech. Fuzzy suiron ni yoru boin ninshiki to kankokugo renzoku onsei eno oyo

    Energy Technology Data Exchange (ETDEWEB)

    Choi, W.K.; Akizuki, K. (Waseda Univ., Tokyo (Japan)); Lee, H.H. (Fukuoka Inst. of Tech., Fukuoka (Japan))

    1991-05-20

    The target of voice recognition is to recognize continuous speech which is effective for speech recognition of unspecified persons. As a new matching method, the variations of feature parameters of speakers are represented as fuzzy variables to express the variation by membership functions. It is a new pattern matching method of fuzzy inference using feature parameters, fuzzy relation and synthesis of each formant, and the fuzzy rule. It is a recognition method for the inference of best formant which matches the fact by providing each characteristic quantity and fuzzy rule for composite calculation. For consonant recognition, pitch, logarithmic energies, zero crossing rates, etc. are used which represent features of each formant. KOSRES 2, recognition system for continuous Korean speech, was structured using this method which was subjected to recognition experiments on continuous Korean speech, and the recognition method by fuzzy inference is found to be effective for speech recognition of unspecified persons. 8 refs., 9 figs., 3 tabs.

  6. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    Science.gov (United States)

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory

  7. A speech recognition system for data collection in precision agriculture

    Science.gov (United States)

    Dux, David Lee

    Agricultural producers have shown interest in collecting detailed, accurate, and meaningful field data through field scouting, but scouting is labor intensive. They use yield monitor attachments to collect weed and other field data while driving equipment. However, distractions from using a keyboard or buttons while driving can lead to driving errors or missed data points. At Purdue University, researchers have developed an ASR system to allow equipment operators to collect georeferenced data while keeping hands and eyes on the machine during harvesting and to ease georeferencing of data collected during scouting. A notebook computer retrieved locations from a GPS unit and displayed and stored data in Excel. A headset microphone with a single earphone collected spoken input while allowing the operator to hear outside sounds. One-, two-, or three-word commands activated appropriate VBA macros. Four speech recognition products were chosen based on hardware requirements and ability to add new terms. After training, speech recognition accuracy was 100% for Kurzweil VoicePlus and Verbex Listen for the 132 vocabulary words tested, during tests walking outdoors or driving an ATV. Scouting tests were performed by carrying the system in a backpack while walking in soybean fields. The system recorded a point or a series of points with each utterance. Boundaries of points showed problem areas in the field and single points marked rocks and field corners. Data were displayed as an Excel chart to show a real-time map as data were collected. The information was later displayed in a GIS over remote sensed field images. Field corners and areas of poor stand matched, with voice data explaining anomalies in the image. The system was tested during soybean harvest by using voice to locate weed patches. A harvester operator with little computer experience marked points by voice when the harvester entered and exited weed patches or areas with poor crop stand. The operator found the

  8. Analysis and prediction of acoustic speech features from mel-frequency cepstral coefficients in distributed speech recognition architectures.

    Science.gov (United States)

    Darch, Jonathan; Milner, Ben; Vaseghi, Saeed

    2008-12-01

    The aim of this work is to develop methods that enable acoustic speech features to be predicted from mel-frequency cepstral coefficient (MFCC) vectors as may be encountered in distributed speech recognition architectures. The work begins with a detailed analysis of the multiple correlation between acoustic speech features and MFCC vectors. This confirms the existence of correlation, which is found to be higher when measured within specific phonemes rather than globally across all speech sounds. The correlation analysis leads to the development of a statistical method of predicting acoustic speech features from MFCC vectors that utilizes a network of hidden Markov models (HMMs) to localize prediction to specific phonemes. Within each HMM, the joint density of acoustic features and MFCC vectors is modeled and used to make a maximum a posteriori prediction. Experimental results are presented across a range of conditions, such as with speaker-dependent, gender-dependent, and gender-independent constraints, and these show that acoustic speech features can be predicted from MFCC vectors with good accuracy. A comparison is also made against an alternative scheme that substitutes the higher-order MFCCs with acoustic features for transmission. This delivers accurate acoustic features but at the expense of a significant reduction in speech recognition accuracy.

  9. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    Directory of Open Access Journals (Sweden)

    KANG, S.

    2014-05-01

    Full Text Available Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech recognition systems focus only on acoustic and language model scores for the current sentence. The proposed model successfully reduces the word error rate and semantic error rate by 3.65% and 8.61%, respectively.

  10. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    Energy Technology Data Exchange (ETDEWEB)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C., E-mail: calexandre@ien.gov.b, E-mail: mol@ien.gov.b, E-mail: cmnap@ien.gov.b, E-mail: mag@ien.gov.b [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Nomiya, Diogo V., E-mail: diogonomiya@gmail.co [Universidade Federal do Rio de Janeiro (UFRJ), RJ (Brazil)

    2009-07-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  11. The effects of reverberant self- and overlap-masking on speech recognition in cochlear implant listeners.

    Science.gov (United States)

    Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S

    2014-06-01

    Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

  12. Optical Character Recognition Based Speech Synthesis System Using LabVIEW

    Directory of Open Access Journals (Sweden)

    S.K. Singla

    2014-10-01

    Full Text Available Knowledge extraction by just listening to sounds is a distinctive property. Speech signal is more effective means of communication than text because blind and visually impaired persons can also respond to sounds. This paper aims to develop a cost effective, and user friendly optical character recognition (OCR based speech synthesis system. The OCR based speech synthesis system has been developed using Laboratory virtual instruments engineering workbench (LabVIEW 7.1.

  13. Speech Recognition Software for Language Learning: Toward an Evaluation of Validity and Student Perceptions

    Science.gov (United States)

    Cordier, Deborah

    2009-01-01

    A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…

  14. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  15. Speech Recognition and Acoustic Features in Combined Electric and Acoustic Stimulation

    Science.gov (United States)

    Yoon, Yang-soo; Li, Yongxin; Fu, Qian-Jie

    2012-01-01

    Purpose: In this study, the authors aimed to identify speech information processed by a hearing aid (HA) that is additive to information processed by a cochlear implant (CI) as a function of signal-to-noise ratio (SNR). Method: Speech recognition was measured with CI alone, HA alone, and CI + HA. Ten participants were separated into 2 groups; good…

  16. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  17. Using Automatic Speech Recognition to Dictate Mathematical Expressions: The Development of the "TalkMaths" Application at Kingston University

    Science.gov (United States)

    Wigmore, Angela; Hunter, Gordon; Pflugel, Eckhard; Denholm-Price, James; Binelli, Vincent

    2009-01-01

    Speech technology--especially automatic speech recognition--has now advanced to a level where it can be of great benefit both to able-bodied people and those with various disabilities. In this paper we describe an application "TalkMaths" which, using the output from a commonly-used conventional automatic speech recognition system,…

  18. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  19. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  20. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  1. Location and acoustic scale cues in concurrent speech recognition.

    Science.gov (United States)

    Ives, D Timothy; Vestergaard, Martin D; Kistler, Doris J; Patterson, Roy D

    2010-06-01

    Location and acoustic scale cues have both been shown to have an effect on the recognition of speech in multi-speaker environments. This study examines the interaction of these variables. Subjects were presented with concurrent triplets of syllables from a target voice and a distracting voice, and asked to recognize a specific target syllable. The task was made more or less difficult by changing (a) the location of the distracting speaker, (b) the scale difference between the two speakers, and/or (c) the relative level of the two speakers. Scale differences were produced by changing the vocal tract length and glottal pulse rate during syllable synthesis: 32 acoustic scale differences were used. Location cues were produced by convolving head-related transfer functions with the stimulus. The angle between the target speaker and the distracter was 0 degrees, 4 degrees, 8 degrees, 16 degrees, or 32 degrees on the 0 degrees horizontal plane. The relative level of the target to the distracter was 0 or -6 dB. The results show that location and scale difference interact, and the interaction is greatest when one of these cues is small. Increasing either the acoustic scale or the angle between target and distracter speakers quickly elevates performance to ceiling levels.

  2. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    effect, recognition rates for common noises (e.g. ventilation, alarms) are only slightly below rates obtained in a quiet environment. Finally, a redundant architecture succeeds in improving the reliability of the recognitions. Conclusion: This study removes some uncertainties regarding the feasibility...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type...... of training. Methods: Eight volunteers read aloud a total of about 3 600 typical short anaesthesia comments to be transcribed by a continuous speech recognition system. Background noises were collected in an operating room and reproduced. A regression analysis and descriptive statistics were done to evaluate...

  3. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...... has been shifting towards dimensional affect recognition based on the idea that emotional states are not independent from one another but related in a systematic manner. In this paper, we design a continuous dimensional speech affect recognition model that combines acoustic and semantic features. We...... show that combining semantic and acoustic information for dimensional speech recognition improves the results. Moreover, we show that valence is better estimated using semantic features while arousal is better estimated using acoustic features....

  4. Effect of Speaker Age on Speech Recognition and Perceived Listening Effort in Older Adults with Hearing Loss

    Science.gov (United States)

    McAuliffe, Megan J.; Wilding, Phillipa J.; Rickard, Natalie A.; O'Beirne, Greg A.

    2012-01-01

    Purpose: Older adults exhibit difficulty understanding speech that has been experimentally degraded. Age-related changes to the speech mechanism lead to natural degradations in signal quality. We tested the hypothesis that older adults with hearing loss would exhibit declines in speech recognition when listening to the speech of older adults,…

  5. Effect of Speaker Age on Speech Recognition and Perceived Listening Effort in Older Adults with Hearing Loss

    Science.gov (United States)

    McAuliffe, Megan J.; Wilding, Phillipa J.; Rickard, Natalie A.; O'Beirne, Greg A.

    2012-01-01

    Purpose: Older adults exhibit difficulty understanding speech that has been experimentally degraded. Age-related changes to the speech mechanism lead to natural degradations in signal quality. We tested the hypothesis that older adults with hearing loss would exhibit declines in speech recognition when listening to the speech of older adults,…

  6. A Russian Keyword Spotting System Based on Large Vocabulary Continuous Speech Recognition and Linguistic Knowledge

    Directory of Open Access Journals (Sweden)

    Valentin Smirnov

    2016-01-01

    Full Text Available The paper describes the key concepts of a word spotting system for Russian based on large vocabulary continuous speech recognition. Key algorithms and system settings are described, including the pronunciation variation algorithm, and the experimental results on the real-life telecom data are provided. The description of system architecture and the user interface is provided. The system is based on CMU Sphinx open-source speech recognition platform and on the linguistic models and algorithms developed by Speech Drive LLC. The effective combination of baseline statistic methods, real-world training data, and the intensive use of linguistic knowledge led to a quality result applicable to industrial use.

  7. Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models to Psychophysical Data

    Directory of Open Access Journals (Sweden)

    Svante Stadler

    2009-01-01

    Full Text Available Users of cochlear implants (CIs vary widely in their ability to recognize speech in noisy conditions. There are many factors that may influence their performance. We have investigated to what degree it can be explained by the users' ability to discriminate spectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. The models were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speech recognition performance was compared to experimental results, and they were significantly correlated. The presented framework may be used to simulate the effects of changing the CI encoding strategy.

  8. How Linguistic Closure and Verbal Working Memory Relate to Speech Recognition in Noise—A Review

    Science.gov (United States)

    Koelewijn, Thomas; Zekveld, Adriana A.; Kramer, Sophia E.; Festen, Joost M.

    2013-01-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations. PMID:23945955

  9. How linguistic closure and verbal working memory relate to speech recognition in noise--a review.

    Science.gov (United States)

    Besser, Jana; Koelewijn, Thomas; Zekveld, Adriana A; Kramer, Sophia E; Festen, Joost M

    2013-06-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations.

  10. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  11. Lipreading and audiovisual speech recognition across the adult lifespan: Implications for audiovisual integration.

    Science.gov (United States)

    Tye-Murray, Nancy; Spehar, Brent; Myerson, Joel; Hale, Sandra; Sommers, Mitchell

    2016-06-01

    In this study of visual (V-only) and audiovisual (AV) speech recognition in adults aged 22-92 years, the rate of age-related decrease in V-only performance was more than twice that in AV performance. Both auditory-only (A-only) and V-only performance were significant predictors of AV speech recognition, but age did not account for additional (unique) variance. Blurring the visual speech signal decreased speech recognition, and in AV conditions involving stimuli associated with equivalent unimodal performance for each participant, speech recognition remained constant from 22 to 92 years of age. Finally, principal components analysis revealed separate visual and auditory factors, but no evidence of an AV integration factor. Taken together, these results suggest that the benefit that comes from being able to see as well as hear a talker remains constant throughout adulthood and that changes in this AV advantage are entirely driven by age-related changes in unimodal visual and auditory speech recognition. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  12. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    Science.gov (United States)

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  13. An open-set detection evaluation methodology for automatic emotion recognition in speech

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we present a detection approach and an ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification

  14. An open-set detection evaluation methodology for automatic emotion recognition in speech

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we present a detection approach and an ‘open-set’ detection evaluation methodology for automatic emotion recognition in speech. The traditional classification approach does not seem to be suitable and flexible enough for typical emotion recognition tasks. For example, classification d

  15. Northeast Artificial Intelligence Consortium Annual Report 1987. Volume 6. Artificial Intelligence Applications to Speech Recognition

    Science.gov (United States)

    1989-03-01

    Waibel (1981), in addition to connected word recognition by Myers et al. (1981), Ney (1984), and Watari (1986). An unknown utterance and a reference...Techniques in Isolated Word Speech Recognition Systems", Research paper, Carnegie-Mellon University, CMU-CS-81-125, June 1981. Watari , M. "New DP

  16. AUTOMATIC SPEECH RECOGNITION SYSTEM CONCERNING THE MOROCCAN DIALECTE (Darija and Tamazight)

    OpenAIRE

    A. EL GHAZI; Daoui, C.; Idrissi, N

    2012-01-01

    In this work we present an automatic speech recognition system for Moroccan dialect mainly: Darija (Arab dialect) and Tamazight. Many approaches have been used to model the Arabic and Tamazightphonetic units. In this paper, we propose to use the hidden Markov model (HMM) for modeling these phoneticunits. Experimental results show that the proposed approach further improves the recognition.

  17. Speech Recognition in Adults with Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging

    Science.gov (United States)

    Moberly, Aaron C.; Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan

    2017-01-01

    Purpose: Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced…

  18. Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students

    Science.gov (United States)

    Sersen, William J.

    2011-01-01

    The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…

  19. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent...... to which physicians’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires—one administered when the physicians were about to begin training with the speech-recognition system and another, asking similar questions, when...... they had had some experience with the system. The survey data were supplemented with performance data from the speech-recognition system. The results show that the surveyed physicians tended to report a more negative view of the system after having used it for some months than before. When judging...

  20. Speaking to the trained ear: musical expertise enhances the recognition of emotions in speech prosody.

    Science.gov (United States)

    Lima, César F; Castro, São Luís

    2011-10-01

    Language and music are closely related in our minds. Does musical expertise enhance the recognition of emotions in speech prosody? Forty highly trained musicians were compared with 40 musically untrained adults (controls) in the recognition of emotional prosody. For purposes of generalization, the participants were from two age groups, young (18-30 years) and middle adulthood (40-60 years). They were presented with short sentences expressing six emotions-anger, disgust, fear, happiness, sadness, surprise-and neutrality, by prosody alone. In each trial, they performed a forced-choice identification of the expressed emotion (reaction times, RTs, were collected) and an intensity judgment. General intelligence, cognitive control, and personality traits were also assessed. A robust effect of expertise was found: musicians were more accurate than controls, similarly across emotions and age groups. This effect cannot be attributed to socioeducational background, general cognitive or personality characteristics, because these did not differ between musicians and controls; perceived intensity and RTs were also similar in both groups. Furthermore, basic acoustic properties of the stimuli like fundamental frequency and duration were predictive of the participants' responses, and musicians and controls were similarly efficient in using them. Musical expertise was thus associated with cross-domain benefits to emotional prosody. These results indicate that emotional processing in music and in language engages shared resources.

  1. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Science.gov (United States)

    Gisladottir, Rosa S; Chwilla, Dorothee J; Levinson, Stephen C

    2015-01-01

    The ability to recognize speech acts (verbal actions) in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action) impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs) were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  2. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Directory of Open Access Journals (Sweden)

    Rosa S Gisladottir

    Full Text Available The ability to recognize speech acts (verbal actions in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  3. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners.

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.

  4. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    Science.gov (United States)

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  5. Relative Contributions of Spectral and Temporal Cues for Speech Recognition in Patients with Sensorineural Hearing Loss

    Institute of Scientific and Technical Information of China (English)

    XU Li; ZHOU Ning; Rebecca Brashears; Katherine Rife

    2008-01-01

    The present study was designed to examine speech recognition in patients with sensorineural hearing loss when the temporal and spectral information in the speech signals were co-varied. Four subjects with mild to moderate sensorineural hearing loss were recruited to participate in consonant and vowel recognition tests that used speech stimuli processed through a noise-excited voeoder. The number of channels was varied between 2 and 32, which defined spectral information. The lowpass cutoff frequency of the temporal envelope extractor was varied from 1 to 512 Hz, which defined temporal information. Results indicate that performance of subjects with sensorineural heating loss varied tremendously among the subjects. For consonant recognition, patterns of relative contributions of spectral and temporal information were similar to those in normal-hearing subjects. The utility of temporal envelope information appeared to be normal in the hearing-impaired listeners. For vowel recognition, which depended predominately on spectral information, the performance plateau was achieved with numbers of channels as high as 16-24, much higher than expected, given that the frequency selectivity in patients with sensorineural hearing loss might be compromised. In order to understand the mechanisms on how hearing-impaired listeners utilize spectral and temporal cues for speech recognition, future studies that involve a large sample of patients with sensorineural hearing loss will be necessary to elucidate the relationship between frequency selectivity as well as central processing capability and speech recognition performance using vocoded signals.

  6. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions. PMID:27378997

  7. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  8. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    Science.gov (United States)

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line

  9. Does quality of life depend on speech recognition performance for adult cochlear implant users?

    Science.gov (United States)

    Capretta, Natalie R; Moberly, Aaron C

    2016-03-01

    Current postoperative clinical outcome measures for adults receiving cochlear implants (CIs) consist of testing speech recognition, primarily under quiet conditions. However, it is strongly suspected that results on these measures may not adequately reflect patients' quality of life (QOL) using their implants. This study aimed to evaluate whether QOL for CI users depends on speech recognition performance. Twenty-three postlingually deafened adults with CIs were assessed. Participants were tested for speech recognition (Central Institute for the Deaf word and AzBio sentence recognition in quiet) and completed three QOL measures-the Nijmegen Cochlear Implant Questionnaire; either the Hearing Handicap Inventory for Adults or the Hearing Handicap Inventory for the Elderly; and the Speech, Spatial and Qualities of Hearing Scale questionnaires-to assess a variety of QOL factors. Correlations were sought between speech recognition and QOL scores. Demographics, audiologic history, language, and cognitive skills were also examined as potential predictors of QOL. Only a few QOL scores significantly correlated with postoperative sentence or word recognition in quiet, and correlations were primarily isolated to speech-related subscales on QOL measures. Poorer pre- and postoperative unaided hearing predicted better QOL. Socioeconomic status, duration of deafness, age at implantation, duration of CI use, reading ability, vocabulary size, and cognitive status did not consistently predict QOL scores. For adult, postlingually deafened CI users, clinical speech recognition measures in quiet do not correlate broadly with QOL. Results suggest the need for additional outcome measures of the benefits and limitations of cochlear implantation. 4. Laryngoscope, 126:699-706, 2016. © 2015 The American Laryngological, Rhinological and Otological Society, Inc.

  10. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    Science.gov (United States)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  11. Melodic contour identification and sentence recognition using sung speech.

    Science.gov (United States)

    Crew, Joseph D; Galvin, John J; Fu, Qian-Jie

    2015-09-01

    For bimodal cochlear implant users, acoustic and electric hearing has been shown to contribute differently to speech and music perception. However, differences in test paradigms and stimuli in speech and music testing can make it difficult to assess the relative contributions of each device. To address these concerns, the Sung Speech Corpus (SSC) was created. The SSC contains 50 monosyllable words sung over an octave range and can be used to test both speech and music perception using the same stimuli. Here SSC data are presented with normal hearing listeners and any advantage of musicianship is examined.

  12. Effect of speech rate variation on acoustic phone stability in Afrikaans speech recognition

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2007-11-01

    Full Text Available The authors analyse the effect of speech rate variation on Afrikaans phone stability from an acoustic perspective. Specifically they introduce two techniques for the acoustic analysis of speech rate variation, apply these techniques to an Afrikaans...

  13. Dopamine receptor D4 (DRD4) gene modulates the influence of informational masking on speech recognition.

    Science.gov (United States)

    Xie, Zilong; Maddox, W Todd; Knopik, Valerie S; McGeary, John E; Chandrasekaran, Bharath

    2015-01-01

    Listeners vary substantially in their ability to recognize speech in noisy environments. Here we examined the role of genetic variation on individual differences in speech recognition in various noise backgrounds. Background noise typically varies in the levels of energetic masking (EM) and informational masking (IM) imposed on target speech. Relative to EM, release from IM is hypothesized to place greater demand on executive function to selectively attend to target speech while ignoring competing noises. Recent evidence suggests that the long allele variant in exon III of the DRD4 gene, primarily expressed in the prefrontal cortex, may be associated with enhanced selective attention to goal-relevant high-priority information even in the face of interference. We investigated the extent to which this polymorphism is associated with speech recognition in IM and EM conditions. In an unscreened adult sample (Experiment 1) and a larger screened replication sample (Experiment 2), we demonstrate that individuals with the DRD4 long variant show better recognition performance in noise conditions involving significant IM, but not in EM conditions. In Experiment 2, we also obtained neuropsychological measures to assess the underlying mechanisms. Mediation analysis revealed that this listening condition-specific advantage was mediated by enhanced executive attention/working memory capacity in individuals with the long allele variant. These findings suggest that DRD4 may contribute specifically to individual differences in speech recognition ability in noise conditions that place demands on executive function. Copyright © 2014 Elsevier Ltd. All rights reserved.

  14. Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?

    CERN Document Server

    Wegmann, Steven

    2010-01-01

    Hidden Markov models (HMMs) have been successfully applied to automatic speech recognition for more than 35 years in spite of the fact that a key HMM assumption -- the statistical independence of frames -- is obviously violated by speech data. In fact, this data/model mismatch has inspired many attempts to modify or replace HMMs with alternative models that are better able to take into account the statistical dependence of frames. However it is fair to say that in 2010 the HMM is the consensus model of choice for speech recognition and that HMMs are at the heart of both commercially available products and contemporary research systems. In this paper we present a preliminary exploration aimed at understanding how speech data depart from HMMs and what effect this departure has on the accuracy of HMM-based speech recognition. Our analysis uses standard diagnostic tools from the field of statistics -- hypothesis testing, simulation and resampling -- which are rarely used in the field of speech recognition. Our ma...

  15. Developing a broadband automatic speech recognition system for Afrikaans

    CSIR Research Space (South Africa)

    De Wet, Febe

    2011-08-01

    Full Text Available Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources...

  16. Robust Multi biometric Recognition Using Face and Ear Images

    CERN Document Server

    Boodoo, Nazmeen Bibi

    2009-01-01

    This study investigates the use of ear as a biometric for authentication and shows experimental results obtained on a newly created dataset of 420 images. Images are passed to a quality module in order to reduce False Rejection Rate. The Principal Component Analysis (eigen ear) approach was used, obtaining 90.7 percent recognition rate. Improvement in recognition results is obtained when ear biometric is fused with face biometric. The fusion is done at decision level, achieving a recognition rate of 96 percent.

  17. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Directory of Open Access Journals (Sweden)

    Rachel Jane Ellis

    2015-08-01

    Full Text Available Proactive interference (PI is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  18. Auditory training of speech recognition with interrupted and continuous noise maskers by children with hearing impairment.

    Science.gov (United States)

    Sullivan, Jessica R; Thibodeau, Linda M; Assmann, Peter F

    2013-01-01

    Previous studies have indicated that individuals with normal hearing (NH) experience a perceptual advantage for speech recognition in interrupted noise compared to continuous noise. In contrast, adults with hearing impairment (HI) and younger children with NH receive a minimal benefit. The objective of this investigation was to assess whether auditory training in interrupted noise would improve speech recognition in noise for children with HI and perhaps enhance their utilization of glimpsing skills. A partially-repeated measures design was used to evaluate the effectiveness of seven 1-h sessions of auditory training in interrupted and continuous noise. Speech recognition scores in interrupted and continuous noise were obtained from pre-, post-, and 3 months post-training from 24 children with moderate-to-severe hearing loss. Children who participated in auditory training in interrupted noise demonstrated a significantly greater improvement in speech recognition compared to those who trained in continuous noise. Those who trained in interrupted noise demonstrated similar improvements in both noise conditions while those who trained in continuous noise only showed modest improvements in the interrupted noise condition. This study presents direct evidence that auditory training in interrupted noise can be beneficial in improving speech recognition in noise for children with HI.

  19. Does cognitive function predict frequency compressed speech recognition in listeners with normal hearing and normal cognition?

    Science.gov (United States)

    Ellis, Rachel J; Munro, Kevin J

    2013-01-01

    The aim was to investigate the relationship between cognitive ability and frequency compressed speech recognition in listeners with normal hearing and normal cognition. Speech-in-noise recognition was measured using Institute of Electrical and Electronic Engineers sentences presented over earphones at 65 dB SPL and a range of signal-to-noise ratios. There were three conditions: unprocessed, and at frequency compression ratios of 2:1 and 3:1 (cut-off frequency, 1.6 kHz). Working memory and cognitive ability were measured using the reading span test and the trail making test, respectively. Participants were 15 young normally-hearing adults with normal cognition. There was a statistically significant reduction in mean speech recognition from around 80% when unprocessed to 40% for 2:1 compression and 30% for 3:1 compression. There was a statistically significant relationship between speech recognition and cognition for the unprocessed condition but not for the frequency-compressed conditions. The relationship between cognitive functioning and recognition of frequency compressed speech-in-noise was not statistically significant. The findings may have been different if the participants had been provided with training and/or time to 'acclimatize' to the frequency-compressed conditions.

  20. Cognitive resources related to speech recognition with a competing talker in young and older listeners.

    Science.gov (United States)

    Meister, H; Schreitmüller, S; Grugel, L; Ortmann, M; Beutner, D; Walger, M; Meister, I G

    2013-03-01

    Speech recognition in a multi-talker situation poses high demands on attentional and other central resources. This study examines the relationship between age, cognition and speech recognition in tasks that require selective or divided attention in a multi-talker setting. Two groups of normal-hearing adults (one younger and one older group) were asked to repeat utterances from either one or two concurrent speakers. Cognitive abilities were then inspected by neuropsychological tests. Speech recognition scores approached its ceiling and did not significantly differ between age groups for tasks that demanded selective attention. However, when divided attention was required, performance in older listeners was reduced as compared to the younger group. When selective attention was required, speech recognition was strongly related to working memory skills, as determined by a regression model. In comparison, speech recognition for tests requiring divided attention could be more strongly determined by neuropsychological probes of fluid intelligence. The findings of this study indicate that - apart from hearing impairment - cognitive aspects account for the typical difficulties of older listeners in a multi-speaker setting. Our results are discussed in the context of evidence showing that frontal lobe functions in terms of working memory and fluid intelligence generally decline with age.

  1. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

    Science.gov (United States)

    Alam, Md Jahangir; Gupta, Vishwa; Kenny, Patrick; Dumouchel, Pierre

    2015-12-01

    The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8

  2. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  3. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  4. Design and Implementation of Monophones and Triphones-Based Speech Recognition Systems for Voice Activated Telephony

    Directory of Open Access Journals (Sweden)

    Rupayan Das

    2013-07-01

    Full Text Available Speech recognition is the ability of a machine or program to convert spoken words into its equivalent text form. Nowadays, most recognition systems use Hidden Markov Models for modeling the spoken utterances. In this paper we have implemented two speaker independent speech recognition systems which include all the words required for dialing a phone. The systems contain 42 words including digits from zero to nine and also include names of 20 persons. A total of 16,800 utterances have been used for training each system. The two systems are able to recognize continuous speech and it is implemented with the help of monophones and triphones using HTK. Experimental results show an accuracy of 74.11% for monophones based models and 93.77% for triphones based models.

  5. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  6. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  7. [Research on Barrier-free Home Environment System Based on Speech Recognition].

    Science.gov (United States)

    Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo

    2015-10-01

    The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment.

  8. Health Care in Home Automation Systems with Speech Recognition and Mobile Technology

    Directory of Open Access Journals (Sweden)

    Jasmin Kurti

    2016-08-01

    Full Text Available - Home automation systems use technology to facilitate the lives of people using it, and it is especially useful for assisting the elderly and persons with special needs. These kind of systems have been a popular research subject in last few years. In this work, I present the design and development of a system that provides a life assistant service in a home environment, a smart home-based healthcare system controlled with speech recognition and mobile technology. This includes developing software with speech recognition, speech synthesis, face recognition, controls for Arduino hardware, and a smartphone application for remote controlling the system. With the developed system, elderly and persons with special needs can stay independently in their own home secure and with care facilities. This system is tailored towards the elderly and disabled, but it can also be embedded in any home and used by anybody. It provides healthcare, security, entertainment, and total local and remote control of home.

  9. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  10. Low-cost speech recognition system for small vocabulary and independent speaker

    Science.gov (United States)

    Teh, Chih Chiang; Jong, Ching C.; Siek, Liter

    2000-10-01

    In this paper an ASIC implementation of a low cost speech recognition system for small vocabulary, 15 isolated word, speaker independent is presented. The IC is a digital block that receives a 12 bit sample with a sampling rate of 11.025 kHz as its input. The IC is running at 10 MHz system clock and targeted at 0.35 micrometers CMOS process. The whole chip, which includes the speech recognition system core, RAM and ROM contains about 61000 gates. The die size is 1.5 mm by 3 mm. The current design had been coded in VHDL for hardware implementation and its functionality is identical with the Matlab simulation. The average speech recognition rate for this IC is 89 percent for 15 isolated words.

  11. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  12. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    Directory of Open Access Journals (Sweden)

    Veton Z. Këpuska

    2014-12-01

    Full Text Available A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc. with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD, and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., serverfor remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV. This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.

  13. The Combined Effects of Aging and Hearing Loss on Temporal Resolution and Recognition of Reverberant Speech

    Science.gov (United States)

    Halling, Dan C.

    Listeners perform more poorly on a speech-recognition task when in a reverberant listening condition than a non -reverberant one. Elderly listeners experience even greater difficulty than young listeners. It has been suggested that this greater difficulty can be almost entirely explained by taking into account the hearing-impairment that typically accompanies the aging process. Nevertheless, existing evidence suggests that elderly hearing-impaired listeners still experience greater difficulty than young hearing -impaired listeners. Some have suggested that elderly listeners, in addition to the hearing loss, also exhibit poorer temporal resolution than young listeners, poorer than even young hearing-impaired listeners. Temporal resolution and speech -recognition performance was evaluated in 8 young normal -hearing listeners, 8 elderly normal-hearing listeners, and 12 elderly listeners with hearing impairment of varying degree. The results suggested that there was an effect of both age and hearing loss on temporal resolution and speech-recognition performance. Additional analyses indicated that the age effects may have actually been caused by slight elevations in the quiet thresholds for the elderly normal -hearing subjects relative to the young normal-hearing subjects. The results also suggested that individual differences in hearing loss and temporal resolution underlie individual differences in speech-recognition performance. Finally, an objective measure of predicting speech intelligibility, the Speech Transmission Index (STI), was evaluated as to its adequacy as a tool for predicting speech-recognition performance in young and elderly, normal-hearing and hearing-impaired, listeners in anechoic or reverberant conditions. Several derivations of the STI provided tight-fitting functions relating percent correct to STI, one of which requires only knowledge of the listener's quiet thresholds and the acoustical properties of the room.

  14. Sensitivity Based Segmentation and Identification in Automatic Speech Recognition.

    Science.gov (United States)

    1984-03-30

    by a network constructed from phonemic, phonetic , and phonological rules. Regardless of the speech processing system used, Klatt 1 2 has described...analysis, and its use in the segmentation and identification of the phonetic units of speech, that was initiated during the 1982 Summer Faculty Research...practicable framework for incorporation of acoustic- phonetic variance as well as time and talker normalization. XOI iF- ? ’:: .:- .- . . l ] 2 D

  15. Marginalised Stacked Denoising Autoencoders for Robust Representation of Real-Time Multi-View Action Recognition

    National Research Council Canada - National Science Library

    Gu, Feng; Flórez-Revuelta, Francisco; Monekosso, Dorothy; Remagnino, Paolo

    2015-01-01

    ...) representation in terms of robustness and usefulness for multi-view action recognition. The resulting representations are fed into three simple fusion strategies as well as a multiple kernel learning algorithm at the classification stage...

  16. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  17. Frequency band-importance functions for auditory and auditory-visual speech recognition

    Science.gov (United States)

    Grant, Ken W.

    2005-04-01

    In many everyday listening environments, speech communication involves the integration of both acoustic and visual speech cues. This is especially true in noisy and reverberant environments where the speech signal is highly degraded, or when the listener has a hearing impairment. Understanding the mechanisms involved in auditory-visual integration is a primary interest of this work. Of particular interest is whether listeners are able to allocate their attention to various frequency regions of the speech signal differently under auditory-visual conditions and auditory-alone conditions. For auditory speech recognition, the most important frequency regions tend to be around 1500-3000 Hz, corresponding roughly to important acoustic cues for place of articulation. The purpose of this study is to determine the most important frequency region under auditory-visual speech conditions. Frequency band-importance functions for auditory and auditory-visual conditions were obtained by having subjects identify speech tokens under conditions where the speech-to-noise ratio of different parts of the speech spectrum is independently and randomly varied on every trial. Point biserial correlations were computed for each separate spectral region and the normalized correlations are interpreted as weights indicating the importance of each region. Relations among frequency-importance functions for auditory and auditory-visual conditions will be discussed.

  18. A computer model of auditory efferent suppression: implications for the recognition of speech in noise.

    Science.gov (United States)

    Brown, Guy J; Ferry, Robert T; Meddis, Ray

    2010-02-01

    The neural mechanisms underlying the ability of human listeners to recognize speech in the presence of background noise are still imperfectly understood. However, there is mounting evidence that the medial olivocochlear system plays an important role, via efferents that exert a suppressive effect on the response of the basilar membrane. The current paper presents a computer modeling study that investigates the possible role of this activity on speech intelligibility in noise. A model of auditory efferent processing [Ferry, R. T., and Meddis, R. (2007). J. Acoust. Soc. Am. 122, 3519-3526] is used to provide acoustic features for a statistical automatic speech recognition system, thus allowing the effects of efferent activity on speech intelligibility to be quantified. Performance of the "basic" model (without efferent activity) on a connected digit recognition task is good when the speech is uncorrupted by noise but falls when noise is present. However, recognition performance is much improved when efferent activity is applied. Furthermore, optimal performance is obtained when the amount of efferent activity is proportional to the noise level. The results obtained are consistent with the suggestion that efferent suppression causes a "release from adaptation" in the auditory-nerve response to noisy speech, which enhances its intelligibility.

  19. Across-site patterns of modulation detection: relation to speech recognition.

    Science.gov (United States)

    Garadat, Soha N; Zwolan, Teresa A; Pfingst, Bryan E

    2012-05-01

    The aim of this study was to identify across-site patterns of modulation detection thresholds (MDTs) in subjects with cochlear implants and to determine if removal of sites with the poorest MDTs from speech processor programs would result in improved speech recognition. Five hundred millisecond trains of symmetric-biphasic pulses were modulated sinusoidally at 10 Hz and presented at a rate of 900 pps using monopolar stimulation. Subjects were asked to discriminate a modulated pulse train from an unmodulated pulse train for all electrodes in quiet and in the presence of an interleaved unmodulated masker presented on the adjacent site. Across-site patterns of masked MDTs were then used to construct two 10-channel MAPs such that one MAP consisted of sites with the best masked MDTs and the other MAP consisted of sites with the worst masked MDTs. Subjects' speech recognition skills were compared when they used these two different MAPs. Results showed that MDTs were variable across sites and were elevated in the presence of a masker by various amounts across sites. Better speech recognition was observed when the processor MAP consisted of sites with best masked MDTs, suggesting that temporal modulation sensitivity has important contributions to speech recognition with a cochlear implant.

  20. Studying the Speech Recognition Scores of Hearing Impaied Children by Using Nonesense Syllables

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Keyhani

    1998-09-01

    Full Text Available Background: The current article is aimed at evaluating speech recognition scores in hearing aid wearers to determine whether nonsense syllables are suitable speech materials to evaluate the effectiveness of their hearing aids. Method: Subjects were 60 children (15 males and 15 females with bilateral moderate and moderately severe sensorineural hearing impairment who were aged between 7.7-14 years old. Gain prescription was fitted by NAL method. Then speech evaluation was performed in a quiet place with and without hearing aid by using a list of 25 monosyllable words recorded on a tape. A list was prepared for the subjects to check in the correct response. The same method was used to obtain results for normal subjects. Results: The results revealed that the subjects using hearing aids achieved significantly higher SRS in comparison of not wearing it. Although the speech recognition ability was not compensated completely (the maximum score obtained was 60% it was also revealed that the syllable recognition ability in the less amplified frequencies were decreased. the SRS was very higher in normal subjects (with an average of 88%. Conclusion: It seems that Speech recognition score can prepare Audiologist with a more comprehensive method to evaluate the hearing aid benefits.

  1. A Review on Speech Corpus Development for Automatic Speech Recognition in Indian Languages

    OpenAIRE

    Cini kurian

    2015-01-01

    Corpus development gained much attention due to recent statistics based natural language processing. It has new applications in Language Technology, linguistic research, language education and information exchange. Corpus based Language research has an innovative outlook which will discard the aged linguistic theories. Speech corpus is the essential resources for building a speech recognizer. One of the main challenges faced by speech scientist is the unavailability of these resources. Very f...

  2. A Spatiotemporal Robust Approach for Human Activity Recognition

    Directory of Open Access Journals (Sweden)

    Md. Zia Uddin

    2013-11-01

    Full Text Available Nowadays, human activity recognition is considered to be one of the fundamental topics in computer vision research areas, including human-robot interaction. In this work, a novel method is proposed utilizing the depth and optical flow motion information of human silhouettes from video for human activity recognition. The recognition method utilizes enhanced independent component analysis (EICA on depth silhouettes, optical flow motion features, and hidden Markov models (HMMs for recognition. The local features are extracted from the collection of the depth silhouettes exhibiting various human activities. Optical flow- based motion features are also extracted from the depth silhouette area and used in an augmented form to form the spatiotemporal features. Next, the augmented features are enhanced by generalized discriminant analysis (GDA for better activity representation. These features are then fed into HMMs to model human activities and recognize them. The experimental results show the superiority of the proposed approach over the conventional ones.

  3. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  4. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the langu......In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...

  5. EEG frequency-amplitude characteristics of the successful recognition of emotional speech.

    Science.gov (United States)

    Kislova, O O; Rusalova, M N

    2010-07-01

    EEG frequency-amplitude characteristics were studied in two groups of subjects, with high and low "emotional hearing" measures. Comparison of power over the whole EEG range between the two groups of subjects led to the conclusion that the EEG activation level was significantly higher in subjects with low "emotional hearing" measures than in those with high levels. This group also showed a higher level of activation in the posterior temporal areas of the cortex of the right hemisphere on recognition of emotions in speech. Thus, high initial levels of cortical activation and greater EEG reactivity on hearing emotional phrases are factors hindering the recognition of emotional expression in speech.

  6. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  7. Speech recognition in noise as a function of the number of spectral channels : Comparison of acoustic hearing and cochlear implants

    NARCIS (Netherlands)

    Friesen, LM; Shannon, RV; Baskent, D; Wang, YB

    Speech recognition was measured as a function of spectral resolution (number of spectral channels) and speech-to-noise ratio in normal-hearing (NH) and cochlear-implant (CI) listeners. Vowel, consonant, word, and sentence recognition were measured in five normal -hearing listeners, ten listeners

  8. Design and implementation of a user-oriented speech recognition interface: the synergy of technology and human factors

    NARCIS (Netherlands)

    Kloosterman, Sietse H.

    1994-01-01

    The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology

  9. DESIGN AND IMPLEMENTATION OF A USER-ORIENTED SPEECH RECOGNITION INTERFACE - THE SYNERGY OF TECHNOLOGY AND HUMAN-FACTORS

    NARCIS (Netherlands)

    KLOOSTERMAN, SH

    The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology

  10. An Automatic Prolongation Detection Approach in Continuous Speech With Robustness Against Speaking Rate Variations

    Science.gov (United States)

    Esmaili, Iman; Dabanloo, Nader Jafarnia; Vali, Mansour

    2017-01-01

    In recent years, many methods have been introduced for supporting the diagnosis of stuttering for automatic detection of prolongation in the speech of people who stutter. However, less attention has been paid to treatment processes in which clients learn to speak more slowly. The aim of this study was to develop a method to help speech-language pathologists (SLPs) during diagnosis and treatment sessions. To this end, speech signals were initially parameterized to perceptual linear predictive (PLP) features. To detect the prolonged segments, the similarities between successive frames of speech signals were calculated based on correlation similarity measures. The segments were labeled as prolongation when the duration of highly similar successive frames exceeded a threshold specified by the speaking rate. The proposed method was evaluated by UCLASS and self-recorded Persian speech databases. The results were also compared with three high-performance studies in automatic prolongation detection. The best accuracies of prolongation detection were 99 and 97.1% for UCLASS and Persian databases, respectively. The proposed method also indicated promising robustness against artificial variation of speaking rate from 70 to 130% of normal speaking rate. PMID:28487827

  11. Representation, Classification and Information Fusion for Robust and Efficient Multimodal Human States Recognition

    Science.gov (United States)

    Li, Ming

    2013-01-01

    The goal of this work is to enhance the robustness and efficiency of the multimodal human states recognition task. Human states recognition can be considered as a joint term for identifying/verifing various kinds of human related states, such as biometric identity, language spoken, age, gender, emotion, intoxication level, physical activity, vocal…

  12. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    OpenAIRE

    Santiago-Omar Caballero-Morales

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger,...

  13. Microphone-Independent Robust Signal Processing Using Probabilistic Optimum Filtering

    Science.gov (United States)

    1994-01-01

    11. A. Acero , "Acoustical and Environmental Robustness in Automatic Speech Recognition," Ph.D. Thesis, Carnegie-MeLton University, Sep- tember 1990...12. R.M. Stem, FJ-I. Leu, Y. Ohshima, T.M. Sullivan, and A. Acero , "Multiple Approaches to Robust Speech Recognition," 1992 Interna- tional

  14. Speech Emotion Recognition Based on Parametric Filter and Fractal Dimension

    Science.gov (United States)

    Mao, Xia; Chen, Lijiang

    In this paper, we propose a new method that employs two novel features, correlation density (Cd) and fractal dimension (Fd), to recognize emotional states contained in speech. The former feature obtained by a list of parametric filters reflects the broad frequency components and the fine structure of lower frequency components, contributed by unvoiced phones and voiced phones, respectively; the latter feature indicates the non-linearity and self-similarity of a speech signal. Comparative experiments based on Hidden Markov Model and K Nearest Neighbor methods are carried out. The results show that Cd and Fd are much more closely related with emotional expression than the features commonly used.

  15. Local Illumination Normalization and Facial Feature Point Selection for Robust Face Recognition

    Directory of Open Access Journals (Sweden)

    Song HAN

    2013-03-01

    Full Text Available Face recognition systems must be robust to the variation of various factors such as facial expression, illumination, head pose and aging. Especially, the robustness against illumination variation is one of the most important problems to be solved for the practical use of face recognition systems. Gabor wavelet is widely used in face detection and recognition because it gives the possibility to simulate the function of human visual system. In this paper, we propose a method for extracting Gabor wavelet features which is stable under the variation of local illumination and show experiment results demonstrating its effectiveness.

  16. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  17. The Role of Somatosensory Information in Speech Perception: Imitation Improves Recognition of Disordered Speech.

    Science.gov (United States)

    Borrie, Stephanie A; Schäfer, Martina C M

    2015-12-01

    Perceptual learning paradigms involving written feedback appear to be a viable clinical tool to reduce the intelligibility burden of dysarthria. The underlying theoretical assumption is that pairing the degraded acoustics with the intended lexical targets facilitates a remapping of existing mental representations in the lexicon. This study investigated whether ties to mental representations can be strengthened by way of a somatosensory motor trace. Following an intelligibility pretest, 100 participants were assigned to 1 of 5 experimental groups. The control group received no training, but the other 4 groups received training with dysarthric speech under conditions involving a unique combination of auditory targets, written feedback, and/or a vocal imitation task. All participants then completed an intelligibility posttest. Training improved intelligibility of dysarthric speech, with the largest improvements observed when the auditory targets were accompanied by both written feedback and an imitation task. Further, a significant relationship between intelligibility improvement and imitation accuracy was identified. This study suggests that somatosensory information can strengthen the activation of speech sound maps of dysarthric speech. The findings, therefore, implicate a bidirectional relationship between speech perception and speech production as well as advance our understanding of the mechanisms that underlie perceptual learning of degraded speech.

  18. Speech recognition using Kohonen neural networks, dynamic programming, and multi-feature fusion

    Science.gov (United States)

    Stowe, Francis S.

    1990-12-01

    The purpose of this thesis was to develop and evaluate the performance of a three-feature speech recognition system. The three features used were LPC spectrum, formants (F1/F2), and cepstrum. The system uses Kohonen neural networks, dynamic programming, and a rule-based, feature-fusion process which integrates the three input features into one output result. The first half of this research involved evaluating the system in a speaker-dependent atmosphere. For this, the 70 word F-16 cockpit command vocabulary was used and both isolated and connected speech was tested. Results obtained are compared to a two-feature system with the same system configuration. Isolated-speech testing yielded 98.7 percent accuracy. Connected-speech testing yielded 75/0 percent accuracy. The three-feature system performed an average of 1.7 percent better than the two-feature system for isolated-speech. The second half of this research was concerned with the speaker-independent performance of the system. First, cross-speaker testing was performed using an updated 86 word library. In general, this testing yielded less than 50 percent accuracy. Then, testing was performed using averaged templates. This testing yielded an overall average in-template recognition rate of approximately 90 percent and an out-of-template recognition rate of approximately 75 percent.

  19. Effect of a Bluetooth-implemented hearing aid on speech recognition performance: subjective and objective measurement.

    Science.gov (United States)

    Kim, Min-Beom; Chung, Won-Ho; Choi, Jeesun; Hong, Sung Hwa; Cho, Yang-Sun; Park, Gyuseok; Lee, Sangmin

    2014-06-01

    The object was to evaluate speech perception improvement through Bluetooth-implemented hearing aids in hearing-impaired adults. Thirty subjects with bilateral symmetric moderate sensorineural hearing loss participated in this study. A Bluetooth-implemented hearing aid was fitted unilaterally in all study subjects. Objective speech recognition score and subjective satisfaction were measured with a Bluetooth-implemented hearing aid to replace the acoustic connection from either a cellular phone or a loudspeaker system. In each system, participants were assigned to 4 conditions: wireless speech signal transmission into hearing aid (wireless mode) in quiet or noisy environment and conventional speech signal transmission using external microphone of hearing aid (conventional mode) in quiet or noisy environment. Also, participants completed questionnaires to investigate subjective satisfaction. Both cellular phone and loudspeaker system situation, participants showed improvements in sentence and word recognition scores with wireless mode compared to conventional mode in both quiet and noise conditions (P Bluetooth-implemented hearing aids helped to improve subjective and objective speech recognition performances in quiet and noisy environments during the use of electronic audio devices.

  20. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  1. Invariant Robust 3-D Face Recognition based on the Hilbert Transform in Spectral Space

    Directory of Open Access Journals (Sweden)

    Eric Paquet

    2006-04-01

    Full Text Available One of the main objectives of face recognition is to determine whether an acquired face belongs to a reference database and to subsequently identify the corresponding individual. Face recognition has application in, for instance, forensic science and security. A face recognition algorithm, to be useful in real applications, must discriminate in between individuals, process data in real-time and be robust against occlusion, facial expression and noise.A new method for robust recognition of three-dimensional faces is presented. The method is based on harmonic coding, Hilbert transform and spectral analysis of 3-D depth distributions. Experimental results with three-dimensional faces, which were scanned with a laser scanner, are presented. The proposed method recognises a face with various facial expressions in the presence of occlusion, has a good discrimination, is able to compare a face against a large database of faces in real-time and is robust against shot noise and additive noise.

  2. Subject independent facial expression recognition with robust face detection using a convolutional neural network.

    Science.gov (United States)

    Matsugu, Masakazu; Mori, Katsuhiko; Mitari, Yusuke; Kaneda, Yuji

    2003-01-01

    Reliable detection of ordinary facial expressions (e.g. smile) despite the variability among individuals as well as face appearance is an important step toward the realization of perceptual user interface with autonomous perception of persons. We describe a rule-based algorithm for robust facial expression recognition combined with robust face detection using a convolutional neural network. In this study, we address the problem of subject independence as well as translation, rotation, and scale invariance in the recognition of facial expression. The result shows reliable detection of smiles with recognition rate of 97.6% for 5600 still images of more than 10 subjects. The proposed algorithm demonstrated the ability to discriminate smiling from talking based on the saliency score obtained from voting visual cues. To the best of our knowledge, it is the first facial expression recognition model with the property of subject independence combined with robustness to variability in facial appearance.

  3. Divided attention disrupts perceptual encoding during speech recognition.

    Science.gov (United States)

    Mattys, Sven L; Palmer, Shekeila D

    2015-03-01

    Performing a secondary task while listening to speech has a detrimental effect on speech processing, but the locus of the disruption within the speech system is poorly understood. Recent research has shown that cognitive load imposed by a concurrent visual task increases dependency on lexical knowledge during speech processing, but it does not affect lexical activation per se. This suggests that "lexical drift" under cognitive load occurs either as a post-lexical bias at the decisional level or as a secondary consequence of reduced perceptual sensitivity. This study aimed to adjudicate between these alternatives using a forced-choice task that required listeners to identify noise-degraded spoken words with or without the addition of a concurrent visual task. Adding cognitive load increased the likelihood that listeners would select a word acoustically similar to the target even though its frequency was lower than that of the target. Thus, there was no evidence that cognitive load led to a high-frequency response bias. Rather, cognitive load seems to disrupt sublexical encoding, possibly by impairing perceptual acuity at the auditory periphery.

  4. Recognition of Rapid Speech by Blind and Sighted Older Adults

    Science.gov (United States)

    Gordon-Salant, Sandra; Friedman, Sarah A.

    2011-01-01

    Purpose: To determine whether older blind participants recognize time-compressed speech better than older sighted participants. Method: Three groups of adults with normal hearing participated (n = 10/group): (a) older sighted, (b) older blind, and (c) younger sighted listeners. Low-predictability sentences that were uncompressed (0% time…

  5. Working Papers in Speech Recognition. IV. The Hearsay II System

    Science.gov (United States)

    1976-02-01

    to the overall goal of the problem solver, (4) The efficiency principle : more processing should be given tfi KSs which perform most reliably and...duration of the speech, actions which can produce such hypotheses or support them will be most preferred. The objective of the efficiency principle , to

  6. The digits-in-noise test: assessing auditory speech recognition abilities in noise.

    Science.gov (United States)

    Smits, Cas; Theo Goverts, S; Festen, Joost M

    2013-03-01

    A speech-in-noise test which uses digit triplets in steady-state speech noise was developed. The test measures primarily the auditory, or bottom-up, speech recognition abilities in noise. Digit triplets were formed by concatenating single digits spoken by a male speaker. Level corrections were made to individual digits to create a set of homogeneous digit triplets with steep speech recognition functions. The test measures the speech reception threshold (SRT) in long-term average speech-spectrum noise via a 1-up, 1-down adaptive procedure with a measurement error of 0.7 dB. One training list is needed for naive listeners. No further learning effects were observed in 24 subsequent SRT measurements. The test was validated by comparing results on the test with results on the standard sentences-in-noise test. To avoid the confounding of hearing loss, age, and linguistic skills, these measurements were performed in normal-hearing subjects with simulated hearing loss. The signals were spectrally smeared and/or low-pass filtered at varying cutoff frequencies. After correction for measurement error the correlation coefficient between SRTs measured with both tests equaled 0.96. Finally, the feasibility of the test was approved in a study where reference SRT values were gathered in a representative set of 1386 listeners over 60 years of age.

  7. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    Science.gov (United States)

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  8. Errors in Automatic Speech Recognition versus Difficulties in Second Language Listening

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Akita, Yuya; Kawahara, Tatsuya

    2015-01-01

    Automatic Speech Recognition (ASR) technology has become a part of contemporary Computer-Assisted Language Learning (CALL) systems. ASR systems however are being criticized for their erroneous performance especially when utilized as a mean to develop skills in a Second Language (L2) where errors are not tolerated. Nevertheless, these errors can…

  9. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  10. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    Science.gov (United States)

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  11. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    NARCIS (Netherlands)

    Baskent, Deniz; Chatterjeec, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz).

  12. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet P.; Leeuwen, van David A.; Jong, de Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two t

  13. Serial audiometry and speech recognition findings in Finnish Usher syndrome type III patients.

    NARCIS (Netherlands)

    Plantinga, R.F.; Kleemola, L.; Huygen, P.L.M.; Joensuu, T.; Sankila, E.M.; Pennings, R.J.E.; Cremers, C.W.R.J.

    2005-01-01

    Audiometric features, evaluated by serial pure tone audiometry and speech recognition tests (n = 31), were analysed in 59 Finnish Usher syndrome type III patients (USH3) with Finmajor/Finmajor (n = 55) and Finmajor/Finminor (n = 4) USH3A mutations. These patients showed a highly variable type and de

  14. The Effects of Auditory-Visual Vowel Identification Training on Speech Recognition under Difficult Listening Conditions

    Science.gov (United States)

    Richie, Carolyn; Kewley-Port, Diane

    2008-01-01

    Purpose: The effective use of visual cues to speech provides benefit for adults with normal hearing in noisy environments and for adults with hearing loss in everyday communication. The purpose of this study was to examine the effects of a computer-based, auditory-visual vowel identification training program on sentence recognition under difficult…

  15. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  16. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  17. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  18. Enhancing Speech Recognition Using Improved Particle Swarm Optimization Based Hidden Markov Model

    Directory of Open Access Journals (Sweden)

    Lokesh Selvaraj

    2014-01-01

    Full Text Available Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO is suggested. The suggested methodology contains four stages, namely, (i denoising, (ii feature mining (iii, vector quantization, and (iv IPSO based hidden Markov model (HMM technique (IP-HMM. At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC, mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  19. Comparing Three Methods to Create Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks

    Science.gov (United States)

    2000-08-01

    Glass, et al.: Multilingual Spoken Language Under- ( multilingual clusters) and 5280 monolingual clusters. This standing in the MIT VOYAGER System...UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP010392 TITLE: Comparing Three Methods to Create Multilingual Phone...METHODS TO CREATE MULTILINGUAL PHONE MODELS FOR VOCABULARY INDEPENDENT SPEECH RECOGNITION TASKS Joachim Kdhler German National Research Center for

  20. Dealing with Phrase Level Co-Articulation (PLC) in speech recognition: a first approach

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; van Leeuwen, David A.; Robinson, Tony; Renals, Steve

    1999-01-01

    Whereas nowadays within-word co-articulation effects are usually sufficiently dealt with in automatic speech recognition, this is not always the case with phrase level co-articulation effects (PLC). This paper describes a first approach in dealing with phrase level co-articulation by applying these

  1. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  2. The Effect of Automatic Speech Recognition Eyespeak Software on Iraqi Students' English Pronunciation: A Pilot Study

    Science.gov (United States)

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…

  3. The Usefulness of Automatic Speech Recognition (ASR) Eyespeak Software in Improving Iraqi EFL Students' Pronunciation

    Science.gov (United States)

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The present study focuses on determining whether automatic speech recognition (ASR) technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not…

  4. Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; O'Toole, John Mitchell

    2015-01-01

    The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…

  5. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  6. The effect of speech recognition on working postures, productivity and the perception of user friendliness

    NARCIS (Netherlands)

    Korte, E.M. de; Lingen, P. van

    2006-01-01

    A comparative, experimental study with repeated measures has been conducted to evaluate the effect of the use of speech recognition on working postures, productivity and the perception of user friendliness. Fifteen subjects performed a standardised task, first with keyboard and mouse and, after a

  7. Acoustic diagnosis of pulmonary hypertension: automated speech- recognition-inspired classification algorithm outperforms physicians

    Science.gov (United States)

    Kaddoura, Tarek; Vadlamudi, Karunakar; Kumar, Shine; Bobhate, Prashant; Guo, Long; Jain, Shreepal; Elgendi, Mohamed; Coe, James Y.; Kim, Daniel; Taylor, Dylan; Tymchak, Wayne; Schuurmans, Dale; Zemp, Roger J.; Adatia, Ian

    2016-09-01

    We hypothesized that an automated speech- recognition-inspired classification algorithm could differentiate between the heart sounds in subjects with and without pulmonary hypertension (PH) and outperform physicians. Heart sounds, electrocardiograms, and mean pulmonary artery pressures (mPAp) were recorded simultaneously. Heart sound recordings were digitized to train and test speech-recognition-inspired classification algorithms. We used mel-frequency cepstral coefficients to extract features from the heart sounds. Gaussian-mixture models classified the features as PH (mPAp ≥ 25 mmHg) or normal (mPAp speech-recognition-inspired algorithm was 74% compared to 56% by physicians (p = 0.005). The false positive rate for the algorithm was 34% versus 50% (p = 0.04) for clinicians. The false negative rate for the algorithm was 23% and 68% (p = 0.0002) for physicians. We developed an automated speech-recognition-inspired classification algorithm for the acoustic diagnosis of PH that outperforms physicians that could be used to screen for PH and encourage earlier specialist referral.

  8. N-best: The Northern- and Southern-Dutch Benchmark Evaluation of Speech recognition Technology

    NARCIS (Netherlands)

    Kessens, J.M.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are

  9. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  10. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  11. Using Speech Recognition Software to Increase Writing Fluency for Individuals with Physical Disabilities

    Science.gov (United States)

    Garrett, Jennifer Tumlin; Heller, Kathryn Wolff; Fowler, Linda P.; Alberto, Paul A.; Fredrick, Laura D.; O'Rourke, Colleen M.

    2011-01-01

    Students with physical disabilities often have difficulty with writing fluency, despite the use of various strategies, adaptations, and assistive technology (AT). One possible intervention is the use of speech recognition software, although there is little research on its impact on students with physical disabilities. This study used an…

  12. Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students

    Science.gov (United States)

    Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice

    2012-01-01

    Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…

  13. Speech Recognition, Working Memory and Conversation in Children with Cochlear Implants

    Science.gov (United States)

    Ibertsson, Tina; Hansson, Kristina; Asker-Arnason, Lena; Sahlen, Birgitta

    2009-01-01

    This study examined the relationship between speech recognition, working memory and conversational skills in a group of 13 children/adolescents with cochlear implants (CIs) between 11 and 19 years of age. Conversational skills were assessed in a referential communication task where the participants interacted with a hearing peer of the same age…

  14. The role of binary mask patterns in automatic speech recognition in background noise.

    Science.gov (United States)

    Narayanan, Arun; Wang, DeLiang

    2013-05-01

    Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local criterion (LC), or by comparing the local target energy with the long-term average spectral energy of speech. ASR results show that (1) akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB; (2) the ASR performance profiles are qualitatively similar to those obtained in human intelligibility experiments; (3) the difference between the LC and mixture SNR is more correlated to the recognition accuracy than LC; (4) LC at which the performance peaks is lower than 0 dB, which is the threshold that maximizes the SNR gain of processed signals. This broad agreement with human performance is rather surprising. The results also indicate that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.

  15. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  16. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    NARCIS (Netherlands)

    Baskent, Deniz; Chatterjeec, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelli

  17. Development of coffee maker service robot using speech and face recognition systems using POMDP

    Science.gov (United States)

    Budiharto, Widodo; Meiliana; Santoso Gunawan, Alexander Agung

    2016-07-01

    There are many development of intelligent service robot in order to interact with user naturally. This purpose can be done by embedding speech and face recognition ability on specific tasks to the robot. In this research, we would like to propose Intelligent Coffee Maker Robot which the speech recognition is based on Indonesian language and powered by statistical dialogue systems. This kind of robot can be used in the office, supermarket or restaurant. In our scenario, robot will recognize user's face and then accept commands from the user to do an action, specifically in making a coffee. Based on our previous work, the accuracy for speech recognition is about 86% and face recognition is about 93% in laboratory experiments. The main problem in here is to know the intention of user about how sweetness of the coffee. The intelligent coffee maker robot should conclude the user intention through conversation under unreliable automatic speech in noisy environment. In this paper, this spoken dialog problem is treated as a partially observable Markov decision process (POMDP). We describe how this formulation establish a promising framework by empirical results. The dialog simulations are presented which demonstrate significant quantitative outcome.

  18. Cued Speech Gesture Recognition: A First Prototype Based on Early Reduction

    Directory of Open Access Journals (Sweden)

    Caplier Alice

    2007-01-01

    Full Text Available Cued Speech is a specific linguistic code for hearing-impaired people. It is based on both lip reading and manual gestures. In the context of THIMP (Telephony for the Hearing-IMpaired Project, we work on automatic cued speech translation. In this paper, we only address the problem of automatic cued speech manual gesture recognition. Such a gesture recognition issue is really common from a theoretical point of view, but we approach it with respect to its particularities in order to derive an original method. This method is essentially built around a bioinspired method called early reduction. Prior to a complete analysis of each image of a sequence, the early reduction process automatically extracts a restricted number of key images which summarize the whole sequence. Only the key images are studied from a temporal point of view with lighter computation than the complete sequence.

  19. Cued Speech Gesture Recognition: A First Prototype Based on Early Reduction

    Directory of Open Access Journals (Sweden)

    Pascal Perret

    2008-01-01

    Full Text Available Cued Speech is a specific linguistic code for hearing-impaired people. It is based on both lip reading and manual gestures. In the context of THIMP (Telephony for the Hearing-IMpaired Project, we work on automatic cued speech translation. In this paper, we only address the problem of automatic cued speech manual gesture recognition. Such a gesture recognition issue is really common from a theoretical point of view, but we approach it with respect to its particularities in order to derive an original method. This method is essentially built around a bioinspired method called early reduction. Prior to a complete analysis of each image of a sequence, the early reduction process automatically extracts a restricted number of key images which summarize the whole sequence. Only the key images are studied from a temporal point of view with lighter computation than the complete sequence.

  20. A commercial large-vocabulary discrete speech recognition system: DragonDictate.

    Science.gov (United States)

    Mandel, M A

    1992-01-01

    DragonDictate is currently the only commercially available general-purpose, large-vocabulary speech recognition system. It uses discrete speech and is speaker-dependent, adapting to the speaker's voice and language model with every word. Its acoustic adaptability is based in a three-level phonology and a stochastic model of production. The phonological levels are phonemes, augmented triphones (phonemes-in-context or PICs), and steady-state spectral slices that are concatenated to approximate the spectra of these PICs (phonetic elements or PELs) and thus of words. Production is treated as a hidden Markov process, which the recognizer has to identify from its output, the spoken word. Findings of practical value to speech recognition are presented from research on six European languages.

  1. The effect of network degradation on speech recognition

    CSIR Research Space (South Africa)

    Joubert, G

    2005-11-01

    Full Text Available are found to enhance recognition accuracy when network jitter occurs, but do not generally compensate for packet loss successfully. They also confirm that packet loss is a more significant problem for larger loss burst-lengths....

  2. Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation.

    Science.gov (United States)

    Choi, Yong-Sun; Lee, Soo-Young

    2013-09-01

    A nonlinear speech feature extraction algorithm was developed by modeling human cochlear functions, and demonstrated as a noise-robust front-end for speech recognition systems. The algorithm was based on a model of the Organ of Corti in the human cochlea with such features as such as basilar membrane (BM), outer hair cells (OHCs), and inner hair cells (IHCs). Frequency-dependent nonlinear compression and amplification of OHCs were modeled by lateral inhibition to enhance spectral contrasts. In particular, the compression coefficients had frequency dependency based on the psychoacoustic evidence. Spectral subtraction and temporal adaptation were applied in the time-frame domain. With long-term and short-term adaptation characteristics, these factors remove stationary or slowly varying components and amplify the temporal changes such as onset or offset. The proposed features were evaluated with a noisy speech database and showed better performance than the baseline methods such as mel-frequency cepstral coefficients (MFCCs) and RASTA-PLP in unknown noisy conditions.

  3. A novel pose and illumination robust face recognition with a single training image per person algorithm

    Institute of Scientific and Technical Information of China (English)

    Junbao Li; Jeng-Shyang Pan

    2008-01-01

    @@ In the real-world application of face recognition system, owing to the difficulties of collecting samples or storage space of systems, only one sample image per person is stored in the system, which is so-called one sample per person problem. Moreover, pose and illumination have impact on recognition performance. We propose a novel pose and illumination robust algorithm for face recognition with a single training image per person to solve the above limitations. Experimental results show that the proposed algorithm is an efficient and practical approach for face recognition.

  4. Speech Waveform Compression Using Robust Adaptive Voice Activity Detection for Nonstationary Noise

    Directory of Open Access Journals (Sweden)

    Hsiao-Chun Wu

    2008-03-01

    Full Text Available The voice activity detection (VAD is crucial in all kinds of speech applications. However, almost all existing VAD algorithms suffer from the nonstationarity of both speech and noise. To combat this difficulty, we propose a new voice activity detector, which is based on the Mel-energy features and an adaptive threshold related to the signal-to-noise ratio (SNR estimates. In this paper, we first justify the robustness of the Bayes classifier using the Mel-energy features over that using the Fourier spectral features in various noise environments. Then, we design an algorithm using the dynamic Mel-energy estimator and the adaptive threshold, which depends on the SNR estimates. In addition, a realignment scheme is incorporated to correct the sparse-and-spurious noise estimates. Numerous simulations are carried out to evaluate the performance of our proposed VAD method and the comparisons are made with a couple of existing representative schemes, namely, the VAD using the likelihood ratio test with Fourier spectral energy features and that based on the enhanced time-frequency parameters. Three types of noises, namely, white noise (stationary, babble noise (nonstationary, and vehicular noise (nonstationary were artificially added by the computer for our experiments. As a result, our proposed VAD algorithm significantly outperforms other existing methods as illustrated by the corresponding receiver operating characteristics (ROC curves. Finally, we demonstrate one of the major applications, namely, speech waveform compression associated with our new robust VAD scheme and quantify the effectiveness in terms of compression efficiency.

  5. Joint sparse representation for robust multimodal biometrics recognition.

    Science.gov (United States)

    Shekhar, Sumit; Patel, Vishal M; Nasrabadi, Nasser M; Chellappa, Rama

    2014-01-01

    Traditional biometric recognition systems rely on a single biometric signature for authentication. While the advantage of using multiple sources of information for establishing the identity has been widely recognized, computational models for multimodal biometrics recognition have only recently received attention. We propose a multimodal sparse representation method, which represents the test data by a sparse linear combination of training data, while constraining the observations from different modalities of the test subject to share their sparse representations. Thus, we simultaneously take into account correlations as well as coupling information among biometric modalities. A multimodal quality measure is also proposed to weigh each modality as it gets fused. Furthermore, we also kernelize the algorithm to handle nonlinearity in data. The optimization problem is solved using an efficient alternative direction method. Various experiments show that the proposed method compares favorably with competing fusion-based methods.

  6. Robust Face Recognition via Occlusion Detection and Masking

    Directory of Open Access Journals (Sweden)

    Guo Tan

    2016-01-01

    Full Text Available Sparse representation-based classification (SRC method has demonstrated promising results in face recognition (FR. In this paper, we consider the problem of face recognition with occlusion. In sparse representation-based classification method, the reconstruction residual of test sample over the training set is usually heterogeneous with the training samples, highlighting the occlusion part in test sample. We detect the occlusion part by extracting a mask from the reconstruction residual through threshold operation. The mask will be applied in the representation-based classification framework to eliminate the impact of occlusion in FR. The method does not assume any prior knowledge about the occlusion, and extensive experiments on publicly available databases show the efficacy of the method.

  7. Cost-Efficient Development of Acoustic Models for Speech Recognition of Related Languages

    Directory of Open Access Journals (Sweden)

    J. Nouza

    2013-09-01

    Full Text Available When adapting an existing speech recognition system to a new language, major development costs are associated with the creation of an appropriate acoustic model (AM. For its training, a certain amount of recorded and annotated speech is required. In this paper, we show that not only the annotation process, but also the process of speech acquisition can be automated to minimize the need of human and expert work. We demonstrate the proposed methodology on Croatian language, for which the target AM has been built via cross-lingual adaptation of a Czech AM in 2 ways: a using commercially available GlobalPhone database, and b by automatic speech data mining from HRT radio archive. The latter approach is cost-free, yet it yields comparable or better results in LVCSR experiments conducted on 3 Croatian test sets.

  8. Post-Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion

    CERN Document Server

    Bassil, Youssef

    2012-01-01

    ASR short for Automatic Speech Recognition is the process of converting a spoken speech into text that can be manipulated by a computer. Although ASR has several applications, it is still erroneous and imprecise especially if used in a harsh surrounding wherein the input speech is of low quality. This paper proposes a post-editing ASR error correction method and algorithm based on Bing's online spelling suggestion. In this approach, the ASR recognized output text is spell-checked using Bing's spelling suggestion technology to detect and correct misrecognized words. More specifically, the proposed algorithm breaks down the ASR output text into several word-tokens that are submitted as search queries to Bing search engine. A returned spelling suggestion implies that a query is misspelled; and thus it is replaced by the suggested correction; otherwise, no correction is performed and the algorithm continues with the next token until all tokens get validated. Experiments carried out on various speeches in differen...

  9. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    Science.gov (United States)

    1983-05-01

    but also suprasegmental (prosodic) information, such as appropriate stress levels and intonation patterns, to improve their output. This is a definite...Society of America Paper, 1975. " Suprasegmentals ." Cambridge, Mass.: MIT Press, 1970. M4akhoul, J., Viswanathan, R., and Huggins, W. F. "A Mixed...the " suprasegmentals " or influences of duration, fundamental frequency, and speech product ion power upon basic phonemes. These ...- 102- influences

  10. Critical Issues in Airborne Applications of Speech Recognition.

    Science.gov (United States)

    1979-01-01

    fatigue, etc.). Airborne applications of speech recognizers are worthy of the considerable attention recently focused on them (Curran, 1978; Coler , et al...1978); and * Simulations of cockpit communications (Curran, 1978; Coler , et al., 1978). Current technology also includes industrial development projects...training applications has already been well acknowledged (Breaux, 1978; Coler , gt al. 1978; Curran, 1978; Grady, et al., 1978; Feuge and Geer, 1978

  11. Dyadic Wavelet Features for Isolated Word Speaker Dependent Speech Recognition

    Science.gov (United States)

    1994-03-01

    Guppies RL/IRAA 32 Hangar Rd Griffiss AFB NY 13441 11. SUPPLEMENTARY NOTES 12a. DISTRIBUTION/ AVAILABILITY STATEMENT 12b. DISTRIBUTION CODE Distribution...contains ten examples of each of the spoken digits ("zero" through "nine") for eight different speakers; four male and four female. The speech recordings...there were no overlapping windows. Once the feature vector was determined, the features were level normalized . This was achieved by subtracting each

  12. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  13. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  14. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  15. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    Science.gov (United States)

    Başkent, Deniz; Chatterjee, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelligibility of interrupted speech decreased with increasing spectral degradation. For all spectral-degradation conditions, however, adding the unprocessed low-pass filtered speech enhanced the intelligibility. The improvement at 4 and 8 channels was higher than the improvement at 16 and 32 channels: 19% and 8%, on average, respectively. The Articulation Index predicted an improvement of 0.09, in a scale from 0 to 1. Thus, the improvement at poorest spectral-degradation conditions was larger than what would be expected from additional speech information. Therefore, the results implied that the fine temporal cues from the unprocessed low-frequency speech, such as the additional voice pitch cues, helped perceptual integration of temporally interrupted and spectrally degraded speech, especially when the spectral degradations were severe. Considering the vocoder processing as a cochlear-implant simulation, where implant users’ performance is closest to 4 and 8-channel vocoder performance, the results support additional benefit of low-frequency acoustic input in combined electric-acoustic stimulation for perception of temporally degraded speech. PMID:20817081

  16. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech.

    Science.gov (United States)

    Başkent, Deniz; Chatterjee, Monita

    2010-12-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelligibility of interrupted speech decreased with increasing spectral degradation. For all spectral degradation conditions, however, adding the unprocessed low-pass filtered speech enhanced the intelligibility. The improvement at 4 and 8 channels was higher than the improvement at 16 and 32 channels: 19% and 8%, on average, respectively. The Articulation Index predicted an improvement of 0.09, in a scale from 0 to 1. Thus, the improvement at poorest spectral degradation conditions was larger than what would be expected from additional speech information. Therefore, the results implied that the fine temporal cues from the unprocessed low-frequency speech, such as the additional voice pitch cues, helped perceptual integration of temporally interrupted and spectrally degraded speech, especially when the spectral degradations were severe. Considering the vocoder processing as a cochlear implant simulation, where implant users' performance is closest to 4 and 8-channel vocoder performance, the results support additional benefit of low-frequency acoustic input in combined electric-acoustic stimulation for perception of temporally degraded speech. Copyright © 2010 Elsevier B.V. All rights reserved.

  17. Robust time delay estimation for speech signals using information theory: A comparison study

    Directory of Open Access Journals (Sweden)

    Wen Fei

    2011-01-01

    Full Text Available Abstract Time delay estimation (TDE is a fundamental subsystem for a speaker localization and tracking system. Most of the traditional TDE methods are based on second-order statistics (SOS under Gaussian assumption for the source. This article resolves the TDE problem using two information-theoretic measures, joint entropy and mutual information (MI, which can be considered to indirectly include higher order statistics (HOS. The TDE solutions using the two measures are presented for both Gaussian and Laplacian models. We show that, for stationary signals, the two measures are equivalent for TDE. However, for non-stationary signals (e.g., noisy speech signals, maximizing MI gives more consistent estimate than minimizing joint entropy. Moreover, an existing idea of using modified MI to embed information about reverberation is generalized to the multiple microphones case. From the experimental results for speech signals, this scheme with Gaussian model shows the most robust performance in various noisy and reverberant environments.

  18. Robust Transmission of Speech LSFs Using Hidden Markov Model-Based Multiple Description Index Assignments

    Directory of Open Access Journals (Sweden)

    Rondeau Paul

    2008-01-01

    Full Text Available Speech coding techniques capable of generating encoded representations which are robust against channel losses play an important role in enabling reliable voice communication over packet networks and mobile wireless systems. In this paper, we investigate the use of multiple description index assignments (MDIAs for loss-tolerant transmission of line spectral frequency (LSF coefficients, typically generated by state-of-the-art speech coders. We propose a simulated annealing-based approach for optimizing MDIAs for Markov-model-based decoders which exploit inter- and intraframe correlations in LSF coefficients to reconstruct the quantized LSFs from coded bit streams corrupted by channel losses. Experimental results are presented which compare the performance of a number of novel LSF transmission schemes. These results clearly demonstrate that Markov-model-based decoders, when used in conjunction with optimized MDIA, can yield average spectral distortion much lower than that produced by methods such as interleaving/interpolation, commonly used to combat the packet losses.

  19. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders.

    Science.gov (United States)

    Kitzing, Peter; Maier, Andreas; Ahlander, Viveka Lyberg

    2009-01-01

    In general opinion computerized automatic speech recognition (ASR) seems to be regarded as a method only to accomplish transcriptions from spoken language to written text and as such quite insecure and rather cumbersome. However, due to great advances in computer technology and informatics methodology ASR has nowadays become quite dependable and easier to handle, and the number of applications has increased considerably. After some introductory background information on ASR a number of applications of great interest for professionals in voice, speech, and language therapy are pointed out. In the foreseeable future, the keyboard and mouse will by means of ASR technology be replaced in many functions by a microphone as the human-computer interface, and the computer will talk back via its loud-speaker. It seems important that professionals engaged in the care of oral communication disorders take part in this development so their clients may get the optimal benefit from this new technology.

  20. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Science.gov (United States)

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  1. Joint training of DNNs by incorporating an explicit dereverberation structure for distant speech recognition

    Science.gov (United States)

    Gao, Tian; Du, Jun; Xu, Yong; Liu, Cong; Dai, Li-Rong; Lee, Chin-Hui

    2016-12-01

    We explore joint training strategies of DNNs for simultaneous dereverberation and acoustic modeling to improve the performance of distant speech recognition. There are two key contributions. First, a new DNN structure incorporating both dereverberated and original reverberant features is shown to effectively improve recognition accuracy over the conventional one using only dereverberated features as the input. Second, in most of the simulated reverberant environments for training data collection and DNN-based dereverberation, the resource data and learning targets are high-quality clean speech. With our joint training strategy, we can relax this constraint by using large-scale diversified real close-talking data as the targets which are easy to be collected via many speech-enabled applications from mobile internet users, and find the scenario even more effective. Our experiments on a Mandarin speech recognition task with 2000-h training data show that the proposed framework achieves relative word error rate reductions of 9.7 and 8.6 % over the multi-condition training systems for the cases of single-channel and multi-channel with beamforming, respectively. Furthermore, significant gains are consistently observed over the pre-processing approach using simply DNN-based dereverberation.

  2. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems Project

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  3. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems Project

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  4. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... gathered from a questionnaire. Analysis of data was made by a method inspired by queuing theory in order to compare the delays associated to the two interfaces and to quantify the workload inherent to the memorisation of items to be entered into the anaesthesia record. Results: The experiment showed...

  5. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes...

  6. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The automatic analysis of affect is a relatively new and challenging multidisciplinary research area that has gained a lot of interest over the past few years. The research and development of affect recognition systems has opened many opportunities for improving the interaction between man and

  7. Bio-Inspired Microsystem for Robust Genetic Assay Recognition

    Directory of Open Access Journals (Sweden)

    Jaw-Chyng Lue

    2008-01-01

    Full Text Available A compact integrated system-on-chip (SoC architecture solution for robust, real-time, and on-site genetic analysis has been proposed. This microsystem solution is noise-tolerable and suitable for analyzing the weak fluorescence patterns from a PCR prepared dual-labeled DNA microchip assay. In the architecture, a preceding VLSI differential logarithm microchip is designed for effectively computing the logarithm of the normalized input fluorescence signals. A posterior VLSI artificial neural network (ANN processor chip is used for analyzing the processed signals from the differential logarithm stage. A single-channel logarithmic circuit was fabricated and characterized. A prototype ANN chip with unsupervised winner-take-all (WTA function was designed, fabricated, and tested. An ANN learning algorithm using a novel sigmoid-logarithmic transfer function based on the supervised backpropagation (BP algorithm is proposed for robustly recognizing low-intensity patterns. Our results show that the trained new ANN can recognize low-fluorescence patterns better than an ANN using the conventional sigmoid function.

  8. Towards a Global Optimization Scheme for Multi-Band Speech Recognition

    OpenAIRE

    Cerisara, Christophe; Haton, Jean-Paul; Fohr, Dominique

    1999-01-01

    Colloque avec actes et comité de lecture.; n this paper, we deal with a new method to globally optimize a Multi-Band Speech Recognition (MBSR) system. We have tested our algorithm with the TIMIT database and obtained a significant improvement in the accuracy over a basic HMM system for clean speech. The goal of this work is not to prove the effectiveness of MBSR, what has yet been done, but to improve the training scheme by introducing a global optimization procedure. A consequence of this me...

  9. Mandarin-Speaking Children’s Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours

    Science.gov (United States)

    Zhou, Hong; Li, Yu; Liang, Meng; Guan, Connie Qun; Zhang, Linjun; Shu, Hua; Zhang, Yang

    2017-01-01

    The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency (F0) contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F0 information. One the one hand, age group modulated significantly the use of F0 contours, indicating that elementary school children relied more on natural F0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F0 contours compared with natural F0 contours, while older children could benefit from semantic context even when natural F0 contours were altered, thus confirming the important role of F0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception. PMID:28701990

  10. DEVELOPMENT OF AUTOMATED SPEECH RECOGNITION SYSTEM FOR EGYPTIAN ARABIC PHONE CONVERSATIONS

    Directory of Open Access Journals (Sweden)

    A. N. Romanenko

    2016-07-01

    Full Text Available The paper deals with description of several speech recognition systems for the Egyptian Colloquial Arabic. The research is based on the CALLHOME Egyptian corpus. The description of both systems, classic: based on Hidden Markov and Gaussian Mixture Models, and state-of-the-art: deep neural network acoustic models is given. We have demonstrated the contribution from the usage of speaker-dependent bottleneck features; for their extraction three extractors based on neural networks were trained. For their training three datasets in several languageswere used:Russian, English and differentArabic dialects.We have studied the possibility of application of a small Modern Standard Arabic (MSA corpus to derive phonetic transcriptions. The experiments have shown that application of the extractor obtained on the basis of the Russian dataset enables to increase significantly the quality of the Arabic speech recognition. We have also stated that the usage of phonetic transcriptions based on modern standard Arabic decreases recognition quality. Nevertheless, system operation results remain applicable in practice. In addition, we have carried out the study of obtained models application for the keywords searching problem solution. The systems obtained demonstrate good results as compared to those published before. Some ways to improve speech recognition are offered.

  11. Fast and robust recognition and localization of 2D objects

    Science.gov (United States)

    Otterbach, Rainer; Gerdes, Rolf; Kammueller, R.

    1994-11-01

    The paper presents a vision system which provides a robust model-based identification and localization of 2-D objects in industrial scenes. A symbolic image description based on the polygonal approximation of the object silhouettes is extracted in video real time by the use of dedicated hardware. Candidate objects are selected from the model database using a time and memory efficient hashing algorithm. Any candidate object is submitted to the next computation stage which generates pose hypotheses by assigning model to image contours. Corresponding continuous measures of similarity are derived from the turning functions of the curves. Finally, the previous generated hypotheses are verified using a voting scheme in transformation space. Experimental results reveal the fault tolerance of the vision system with regard to noisy and split image contours as well as partial occlusion of objects. THe short cycle time and the easy adaptability of the vision system make it well suited for a wide variety of applications in industrial automation.

  12. Effects of contextual cues on speech recognition in simulated electric-acoustic stimulation.

    Science.gov (United States)

    Kong, Ying-Yee; Donaldson, Gail; Somarowthu, Ala

    2015-05-01

    Low-frequency acoustic cues have shown to improve speech perception in cochlear-implant listeners. However, the mechanisms underlying this benefit are still not well understood. This study investigated the extent to which low-frequency cues can facilitate listeners' use of linguistic knowledge in simulated electric-acoustic stimulation (EAS). Experiment 1 examined differences in the magnitude of EAS benefit at the phoneme, word, and sentence levels. Speech materials were processed via noise-channel vocoding and lowpass (LP) filtering. The amount of spectral degradation in the vocoder speech was varied by applying different numbers of vocoder channels. Normal-hearing listeners were tested on vocoder-alone, LP-alone, and vocoder + LP conditions. Experiment 2 further examined factors that underlie the context effect on EAS benefit at the sentence level by limiting the low-frequency cues to temporal envelope and periodicity (AM + FM). Results showed that EAS benefit was greater for higher-context than for lower-context speech materials even when the LP ear received only low-frequency AM + FM cues. Possible explanations for the greater EAS benefit observed with higher-context materials may lie in the interplay between perceptual and expectation-driven processes for EAS speech recognition, and/or the band-importance functions for different types of speech materials.

  13. Discriminative tonal feature extraction method in mandarin speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2007-01-01

    To utilize the supra-segmental nature of Mandarin tones, this article proposes a feature extraction method for hidden markov model (HMM) based tone modeling. The method uses linear transforms to project F0 (fundamental frequency) features of neighboring syllables as compensations, and adds them to the original F0 features of the current syllable. The transforms are discriminatively trained by using an objective function termed as "minimum tone error", which is a smooth approximation of tone recognition accuracy. Experiments show that the new tonal features achieve 3.82% tone recognition rate improvement, compared with the baseline, using maximum likelihood trained HMM on the normal F0 features. Further experiments show that discriminative HMM training on the new features is 8.78% better than the baseline.

  14. Haar-like Features for Robust Real-Time Face Recognition

    DEFF Research Database (Denmark)

    Nasrollahi, Kamal; Moeslund, Thomas B.

    2013-01-01

    Face recognition is still a very challenging task when the input face image is noisy, occluded by some obstacles, of very low-resolution, not facing the camera, and not properly illuminated. These problems make the feature extraction and consequently the face recognition system unstable....... The proposed system in this paper introduces the novel idea of using Haar-like features, which have commonly been used for object detection, along with a probabilistic classifier for face recognition. The proposed system is simple, real-time, effective and robust against most of the mentioned problems...

  15. Study on Acoustic Modeling in a Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    PENG Di; LIU Gang; GUO Jun

    2007-01-01

    The design of acoustic models is of vital importance to build a reliable connection between acoustic waveform and linguistic messages in terms of individual speech units. According to the characteristic of Chinese phonemes,the base acoustic phoneme units set is decided and refined and a decision tree based state tying approach is explored.Since one of the advantages of top-down tying method is flexibility in maintaining a balance between model accuracy and complexity, relevant adjustments are conducted, such as the stopping criterion of decision tree node splitting, during which optimal thresholds are captured. Better results are achieved in improving acoustic modeling accuracy as well as minimizing the scale of the model to a trainable extent.

  16. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN......-pass-filtered speech items. Data were collected from 22 hearing-impaired subjects with moderate-to-profound sensorineural hearing losses. The results showed that 11 subjects exhibited abnormal psychoacoustic behaviour in the TEN test, indicative of a possible dead region. Estimates of audibility were used to assess...... the possible connection between dead-region candidacy and ability to recognize low-pass-filtered speech. Large variability was observed with regard to the ability of audibility to predict recognition scores for both dead-region and no-dead-region subjects. Furthermore, the results indicate that dead...

  17. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    the possible connection between dead-region candidacy and ability to recognize low-pass-filtered speech. Large variability was observed with regard to the ability of audibility to predict recognition scores for both dead-region and no-dead-region subjects. Furthermore, the results indicate that dead......Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN......-pass-filtered speech items. Data were collected from 22 hearing-impaired subjects with moderate-to-profound sensorineural hearing losses. The results showed that 11 subjects exhibited abnormal psychoacoustic behaviour in the TEN test, indicative of a possible dead region. Estimates of audibility were used to assess...

  18. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    recognition system depends heavily on the dictionary and the transcriptions therein. This paper presents an analysis of phonetic/phonemic features that are salient for current Danish ASR systems. This preliminary study consists of a series of experiments using an ASR system trained on the DK-PAROLE corpus....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  19. A Log—Index Weighted Cepstral Distance Measure for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 吴文虎; 等

    1997-01-01

    A log-index weighted cepstral distance measure is proposed and tested in speacker-independent and speaker-dependent isolated word recognition systems using statistic techniques.The weights for the cepstral coefficients of this measure equal the logarithm of the corresponding indices.The experimental results show that this kind of measure works better than any other weighted Euclidean cepstral distance measures on three speech databases.The error rate obtained using this measure is about 1.8 percent for three databases on average,which is a 25% reduction from that obtained using other measures,and a 40% reduction from that obtained using Log Likelihood Ratio(LLR)measure.The experimental results also show that this kind of distance measure woks well in both speaker-dependent and speaker-independent speech recognition systems.

  20. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition

    Institute of Scientific and Technical Information of China (English)

    Hong-tao WANG; Yuan-qing LI; Tian-you YU

    2014-01-01

    An intelligent wheelchair is devised, which is controlled by a coordinated mechanism based on a brain-computer interface (BCI) and speech recognition. By performing appropriate activities, users can navigate the wheelchair with four steering behaviors (start, stop, turn left, and turn right). Five healthy subjects participated in an indoor experiment. The results demonstrate the efficiency of the coordinated control mechanism with satisfactory path and time optimality ratios, and show that speech recognition is a fast and accurate supplement for BCI-based control systems. The proposed intelligent wheelchair is especially suitable for patients suffering from paralysis (especially those with aphasia) who can learn to pronounce only a single sound (e.g.,‘ah’).