WorldWideScience

Sample records for modeling speech signals

  1. Enhancement of speech signals - with a focus on voiced speech models

    DEFF Research Database (Denmark)

    Nørholm, Sidsel Marie

    This thesis deals with speech enhancement, i.e., noise reduction in speech signals. This has applications in, e.g., hearing aids and teleconference systems. We consider a signal-driven approach to speech enhancement where a model of the speech is assumed and filters are generated based on this mo......This thesis deals with speech enhancement, i.e., noise reduction in speech signals. This has applications in, e.g., hearing aids and teleconference systems. We consider a signal-driven approach to speech enhancement where a model of the speech is assumed and filters are generated based...

  2. Mathematical modeling and signal processing in speech and hearing sciences

    CERN Document Server

    Xin, Jack

    2014-01-01

    The aim of the book is to give an accessible introduction of mathematical models and signal processing methods in speech and hearing sciences for senior undergraduate and beginning graduate students with basic knowledge of linear algebra, differential equations, numerical analysis, and probability. Speech and hearing sciences are fundamental to numerous technological advances of the digital world in the past decade, from music compression in MP3 to digital hearing aids, from network based voice enabled services to speech interaction with mobile phones. Mathematics and computation are intimately related to these leaps and bounds. On the other hand, speech and hearing are strongly interdisciplinary areas where dissimilar scientific and engineering publications and approaches often coexist and make it difficult for newcomers to enter.

  3. Phonetic perspectives on modelling information in the speech signal

    Indian Academy of Sciences (India)

    S Hawkins

    2011-10-01

    This paper reassesses conventional assumptions about the informativeness of the acoustic speech signal, and shows how recent research on systematic variability in the acoustic signal is consistent with an alternative linguistic model that is more biologically plausible and compatible with recent advances in modelling embodied visual perception and action. Standard assumptions about the information available from the speech signal, especially strengths and limitations of phonological features and phonemes, are reviewed, and compared with an alternative approach based on Firthian prosodic analysis (FPA). FPA places more emphasis than standard models on the linguistic and interactional function of an utterance, de-emphasizes the need to identify phonemes, and uses formalisms that force us to recognize that every perceptual decision is context- and task-dependent. Examples of perceptually-significant phonetic detail that is neglected by standard models are discussed. Similarities between the theoretical approach recommended and current work on perception–action robots are explored.

  4. Sparse gammatone signal model optimized for English speech does not match the human auditory filters.

    Science.gov (United States)

    Strahl, Stefan; Mertins, Alfred

    2008-07-18

    Evidence that neurosensory systems use sparse signal representations as well as improved performance of signal processing algorithms using sparse signal models raised interest in sparse signal coding in the last years. For natural audio signals like speech and environmental sounds, gammatone atoms have been derived as expansion functions that generate a nearly optimal sparse signal model (Smith, E., Lewicki, M., 2006. Efficient auditory coding. Nature 439, 978-982). Furthermore, gammatone functions are established models for the human auditory filters. Thus far, a practical application of a sparse gammatone signal model has been prevented by the fact that deriving the sparsest representation is, in general, computationally intractable. In this paper, we applied an accelerated version of the matching pursuit algorithm for gammatone dictionaries allowing real-time and large data set applications. We show that a sparse signal model in general has advantages in audio coding and that a sparse gammatone signal model encodes speech more efficiently in terms of sparseness than a sparse modified discrete cosine transform (MDCT) signal model. We also show that the optimal gammatone parameters derived for English speech do not match the human auditory filters, suggesting for signal processing applications to derive the parameters individually for each applied signal class instead of using psychometrically derived parameters. For brain research, it means that care should be taken with directly transferring findings of optimality for technical to biological systems.

  5. Degradation decomposition of the perceived quality of speech signals on the basis of a perceptual modeling approach

    NARCIS (Netherlands)

    Beerends, J.G.; Busz, B.; Oudshoorn, P.; Vugt, J. van; Ahmed, K.; Niamut, O.

    2007-01-01

    The authors discuss the way we perceive the quality of a speech signal and how different degradations contribute to the overall perceived speech (listening) quality. More specifically, ITU-T Recommendation P.862 (perceptual evaluation of speech quality-PESQ), which provides a perceptual modeling app

  6. Degradation decomposition of the perceived quality of speech signals on the basis of a perceptual modeling approach

    NARCIS (Netherlands)

    Beerends, J.G.; Busz, B.; Oudshoorn, P.; Vugt, J. van; Ahmed, K.; Niamut, O.

    2007-01-01

    The authors discuss the way we perceive the quality of a speech signal and how different degradations contribute to the overall perceived speech (listening) quality. More specifically, ITU-T Recommendation P.862 (perceptual evaluation of speech quality-PESQ), which provides a perceptual modeling

  7. Volterra prediction model for speech signal series%语音信号序列的Volterra预测模型∗

    Institute of Scientific and Technical Information of China (English)

    张玉梅; 胡小俊; 吴晓军; 白树林; 路纲

    2015-01-01

    The given English phonemes, words and sentences are sampled and preprocessed. For these real measured speech signal series, time delay and embedding dimension are determined by using mutual information method and Cao’s method, respectively, so as to perform phase space reconstruction of the speech signal series. By using small data set method, the largest Lyapunov exponent of the speech signal series is calculated and the fact that its value is greater than zero presents chaotic characteristics of the speech signal series. This, in fact, performs the chaotic characteristic identification of the speech signal series. By introducing second-order Volterra series, in this paper we put forward a type of nonlinear prediction model with an explicit structure. To overcome some intrinsic shortcomings caused by improper parameter selection when using the least mean square (LMS) algorithm to update Volterra model efficiency, by using a variable convergence factor technology based on a posteriori error assumption on the basis of LMS algorithm, a novel Davidon-Fletcher-Powell-based second of Volterra filter (DFPSOVF) is constructed and is performed to predict speech signal series of the given English phonemes, words and sentences with chaotic characteristics. Simulation results under MATLAB 7.0 environment show that the proposed nonlinear model DFPSOVF can guarantee its stability and convergence and there are no divergence problems in using LMS algorithm; for single-frame and multi-frame of the measured speech signals, when root mean square error (RMSE) is used as an evaluation criterion the prediction accuracy of the proposed nonlinear prediction model DFPSOVF in this paper is better than that of the linear prediction (LP) that is traditionally employed. The primary results of single-frame and multi-frame predictions are given. So, the proposed DFPSOVF model can substitute linear prediction model on certain conditions. Meanwhile, it can better reflect trends and regularity

  8. Modeling speech intelligibility based on the signal-to-noise envelope power ratio

    DEFF Research Database (Denmark)

    Jørgensen, Søren

    through three commercially available mobile phones. The model successfully accounts for the performance across the phones in conditions with a stationary speech-shaped background noise, whereas deviations were observed in conditions with “Traffic” and “Pub” noise. Overall, the results of this thesis......The intelligibility of speech depends on factors related to the auditory processes involved in sound perception as well as on the acoustic properties of the sound entering the ear. However, a clear understanding of speech perception in complex acoustic conditions and, in particular, a quantitative...... description of the involved auditory processes provides a major challenge in speech and hearing research. This thesis presents a computational model that attempts to predict the speech intelligibility obtained by normal-hearing listeners in various adverse conditions. The model combines the concept...

  9. Robust digital processing of speech signals

    CERN Document Server

    Kovacevic, Branko; Veinović, Mladen; Marković, Milan

    2017-01-01

    This book focuses on speech signal phenomena, presenting a robustification of the usual speech generation models with regard to the presumed types of excitation signals, which is equivalent to the introduction of a class of nonlinear models and the corresponding criterion functions for parameter estimation. Compared to the general class of nonlinear models, such as various neural networks, these models possess good properties of controlled complexity, the option of working in “online” mode, as well as a low information volume for efficient speech encoding and transmission. Providing comprehensive insights, the book is based on the authors’ research, which has already been published, supplemented by additional texts discussing general considerations of speech modeling, linear predictive analysis and robust parameter estimation.

  10. Model-Based Speech Signal Coding Using Optimized Temporal Decomposition for Storage and Broadcasting Applications

    Directory of Open Access Journals (Sweden)

    Chandranath R. N. Athaudage

    2003-09-01

    Full Text Available A dynamic programming-based optimization strategy for a temporal decomposition (TD model of speech and its application to low-rate speech coding in storage and broadcasting is presented. In previous work with the spectral stability-based event localizing (SBEL TD algorithm, the event localization was performed based on a spectral stability criterion. Although this approach gave reasonably good results, there was no assurance on the optimality of the event locations. In the present work, we have optimized the event localizing task using a dynamic programming-based optimization strategy. Simulation results show that an improved TD model accuracy can be achieved. A methodology of incorporating the optimized TD algorithm within the standard MELP speech coder for the efficient compression of speech spectral information is also presented. The performance evaluation results revealed that the proposed speech coding scheme achieves 50%–60% compression of speech spectral information with negligible degradation in the decoded speech quality.

  11. Array Signal Processing Under Model Errors With Application to Speech Separation

    Science.gov (United States)

    1992-10-31

    Acoust. Speech Sig. Proc., pp. 1149-1152, Alburqueque NM, 1990. [37] E. A. Patrick , Fundamentals of Pattern Recognition, Englewood Cliffs, NJ, 1972...Proc., Toronto, pp. 1365-1368, 1991. [411 S. U. Pillai , Array Signal Processing, Springer Verlag, NY, 1989. [42] B. Porat and B. Friedlander

  12. Heart Rate Extraction from Vowel Speech Signals

    Institute of Scientific and Technical Information of China (English)

    Abdelwadood Mesleh; Dmitriy Skopin; Sergey Baglikov; Anas Quteishat

    2012-01-01

    This paper presents a novel non-contact heart rate extraction method from vowel speech signals.The proposed method is based on modeling the relationship between speech production of vowel speech signals and heart activities for humans where it is observed that the moment of heart beat causes a short increment (evolution) of vowel speech formants.The short-time Fourier transform (STFT) is used to detect the formant maximum peaks so as to accurately estimate the heart rate.Compared with traditional contact pulse oximeter,the average accuracy of the proposed non-contact heart rate extraction method exceeds 95%.The proposed non-contact heart rate extraction method is expected to play an important role in modern medical applications.

  13. Method and apparatus for obtaining complete speech signals for speech recognition applications

    Science.gov (United States)

    Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)

    2009-01-01

    The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.

  14. Automatic Smoker Detection from Telephone Speech Signals

    DEFF Research Database (Denmark)

    Alavijeh, Amir Hossein Poorjam; Hesaraki, Soheila; Safavi, Saeid

    2017-01-01

    on Gaussian mixture model means and weights respectively. Each framework is evaluated using different classification algorithms to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to improve the classification accuracy. The proposed......This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis...... method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach...

  15. Negative blood oxygen level dependent signals during speech comprehension.

    Science.gov (United States)

    Rodriguez Moreno, Diana; Schiff, Nicholas D; Hirsch, Joy

    2015-05-01

    Speech comprehension studies have generally focused on the isolation and function of regions with positive blood oxygen level dependent (BOLD) signals with respect to a resting baseline. Although regions with negative BOLD signals in comparison to a resting baseline have been reported in language-related tasks, their relationship to regions of positive signals is not fully appreciated. Based on the emerging notion that the negative signals may represent an active function in language tasks, the authors test the hypothesis that negative BOLD signals during receptive language are more associated with comprehension than content-free versions of the same stimuli. Regions associated with comprehension of speech were isolated by comparing responses to passive listening to natural speech to two incomprehensible versions of the same speech: one that was digitally time reversed and one that was muffled by removal of high frequencies. The signal polarity was determined by comparing the BOLD signal during each speech condition to the BOLD signal during a resting baseline. As expected, stimulation-induced positive signals relative to resting baseline were observed in the canonical language areas with varying signal amplitudes for each condition. Negative BOLD responses relative to resting baseline were observed primarily in frontoparietal regions and were specific to the natural speech condition. However, the BOLD signal remained indistinguishable from baseline for the unintelligible speech conditions. Variations in connectivity between brain regions with positive and negative signals were also specifically related to the comprehension of natural speech. These observations of anticorrelated signals related to speech comprehension are consistent with emerging models of cooperative roles represented by BOLD signals of opposite polarity.

  16. Epoch-based analysis of speech signals

    Indian Academy of Sciences (India)

    B Yegnanarayana; Suryakanth V Gangashetty

    2011-10-01

    Speech analysis is traditionally performed using short-time analysis to extract features in time and frequency domains. The window size for the analysis is fixed somewhat arbitrarily, mainly to account for the time varying vocal tract system during production. However, speech in its primary mode of excitation is produced due to impulse-like excitation in each glottal cycle. Anchoring the speech analysis around the glottal closure instants (epochs) yields significant benefits for speech analysis. Epoch-based analysis of speech helps not only to segment the speech signals based on speech production characteristics, but also helps in accurate analysis of speech. It enables extraction of important acoustic-phonetic features such as glottal vibrations, formants, instantaneous fundamental frequency, etc. Epoch sequence is useful to manipulate prosody in speech synthesis applications. Accurate estimation of epochs helps in characterizing voice quality features. Epoch extraction also helps in speech enhancement and multispeaker separation. In this tutorial article, the importance of epochs for speech analysis is discussed, and methods to extract the epoch information are reviewed. Applications of epoch extraction for some speech applications are demonstrated.

  17. Comparing Infants' Preference for Correlated Audiovisual Speech with Signal-Level Computational Models

    Science.gov (United States)

    Hollich, George; Prince, Christopher G.

    2009-01-01

    How much of infant behaviour can be accounted for by signal-level analyses of stimuli? The current paper directly compares the moment-by-moment behaviour of 8-month-old infants in an audiovisual preferential looking task with that of several computational models that use the same video stimuli as presented to the infants. One type of model…

  18. Modeling speech intelligibility in adverse conditions

    DEFF Research Database (Denmark)

    Dau, Torsten

    2012-01-01

    by the normal as well as impaired auditory system. Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] proposed the speech-based envelope power spectrum model (sEPSM) in an attempt to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII......) in conditions with nonlinearly processed speech. Instead of considering the reduction of the temporal modulation energy as the intelligibility metric, as assumed in the STI, the sEPSM applies the signal-to-noise ratio in the envelope domain (SNRenv). This metric was shown to be the key for predicting...... the intelligibility of reverberant speech as well as noisy speech processed by spectral subtraction. However, the sEPSM cannot account for speech subjected to phase jitter, a condition in which the spectral structure of speech is destroyed, while the broadband temporal envelope is kept largely intact. In contrast...

  19. A DISTRIBUTED COMPRESSED SENSING APPROACH FOR SPEECH SIGNAL DENOISING

    Institute of Scientific and Technical Information of China (English)

    Ji Yunyun; Yang Zhen

    2011-01-01

    Compressed sensing,a new area of signal processing rising in recent years,seeks to minimize the number of samples that is necessary to be taken from a signal for precise reconstruction.The precondition of compressed sensing theory is the sparsity of signals.In this paper,two methods to estimate the sparsity level of the signal are formulated.And then an approach to estimate the sparsity level directly from the noisy signal is presented.Moreover,a scheme based on distributed compressed sensing for speech signal denoising is described in this work which exploits multiple measurements of the noisy speech signal to construct the block-sparse data and then reconstruct the original speech signal using block-sparse model-based Compressive Sampling Matching Pursuit (CoSaMP) algorithm.Several simulation results demonstrate the accuracy of the estimated sparsity level and that this denoising system for noisy speech signals can achieve favorable performance especially when speech signals suffer severe noise.

  20. Modelling speech intelligibility in adverse conditions

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2013-01-01

    Jørgensen and Dau (J Acoust Soc Am 130:1475-1487, 2011) proposed the speech-based envelope power spectrum model (sEPSM) in an attempt to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII) in conditions with nonlinearly processed speech....... Instead of considering the reduction of the temporal modulation energy as the intelligibility metric, as assumed in the STI, the sEPSM applies the signal-to-noise ratio in the envelope domain (SNRenv). This metric was shown to be the key for predicting the intelligibility of reverberant speech as well...... subjected to phase jitter, a condition in which the spectral structure of the intelligibility of speech signal is strongly affected, while the broadband temporal envelope is kept largely intact. In contrast, the effects of this distortion can be predicted -successfully by the spectro-temporal modulation...

  1. Hidden Markov models in automatic speech recognition

    Science.gov (United States)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  2. Modelling speech intelligibility in adverse conditions.

    Science.gov (United States)

    Jørgensen, Søren; Dau, Torsten

    2013-01-01

    Jørgensen and Dau (J Acoust Soc Am 130:1475-1487, 2011) proposed the speech-based envelope power spectrum model (sEPSM) in an attempt to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII) in conditions with nonlinearly processed speech. Instead of considering the reduction of the temporal modulation energy as the intelligibility metric, as assumed in the STI, the sEPSM applies the signal-to-noise ratio in the envelope domain (SNRenv). This metric was shown to be the key for predicting the intelligibility of reverberant speech as well as noisy speech processed by spectral subtraction. The key role of the SNRenv metric is further supported here by the ability of a short-term version of the sEPSM to predict speech masking release for different speech materials and modulated interferers. However, the sEPSM cannot account for speech subjected to phase jitter, a condition in which the spectral structure of the intelligibility of speech signal is strongly affected, while the broadband temporal envelope is kept largely intact. In contrast, the effects of this distortion can be predicted -successfully by the spectro-temporal modulation index (STMI) (Elhilali et al., Speech Commun 41:331-348, 2003), which assumes an explicit analysis of the spectral "ripple" structure of the speech signal. However, since the STMI applies the same decision metric as the STI, it fails to account for spectral subtraction. The results from this study suggest that the SNRenv might reflect a powerful decision metric, while some explicit across-frequency analysis seems crucial in some conditions. How such across-frequency analysis is "realized" in the auditory system remains unresolved.

  3. Model-based speech enhancement using a bone-conducted signal

    NARCIS (Netherlands)

    Kechichian, P.; Srinivasan, S.

    2012-01-01

    Common single-microphone noise suppressors perform well when the background noise is stationary but exhibit unsatisfactory performance for nonstationary noise signals, since properly tracking and estimating the noise becomes difficult. Codebook-based methods, which exploit prior knowledge about the

  4. Does Signal Degradation Affect Top-Down Processing of Speech?

    NARCIS (Netherlands)

    Wagner, Anita; Pals, Carina; de Blecourt, Charlotte M; Sarampalis, Anastasios; Başkent, Deniz

    2016-01-01

    Speech perception is formed based on both the acoustic signal and listeners' knowledge of the world and semantic context. Access to semantic information can facilitate interpretation of degraded speech, such as speech in background noise or the speech signal transmitted via cochlear implants (CIs).

  5. A Model for the representation of Speech Signals in Normal and Impaired Ears

    DEFF Research Database (Denmark)

    Christiansen, Thomas Ulrich

    2004-01-01

    A model of human auditory periphery, ranging from the outer ear to the auditory nerve, was developed. The model consists of the following components: outer ear transfer function, middle ear transfer function, basilar membrane velocity, inner hair cell receptor potential, inner hair cell probabili...

  6. Modeling Speech Intelligibility in Hearing Impaired Listeners

    DEFF Research Database (Denmark)

    Scheidiger, Christoph; Jørgensen, Søren; Dau, Torsten

    2014-01-01

    Models of speech intelligibility (SI) have a long history, starting with the articulation index (AI, [17]), followed by the SI index (SI I, [18]) and the speech transmission index (STI, [7]), to only name a few. However, these models fail to accurately predict SI with nonlinearly processed noisy...... speech, e.g. phase jitter or spectral subtraction. Recent studies predict SI for normal-hearing (NH) listeners based on a signal-to-noise ratio measure in the envelope domain (SNRenv), in the framework of the speech-based envelope power spectrum model (sEPSM, [20, 21]). These models have shown good...... agreement with measured data under a broad range of conditions, including stationary and modulated interferers, reverberation, and spectral subtraction. Despite the advances in modeling intelligibility in NH listeners, a broadly applicable model that can predict SI in hearing-impaired (HI) listeners...

  7. Predicting speech intelligibility in adverse conditions: evaluation of the speech-based envelope power spectrum model

    DEFF Research Database (Denmark)

    2011-01-01

    The speech-based envelope power spectrum model (sEPSM) [Jørgensen and Dau (2011). J. Acoust. Soc. Am., 130 (3), 1475–1487] estimates the envelope signal-to-noise ratio (SNRenv) of distorted speech and accurately describes the speech recognition thresholds (SRT) for normal-hearing listeners...... conditions by comparing predictions to measured data from [Kjems et al. (2009). J. Acoust. Soc. Am. 126 (3), 1415-1426] where speech is mixed with four different interferers, including speech-shaped noise, bottle noise, car noise, and cafe noise. The model accounts well for the differences in intelligibility...

  8. Prediction of speech intelligibility based on an auditory preprocessing model

    DEFF Research Database (Denmark)

    Christiansen, Claus Forup Corlin; Pedersen, Michael Syskind; Dau, Torsten

    2010-01-01

    Classical speech intelligibility models, such as the speech transmission index (STI) and the speech intelligibility index (SII) are based on calculations on the physical acoustic signals. The present study predicts speech intelligibility by combining a psychoacoustically validated model of auditory...... preprocessing [Dau et al., 1997. J. Acoust. Soc. Am. 102, 2892-2905] with a simple central stage that describes the similarity of the test signal with the corresponding reference signal at a level of the internal representation of the signals. The model was compared with previous approaches, whereby a speech...... in noise experiment was used for training and an ideal binary mask experiment was used for evaluation. All three models were able to capture the trends in the speech in noise training data well, but the proposed model provides a better prediction of the binary mask test data, particularly when the binary...

  9. Robust speech features representation based on computational auditory model

    Institute of Scientific and Technical Information of China (English)

    LU Xugang; JIA Chuan; DANG Jianwu

    2004-01-01

    A speech signal processing and features extracting method based on computational auditory model is proposed. The computational model is based on psychological, physiological knowledge and digital signal processing methods. In each stage of a hearing perception system, there is a corresponding computational model to simulate its function. Based on this model, speech features are extracted. In each stage, the features in different kinds of level are extracted. A further processing for primary auditory spectrum based on lateral inhibition is proposed to extract much more robust speech features. All these features can be regarded as the internal representations of speech stimulation in hearing system. The robust speech recognition experiments are conducted to test the robustness of the features. Results show that the representations based on the proposed computational auditory model are robust representations for speech signals.

  10. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  11. ARMA Modelling for Whispered Speech

    Institute of Scientific and Technical Information of China (English)

    Xue-li LI; Wei-dong ZHOU

    2010-01-01

    The Autoregressive Moving Average (ARMA) model for whispered speech is proposed. Compared with normal speech, whispered speech has no fundamental frequency because of the glottis being semi-opened and turbulent flow being created, and formant shifting exists in the lower frequency region due to the narrowing of the tract in the false vocal fold regions and weak acoustic coupling with the subglottal system. Analysis shows that the effect of the subglottal system is to introduce additional pole-zero pairs into the vocal tract transfer function. Theoretically, the method based on an ARMA process is superior to that based on an AR process in the spectral analysis of the whispered speech. Two methods, the least squared modified Yule-Walker likelihood estimate (LSMY) algorithm and the Frequency-Domain Steiglitz-Mcbride (FDSM) algorithm, are applied to the ARMA model for the whispered speech. The performance evaluation shows that the ARMA model is much more appropriate for representing the whispered speech than the AR model, and the FDSM algorithm provides a more accurate estimation of the whispered speech spectral envelope than the LSMY algorithm with higher computational complexity.

  12. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    Energy Technology Data Exchange (ETDEWEB)

    Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.

    2006-02-14

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  13. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    Energy Technology Data Exchange (ETDEWEB)

    Burnett, Greg C. (Livermore, CA); Holzrichter, John F. (Berkeley, CA); Ng, Lawrence C. (Danville, CA)

    2006-08-08

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  14. System and method for characterizing voiced excitations of speech and acoustic signals, removing acoustic noise from speech, and synthesizing speech

    Energy Technology Data Exchange (ETDEWEB)

    Burnett, Greg C.; Holzrichter, John F.; Ng, Lawrence C.

    2004-03-23

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  15. Switched Scalar Quantization with Adaptation Performed on both the Power and the Distribution of Speech Signal

    Directory of Open Access Journals (Sweden)

    L. V. Stoimenov

    2011-11-01

    Full Text Available This paper analyzes the models for switching scalar quantization of a source with the Laplacian and Gaussian distribution. We have analyzed the results of real telephone speech and proposed a model of switching scalar quantization, which, in addition to adaptation on the power of speech, includes the adaptation on the distribution of signals (Gaussian and Laplacian , which resulted in a better quality of voice signal pronounced with Signal-to-Quantization-Noise Ratio.

  16. Approximated mutual information training for speech recognition using myoelectric signals.

    Science.gov (United States)

    Guo, Hua J; Chan, A D C

    2006-01-01

    A new training algorithm called the approximated maximum mutual information (AMMI) is proposed to improve the accuracy of myoelectric speech recognition using hidden Markov models (HMMs). Previous studies have demonstrated that automatic speech recognition can be performed using myoelectric signals from articulatory muscles of the face. Classification of facial myoelectric signals can be performed using HMMs that are trained using the maximum likelihood (ML) algorithm; however, this algorithm maximizes the likelihood of the observations in the training sequence, which is not directly associated with optimal classification accuracy. The AMMI training algorithm attempts to maximize the mutual information, thereby training the HMMs to optimize their parameters for discrimination. Our results show that AMMI training consistently reduces the error rates compared to these by the ML training, increasing the accuracy by approximately 3% on average.

  17. The Study of Full Light Speech Signal Collection System

    Institute of Scientific and Technical Information of China (English)

    JIA Bo; JIN Yaqiu; ZHANG Wei; HU Li; YE Kunzhen

    2001-01-01

    The demodulation character of 3×3 optic fiber couplers is analyzed, and the application in the coherent communication system and speech signal collection is pointed out in the paper. By the experiment, the feasibility of speech signal collection system by the way of the all light is verified.

  18. Start/End Delays of Voiced and Unvoiced Speech Signals

    Energy Technology Data Exchange (ETDEWEB)

    Herrnstein, A

    1999-09-24

    Recent experiments using low power EM-radar like sensors (e.g, GEMs) have demonstrated a new method for measuring vocal fold activity and the onset times of voiced speech, as vocal fold contact begins to take place. Similarly the end time of a voiced speech segment can be measured. Secondly it appears that in most normal uses of American English speech, unvoiced-speech segments directly precede or directly follow voiced-speech segments. For many applications, it is useful to know typical duration times of these unvoiced speech segments. A corpus, assembled earlier of spoken ''Timit'' words, phrases, and sentences and recorded using simultaneously measured acoustic and EM-sensor glottal signals, from 16 male speakers, was used for this study. By inspecting the onset (or end) of unvoiced speech, using the acoustic signal, and the onset (or end) of voiced speech using the EM sensor signal, the average duration times for unvoiced segments preceding onset of vocalization were found to be 300ms, and for following segments, 500ms. An unvoiced speech period is then defined in time, first by using the onset of the EM-sensed glottal signal, as the onset-time marker for the voiced speech segment and end marker for the unvoiced segment. Then, by subtracting 300ms from the onset time mark of voicing, the unvoiced speech segment start time is found. Similarly, the times for a following unvoiced speech segment can be found. While data of this nature have proven to be useful for work in our laboratory, a great deal of additional work remains to validate such data for use with general populations of users. These procedures have been useful for applying optimal processing algorithms over time segments of unvoiced, voiced, and non-speech acoustic signals. For example, these data appear to be of use in speaker validation, in vocoding, and in denoising algorithms.

  19. Does Signal Degradation Affect Top-Down Processing of Speech?

    Science.gov (United States)

    Wagner, Anita; Pals, Carina; de Blecourt, Charlotte M; Sarampalis, Anastasios; Başkent, Deniz

    2016-01-01

    Speech perception is formed based on both the acoustic signal and listeners' knowledge of the world and semantic context. Access to semantic information can facilitate interpretation of degraded speech, such as speech in background noise or the speech signal transmitted via cochlear implants (CIs). This paper focuses on the latter, and investigates the time course of understanding words, and how sentential context reduces listeners' dependency on the acoustic signal for natural and degraded speech via an acoustic CI simulation.In an eye-tracking experiment we combined recordings of listeners' gaze fixations with pupillometry, to capture effects of semantic information on both the time course and effort of speech processing. Normal-hearing listeners were presented with sentences with or without a semantically constraining verb (e.g., crawl) preceding the target (baby), and their ocular responses were recorded to four pictures, including the target, a phonological (bay) competitor and a semantic (worm) and an unrelated distractor.The results show that in natural speech, listeners' gazes reflect their uptake of acoustic information, and integration of preceding semantic context. Degradation of the signal leads to a later disambiguation of phonologically similar words, and to a delay in integration of semantic information. Complementary to this, the pupil dilation data show that early semantic integration reduces the effort in disambiguating phonologically similar words. Processing degraded speech comes with increased effort due to the impoverished nature of the signal. Delayed integration of semantic information further constrains listeners' ability to compensate for inaudible signals.

  20. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hoirin Kim

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  1. The effects of noise on speech and warning signals

    Science.gov (United States)

    Suter, Alice H.

    1989-06-01

    To assess the effects of noise on speech communication it is necessary to examine certain characteristics of the speech signal. Speech level can be measured by a variety of methods, none of which has yet been standardized, and it should be kept in mind that vocal effort increases with background noise level and with different types of activity. Noise and filtering commonly degrade the speech signal, especially as it is transmitted through communications systems. Intelligibility is also adversely affected by distance, reverberation, and monaural listening. Communication systems currently in use may cause strain and delays on the part of the listener, but there are many possibilities for improvement. Individuals who need to communicate in noise may be subject to voice disorders. Shouted speech becomes progressively less intelligible at high voice levels, but improvements can be realized when talkers use clear speech. Tolerable listening levels are lower for negative than for positive S/Ns, and comfortable listening levels should be at a S/N of at least 5 dB, and preferably above 10 dB. Popular methods to predict speech intelligibility in noise include the Articulation Index, Speech Interference Level, Speech Transmission Index, and the sound level meter's A-weighting network. This report describes these methods, discussing certain advantages and disadvantages of each, and shows their interrelations.

  2. Study on adaptive compressed sensing & reconstruction of quantized speech signals

    Science.gov (United States)

    Yunyun, Ji; Zhen, Yang

    2012-12-01

    Compressed sensing (CS) is a rising focus in recent years for its simultaneous sampling and compression of sparse signals. Speech signals can be considered approximately sparse or compressible in some domains for natural characteristics. Thus, it has great prospect to apply compressed sensing to speech signals. This paper is involved in three aspects. Firstly, the sparsity and sparsifying matrix for speech signals are analyzed. Simultaneously, a kind of adaptive sparsifying matrix based on the long-term prediction of voiced speech signals is constructed. Secondly, a CS matrix called two-block diagonal (TBD) matrix is constructed for speech signals based on the existing block diagonal matrix theory to find out that its performance is empirically superior to that of the dense Gaussian random matrix when the sparsifying matrix is the DCT basis. Finally, we consider the quantization effect on the projections. Two corollaries about the impact of the adaptive quantization and nonadaptive quantization on reconstruction performance with two different matrices, the TBD matrix and the dense Gaussian random matrix, are derived. We find that the adaptive quantization and the TBD matrix are two effective ways to mitigate the quantization effect on reconstruction of speech signals in the framework of CS.

  3. Physical and perceptual factors shape the neural mechanisms that integrate audiovisual signals in speech comprehension.

    Science.gov (United States)

    Lee, HweeLing; Noppeney, Uta

    2011-08-01

    Face-to-face communication challenges the human brain to integrate information from auditory and visual senses with linguistic representations. Yet the role of bottom-up physical (spectrotemporal structure) input and top-down linguistic constraints in shaping the neural mechanisms specialized for integrating audiovisual speech signals are currently unknown. Participants were presented with speech and sinewave speech analogs in visual, auditory, and audiovisual modalities. Before the fMRI study, they were trained to perceive physically identical sinewave speech analogs as speech (SWS-S) or nonspeech (SWS-N). Comparing audiovisual integration (interactions) of speech, SWS-S, and SWS-N revealed a posterior-anterior processing gradient within the left superior temporal sulcus/gyrus (STS/STG): Bilateral posterior STS/STG integrated audiovisual inputs regardless of spectrotemporal structure or speech percept; in left mid-STS, the integration profile was primarily determined by the spectrotemporal structure of the signals; more anterior STS regions discarded spectrotemporal structure and integrated audiovisual signals constrained by stimulus intelligibility and the availability of linguistic representations. In addition to this "ventral" processing stream, a "dorsal" circuitry encompassing posterior STS/STG and left inferior frontal gyrus differentially integrated audiovisual speech and SWS signals. Indeed, dynamic causal modeling and Bayesian model comparison provided strong evidence for a parallel processing structure encompassing a ventral and a dorsal stream with speech intelligibility training enhancing the connectivity between posterior and anterior STS/STG. In conclusion, audiovisual speech comprehension emerges in an interactive process with the integration of auditory and visual signals being progressively constrained by stimulus intelligibility along the STS and spectrotemporal structure in a dorsal fronto-temporal circuitry.

  4. Effects of manipulating the signal-to-noise envelope power ratio on speech intelligibility

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Decorsière, Remi Julien Blaise; Dau, Torsten

    2015-01-01

    Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475–1487] suggested a metric for speech intelligibility prediction based on the signal-to-noise envelope power ratio (SNRenv), calculated at the output of a modulation-frequency selective process. In the framework of the speech-based envelope...... power spectrum model (sEPSM), the SNRenv was demonstrated to account for speech intelligibility data in various conditions with linearly and nonlinearly processed noisy speech, as well as for conditions with stationary and fluctuating interferers. Here, the relation between the SNRenv and speech...... intelligibility was investigated further by systematically varying the modulation power of either the speech or the noise before mixing the two components, while keeping the overall power ratio of the two components constant. A good correspondence between the data and the corresponding sEPSM predictions...

  5. Robust Speech Recognition Using a Harmonic Model

    Institute of Scientific and Technical Information of China (English)

    许超; 曹志刚

    2004-01-01

    Automatic speech recognition under conditions of a noisy environment remains a challenging problem. Traditionally, methods focused on noise structure, such as spectral subtraction, have been employed to address this problem, and thus the performance of such methods depends on the accuracy in noise estimation. In this paper, an alternative method, using a harmonic-based spectral reconstruction algorithm, is proposed for the enhancement of robust automatic speech recognition. Neither noise estimation nor noise-model training are required in the proposed approach. A spectral subtraction integrated autocorrelation function is proposed to determine the pitch for the harmonic model. Recognition results show that the harmonic-based spectral reconstruction approach outperforms spectral subtraction in the middle- and low-signal noise ratio (SNR) ranges. The advantage of the proposed method is more manifest for non-stationary noise, as the algorithm does not require an assumption of stationary noise.

  6. The concept of signal-to-noise ratio in the modulation domain and speech intelligibility.

    Science.gov (United States)

    Dubbelboer, Finn; Houtgast, Tammo

    2008-12-01

    A new concept is proposed that relates to intelligibility of speech in noise. The concept combines traditional estimations of signal-to-noise ratios (S/N) with elements from the modulation transfer function model, which results in the definition of the signal-to-noise ratio in the modulation domain: the (SN)(mod). It is argued that this (SN)(mod), quantifying the strength of speech modulations relative to a floor of spurious modulations arising from the speech-noise interaction, is the key factor in relation to speech intelligibility. It is shown that, by using a specific test signal, the strength of these spurious modulations can be measured, allowing an estimation of the (SN)(mod) for various conditions of additive noise, noise suppression, and amplitude compression. By relating these results to intelligibility data for these same conditions, the relevance of the (SN)(mod) as the key factor underlying speech intelligibility is clearly illustrated. For instance, it is shown that the commonly observed limited effect of noise suppression on speech intelligibility is correctly "predicted" by the (SN)(mod), whereas traditional measures such as the speech transmission index, considering only the changes in the speech modulations, fall short in this respect. It is argued that (SN)(mod) may provide a relevant tool in the design of successful noise-suppression systems.

  7. Effect of continuous speech and non-speech signals on stuttering frequency in adults who stutter.

    Science.gov (United States)

    Dayalu, Vikram N; Guntupalli, Vijaya K; Kalinowski, Joseph; Stuart, Andrew; Saltuklaroglu, Tim; Rastatter, Michael P

    2011-10-01

    The inhibitory effects of continuously presented audio signals (/a/, /s/, 1,000 Hz pure-tone) on stuttering were examined. Eleven adults who stutter participated. Participants read four 300-syllable passages (i.e. in the presence and absence of the audio signals). All of the audio signals induced a significant reduction in stuttering frequency relative to the control condition (P = 0.005). A significantly greater reduction in stuttering occurred in the /a/ condition (P 0.05). These findings are consistent with the notion that the percept of a second signal as speech or non-speech can respectively augment or attenuate the potency for reducing stuttering frequency.

  8. Automatic speech signal segmentation based on the innovation adaptive filter

    Directory of Open Access Journals (Sweden)

    Makowski Ryszard

    2014-06-01

    Full Text Available Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors’ studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006, and it is a modified version of Brandt’s algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.

  9. Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders

    CERN Document Server

    Baghai-Ravary, Ladan

    2013-01-01

    Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders provides a survey of methods designed to aid clinicians in the diagnosis and monitoring of speech disorders such as dysarthria and dyspraxia, with an emphasis on the signal processing techniques, statistical validity of the results presented in the literature, and the appropriateness of methods that do not require specialized equipment, rigorously controlled recording procedures or highly skilled personnel to interpret results. Such techniques offer the promise of a simple and cost-effective, yet objective, assessment of a range of medical conditions, which would be of great value to clinicians. The ideal scenario would begin with the collection of examples of the clients’ speech, either over the phone or using portable recording devices operated by non-specialist nursing staff. The recordings could then be analyzed initially to aid diagnosis of conditions, and subsequently to monitor the clients’ progress and res...

  10. MMSE based Noise Tracking and Echo Cancellation of Speech Signals

    Directory of Open Access Journals (Sweden)

    Praveen. N

    2014-03-01

    Full Text Available During the recent years, there have been many studies on automatic audio classification and segmentation to use several features and techniques. The signal recorded at a microphone in a room incorporates the direct arrival of the sound from the source as well as the multiple weaker copies of the same signal that are created by sound reflections off the room walls. We call it as acoustic. Due to complications associated with exact tracking of the target in three dimensional environments, in many applications such as source localization and speech recognition the reverberate patterns imposed by the environment are seen as undesirable. In an auditorium, the common disturbances in public speech are Noise, Acoustic Echo etc. The MMSE estimator is one of the algorithms proposed for removal of additive background noise. It is a single channel speech enhancement technique for the enhancement of speech degraded by additive background noise. Background noise can effect our conversation in a noisy environment like in streets or in a car, when sending speech from the cockpit of an airplane to the ground or to the cabin and can effect both quality and intelligibility of speech. With the passage of time Spectral subtraction has undergone many modifications. This is a review paper and its objective is to provide an overview of MMSE estimator that has been proposed for enhancement of speech degraded by additive background noise during past decades

  11. Speech Signals Measurements Sequence Modeling in Compressed Sensing Based on GEP%基于GEP算法的压缩感知语音观测序列建模

    Institute of Scientific and Technical Information of China (English)

    郭海亮

    2015-01-01

    为了进一步减少压缩感知中语音信号观测序列的数据传输量,文中采用基因表达式编程( Gene Expression Pro-gramming,GEP)算法对语音信号的观测序列进行建模与预测,同时引入观测序列建模预测后的压缩感知理论框架来解决该问题。首先,分析了压缩感知中语音信号观测序列的相关特性,然后利用GEP算法对语音信号观测序列建立了精确的非线性模型结构,最终实现原始语音信号的重构。实验结果表明,该算法在保证重构语音的性能的前提下,可以进一步减少语音信号观测序列的传输量,最终实现语音信号的二次压缩。%In order to reduce the amount of data transmission of speech signals measurements sequence in compressed sensing,use GEP al-gorithm to model and predict for the measurements sequence of speech signal,and a CS theoretical framework is proposed after predicting and modeling the measurements sequence to fix this question. First,analyze the relevant characteristics of the measurements sequence of speech signal in compressed sensing. Then using GEP algorithm,build an accurate nonlinearity model structure for speech signals meas-urements sequence. Finally,achieve the goal of rebuilding original signals. Showed by experiment,this algorithm can further reduce the a-mount of measurements sequence,while ensuring the performance of reconstructed speech signals,to achieve the purpose of secondary compression of speech signals.

  12. Issues in acoustic modeling of speech for automatic speech recognition

    OpenAIRE

    Gong, Yifan; Haton, Jean-Paul; Mari, Jean-François

    1994-01-01

    Projet RFIA; Stochastic modeling is a flexible method for handling the large variability in speech for recognition applications. In contrast to dynamic time warping where heuristic training methods for estimating word templates are used, stochastic modeling allows a probabilistic and automatic training for estimating models. This paper deals with the improvement of stochastic techniques, especially for a better representation of time varying phenomena.

  13. Signal-to-Signal Ratio Independent Speaker Identification for Co-channel Speech Signals

    DEFF Research Database (Denmark)

    Saeidi, Rahim; Mowlaee, Pejman; Kinnunen, Tomi

    2010-01-01

    In this paper, we consider speaker identification for the co-channel scenario in which speech mixture from speakers is recorded by one microphone only. The goal is to identify both of the speakers from their mixed signal. High recognition accuracies have already been reported when an accurately...

  14. The effects of temporal modification of second speech signals on stuttering inhibition at two speech rates in adults.

    Science.gov (United States)

    Guntupalli, Vijaya K; Kalinowski, Joseph; Saltuklaroglu, Tim; Nanjundeswaran, Chayadevie

    2005-09-02

    The recovery of 'gestural' speech information via the engagement of mirror neurons has been suggested to be the key agent in stuttering inhibition during the presentation of exogenous second speech signals. Based on this hypothesis, we expect the amount of stuttering inhibition to depend on the ease of recovery of exogenous speech gestures. To examine this possibility, linguistically non-congruent second speech signals were temporally compressed and expanded in two experiments. In Experiment 1, 12 participants who stutter read passages aloud at normal and fast speech rates while listening to second speech signals that were 0, 40, 80% compressed, and 40 and 80% expanded. Except for the 80% compressed speech signal, all other stimuli induced significant stuttering inhibition relative to the control condition. The 80% compressed speech signal was the first exogenously presented speech signal that failed to significantly reduce stuttering frequency by 60--70% that has been the case in our research over the years. It was hypothesized that at a compression ratio of 80%, exogenous speech signals generated too many gestures per unit time to allow for adequate gestural recovery via mirror neurons. However, considering that 80% compressed signal was also highly unintelligible, a second experiment was conducted to further examine whether the effects of temporal compression on stuttering inhibition are mediated by speech intelligibility. In Experiment 2, 10 participants who stutter read passages at a normal rate while listening to linguistically non-congruent second speech signals that were compressed by 0, 20, 40, 60, and 80%. Results revealed that 0 and 20% compressed speech signals induced approximately 52% stuttering inhibition. In contrast, compression ratios of 40% and beyond induced only 27% stuttering inhibition although 40 and 60% compressed signals were perceptually intelligible. Our findings suggest that recovery of gestural information is affected by temporal

  15. Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language: Computational techniques are presented to analyze and model expressed and perceived human behavior-variedly characterized as typical, atypical, distressed, and disordered-from speech and language cues and their applications in health, commerce, education, and beyond.

    Science.gov (United States)

    Narayanan, Shrikanth; Georgiou, Panayiotis G

    2013-02-01

    The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion.

  16. An implementation of anger detection in speech signals

    NARCIS (Netherlands)

    Mohamoud, A.A.; Maris, M.G.

    2008-01-01

    In this paper, an emotion classification system based on speech signals is presented. The classifier can identify the most common emotions, namely anger, neutral, happiness and fear. The algorithm computes a number of acoustic features which are fed into the classifier based on a pattern recognition

  17. An implementation of anger detection in speech signals

    NARCIS (Netherlands)

    Mohamoud, A.A.; Maris, M.G.

    2008-01-01

    In this paper, an emotion classification system based on speech signals is presented. The classifier can identify the most common emotions, namely anger, neutral, happiness and fear. The algorithm computes a number of acoustic features which are fed into the classifier based on a pattern recognition

  18. Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing.

    Science.gov (United States)

    Jørgensen, Søren; Dau, Torsten

    2011-09-01

    A model for predicting the intelligibility of processed noisy speech is proposed. The speech-based envelope power spectrum model has a similar structure as the model of Ewert and Dau [(2000). J. Acoust. Soc. Am. 108, 1181-1196], developed to account for modulation detection and masking data. The model estimates the speech-to-noise envelope power ratio, SNR(env), at the output of a modulation filterbank and relates this metric to speech intelligibility using the concept of an ideal observer. Predictions were compared to data on the intelligibility of speech presented in stationary speech-shaped noise. The model was further tested in conditions with noisy speech subjected to reverberation and spectral subtraction. Good agreement between predictions and data was found in all cases. For spectral subtraction, an analysis of the model's internal representation of the stimuli revealed that the predicted decrease of intelligibility was caused by the estimated noise envelope power exceeding that of the speech. The classical concept of the speech transmission index fails in this condition. The results strongly suggest that the signal-to-noise ratio at the output of a modulation frequency selective process provides a key measure of speech intelligibility.

  19. Research on Auditory Model and Speech Signal Processing Method%听觉模型与语音信号处理方法的研究

    Institute of Scientific and Technical Information of China (English)

    刘婧婕; 张刚; 武淑红

    2012-01-01

    在通信领域中,语音编码是语音信号处理的重要分支.为了适合信道传输,语音必须变换形式,基于承载信息并且保留信号,尽可能地处理.在当今的通信、计算机网络等应用领域中,具备低延迟、低码率两大特性的语音编码算法,发挥着决定性作用.在语音编码中,线性预测分析技术主要应用在感觉加权滤波器、综合滤波器及对数增益滤波器,该技术发挥着关键作用.文中的工作是呈现出一种混合LPC( Auditory- Acoustic - Hybrid- LPC)系数,它结合声学特性与听觉特性,以便提高编码后合成语音的听觉质量,这对编码算法的钻研有积极意义.%In the field of communication, speech coding is an important branch of speech signal processing. It reserves the signal carrying the information,based on the maximum degree of its processing,it transforms into a suitable form of transmission channels. In the today 's society, speech coding algorithms with low bit rate and low delay, plays a key role in the communication, computer networks and many other applications. In the speech coding, feeling weighted filters, synthesis filters, logarithmic gain filters are related to the linear prediction analysis techniques, which plays a central role in the voice processing. The main work of this paper is to propose a combination of acoustic features and auditory features of the mixed LPC coefficients (Auditory-Acoustic-Hybrid-LPC) , making the audio quality of the encoded synthetic voice improved, which has a positive meaning to the research of coding algorithm.

  20. Pronunciation Modeling for Large Vocabulary Speech Recognition

    Science.gov (United States)

    Kantor, Arthur

    2010-01-01

    The large pronunciation variability of words in conversational speech is one of the major causes of low accuracy in automatic speech recognition (ASR). Many pronunciation modeling approaches have been developed to address this problem. Some explicitly manipulate the pronunciation dictionary as well as the set of the units used to define the…

  1. Tools for signal compression applications to speech and audio coding

    CERN Document Server

    Moreau, Nicolas

    2013-01-01

    This book presents tools and algorithms required to compress/uncompress signals such as speech and music. These algorithms are largely used in mobile phones, DVD players, HDTV sets, etc. In a first rather theoretical part, this book presents the standard tools used in compression systems: scalar and vector quantization, predictive quantization, transform quantization, entropy coding. In particular we show the consistency between these different tools. The second part explains how these tools are used in the latest speech and audio coders. The third part gives Matlab programs simulating t

  2. Speech Repairs, Intonational Boundaries and Discourse Markers Modeling Speakers' Utterances in Spoken Dialog

    CERN Document Server

    Heeman, P A

    1999-01-01

    In this thesis, we present a statistical language model for resolving speech repairs, intonational boundaries and discourse markers. Rather than finding the best word interpretation for an acoustic signal, we redefine the speech recognition problem to so that it also identifies the POS tags, discourse markers, speech repairs and intonational phrase endings (a major cue in determining utterance units). Adding these extra elements to the speech recognition problem actually allows it to better predict the words involved, since we are able to make use of the predictions of boundary tones, discourse markers and speech repairs to better account for what word will occur next. Furthermore, we can take advantage of acoustic information, such as silence information, which tends to co-occur with speech repairs and intonational phrase endings, that current language models can only regard as noise in the acoustic signal. The output of this language model is a much fuller account of the speaker's turn, with part-of-speech ...

  3. System And Method For Characterizing Voiced Excitations Of Speech And Acoustic Signals, Removing Acoustic Noise From Speech, And Synthesizi

    Energy Technology Data Exchange (ETDEWEB)

    Burnett, Greg C. (Livermore, CA); Holzrichter, John F. (Berkeley, CA); Ng, Lawrence C. (Danville, CA)

    2006-04-25

    The present invention is a system and method for characterizing human (or animate) speech voiced excitation functions and acoustic signals, for removing unwanted acoustic noise which often occurs when a speaker uses a microphone in common environments, and for synthesizing personalized or modified human (or other animate) speech upon command from a controller. A low power EM sensor is used to detect the motions of windpipe tissues in the glottal region of the human speech system before, during, and after voiced speech is produced by a user. From these tissue motion measurements, a voiced excitation function can be derived. Further, the excitation function provides speech production information to enhance noise removal from human speech and it enables accurate transfer functions of speech to be obtained. Previously stored excitation and transfer functions can be used for synthesizing personalized or modified human speech. Configurations of EM sensor and acoustic microphone systems are described to enhance noise cancellation and to enable multiple articulator measurements.

  4. Robust time delay estimation for speech signals using information theory: A comparison study

    Directory of Open Access Journals (Sweden)

    Wen Fei

    2011-01-01

    Full Text Available Abstract Time delay estimation (TDE is a fundamental subsystem for a speaker localization and tracking system. Most of the traditional TDE methods are based on second-order statistics (SOS under Gaussian assumption for the source. This article resolves the TDE problem using two information-theoretic measures, joint entropy and mutual information (MI, which can be considered to indirectly include higher order statistics (HOS. The TDE solutions using the two measures are presented for both Gaussian and Laplacian models. We show that, for stationary signals, the two measures are equivalent for TDE. However, for non-stationary signals (e.g., noisy speech signals, maximizing MI gives more consistent estimate than minimizing joint entropy. Moreover, an existing idea of using modified MI to embed information about reverberation is generalized to the multiple microphones case. From the experimental results for speech signals, this scheme with Gaussian model shows the most robust performance in various noisy and reverberant environments.

  5. A New Method to Represent Speech Signals Via Predefined Signature and Envelope Sequences

    Directory of Open Access Journals (Sweden)

    Yarman Binboga Sıddık

    2007-01-01

    Full Text Available A novel systematic procedure referred to as "SYMPES" to model speech signals is introduced. The structure of SYMPES is based on the creation of the so-called predefined "signature and envelope " sets. These sets are speaker and language independent. Once the speech signals are divided into frames with selected lengths, then each frame sequence is reconstructed by means of the mathematical form . In this representation, is called the gain factor, and are properly assigned from the predefined signature and envelope sets, respectively. Examples are given to exhibit the implementation of SYMPES. It is shown that for the same compression ratio or better, SYMPES yields considerably better speech quality over the commercially available coders such as G.726 (ADPCM at 16 kbps and voice excited LPC-10E (FS1015 at kbps.

  6. A Generative Model of Speech Production in Broca's and Wernicke's Areas.

    Science.gov (United States)

    Price, Cathy J; Crinion, Jenny T; Macsweeney, Mairéad

    2011-01-01

    Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalization, auditory feedback, and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping, and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words "one" and "three." We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca's area in the left dorsal pars opercularis and Wernicke's area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca's and Wernicke's areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations.

  7. A generative model of speech production in Broca’s and Wernicke’s areas

    Directory of Open Access Journals (Sweden)

    Cathy J Price

    2011-09-01

    Full Text Available Speech production involves the generation of an auditory signal from the articulators and vocal tract. When the intended auditory signal does not match the produced sounds, subsequent articulatory commands can be adjusted to reduce the difference between the intended and produced sounds. This requires an internal model of the intended speech output that can be compared to the produced speech. The aim of this functional imaging study was to identify brain activation related to the internal model of speech production after activation related to vocalisation, auditory feedback and movement in the articulators had been controlled. There were four conditions: silent articulation of speech, non-speech mouth movements, finger tapping and visual fixation. In the speech conditions, participants produced the mouth movements associated with the words one and three. We eliminated auditory feedback from the spoken output by instructing participants to articulate these words without producing any sound. The non-speech mouth movement conditions involved lip pursing and tongue protrusions to control for movement in the articulators. The main difference between our speech and non-speech mouth movement conditions is that prior experience producing speech sounds leads to the automatic and covert generation of auditory and phonological associations that may play a role in predicting auditory feedback. We found that, relative to non-speech mouth movements, silent speech activated Broca's area in the left dorsal pars opercularis and Wernicke's area in the left posterior superior temporal sulcus. We discuss these results in the context of a generative model of speech production and propose that Broca’s and Wernicke’s areas may be involved in predicting the speech output that follows articulation. These predictions could provide a mechanism by which rapid movement of the articulators is precisely matched to the intended speech outputs during future articulations.

  8. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  9. Speech Therapy for Children with Cleft Lip and Palate Using a Community-Based Speech Therapy Model with Speech Assistants.

    Science.gov (United States)

    Makarabhirom, Kalyanee; Prathanee, Benjamas; Suphawatjariyakul, Ratchanee; Yoodee, Phanomwan

    2015-08-01

    Evaluate the speech services using a Community-Based Speech Therapy model by trained speech assistants (SAs) on the improvement of articulation in cleft palate children. Seventeen children with repaired cleft palates who lived in Chiang Rai and Phayao provinces were registered to the camp. They received speech therapy with a 4-day intensive camp and five follow-up camps at Chiang Rai's The Young Men's Christian Association (YMCA). Eight speech assistants (SAs) were trained to correct articulation errors with specific modeling by the speech-language pathologists (SLPs). SAs encouraged family members to stimulate their children every day with speech exercise at home. Each camp was covered with a main speech therapy and others supported by the multidisciplinary team, as well as, discussion among SLPs, SAs and the care givers for feedback or difficulties. Results showed a sufficient method for treating persistent speech disorders associated with cleft palate. Perceptual analyses presented significant improvement of misarticulation sounds both word and sentence levels after speech camp (mean difference = 1.5, 95% confidence interval = 0.5-2.5, p-value Speech Therapy model is a valid and efficient method for providing speech therapy in cleft palate children.

  10. Behavioral Signal Processing: Deriving Human Behavioral Informatics From Speech and Language

    Science.gov (United States)

    Narayanan, Shrikanth; Georgiou, Panayiotis G.

    2013-01-01

    The expression and experience of human behavior are complex and multimodal and characterized by individual and contextual heterogeneity and variability. Speech and spoken language communication cues offer an important means for measuring and modeling human behavior. Observational research and practice across a variety of domains from commerce to healthcare rely on speech- and language-based informatics for crucial assessment and diagnostic information and for planning and tracking response to an intervention. In this paper, we describe some of the opportunities as well as emerging methodologies and applications of human behavioral signal processing (BSP) technology and algorithms for quantitatively understanding and modeling typical, atypical, and distressed human behavior with a specific focus on speech- and language-based communicative, affective, and social behavior. We describe the three important BSP components of acquiring behavioral data in an ecologically valid manner across laboratory to real-world settings, extracting and analyzing behavioral cues from measured data, and developing models offering predictive and decision-making support. We highlight both the foundational speech and language processing building blocks as well as the novel processing and modeling opportunities. Using examples drawn from specific real-world applications ranging from literacy assessment and autism diagnostics to psychotherapy for addiction and marital well being, we illustrate behavioral informatics applications of these signal processing techniques that contribute to quantifying higher level, often subjectively described, human behavior in a domain-sensitive fashion. PMID:24039277

  11. Digital speech processing using Matlab

    CERN Document Server

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  12. Modeling the Development of Audiovisual Cue Integration in Speech Perception

    Science.gov (United States)

    Getz, Laura M.; Nordeen, Elke R.; Vrabic, Sarah C.; Toscano, Joseph C.

    2017-01-01

    Adult speech perception is generally enhanced when information is provided from multiple modalities. In contrast, infants do not appear to benefit from combining auditory and visual speech information early in development. This is true despite the fact that both modalities are important to speech comprehension even at early stages of language acquisition. How then do listeners learn how to process auditory and visual information as part of a unified signal? In the auditory domain, statistical learning processes provide an excellent mechanism for acquiring phonological categories. Is this also true for the more complex problem of acquiring audiovisual correspondences, which require the learner to integrate information from multiple modalities? In this paper, we present simulations using Gaussian mixture models (GMMs) that learn cue weights and combine cues on the basis of their distributional statistics. First, we simulate the developmental process of acquiring phonological categories from auditory and visual cues, asking whether simple statistical learning approaches are sufficient for learning multi-modal representations. Second, we use this time course information to explain audiovisual speech perception in adult perceivers, including cases where auditory and visual input are mismatched. Overall, we find that domain-general statistical learning techniques allow us to model the developmental trajectory of audiovisual cue integration in speech, and in turn, allow us to better understand the mechanisms that give rise to unified percepts based on multiple cues. PMID:28335558

  13. Fundamental Frequency Estimation of the Speech Signal Compressed by MP3 Algorithm Using PCC Interpolation

    Directory of Open Access Journals (Sweden)

    MILIVOJEVIC, Z. N.

    2010-02-01

    Full Text Available In this paper the fundamental frequency estimation results of the MP3 modeled speech signal are analyzed. The estimation of the fundamental frequency was performed by the Picking-Peaks algorithm with the implemented Parametric Cubic Convolution (PCC interpolation. The efficiency of PCC was tested for Catmull-Rom, Greville and Greville two-parametric kernel. Depending on MSE, a window that gives optimal results was chosen.

  14. Algorithms and Software for Predictive and Perceptual Modeling of Speech

    CERN Document Server

    Atti, Venkatraman

    2010-01-01

    From the early pulse code modulation-based coders to some of the recent multi-rate wideband speech coding standards, the area of speech coding made several significant strides with an objective to attain high quality of speech at the lowest possible bit rate. This book presents some of the recent advances in linear prediction (LP)-based speech analysis that employ perceptual models for narrow- and wide-band speech coding. The LP analysis-synthesis framework has been successful for speech coding because it fits well the source-system paradigm for speech synthesis. Limitations associated with th

  15. Prediction of speech masking release for fluctuating interferers based on the envelope power signal-to-noise ratio

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2012-01-01

    The speech-based envelope power spectrum model (sEPSM) presented by Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] estimates the envelope signal-to-noise ratio (SNRenv) after modulation-frequency selective processing, which accurately predicts the speech intelligibility for normal......-hearing listeners in conditions with additive stationary noise, reverberation, and nonlinear processing with spectral subtraction. The latter condition represents a case in which the standardized speech intelligibility index and speech transmission index fail. However, the sEPSM is limited to conditions...... with stationary interferers due to the long-term estimation of the envelope power and cannot account for the well known phenomenon of speech masking release. Here, a short-term version of the sEPSM is presented, estimating the envelope SNR in 10-ms time frames. Predictions obtained with the short-term s...

  16. Prediction of speech masking release for fluctuating interferers based on the envelope power signal-to-noise ratio

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2012-01-01

    The speech-based envelope power spectrum model (sEPSM) presented by Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] estimates the envelope signal-to-noise ratio (SNRenv) after modulation-frequency selective processing. This approach accurately predicts the speech intelligibility...... for normal-hearing listeners in conditions with additive stationary noise, reverberation, and nonlinear processing with spectral subtraction. The latter condition represents a case in which the standardized speech intelligibility index and the speech transmission index fail. However, the sEPSM is limited...... to conditions with stationary interferers due to the long-term estimation of the envelope power and cannot account for the well-known phenomenon of speech masking release. Here, a short-term version of the sEPSM is described [Jørgensen and Dau, 2012, in preparation], which estimates the SNRenv in short temporal...

  17. Ordinal models of audiovisual speech perception

    DEFF Research Database (Denmark)

    Andersen, Tobias

    2011-01-01

    Audiovisual information is integrated in speech perception. One manifestation of this is the McGurk illusion in which watching the articulating face alters the auditory phonetic percept. Understanding this phenomenon fully requires a computational model with predictive power. Here, we describe...... ordinal models that can account for the McGurk illusion. We compare this type of models to the Fuzzy Logical Model of Perception (FLMP) in which the response categories are not ordered. While the FLMP generally fit the data better than the ordinal model it also employs more free parameters in complex...... experiments when the number of response categories are high as it is for speech perception in general. Testing the predictive power of the models using a form of cross-validation we found that ordinal models perform better than the FLMP. Based on these findings we suggest that ordinal models generally have...

  18. Emotion Recognition of Speech Signals Based on Filter Methods

    Directory of Open Access Journals (Sweden)

    Narjes Yazdanian

    2016-10-01

    Full Text Available Speech is the basic mean of communication among human beings.With the increase of transaction between human and machine, necessity of automatic dialogue and removing human factor has been considered. The aim of this study was to determine a set of affective features the speech signal is based on emotions. In this study system was designs that include three mains sections, features extraction, features selection and classification. After extraction of useful features such as, mel frequency cepstral coefficient (MFCC, linear prediction cepstral coefficients (LPC, perceptive linear prediction coefficients (PLP, ferment frequency, zero crossing rate, cepstral coefficients and pitch frequency, Mean, Jitter, Shimmer, Energy, Minimum, Maximum, Amplitude, Standard Deviation, at a later stage with filter methods such as Pearson Correlation Coefficient, t-test, relief and information gain, we came up with a method to rank and select effective features in emotion recognition. Then Result, are given to the classification system as a subset of input. In this classification stage, multi support vector machine are used to classify seven type of emotion. According to the results, that method of relief, together with multi support vector machine, has the most classification accuracy with emotion recognition rate of 93.94%.

  19. Subspace-Based Noise Reduction for Speech Signals via Diagonal and Triangular Matrix Decompositions

    DEFF Research Database (Denmark)

    Hansen, Per Christian; Jensen, Søren Holdt

    2007-01-01

    We survey the definitions and use of rank-revealing matrix decompositions in single-channel noise reduction algorithms for speech signals. Our algorithms are based on the rank-reduction paradigm and, in particular, signal subspace techniques. The focus is on practical working algorithms, using bo...... with working Matlab code and applications in speech processing....

  20. Subspace-Based Noise Reduction for Speech Signals via Diagonal and Triangular Matrix Decompositions

    DEFF Research Database (Denmark)

    Hansen, Per Christian; Jensen, Søren Holdt

    We survey the definitions and use of rank-revealing matrix decompositions in single-channel noise reduction algorithms for speech signals. Our algorithms are based on the rank-reduction paradigm and, in particular, signal subspace techniques. The focus is on practical working algorithms, using bo...... with working Matlab code and applications in speech processing....

  1. Phonological modeling for continuous speech recognition in Korean

    CERN Document Server

    Lee, W I; Lee, J H; Lee, WonIl; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.

  2. Speech-To-Text Conversion STT System Using Hidden Markov Model HMM

    Directory of Open Access Journals (Sweden)

    Su Myat Mon

    2015-06-01

    Full Text Available Abstract Speech is an easiest way to communicate with each other. Speech processing is widely used in many applications like security devices household appliances cellular phones ATM machines and computers. The human computer interface has been developed to communicate or interact conveniently for one who is suffering from some kind of disabilities. Speech-to-Text Conversion STT systems have a lot of benefits for the deaf or dumb people and find their applications in our daily lives. In the same way the aim of the system is to convert the input speech signals into the text output for the deaf or dumb students in the educational fields. This paper presents an approach to extract features by using Mel Frequency Cepstral Coefficients MFCC from the speech signals of isolated spoken words. And Hidden Markov Model HMM method is applied to train and test the audio files to get the recognized spoken word. The speech database is created by using MATLAB.Then the original speech signals are preprocessed and these speech samples are extracted to the feature vectors which are used as the observation sequences of the Hidden Markov Model HMM recognizer. The feature vectors are analyzed in the HMM depending on the number of states.

  3. APPLICATION OF PARTIAL LEAST SQUARES REGRESSION FOR AUDIO-VISUAL SPEECH PROCESSING AND MODELING

    Directory of Open Access Journals (Sweden)

    A. L. Oleinik

    2015-09-01

    Full Text Available Subject of Research. The paper deals with the problem of lip region image reconstruction from speech signal by means of Partial Least Squares regression. Such problems arise in connection with development of audio-visual speech processing methods. Audio-visual speech consists of acoustic and visual components (called modalities. Applications of audio-visual speech processing methods include joint modeling of voice and lips’ movement dynamics, synchronization of audio and video streams, emotion recognition, liveness detection. Method. Partial Least Squares regression was applied to solve the posed problem. This method extracts components of initial data with high covariance. These components are used to build regression model. Advantage of this approach lies in the possibility of achieving two goals: identification of latent interrelations between initial data components (e.g. speech signal and lip region image and approximation of initial data component as a function of another one. Main Results. Experimental research on reconstruction of lip region images from speech signal was carried out on VidTIMIT audio-visual speech database. Results of the experiment showed that Partial Least Squares regression is capable of solving reconstruction problem. Practical Significance. Obtained findings give the possibility to assert that Partial Least Squares regression is successfully applicable for solution of vast variety of audio-visual speech processing problems: from synchronization of audio and video streams to liveness detection.

  4. Cross-Word Modeling for Arabic Speech Recognition

    CERN Document Server

    AbuZeina, Dia

    2012-01-01

    "Cross-Word Modeling for Arabic Speech Recognition" utilizes phonological rules in order to model the cross-word problem, a merging of adjacent words in speech caused by continuous speech, to enhance the performance of continuous speech recognition systems. The author aims to provide an understanding of the cross-word problem and how it can be avoided, specifically focusing on Arabic phonology using an HHM-based classifier.

  5. Selection of individual features of a speech signal using genetic algorithms

    Directory of Open Access Journals (Sweden)

    Kamil Kamiński

    2016-03-01

    Full Text Available The paper presents an automatic speaker’s recognition system, implemented in the Matlab environment, and demonstrates how to achieve and optimize various elements of the system. The main emphasis was put on features selection of a speech signal using a genetic algorithm which takes into account synergy of features. The results of optimization of selected elements of a classifier have been also shown, including the number of Gaussian distributions used to model each of the voices. In addition, for creating voice models, a universal voice model has been used.[b]Keywords[/b]: biometrics, automatic speaker recognition, genetic algorithms, feature selection

  6. Modelling Human Speech Recognition using Automatic Speech Recognition Paradigms in SpeM

    NARCIS (Netherlands)

    Scharenborg, O.E.; McQueen, J.M.; Bosch, L.F.M. ten; Norris, D.

    2003-01-01

    We have recently developed a new model of human speech recognition, based on automatic speech recognition techniques [1]. The present paper has two goals. First, we show that the new model performs well in the recognition of lexically ambiguous input. These demonstrations suggest that the model is

  7. Denoising of human speech using combined acoustic and em sensor signal processing

    Energy Technology Data Exchange (ETDEWEB)

    Ng, L C; Burnett, G C; Holzrichter, J F; Gable, T J

    1999-11-29

    Low Power EM radar-like sensors have made it possible to measure properties of the human speech production system in real-time, without acoustic interference. This greatly enhances the quality and quantify of information for many speech related applications. See Holzrichter, Burnett, Ng, and Lea, J. Acoustic. Soc. Am. 103 (1) 622 (1998). By using combined Glottal-EM- Sensor- and Acoustic-signals, segments of voiced, unvoiced, and no-speech can be reliably defined. Real-time Denoising filters can be constructed to remove noise from the user's corresponding speech signal.

  8. A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio Evaluation Approach for Speech Enhancement

    Directory of Open Access Journals (Sweden)

    M. Ravichandra Kumar

    2014-10-01

    Full Text Available Usually, hearing impaired people use hearing aids which are implemented with speech enhancement algorithms. Estimation of speech and estimation of nose are the components in single channel speech enhancement system. The main objective of any speech enhancement algorithm is estimation of noise power spectrum for non stationary environment. VAD (Voice Activity Detector is used to identify speech pauses and during these pauses only estimation of noise. MMSE (Minimum Mean Square Error speech enhancement algorithm did not enhance the intelligibility, quality and listener fatigues are the perceptual aspects of speech. Novel evaluation approach SR (Signal to Residual spectrum ratio based on uncertainty parameter introduced for the benefits of hearing impaired people in non stationary environments to control distortions. By estimation and updating of noise based on division of original pure signal into three parts such as pure speech, quasi speech and non speech frames based on multiple threshold conditions. Different values of SR and LLR demonstrate the amount of attenuation and amplification distortions. The proposed method will compared with any one method WAT(Weighted Average Technique Hence by using parameters SR (signal to residual spectrum ratio and LLR (log like hood ratio, MMSE (Minim Mean Square Error in terms of segmented SNR and LLR.

  9. Phrase-level speech simulation with an airway modulation model of speech production.

    Science.gov (United States)

    Story, Brad H

    2013-06-01

    Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated.

  10. Speech motor development in childhood apraxia of speech: generating testable hypotheses by neurocomputational modeling.

    NARCIS (Netherlands)

    Terband, H.R.; Maassen, B.A.M.

    2010-01-01

    Childhood apraxia of speech (CAS) is a highly controversial clinical entity, with respect to both clinical signs and underlying neuromotor deficit. In the current paper, we advocate a modeling approach in which a computational neural model of speech acquisition and production is utilized in order to

  11. Speech Motor Development in Childhood Apraxia of Speech : Generating Testable Hypotheses by Neurocomputational Modeling

    NARCIS (Netherlands)

    Terband, H.; Maassen, B.

    2010-01-01

    Childhood apraxia of speech (CAS) is a highly controversial clinical entity, with respect to both clinical signs and underlying neuromotor deficit. In the current paper, we advocate a modeling approach in which a computational neural model of speech acquisition and production is utilized in order to

  12. Processing of Speech Signals for Physical and Sensory Disabilities

    Science.gov (United States)

    Levitt, Harry

    1995-10-01

    Assistive technology involving voice communication is used primarily by people who are deaf, hard of hearing, or who have speech and/or language disabilities. It is also used to a lesser extent by people with visual or motor disabilities. A very wide range of devices has been developed for people with hearing loss. These devices can be categorized not only by the modality of stimulation [i.e., auditory, visual, tactile, or direct electrical stimulation of the auditory nerve (auditory-neural)] but also in terms of the degree of speech processing that is used. At least four such categories can be distinguished: assistive devices (a) that are not designed specifically for speech, (b) that take the average characteristics of speech into account, (c) that process articulatory or phonetic characteristics of speech, and (d) that embody some degree of automatic speech recognition. Assistive devices for people with speech and/or language disabilities typically involve some form of speech synthesis or symbol generation for severe forms of language disability. Speech synthesis is also used in text-to-speech systems for sightless persons. Other applications of assistive technology involving voice communication include voice control of wheelchairs and other devices for people with mobility disabilities.

  13. A New Method to Represent Speech Signals Via Predefined Signature and Envelope Sequences

    Directory of Open Access Journals (Sweden)

    Binboga Sıddık Yarman

    2007-01-01

    Full Text Available A novel systematic procedure referred to as “SYMPES” to model speech signals is introduced. The structure of SYMPES is based on the creation of the so-called predefined “signature S={SR(n} and envelope E={EK(n}” sets. These sets are speaker and language independent. Once the speech signals are divided into frames with selected lengths, then each frame sequence Xi(n is reconstructed by means of the mathematical form Xi(n=CiEK(nSR(n. In this representation, Ci is called the gain factor, SR(n and EK(n are properly assigned from the predefined signature and envelope sets, respectively. Examples are given to exhibit the implementation of SYMPES. It is shown that for the same compression ratio or better, SYMPES yields considerably better speech quality over the commercially available coders such as G.726 (ADPCM at 16 kbps and voice excited LPC-10E (FS1015 at 2.4 kbps.

  14. COMPRESSED SPEECH SIGNAL SENSING BASED ON THE STRUCTURED BLOCK SPARSITY WITH PARTIAL KNOWLEDGE OF SUPPORT

    Institute of Scientific and Technical Information of China (English)

    Ji Yunyun; Yang Zhen; Xu Qian

    2012-01-01

    Structural and statistical characteristics of signals can improve the performance of Compressed Sensing (CS).Two kinds of features of Discrete Cosine Transform (DCT) coefficients of voiced speech signals are discussed in this paper.The first one is the block sparsity of DCT coefficients of voiced speech formulated from two different aspects which are the distribution of the DCT coefficients of voiced speech and the comparison of reconstruction performance between the mixed l2 /l1 program and Basis Pursuit (BP).The block sparsity of DCT coefficients of voiced speech means that some algorithms of block-sparse CS can be used to improve the recovery performance of speech signals.It is proved by the simulation results of the l2 / reweighted l1 mixed program which is an improved version of the mixed l2 /l1 program.The second one is the well known large DCT coefficients of voiced speech focus on low frequency.In line with this feature,a special Gaussian and Partial Identity Joint (GPIJ)matrix is constructed as the sensing matrix for voiced speech signals.Simulation results show that the GPIJ matrix outperforms the classical Gaussian matrix for speech signals of male and female adults.

  15. Phoneme Compression: processing of the speech signal and effects on speech intelligibility in hearing-Impaired listeners

    NARCIS (Netherlands)

    A. Goedegebure (Andre)

    2005-01-01

    textabstractHearing-aid users often continue to have problems with poor speech understanding in difficult acoustical conditions. Another generally accounted problem is that certain sounds become too loud whereas other sounds are still not audible. Dynamic range compression is a signal processing tec

  16. Bandwidth Extension of Speech Signals: A Comprehensive Review

    Directory of Open Access Journals (Sweden)

    N.Prasad

    2016-02-01

    Full Text Available Telephone systems commonly transmit narrowband (NB speech with an audio bandwidth limited to the traditional telephone band of 300-3400 Hz. To improve the quality and intelligibility of speech degraded by narrow bandwidth, researchers have tried to standardize the telephonic networks by introducing wideband (50-7000 Hz speech codecs. Wideband (WB speech transmission requires the transmission network and terminal devices at both ends to be upgraded to the wideband that turns out to be time-consuming. In this situation, novel Bandwidth extension (BWE techniques have been developed to overcome the limitations of NB speech. This paper discusses the basic principles, realization, and applications of BWE. Challenges and limitations of BWE are also addressed.

  17. Role of neural network models for developing speech systems

    Indian Academy of Sciences (India)

    K Sreenivasa Rao

    2011-10-01

    This paper discusses the application of neural networks for developing different speech systems. Prosodic parameters of speech at syllable level depend on positional, contextual and phonological features of the syllables. In this paper, neural networks are explored to model the prosodic parameters of the syllables from their positional, contextual and phonological features. The prosodic parameters considered in this work are duration and sequence of pitch $(F_0)$ values of the syllables. These prosody models are further examined for applications such as text to speech synthesis, speech recognition, speaker recognition and language identification. Neural network models in voice conversion system are explored for capturing the mapping functions between source and target speakers at source, system and prosodic levels. We have also used neural network models for characterizing the emotions present in speech. For identification of dialects in Hindi, neural network models are used to capture the dialect specific information from spectral and prosodic features of speech.

  18. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Lévy Christophe

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  19. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise....... The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed......, denoted the speech-based envelope power spectrum model (sEPSM) which predicts the intelligibility based on the signal-to-noise ratio in the envelope domain. In contrast to the STI, the sEPSM is sensitive to the increased amount of the noise envelope power as a consequence of the spectral subtraction...

  20. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    . The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed...... rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise......, denoted the speech-based envelope power spectrum model (sEPSM) which predicts the intelligibility based on the signal-to-noise ratio in the envelope domain. In contrast to the STI, the sEPSM is sensitive to the increased amount of the noise envelope power as a consequence of the spectral subtraction...

  1. [The acoustic characteristics of the speech signal as an indicator of the human functional state].

    Science.gov (United States)

    Lebedeva, N N; Karimova, E D

    2014-01-01

    The current review focuses on the capabilities and the efficiency of the functional state investigation based on the extralinguistic (nonsemantic) properties of the speech signal. This article provides an overview of the studies, which used such speech signal characteristics to investigate the human state during performing the operator activity, in the simulation of emotions, under emotional stress and in the presence of various mental disorders.

  2. Model based Binaural Enhancement of Voiced and Unvoiced Speech

    DEFF Research Database (Denmark)

    Kavalekalam, Mathew Shaji; Christensen, Mads Græsbøll; Boldt, Jesper B.

    2017-01-01

    This paper deals with the enhancement of speech in presence of non-stationary babble noise. A binaural speech enhancement framework is proposed which takes into account both the voiced and unvoiced speech production model. The usage of this model in enhancement requires the Short term predictor...... (STP) parameters and the pitch information to be estimated. This paper uses a codebook based approach for estimating the STP parameters and a parametric binaural method is proposed for estimating the pitch parameters. Improvements in objective score are shown when using the voicedunvoiced speech model...

  3. The Effects of the Acute Hypoxia to the Fundamental Frequency of the Speech Signal

    Directory of Open Access Journals (Sweden)

    MILIVOJEVIC, Z. N.

    2012-05-01

    Full Text Available When people that live at the small altitudes (up to 400 m above the sea level climb on the mountain, they are exposed to the effects of an acute hypoxia. As a consequence, theirs oxygen concentration decreases in the tissue. This paper presents the analysis of the acute hypoxia effects to the speech signal at the altitudes up to 2600 m above the sea level. For the experiment, the articulation of vowels (A, E, I, O, U from the test group of persons was recorded at different altitudes, which creates the speech signal database. The speech signal from database is processed by the original algorithm. As the results, the fundamental frequency and the energy of dissonant intervals of speech signal are obtained. Furthermore, the acute hypoxia effect to the energy distribution in the dissonant intervals of the speech signal is analyzed. At the end, the comparative analysis of the acute hypoxia effects shows that the level of the hypoxia can be determined by the change of the fundamental frequency and the energy of the dissonant intervals of speech signal. Hence, it is possible to bring conclusions about the degree of hypoxia, which in many situations can be of importance for avoiding catastrophic consequences.

  4. Neural correlates of quality perception for complex speech signals

    CERN Document Server

    Antons, Jan-Niklas

    2015-01-01

    This book interconnects two essential disciplines to study the perception of speech: Neuroscience and Quality of Experience, which to date have rarely been used together for the purposes of research on speech quality perception. In five key experiments, the book demonstrates the application of standard clinical methods in neurophysiology on the one hand, and of methods used in fields of research concerned with speech quality perception on the other. Using this combination, the book shows that speech stimuli with different lengths and different quality impairments are accompanied by physiological reactions related to quality variations, e.g., a positive peak in an event-related potential. Furthermore, it demonstrates that – in most cases – quality impairment intensity has an impact on the intensity of physiological reactions.

  5. An effective cluster-based model for robust speech detection and speech recognition in noisy environments.

    Science.gov (United States)

    Górriz, J M; Ramírez, J; Segura, J C; Puntonet, C G

    2006-07-01

    This paper shows an accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognition systems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection (VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.

  6. A 94-GHz Millimeter-Wave Sensor for Speech Signal Acquisition

    Directory of Open Access Journals (Sweden)

    Jianqi Wang

    2013-10-01

    Full Text Available High frequency millimeter-wave (MMW radar-like sensors enable the detection of speech signals. This novel non-acoustic speech detection method has some special advantages not offered by traditional microphones, such as preventing strong-acoustic interference, high directional sensitivity with penetration, and long detection distance. A 94-GHz MMW radar sensor was employed in this study to test its speech acquisition ability. A 34-GHz zero intermediate frequency radar, a 34-GHz superheterodyne radar, and a microphone were also used for comparison purposes. A short-time phase-spectrum-compensation algorithm was used to enhance the detected speech. The results reveal that the 94-GHz radar sensor showed the highest sensitivity and obtained the highest speech quality subjective measurement score. This result suggests that the MMW radar sensor has better performance than a traditional microphone in terms of speech detection for detection distances longer than 1 m. As a substitute for the traditional speech acquisition method, this novel speech acquisition method demonstrates a large potential for many speech related applications.

  7. Tracking the speech signal--time-locked MEG signals during perception of ultra-fast and moderately fast speech in blind and in sighted listeners.

    Science.gov (United States)

    Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann

    2013-01-01

    Blind people can learn to understand speech at ultra-high syllable rates (ca. 20 syllables/s), a capability associated with hemodynamic activation of the central-visual system. To further elucidate the neural mechanisms underlying this skill, magnetoencephalographic (MEG) measurements during listening to sentence utterances were cross-correlated with time courses derived from the speech signal (envelope, syllable onsets and pitch periodicity) to capture phase-locked MEG components (14 blind, 12 sighted subjects; speech rate=8 or 16 syllables/s, pre-defined source regions: auditory and visual cortex, inferior frontal gyrus). Blind individuals showed stronger phase locking in auditory cortex than sighted controls, and right-hemisphere visual cortex activity correlated with syllable onsets in case of ultra-fast speech. Furthermore, inferior-frontal MEG components time-locked to pitch periodicity displayed opposite lateralization effects in sighted (towards right hemisphere) and blind subjects (left). Thus, ultra-fast speech comprehension in blind individuals appears associated with changes in early signal-related processing mechanisms both within and outside the central-auditory terrain.

  8. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    Science.gov (United States)

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  9. Hearing aid processing of loud speech and noise signals: Consequences for loudness perception and listening comfort

    DEFF Research Database (Denmark)

    Schmidt, Erik

    2007-01-01

    in regard to perceived level variation, loudness and overall acceptance. In the second experiment, two signals containing speech and noise at 75 dB SPL RMS-level, were compressed with six compression ratios from 1:1 to 10:1 and three release times from 40 ms to 4000 ms. In this experiment, subjects rated...... the signals in regard to loudness, speech clarity, noisiness and overall acceptance. Based on the results, a criterion for selecting compression parameters that yield some level-variation in the output signal, while still keeping the overall user-acceptance at a tolerable level, is suggested. It is also...... discussed how differences in speech and noise components seem to influence listeners ratings of the test signals. General recommendations for a fitting rule, that takes into account the spectral and temporal characteristics of the input signal, is given together with suggestions for further studies. Finally...

  10. Cuckoo search based optimal mask generation for noise suppression and enhancement of speech signal

    Directory of Open Access Journals (Sweden)

    Anil Garg

    2015-07-01

    Full Text Available In this paper, an effective noise suppression technique for enhancement of speech signals using optimized mask is proposed. Initially, the noisy speech signal is broken down into various time–frequency (TF units and the features are extracted by finding out the Amplitude Magnitude Spectrogram (AMS. The signals are then classified based on quality ratio into different classes to generate the initial set of solutions. Subsequently, the optimal mask for each class is generated based on Cuckoo search algorithm. Subsequently, in the waveform synthesis stage, filtered waveforms are windowed and then multiplied by the optimal mask value and summed up to get the enhanced target signal. The experimentation of the proposed technique was carried out using various datasets and the performance is compared with the previous techniques using SNR. The results obtained proved the effectiveness of the proposed technique and its ability to suppress noise and enhance the speech signal.

  11. An Improved Technique for Speech Signal De-noising Based on Wavelet Threshold and Invasive Weed Optimization Algorithm

    Directory of Open Access Journals (Sweden)

    Ali Shaban

    2017-07-01

    Full Text Available Speech signals play a significant role in the area of digital signal processing. When these signals pass through air as a channel of propagation, it interacts with noise. Therefore, it needs removing noise from corrupted signal without altering it. De-noising is a compromise between the removal of the largest possible amount of noise and the preservation of signal integrity. To improve the performance of the speech which displays high power fluctuations, a new speech de-noising method based on Invasive Weed Optimization (IWO is proposed. In addition, a theoretical model is modified to estimate the value of threshold without any priority of knowledge. This is done by implementing the IWO algorithm for kurtosis measuring of the residual noise signal to find an optimum threshold value at which the kurtosis function is maximum. It has been observed that the proposed method appeared better performance than other methods at the same condition. Moreover, the results show that the proposed IWO algorithm offered a better mean square error(MSE than Particle Swarm Optimization Algorithm (PSO for both one and multilevel decomposition. For instance, IWO brought an improvement in MSE in the range of 0.01 compared with PSO for multilevel decomposition.

  12. Enhancement and Bandwidth Compression of Noisy Speech by Estimation of Speech and Its Model Parameters

    Science.gov (United States)

    1978-08-01

    Correlation Subtraction Method 44 Ii.2.7 SPAC and SPOC 46 112.8 .iener F lt ing Method 49 11.2.9 S u.mTmary 31 1. ~-6- 11.3 Summary of Performance...underlying speech model. The objective of this dissertation is, oil course , to develop speech enhancement and bandwidth compression systems that are...special cases. -46- 11.2.7 SPAC and SPOC Suzuki developed [201 a speech enhancement system called SPAC or "Splicing of Autocorrelation Function". SPOC or

  13. Application of a short-time version of the Equalization-Cancellation model to speech intelligibility experiments with speech maskers.

    Science.gov (United States)

    Wan, Rui; Durlach, Nathaniel I; Colburn, H Steven

    2014-08-01

    A short-time-processing version of the Equalization-Cancellation (EC) model of binaural processing is described and applied to speech intelligibility tasks in the presence of multiple maskers, including multiple speech maskers. This short-time EC model, called the STEC model, extends the model described by Wan et al. [J. Acoust. Soc. Am. 128, 3678-3690 (2010)] to allow the EC model's equalization parameters τ and α to be adjusted as a function of time, resulting in improved masker cancellation when the dominant masker location varies in time. Using the Speech Intelligibility Index, the STEC model is applied to speech intelligibility with maskers that vary in number, type, and spatial arrangements. Most notably, when maskers are located on opposite sides of the target, this STEC model predicts improved thresholds when the maskers are modulated independently with speech-envelope modulators; this includes the most relevant case of independent speech maskers. The STEC model describes the spatial dependence of the speech reception threshold with speech maskers better than the steady-state model. Predictions are also improved for independently speech-modulated noise maskers but are poorer for reversed-speech maskers. In general, short-term processing is useful, but much remains to be done in the complex task of understanding speech in speech maskers.

  14. Anthropomorphic Coding of Speech and Audio: A Model Inversion Approach

    Directory of Open Access Journals (Sweden)

    W. Bastiaan Kleijn

    2005-06-01

    Full Text Available Auditory modeling is a well-established methodology that provides insight into human perception and that facilitates the extraction of signal features that are most relevant to the listener. The aim of this paper is to provide a tutorial on perceptual speech and audio coding using an invertible auditory model. In this approach, the audio signal is converted into an auditory representation using an invertible auditory model. The auditory representation is quantized and coded. Upon decoding, it is then transformed back into the acoustic domain. This transformation converts a complex distortion criterion into a simple one, thus facilitating quantization with low complexity. We briefly review past work on auditory models and describe in more detail the components of our invertible model and its inversion procedure, that is, the method to reconstruct the signal from the output of the auditory model. We summarize attempts to use the auditory representation for low-bit-rate coding. Our approach also allows the exploitation of the inherent redundancy of the human auditory system for the purpose of multiple description (joint source-channel coding.

  15. Development of the speech test signal in Brazilian Portuguese for real-ear measurement.

    Science.gov (United States)

    Garolla, Luciana P; Scollie, Susan D; Martinelli Iório, Maria Cecília

    2013-08-01

    Recommended practice is to verify the gain and/or output of hearing aids with speech or speech-shaped signals. This study has the purpose of developing a speech test signal in Brazilian Portuguese that is electroacoustically similar to the international long-term average speech spectrum (ILTASS) for use in real ear verification systems. A Brazilian Portuguese speech passage was recorded using standardized equipment and procedures for one female talker and compared to ISTS. The passage consisted of simple, declarative sentences making a total of 148 words. The recordings of a Brazilian Portuguese passage were filtered to the ILTASS and compared to the International Speech Test Signal (ISTS). Aided recordings were made at three test levels, for three audiograms for the Brazilian Portuguese passage and the ISTS. The unaided test signals were spectrally matched to within 0.5 dB. Aided evaluation revealed that the Brazilian Portuguese passage produced aided spectra that were within 1 dB on average, within about 2 dB per audiogram, and within about 3 dB per frequency for 95% of fittings. Results indicate that the Brazilian Portuguese passage developed in this study provides similar electroacoustic hearing-aid evaluations to those expected from the standard ISTS passage.

  16. Estimation of glottal source features from the spectral envelope of the acoustic speech signal

    Science.gov (United States)

    Torres, Juan Felix

    Speech communication encompasses diverse types of information, including phonetics, affective state, voice quality, and speaker identity. From a speech production standpoint, the acoustic speech signal can be mainly divided into glottal source and vocal tract components, which play distinct roles in rendering the various types of information it contains. Most deployed speech analysis systems, however, do not explicitly represent these two components as distinct entities, as their joint estimation from the acoustic speech signal becomes an ill-defined blind deconvolution problem. Nevertheless, because of the desire to understand glottal behavior and how it relates to perceived voice quality, there has been continued interest in explicitly estimating the glottal component of the speech signal. To this end, several inverse filtering (IF) algorithms have been proposed, but they are unreliable in practice because of the blind formulation of the separation problem. In an effort to develop a method that can bypass the challenging IF process, this thesis proposes a new glottal source information extraction method that relies on supervised machine learning to transform smoothed spectral representations of speech, which are already used in some of the most widely deployed and successful speech analysis applications, into a set of glottal source features. A transformation method based on Gaussian mixture regression (GMR) is presented and compared to current IF methods in terms of feature similarity, reliability, and speaker discrimination capability on a large speech corpus, and potential representations of the spectral envelope of speech are investigated for their ability represent glottal source variation in a predictable manner. The proposed system was found to produce glottal source features that reasonably matched their IF counterparts in many cases, while being less susceptible to spurious errors. The development of the proposed method entailed a study into the aspects

  17. Predicting speech intelligibility in conditions with nonlinearly processed noisy speech

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2013-01-01

    The speech-based envelope power spectrum model (sEPSM; [1]) was proposed in order to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII). The sEPSM applies the signal-tonoise ratio in the envelope domain (SNRenv), which was demonstrated...... to successfully predict speech intelligibility in conditions with nonlinearly processed noisy speech, such as processing with spectral subtraction. Moreover, a multiresolution version (mr-sEPSM) was demonstrated to account for speech intelligibility in various conditions with stationary and fluctuating...... from computational auditory scene analysis and further support the hypothesis that the SNRenv is a powerful metric for speech intelligibility prediction....

  18. Phone Duration Modeling of Affective Speech Using Support Vector Regression

    Directory of Open Access Journals (Sweden)

    Alexandros Lazaridis

    2012-07-01

    Full Text Available In speech synthesis accurate modeling of prosody is important for producing high quality synthetic speech. One of the main aspects of prosody is phone duration. Robust phone duration modeling is a prerequisite for synthesizing emotional speech with natural sounding. In this work ten phone duration models are evaluated. These models belong to well known and widely used categories of algorithms, such as the decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms. Furthermore, we investigate the effectiveness of Support Vector Regression (SVR in phone duration modeling in the context of emotional speech. The evaluation of the eleven models is performed on a Modern Greek emotional speech database which consists of four categories of emotional speech (anger, fear, joy, sadness plus neutral speech. The experimental results demonstrated that the SVR-based modeling outperforms the other ten models across all the four emotion categories. Specifically, the SVR model achieved an average relative reduction of 8% in terms of root mean square error (RMSE throughout all emotional categories.

  19. A Statistical Quality Model for Data-Driven Speech Animation.

    Science.gov (United States)

    Ma, Xiaohan; Deng, Zhigang

    2012-11-01

    In recent years, data-driven speech animation approaches have achieved significant successes in terms of animation quality. However, how to automatically evaluate the realism of novel synthesized speech animations has been an important yet unsolved research problem. In this paper, we propose a novel statistical model (called SAQP) to automatically predict the quality of on-the-fly synthesized speech animations by various data-driven techniques. Its essential idea is to construct a phoneme-based, Speech Animation Trajectory Fitting (SATF) metric to describe speech animation synthesis errors and then build a statistical regression model to learn the association between the obtained SATF metric and the objective speech animation synthesis quality. Through delicately designed user studies, we evaluate the effectiveness and robustness of the proposed SAQP model. To the best of our knowledge, this work is the first-of-its-kind, quantitative quality model for data-driven speech animation. We believe it is the important first step to remove a critical technical barrier for applying data-driven speech animation techniques to numerous online or interactive talking avatar applications.

  20. Modelling context in automatic speech recognition

    NARCIS (Netherlands)

    Wiggers, P.

    2008-01-01

    Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the o

  1. A multi-resolution envelope-power based model for speech intelligibility

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Ewert, Stephan D.; Dau, Torsten

    2013-01-01

    The speech-based envelope power spectrum model (sEPSM) presented by Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] estimates the envelope power signal-to-noise ratio (SNRenv) after modulation-frequency selective processing. Changes in this metric were shown to account well...... for changes of speech intelligibility for normal-hearing listeners in conditions with additive stationary noise, reverberation, and nonlinear processing with spectral subtraction. In the latter condition, the standardized speech transmission index [(2003). IEC 60268-16] fails. However, the sEPSM is limited...... with a modulation-filter dependent duration. The multi-resolution sEPSM is demonstrated to account for intelligibility obtained in conditions with stationary and fluctuating interferers, and noisy speech distorted by reverberation or spectral subtraction. The results support the hypothesis that the SNRenv...

  2. Combined Hand Gesture — Speech Model for Human Action Recognition

    Directory of Open Access Journals (Sweden)

    Sheng-Tzong Cheng

    2013-12-01

    Full Text Available This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  3. Combined hand gesture--speech model for human action recognition.

    Science.gov (United States)

    Cheng, Sheng-Tzong; Hsu, Chih-Wei; Li, Jian-Pan

    2013-12-12

    This study proposes a dynamic hand gesture detection technology to effectively detect dynamic hand gesture areas, and a hand gesture recognition technology to improve the dynamic hand gesture recognition rate. Meanwhile, the corresponding relationship between state sequences in hand gesture and speech models is considered by integrating speech recognition technology with a multimodal model, thus improving the accuracy of human behavior recognition. The experimental results proved that the proposed method can effectively improve human behavior recognition accuracy and the feasibility of system applications. Experimental results verified that the multimodal gesture-speech model provided superior accuracy when compared to the single modal versions.

  4. Joint spatial-spectral feature space clustering for speech activity detection from ECoG signals.

    Science.gov (United States)

    Kanas, Vasileios G; Mporas, Iosif; Benz, Heather L; Sgarbas, Kyriakos N; Bezerianos, Anastasios; Crone, Nathan E

    2014-04-01

    Brain-machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and nonspeech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllables repetition tasks and may contribute to the development of portable ECoG-based communication.

  5. The Effects of the Active Hypoxia to the Speech Signal Inharmonicity

    Directory of Open Access Journals (Sweden)

    Z. N. Milivojevic

    2014-06-01

    Full Text Available When the people are climbing on the mountain, they are exposed to decreased oxygen concentration in the tissue, which is commonly called the active hypoxia. This paper addressed the problem of an acute hypoxia that affects the speech signal at the altitude up to 2500 m. For the experiment, the speech signal database that contains the articulation of vowels was recorded at different alti¬tudes. This speech signal was processed by the originally developed algorithm, which extracted the fundamental frequency and the inharmonicity coefficient. Then, they were subjected to the analysis in order to derive the effects of the acute hypoxia. The results showed that the hypoxia level can be determined by the change of the inharmonicity coefficient. Accordingly, the degree of hypoxia can be estimated.

  6. A visual or tactile signal makes auditory speech detection more efficient by reducing uncertainty.

    Science.gov (United States)

    Tjan, Bosco S; Chao, Ewen; Bernstein, Lynne E

    2014-04-01

    Acoustic speech is easier to detect in noise when the talker can be seen. This finding could be explained by integration of multisensory inputs or refinement of auditory processing from visual guidance. In two experiments, we studied two-interval forced-choice detection of an auditory 'ba' in acoustic noise, paired with various visual and tactile stimuli that were identically presented in the two observation intervals. Detection thresholds were reduced under the multisensory conditions vs. the auditory-only condition, even though the visual and/or tactile stimuli alone could not inform the correct response. Results were analysed relative to an ideal observer for which intrinsic (internal) noise and efficiency were independent contributors to detection sensitivity. Across experiments, intrinsic noise was unaffected by the multisensory stimuli, arguing against the merging (integrating) of multisensory inputs into a unitary speech signal, but sampling efficiency was increased to varying degrees, supporting refinement of knowledge about the auditory stimulus. The steepness of the psychometric functions decreased with increasing sampling efficiency, suggesting that the 'task-irrelevant' visual and tactile stimuli reduced uncertainty about the acoustic signal. Visible speech was not superior for enhancing auditory speech detection. Our results reject multisensory neuronal integration and speech-specific neural processing as explanations for the enhanced auditory speech detection under noisy conditions. Instead, they support a more rudimentary form of multisensory interaction: the otherwise task-irrelevant sensory systems inform the auditory system about when to listen. © 2014 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.

  7. ADAPTIVE LEARNING OF HIDDEN MARKOV MODELS FOR EMOTIONAL SPEECH

    Directory of Open Access Journals (Sweden)

    A. V. Tkachenia

    2014-01-01

    Full Text Available An on-line unsupervised algorithm for estimating the hidden Markov models (HMM parame-ters is presented. The problem of hidden Markov models adaptation to emotional speech is solved. To increase the reliability of estimated HMM parameters, a mechanism of forgetting and updating is proposed. A functional block diagram of the hidden Markov models adaptation algorithm is also provided with obtained results, which improve the efficiency of emotional speech recognition.

  8. Applications of sub-audible speech recognition based upon electromyographic signals

    Science.gov (United States)

    Jorgensen, C. Charles (Inventor); Betts, Bradley J. (Inventor)

    2009-01-01

    Method and system for generating electromyographic or sub-audible signals (''SAWPs'') and for transmitting and recognizing the SAWPs that represent the original words and/or phrases. The SAWPs may be generated in an environment that interferes excessively with normal speech or that requires stealth communications, and may be transmitted using encoded, enciphered or otherwise transformed signals that are less subject to signal distortion or degradation in the ambient environment.

  9. Cognitive Bias for Learning Speech Sounds From a Continuous Signal Space Seems Nonlinguistic

    Directory of Open Access Journals (Sweden)

    Sabine van der Ham

    2015-10-01

    Full Text Available When learning language, humans have a tendency to produce more extreme distributions of speech sounds than those observed most frequently: In rapid, casual speech, vowel sounds are centralized, yet cross-linguistically, peripheral vowels occur almost universally. We investigate whether adults’ generalization behavior reveals selective pressure for communication when they learn skewed distributions of speech-like sounds from a continuous signal space. The domain-specific hypothesis predicts that the emergence of sound categories is driven by a cognitive bias to make these categories maximally distinct, resulting in more skewed distributions in participants’ reproductions. However, our participants showed more centered distributions, which goes against this hypothesis, indicating that there are no strong innate linguistic biases that affect learning these speech-like sounds. The centralization behavior can be explained by a lack of communicative pressure to maintain categories.

  10. Cognitive Bias for Learning Speech Sounds From a Continuous Signal Space Seems Nonlinguistic.

    Science.gov (United States)

    van der Ham, Sabine; de Boer, Bart

    2015-10-01

    When learning language, humans have a tendency to produce more extreme distributions of speech sounds than those observed most frequently: In rapid, casual speech, vowel sounds are centralized, yet cross-linguistically, peripheral vowels occur almost universally. We investigate whether adults' generalization behavior reveals selective pressure for communication when they learn skewed distributions of speech-like sounds from a continuous signal space. The domain-specific hypothesis predicts that the emergence of sound categories is driven by a cognitive bias to make these categories maximally distinct, resulting in more skewed distributions in participants' reproductions. However, our participants showed more centered distributions, which goes against this hypothesis, indicating that there are no strong innate linguistic biases that affect learning these speech-like sounds. The centralization behavior can be explained by a lack of communicative pressure to maintain categories.

  11. Subspace-Based Noise Reduction for Speech Signals via Diagonal and Triangular Matrix Decompositions

    DEFF Research Database (Denmark)

    Hansen, Per Christian; Jensen, Søren Holdt

    2007-01-01

    We survey the definitions and use of rank-revealing matrix decompositions in single-channel noise reduction algorithms for speech signals. Our algorithms are based on the rank-reduction paradigm and, in particular, signal subspace techniques. The focus is on practical working algorithms, using both...... diagonal (eigenvalue and singular value) decompositions and rank-revealing triangular decompositions (ULV, URV, VSV, ULLV and ULLIV). In addition we show how the subspace-based algorithms can be evaluated and compared by means of simple FIR filter interpretations. The algorithms are illustrated...... with working Matlab code and applications in speech processing....

  12. Research on Volterra and Wiener Model of Compressed Sensing Measurement of Speech Signal Based on Row Echelon Matrix%行阶梯观测矩阵下语音压缩感知观测序列的Volterra+Wiener模型研究

    Institute of Scientific and Technical Information of China (English)

    叶蕾; 杨震; 孙林慧; 郭海燕

    2013-01-01

    Aiming at the difficulty of modeling the gauss measurement of speech signal for its strong randomicity under compressed sensing theory,this paper proposed Volterra model of compressed sensing measurement of speech signal based on special row echelon measurement matrix,and realized the prediction of compressed sensing measurement of speech signal based on this kind of Volterra model.The prediction effects of input dimensions and order of Volterra model were studied.Wiener filter was used in order to ruduce the prediction error of Volterra model.Under the same known data quantity,reconstruction based on part of compressed sensing measurement,Volterra model and wiener filter,achieves better reconstruction performance than that of gauss measurement.Research on this model offered certain reference value on the combination of compressed sensing and speech signal processing techniques.%针对压缩感知理论下,语音信号经随机高斯矩阵投影后得到的观测序列随机性太强,难以建模的问题,提出了一种基于行阶梯观测矩阵的语音压缩感知观测序列的Volterra模型,利用该模型实现对语音压缩感知观测序列的预测,研究了Volterra滤波器输入维数与阶数对预测效果的影响,并利用维纳滤波器进一步降低预测误差.在相同的已知数据量下,基于部分压缩感知观测序列、Volterra模型、Wiener滤波器的重构,获得了优于高斯随机观测序列的重构性能.模型的研究为压缩感知与语音技术的结合提供一定的参考价值.

  13. Acceptance Noise Level: Effects of the Speech Signal, Babble, and Listener Language

    Science.gov (United States)

    Shi, Lu-Feng; Azcona, Gabrielly; Buten, Lupe

    2015-01-01

    Purpose: The acceptable noise level (ANL) measure has gained much research/clinical interest in recent years. The present study examined how the characteristics of the speech signal and the babble used in the measure may affect the ANL in listeners with different native languages. Method: Fifteen English monolingual, 16 Russian-English bilingual,…

  14. A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments

    Directory of Open Access Journals (Sweden)

    Jing Mi

    2016-09-01

    Full Text Available Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model.

  15. A Binaural Grouping Model for Predicting Speech Intelligibility in Multitalker Environments.

    Science.gov (United States)

    Mi, Jing; Colburn, H Steven

    2016-10-03

    Spatially separating speech maskers from target speech often leads to a large intelligibility improvement. Modeling this phenomenon has long been of interest to binaural-hearing researchers for uncovering brain mechanisms and for improving signal-processing algorithms in hearing-assistive devices. Much of the previous binaural modeling work focused on the unmasking enabled by binaural cues at the periphery, and little quantitative modeling has been directed toward the grouping or source-separation benefits of binaural processing. In this article, we propose a binaural model that focuses on grouping, specifically on the selection of time-frequency units that are dominated by signals from the direction of the target. The proposed model uses Equalization-Cancellation (EC) processing with a binary decision rule to estimate a time-frequency binary mask. EC processing is carried out to cancel the target signal and the energy change between the EC input and output is used as a feature that reflects target dominance in each time-frequency unit. The processing in the proposed model requires little computational resources and is straightforward to implement. In combination with the Coherence-based Speech Intelligibility Index, the model is applied to predict the speech intelligibility data measured by Marrone et al. The predicted speech reception threshold matches the pattern of the measured data well, even though the predicted intelligibility improvements relative to the colocated condition are larger than some of the measured data, which may reflect the lack of internal noise in this initial version of the model.

  16. Improved single-channel speech separation using sinusoidal modeling

    DEFF Research Database (Denmark)

    Mowlaee, Pejman; Christensen, Mads Græsbøll; Jensen, Søren Holdt

    2010-01-01

    ) and Wiener filter (softmask) approaches, the proposed approach works independently of pitch estimates. Furthermore, it is observed that it can achieve acceptable perceptual speech quality with less cross-talk at different signal-tosignal ratios while bringing down the complexity by replacing STFT...

  17. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  18. A model of serial order problems in fluent, stuttered and agrammatic speech

    OpenAIRE

    Howell, Peter

    2007-01-01

    Many models of speech production have attempted to explain dysfluent speech. Most models assume that the disruptions that occur when speech is dysfluent arise because the speakers make errors while planning an utterance. In this contribution, a model of the serial order of speech is described that does not make this assumption. It involves the coordination or ‘interlocking’ of linguistic planning and execution stages at the language–speech interface. The model is examined to determine whether...

  19. A model of serial order problems in fluent, stuttered and agrammatic speech

    OpenAIRE

    Howell, P.

    2007-01-01

    Many models of speech production have attempted to explain dysfluent speech. Most models assume that the disruptions that occur when speech is dysfluent arise because the speakers make errors while planning an utterance. In this contribution, a model of the serial order of speech is described that does not make this assumption. It involves the coordination or 'interlocking' of linguistic planning and execution stages at the language-speech interface. The model is examined to determine whether...

  20. A speech processing study using an acoustic model of a multiple-channel cochlear implant

    Science.gov (United States)

    Xu, Ying

    1998-10-01

    A cochlear implant is an electronic device designed to provide sound information for adults and children who have bilateral profound hearing loss. The task of representing speech signals as electrical stimuli is central to the design and performance of cochlear implants. Studies have shown that the current speech- processing strategies provide significant benefits to cochlear implant users. However, the evaluation and development of speech-processing strategies have been complicated by hardware limitations and large variability in user performance. To alleviate these problems, an acoustic model of a cochlear implant with the SPEAK strategy is implemented in this study, in which a set of acoustic stimuli whose psychophysical characteristics are as close as possible to those produced by a cochlear implant are presented on normal-hearing subjects. To test the effectiveness and feasibility of this acoustic model, a psychophysical experiment was conducted to match the performance of a normal-hearing listener using model- processed signals to that of a cochlear implant user. Good agreement was found between an implanted patient and an age-matched normal-hearing subject in a dynamic signal discrimination experiment, indicating that this acoustic model is a reasonably good approximation of a cochlear implant with the SPEAK strategy. The acoustic model was then used to examine the potential of the SPEAK strategy in terms of its temporal and frequency encoding of speech. It was hypothesized that better temporal and frequency encoding of speech can be accomplished by higher stimulation rates and a larger number of activated channels. Vowel and consonant recognition tests were conducted on normal-hearing subjects using speech tokens processed by the acoustic model, with different combinations of stimulation rate and number of activated channels. The results showed that vowel recognition was best at 600 pps and 8 activated channels, but further increases in stimulation rate and

  1. Modeling the development of pronunciation in infant speech acquisition.

    OpenAIRE

    Howard, IS; Messum, P

    2011-01-01

    Pronunciation is an important part of speech acquisition, but little attention has been given to the mechanism or mechanisms by which it develops. Speech sound qualities, for example, have just been assumed to develop by simple imitation. In most accounts this is then assumed to be by acoustic matching, with the infant comparing his output to that of his caregiver. There are theoretical and empirical problems with both of these assumptions, and we present a computational model- Elija-that doe...

  2. Multilingual Phoneme Models for Rapid Speech Processing System Development

    Science.gov (United States)

    2006-09-01

    clusters. It was found that multilingual bootstrapping methods out- perform monolingual English bootstrapping methods on the Arabic evaluation data initially...International Phonetic Alphabet . . . . . . . . . 7 2.3.2 Multilingual vs. Monolingual Speech Recognition 7 2.3.3 Data-Driven Approaches...one set of models and monolingual speech recognition systems that draw from a multilingual training space. The second approach is the focus of this

  3. Predicting speech intelligibility based on the signal-to-noise envelope power ratio after modulation-frequency selective processing

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    A model for predicting the intelligibility of processed noisy speech is proposed. The speech-based envelope power spectrum model has a similar structure as the model of Ewert and Dau [(2000). J. Acoust. Soc. Am. 108, 1181-1196], developed to account for modulation detection and masking data....... The model estimates the speech-to-noise envelope power ratio, SNR env, at the output of a modulation filterbank and relates this metric to speech intelligibility using the concept of an ideal observer. Predictions were compared to data on the intelligibility of speech presented in stationary speech......-shaped noise. The model was further tested in conditions with noisy speech subjected to reverberation and spectral subtraction. Good agreement between predictions and data was found in all cases. For spectral subtraction, an analysis of the model's internal representation of the stimuli revealed...

  4. Hearing aid measurements with speech and noise signals

    DEFF Research Database (Denmark)

    Dyrlund, Ole; Ludvigsen, Carl; Olofsson, Åke

    1994-01-01

    to evaluate three different broad-band measuring methods. The results revealed that these methods were meaningful in estimating average frequency response obtained with a specific input signal, but none of the three methods used in the study was able to evaluate separately the effects of the most important...... signal modifications: memoryless non-linearity like peak clipping, time-varying gain from AGC and additive internal noise....

  5. Modeling the development of pronunciation in infant speech acquisition.

    Science.gov (United States)

    Howard, Ian S; Messum, Piers

    2011-01-01

    Pronunciation is an important part of speech acquisition, but little attention has been given to the mechanism or mechanisms by which it develops. Speech sound qualities, for example, have just been assumed to develop by simple imitation. In most accounts this is then assumed to be by acoustic matching, with the infant comparing his output to that of his caregiver. There are theoretical and empirical problems with both of these assumptions, and we present a computational model- Elija-that does not learn to pronounce speech sounds this way. Elija starts by exploring the sound making capabilities of his vocal apparatus. Then he uses the natural responses he gets from a caregiver to learn equivalence relations between his vocal actions and his caregiver's speech. We show that Elija progresses from a babbling stage to learning the names of objects. This demonstrates the viability of a non-imitative mechanism in learning to pronounce.

  6. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  7. Modeling Co-evolution of Speech and Biology.

    Science.gov (United States)

    de Boer, Bart

    2016-04-01

    Two computer simulations are investigated that model interaction of cultural evolution of language and biological evolution of adaptations to language. Both are agent-based models in which a population of agents imitates each other using realistic vowels. The agents evolve under selective pressure for good imitation. In one model, the evolution of the vocal tract is modeled; in the other, a cognitive mechanism for perceiving speech accurately is modeled. In both cases, biological adaptations to using and learning speech evolve, even though the system of speech sounds itself changes at a more rapid time scale than biological evolution. However, the fact that the available acoustic space is used maximally (a self-organized result of cultural evolution) is constant, and therefore biological evolution does have a stable target. This work shows that when cultural and biological traits are continuous, their co-evolution may lead to cognitive adaptations that are strong enough to detect empirically.

  8. Neural Spike-Train Analyses of the Speech-Based Envelope Power Spectrum Model

    Directory of Open Access Journals (Sweden)

    Varsha H. Rallapalli

    2016-10-01

    Full Text Available Diagnosing and treating hearing impairment is challenging because people with similar degrees of sensorineural hearing loss (SNHL often have different speech-recognition abilities. The speech-based envelope power spectrum model (sEPSM has demonstrated that the signal-to-noise ratio (SNRENV from a modulation filter bank provides a robust speech-intelligibility measure across a wider range of degraded conditions than many long-standing models. In the sEPSM, noise (N is assumed to: (a reduce S + N envelope power by filling in dips within clean speech (S and (b introduce an envelope noise floor from intrinsic fluctuations in the noise itself. While the promise of SNRENV has been demonstrated for normal-hearing listeners, it has not been thoroughly extended to hearing-impaired listeners because of limited physiological knowledge of how SNHL affects speech-in-noise envelope coding relative to noise alone. Here, envelope coding to speech-in-noise stimuli was quantified from auditory-nerve model spike trains using shuffled correlograms, which were analyzed in the modulation-frequency domain to compute modulation-band estimates of neural SNRENV. Preliminary spike-train analyses show strong similarities to the sEPSM, demonstrating feasibility of neural SNRENV computations. Results suggest that individual differences can occur based on differential degrees of outer- and inner-hair-cell dysfunction in listeners currently diagnosed into the single audiological SNHL category. The predicted acoustic-SNR dependence in individual differences suggests that the SNR-dependent rate of susceptibility could be an important metric in diagnosing individual differences. Future measurements of the neural SNRENV in animal studies with various forms of SNHL will provide valuable insight for understanding individual differences in speech-in-noise intelligibility.

  9. Searching-and-averaging method of underdetermined blind speech signal separation in time domain

    Institute of Scientific and Technical Information of China (English)

    2007-01-01

    Underdetermined blind signal separation (BSS) (with fewer observed mixtures than sources) is discussed. A novel searching-and-averaging method in time domain (SAMTD) is proposed. It can solve a kind of problems that are very hard to solve by using sparse representation in frequency domain. Bypassing the disadvantages of traditional clustering (e.g., K-means or potential-function clustering), the durative- sparsity of a speech signal in time domain is used. To recover the mixing matrix, our method deletes those samples, which are not in the same or inverse direction of the basis vectors. To recover the sources, an improved geometric approach to overcomplete ICA (Independent Component Analysis) is presented. Several speech signal experiments demonstrate the good performance of the proposed method.

  10. Speech perception of sine-wave signals by children with cochlear implants.

    Science.gov (United States)

    Nittrouer, Susan; Kuess, Jamie; Lowenstein, Joanna H

    2015-05-01

    Children need to discover linguistically meaningful structures in the acoustic speech signal. Being attentive to recurring, time-varying formant patterns helps in that process. However, that kind of acoustic structure may not be available to children with cochlear implants (CIs), thus hindering development. The major goal of this study was to examine whether children with CIs are as sensitive to time-varying formant structure as children with normal hearing (NH) by asking them to recognize sine-wave speech. The same materials were presented as speech in noise, as well, to evaluate whether any group differences might simply reflect general perceptual deficits on the part of children with CIs. Vocabulary knowledge, phonemic awareness, and "top-down" language effects were all also assessed. Finally, treatment factors were examined as possible predictors of outcomes. Results showed that children with CIs were as accurate as children with NH at recognizing sine-wave speech, but poorer at recognizing speech in noise. Phonemic awareness was related to that recognition. Top-down effects were similar across groups. Having had a period of bimodal stimulation near the time of receiving a first CI facilitated these effects. Results suggest that children with CIs have access to the important time-varying structure of vocal-tract formants.

  11. Perceptual Wavelet packet transform based Wavelet Filter Banks Modeling of Human Auditory system for improving the intelligibility of voiced and unvoiced speech: A Case Study of a system development

    OpenAIRE

    Ranganadh Narayanam*

    2015-01-01

    The objective of this project is to discuss a versatile speech enhancement method based on the human auditory model. In this project a speech enhancement scheme is being described which meets the demand for quality noise reduction algorithms which are capable of operating at a very low signal to noise ratio. We will be discussing how proposed speech enhancement system is capable of reducing noise with little speech degradation in diverse noise environments. In this model to reduce the resi...

  12. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  13. Enhancing Speech Recognition Using Improved Particle Swarm Optimization Based Hidden Markov Model

    Directory of Open Access Journals (Sweden)

    Lokesh Selvaraj

    2014-01-01

    Full Text Available Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO is suggested. The suggested methodology contains four stages, namely, (i denoising, (ii feature mining (iii, vector quantization, and (iv IPSO based hidden Markov model (HMM technique (IP-HMM. At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC, mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  14. Khon Kaen: a community-based speech therapy model for an area lacking in speech services for clefts.

    Science.gov (United States)

    Prathaneel, Benjamas; Makarabhirom, Kalyanee; Jaiyong, Pechcharat; Pradubwong, Suteera

    2014-09-01

    Absence of speech rehabilitation services is one of the critical difficulties in care for clefts in Thailand and some other developing countries. The objective of this study was to determine the effectiveness of the "Khon Kaen Community-Based Speech Therapy Model" in decreasing the number of articulation defects in children with cleft palate and/or lip. Sixteen children with cleft palate and/or lip in 6 districts of Maha Sarakham Province were enrolled for study. A three-day intensive speech camp was held in Srinagarind Hospital and followed by an outreach program of six one-day follow-up speech camps in Maha Sarakham Hospital. Six paraprofessionals, speech assistants, provided home- or community- based speech correction every week for one year. Numbers of various articulation errors were compared pre- and post-treatment using the Wilcoxon signed-rank test. The number of articulation defects showed a statistically significant reduction (mean difference = 10; Z = -3.52; p Speech Therapy Model" is one of the best models for solving speech therapy problems in areas of Thailand lacking speech services and can be applied to other developing countries.

  15. Self-Modelling as a Relapse Intervention Following Speech-Restructuring Treatment for Stuttering

    Science.gov (United States)

    Cream, Angela; O'Brian, Sue; Onslow, Mark; Packman, Ann; Menzies, Ross

    2009-01-01

    Background: Speech restructuring is an efficacious method for the alleviation of stuttered speech. However, post-treatment relapse is common. Aims: To investigate whether the use of video self-modelling using restructured stutter-free speech reduces stuttering in adults who had learnt a speech-restructuring technique and subsequently relapsed.…

  16. Self-Modelling as a Relapse Intervention Following Speech-Restructuring Treatment for Stuttering

    Science.gov (United States)

    Cream, Angela; O'Brian, Sue; Onslow, Mark; Packman, Ann; Menzies, Ross

    2009-01-01

    Background: Speech restructuring is an efficacious method for the alleviation of stuttered speech. However, post-treatment relapse is common. Aims: To investigate whether the use of video self-modelling using restructured stutter-free speech reduces stuttering in adults who had learnt a speech-restructuring technique and subsequently relapsed.…

  17. Modeling speech intelligibility in quiet and noise in listeners with normal and impaired hearing

    NARCIS (Netherlands)

    K.S. Rhebergen; J. Lyzenga; W.A. Dreschler; J.M. Festen

    2010-01-01

    The speech intelligibility index (SII) is an often used calculation method for estimating the proportion of audible speech in noise. For speech reception thresholds (SRTs), measured in normally hearing listeners using various types of stationary noise, this model predicts a fairly constant speech pr

  18. Improvement for Speech Signal based on Post Wiener Filter and Adjustable Beam-Former

    Directory of Open Access Journals (Sweden)

    Xiaorong Tong

    2013-06-01

    Full Text Available In this study, a two-stage filter structure is introduced for speech enhancement. The first stage is an adjustable filter and sum beam-former with four-microphone array. The control of beam-forming filter is realized by adjusting only a single control variable. Different from the adaptive beam-forming filter, the proposed filter structure does not bring to any adaptive error noise, thus, it also does not bring the trouble to the second stage of the speech signal processing. The second stage of the proposed filter is a Wiener filter. The estimation of signal’s power spectrum for Wiener filter is realized by cross-correlation between primary outputs of two adjacent directional beams. This estimation is based on the assumption that the noise outputs of the two adjacent directional beams come from two independent noise source but the speech outputs come from the same speech source. The simulation results shown that the proposed algorithm can improve the Signal-Noise-Ratio (SNR about 6 dB.

  19. Sub-Audible Speech Recognition Based upon Electromyographic Signals

    Science.gov (United States)

    Jorgensen, Charles C. (Inventor); Lee, Diana D. (Inventor); Agabon, Shane T. (Inventor)

    2012-01-01

    Method and system for processing and identifying a sub-audible signal formed by a source of sub-audible sounds. Sequences of samples of sub-audible sound patterns ("SASPs") for known words/phrases in a selected database are received for overlapping time intervals, and Signal Processing Transforms ("SPTs") are formed for each sample, as part of a matrix of entry values. The matrix is decomposed into contiguous, non-overlapping two-dimensional cells of entries, and neural net analysis is applied to estimate reference sets of weight coefficients that provide sums with optimal matches to reference sets of values. The reference sets of weight coefficients are used to determine a correspondence between a new (unknown) word/phrase and a word/phrase in the database.

  20. Refining a model of hearing impairment using speech psychophysics

    DEFF Research Database (Denmark)

    Jepsen, Morten Løve; Dau, Torsten; Ghitza, Oded

    2014-01-01

    The premise of this study is that models of hearing, in general, and of individual hearing impairment, in particular, can be improved by using speech test results as an integral part of the modeling process. A conceptual iterative procedure is presented which, for an individual, considers measures...

  1. Prediction of speech intelligibility based on an auditory preprocessing model

    DEFF Research Database (Denmark)

    Christiansen, Claus Forup Corlin; Pedersen, Michael Syskind; Dau, Torsten

    2010-01-01

    in noise experiment was used for training and an ideal binary mask experiment was used for evaluation. All three models were able to capture the trends in the speech in noise training data well, but the proposed model provides a better prediction of the binary mask test data, particularly when the binary...... masks degenerate to a noise vocoder....

  2. Intonational Boundaries, Speech Repairs and Discourse Markers Modeling Spoken Dialog

    CERN Document Server

    Heeman, P A; Heeman, Peter A.; Allen, James F.

    1997-01-01

    To understand a speaker's turn of a conversation, one needs to segment it into intonational phrases, clean up any speech repairs that might have occurred, and identify discourse markers. In this paper, we argue that these problems must be resolved together, and that they must be resolved early in the processing stream. We put forward a statistical language model that resolves these problems, does POS tagging, and can be used as the language model of a speech recognizer. We find that by accounting for the interactions between these tasks that the performance on each task improves, as does POS tagging and perplexity.

  3. A multi-resolution envelope-power based model for speech intelligibility.

    Science.gov (United States)

    Jørgensen, Søren; Ewert, Stephan D; Dau, Torsten

    2013-07-01

    The speech-based envelope power spectrum model (sEPSM) presented by Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] estimates the envelope power signal-to-noise ratio (SNRenv) after modulation-frequency selective processing. Changes in this metric were shown to account well for changes of speech intelligibility for normal-hearing listeners in conditions with additive stationary noise, reverberation, and nonlinear processing with spectral subtraction. In the latter condition, the standardized speech transmission index [(2003). IEC 60268-16] fails. However, the sEPSM is limited to conditions with stationary interferers, due to the long-term integration of the envelope power, and cannot account for increased intelligibility typically obtained with fluctuating maskers. Here, a multi-resolution version of the sEPSM is presented where the SNRenv is estimated in temporal segments with a modulation-filter dependent duration. The multi-resolution sEPSM is demonstrated to account for intelligibility obtained in conditions with stationary and fluctuating interferers, and noisy speech distorted by reverberation or spectral subtraction. The results support the hypothesis that the SNRenv is a powerful objective metric for speech intelligibility prediction.

  4. Feature extraction and models for speech: An overview

    Science.gov (United States)

    Schroeder, Manfred

    2002-11-01

    Modeling of speech has a long history, beginning with Count von Kempelens 1770 mechanical speaking machine. Even then human vowel production was seen as resulting from a source (the vocal chords) driving a physically separate resonator (the vocal tract). Homer Dudley's 1928 frequency-channel vocoder and many of its descendants are based on the same successful source-filter paradigm. For linguistic studies as well as practical applications in speech recognition, compression, and synthesis (see M. R. Schroeder, Computer Speech), the extant models require the (often difficult) extraction of numerous parameters such as the fundamental and formant frequencies and various linguistic distinctive features. Some of these difficulties were obviated by the introduction of linear predictive coding (LPC) in 1967 in which the filter part is an all-pole filter, reflecting the fact that for non-nasalized vowels the vocal tract is well approximated by an all-pole transfer function. In the now ubiquitous code-excited linear prediction (CELP), the source-part is replaced by a code book which (together with a perceptual error criterion) permits speech compression to very low bit rates at high speech quality for the Internet and cell phones.

  5. Speech encoding strategies for multielectrode cochlear implants: a digital signal processor approach.

    Science.gov (United States)

    Dillier, N; Bögli, H; Spillmann, T

    1993-01-01

    The following processing strategies have been implemented on an experimental laboratory system of a cochlear implant digital speech processor (CIDSP) for the Nucleus 22-channel cochlear prosthesis. The first approach (PES, Pitch Excited Sampler) is based on the maximum peak channel vocoder concept whereby the time-varying spectral energy of a number of frequency bands is transformed into electrical stimulation parameters for up to 22 electrodes. The pulse rate at any electrode is controlled by the voice pitch of the input speech signal. The second approach (CIS, Continuous Interleaved Sampler) uses a stimulation pulse rate which is independent of the input signal. The algorithm continuously scans all specified frequency bands (typically between four and 22) and samples their energy levels. As only one electrode can be stimulated at any instance of time, the maximally achievable rate of stimulation is limited by the required stimulus pulse widths (determined individually for each subject) and some additional constraints and parameters. A number of variations of the CIS approach have, therefore, been implemented which either maximize the number of quasi-simultaneous stimulation channels or the pulse rate on a reduced number of electrodes. Evaluation experiments with five experienced cochlear implant users showed significantly better performance in consonant identification tests with the new processing strategies than with the subjects' own wearable speech processors; improvements in vowel identification tasks were rarely observed. Modifications of the basic PES- and CIS strategies resulted in large variations of identification scores. Information transmission analysis of confusion matrices revealed a rather complex pattern across conditions and speech features. Optimization and fine-tuning of processing parameters for these coding strategies will require more data both from speech identification and discrimination evaluations and from psychophysical experiments.

  6. Noise Reduction in Car Speech

    OpenAIRE

    V. Bolom

    2009-01-01

    This paper presents properties of chosen multichannel algorithms for speech enhancement in a noisy environment. These methods are suitable for hands-free communication in a car cabin. Criteria for evaluation of these systems are also presented. The criteria consider both the level of noise suppression and the level of speech distortion. The performance of multichannel algorithms is investigated for a mixed model of speech signals and car noise and for real signals recorded in a car. 

  7. Noise Reduction in Car Speech

    Directory of Open Access Journals (Sweden)

    V. Bolom

    2009-01-01

    Full Text Available This paper presents properties of chosen multichannel algorithms for speech enhancement in a noisy environment. These methods are suitable for hands-free communication in a car cabin. Criteria for evaluation of these systems are also presented. The criteria consider both the level of noise suppression and the level of speech distortion. The performance of multichannel algorithms is investigated for a mixed model of speech signals and car noise and for real signals recorded in a car. 

  8. Sub-vocal speech pattern recognition of Hindi alphabet with surface electromyography signal

    Directory of Open Access Journals (Sweden)

    Munna Khan

    2016-09-01

    Full Text Available Recently electromyography (EMG based speech signals have been used as pattern recognition of phoneme, vocal frequency estimation, browser interface, and classification of speech related problem identification. Attempts have been made to use EMG signal for sub-vocal speech pattern recognition of Hindi phonemes fx1,fx2,fx3,fx4 and Hindi words. That provides the command sub-vocally to control the devices. Sub-vocal EMG data were collected from more than 10 healthy subjects aged between 25 and 30 years. EMG-based sub-vocal database are acquired from four channel BIOPAC MP-30 acquisition system. Four pairs of Ag-AgCl electrodes placed in the participant neck area of skin. AR coefficients and Cepstral coefficients were computed as features of EMG-based sub-vocal signal. Furthermore, these features are classified by HMM classifier. H2M MATLAB toolbox was used to develop HMM classifier for classification of phonemes. Results were averaged on 10 subjects. An average classification accuracy of Ka is found to be 85% whereas the classification accuracy of Kha and Gha is in between 88% and 90%. The classification accuracy rate of Ga was found to be 78% which was lesser as compared to Kha and Gha.

  9. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Directory of Open Access Journals (Sweden)

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  10. Shortlist B: A Bayesian model of continuous speech recognition

    NARCIS (Netherlands)

    Norris, D.; McQueen, J.M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prel

  11. Shortlist B: A Bayesian Model of Continuous Speech Recognition

    Science.gov (United States)

    Norris, Dennis; McQueen, James M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prelexical and lexical representations, a feedforward…

  12. Development and testing of an audio forensic software for enhancing speech signals masked by loud music

    Science.gov (United States)

    Dobre, Robert A.; Negrescu, Cristian; Stanomir, Dumitru

    2016-12-01

    In many situations audio recordings can decide the fate of a trial when accepted as evidence. But until they can be taken into account they must be authenticated at first, but also the quality of the targeted content (speech in most cases) must be good enough to remove any doubt. In this scope two main directions of multimedia forensics come into play: content authentication and noise reduction. This paper presents an application that is included in the latter. If someone would like to conceal their conversation, the easiest way to do it would be to turn loud the nearest audio system. In this situation, if a microphone was placed close by, the recorded signal would be apparently useless because the speech signal would be masked by the loud music signal. The paper proposes an adaptive filters based solution to remove the musical content from a previously described signal mixture in order to recover the masked vocal signal. Two adaptive filtering algorithms were tested in the proposed solution: the Normalised Least Mean Squares (NLMS) and Recursive Least Squares (RLS). Their performances in the described situation were evaluated using Simulink, compared and included in the paper.

  13. The Effectiveness of Clear Speech as a Masker

    Science.gov (United States)

    Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.

    2010-01-01

    Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…

  14. Novel speech signal processing algorithms for high-accuracy classification of Parkinson's disease.

    Science.gov (United States)

    Tsanas, Athanasios; Little, Max A; McSharry, Patrick E; Spielman, Jennifer; Ramig, Lorraine O

    2012-05-01

    There has been considerable recent research into the connection between Parkinson's disease (PD) and speech impairment. Recently, a wide range of speech signal processing algorithms (dysphonia measures) aiming to predict PD symptom severity using speech signals have been introduced. In this paper, we test how accurately these novel algorithms can be used to discriminate PD subjects from healthy controls. In total, we compute 132 dysphonia measures from sustained vowels. Then, we select four parsimonious subsets of these dysphonia measures using four feature selection algorithms, and map these feature subsets to a binary classification response using two statistical classifiers: random forests and support vector machines. We use an existing database consisting of 263 samples from 43 subjects, and demonstrate that these new dysphonia measures can outperform state-of-the-art results, reaching almost 99% overall classification accuracy using only ten dysphonia features. We find that some of the recently proposed dysphonia measures complement existing algorithms in maximizing the ability of the classifiers to discriminate healthy controls from PD subjects. We see these results as an important step toward noninvasive diagnostic decision support in PD.

  15. Performance Comparison of Sub Phonetic Model with Input Signal Processing

    Directory of Open Access Journals (Sweden)

    Dr E. Ramaraj

    2006-01-01

    Full Text Available The quest to arrive at a better model for signal transformation for speech has resulted in striving to develop better signal representations and algorithm. The article explores the word model which is a concatenation of state dependent senones as an alternate for phoneme. The Research Work has an objective of involving the senone with the Input signal processing an algorithm which has been tried with phoneme and has been quite successful and try to compare the performance of senone with ISP and Phoneme with ISP and supply the result analysis. The research model has taken the SPHINX IV[4] speech engine for its implementation owing to its flexibility to the new algorithm, robustness and performance consideration.

  16. Multiscale Signal Analysis and Modeling

    CERN Document Server

    Zayed, Ahmed

    2013-01-01

    Multiscale Signal Analysis and Modeling presents recent advances in multiscale analysis and modeling using wavelets and other systems. This book also presents applications in digital signal processing using sampling theory and techniques from various function spaces, filter design, feature extraction and classification, signal and image representation/transmission, coding, nonparametric statistical signal processing, and statistical learning theory. This book also: Discusses recently developed signal modeling techniques, such as the multiscale method for complex time series modeling, multiscale positive density estimations, Bayesian Shrinkage Strategies, and algorithms for data adaptive statistics Introduces new sampling algorithms for multidimensional signal processing Provides comprehensive coverage of wavelets with presentations on waveform design and modeling, wavelet analysis of ECG signals and wavelet filters Reviews features extraction and classification algorithms for multiscale signal and image proce...

  17. Detection of Voice Pathology using Fractal Dimension in a Multiresolution Analysis of Normal and Disordered Speech Signals.

    Science.gov (United States)

    Ali, Zulfiqar; Elamvazuthi, Irraivan; Alsulaiman, Mansour; Muhammad, Ghulam

    2016-01-01

    Voice disorders are associated with irregular vibrations of vocal folds. Based on the source filter theory of speech production, these irregular vibrations can be detected in a non-invasive way by analyzing the speech signal. In this paper we present a multiband approach for the detection of voice disorders given that the voice source generally interacts with the vocal tract in a non-linear way. In normal phonation, and assuming sustained phonation of a vowel, the lower frequencies of speech are heavily source dependent due to the low frequency glottal formant, while the higher frequencies are less dependent on the source signal. During abnormal phonation, this is still a valid, but turbulent noise of source, because of the irregular vibration, affects also higher frequencies. Motivated by such a model, we suggest a multiband approach based on a three-level discrete wavelet transformation (DWT) and in each band the fractal dimension (FD) of the estimated power spectrum is estimated. The experiments suggest that frequency band 1-1562 Hz, lower frequencies after level 3, exhibits a significant difference in the spectrum of a normal and pathological subject. With this band, a detection rate of 91.28 % is obtained with one feature, and the obtained result is higher than all other frequency bands. Moreover, an accuracy of 92.45 % and an area under receiver operating characteristic curve (AUC) of 95.06 % is acquired when the FD of all levels is fused. Likewise, when the FD of all levels is combined with 22 Multi-Dimensional Voice Program (MDVP) parameters, an improvement of 2.26 % in accuracy and 1.45 % in AUC is observed.

  18. Modeling auditory processing and speech perception in hearing-impaired listeners

    DEFF Research Database (Denmark)

    Jepsen, Morten Løve

    A better understanding of how the human auditory system represents and analyzes sounds and how hearing impairment affects such processing is of great interest for researchers in the fields of auditory neuroscience, audiology, and speech communication as well as for applications in hearing......-instrument and speech technology. In this thesis, the primary focus was on the development and evaluation of a computational model of human auditory signal-processing and perception. The model was initially designed to simulate the normal-hearing auditory system with particular focus on the nonlinear processing...... aimed at experimentally characterizing the effects of cochlear damage on listeners' auditory processing, in terms of sensitivity loss and reduced temporal and spectral resolution. The results showed that listeners with comparable audiograms can have very different estimated cochlear input...

  19. Mapping from Speech to Images Using Continuous State Space Models

    DEFF Research Database (Denmark)

    Lehn-Schiøler, Tue; Hansen, Lars Kai; Larsen, Jan

    2005-01-01

    In this paper a system that transforms speech waveforms to animated faces are proposed. The system relies on continuous state space models to perform the mapping, this makes it possible to ensure video with no sudden jumps and allows continuous control of the parameters in 'face space......'. The performance of the system is critically dependent on the number of hidden variables, with too few variables the model cannot represent data, and with too many overfitting is noticed. Simulations are performed on recordings of 3-5 sec.\\$\\backslash\\$ video sequences with sentences from the Timit database. From...... a subjective point of view the model is able to construct an image sequence from an unknown noisy speech sequence even though the number of training examples are limited....

  20. Speech signal denoising with wavelet-transforms and the mean opinion score characterizing the filtering quality

    Science.gov (United States)

    Yaseen, Alauldeen S.; Pavlov, Alexey N.; Hramov, Alexander E.

    2016-03-01

    Speech signal processing is widely used to reduce noise impact in acquired data. During the last decades, wavelet-based filtering techniques are often applied in communication systems due to their advantages in signal denoising as compared with Fourier-based methods. In this study we consider applications of a 1-D double density complex wavelet transform (1D-DDCWT) and compare the results with the standard 1-D discrete wavelet-transform (1DDWT). The performances of the considered techniques are compared using the mean opinion score (MOS) being the primary metric for the quality of the processed signals. A two-dimensional extension of this approach can be used for effective image denoising.

  1. A MAP Criterion for Detecting the Number of Speakers at frame level in Model-based Single-Channel Speech Separation

    DEFF Research Database (Denmark)

    Mowlaee, Pejman; Christensen, Mads Græsbøll; Tan, Zheng-Hua

    2010-01-01

    The problem of detecting the number of speakers for a particular segment occurs in many dif- ferent speech applications. In single channel speech separation, for example, this information is often used to simplify the separation process, as the signal has to be treated differently depending...... on the number of speakers. Inspired by the asymptotic maximum a posteriori rule proposed for model selection, we pose the problem as a model selection problem. More specifically, we derive a multiple hypotheses test for determining the number of speakers at a frame level in an observed signal based on underlying...... parametric speaker models, trained a priori. The experimental results indicate that the suggested method improves the quality of the separated signals in a single-channel speech separation scenario at different signal-to-signal ratio levels....

  2. Esophageal aerodynamics in an idealized experimental model of tracheoesophageal speech

    Science.gov (United States)

    Erath, Byron D.; Hemsing, Frank S.

    2016-03-01

    Flow behavior is investigated in the esophageal tract in an idealized experimental model of tracheoesophageal speech. The tracheoesophageal prosthesis is idealized as a first-order approximation using a straight, constant diameter tube. The flow is scaled according to Reynolds, Strouhal, and Euler numbers to ensure dynamic similarity. Flow pulsatility is produced by a driven orifice that approximates the kinematics of the pharyngoesophageal segment during tracheoesophageal speech. Particle image velocimetry data are acquired in three orthogonal planes as the flow exits the model prosthesis and enters the esophageal tract. Contrary to prior investigations performed in steady flow with the prosthesis oriented in-line with the flow direction, the fluid dynamics are shown to be highly unsteady, suggesting that the esophageal pressure field will be similarly complex. A large vortex ring is formed at the inception of each phonatory cycle, followed by the formation of a persistent jet. This vortex ring appears to remain throughout the entire cycle due to the continued production of vorticity resulting from entrainment between the prosthesis jet and the curved esophageal walls. Mean flow in the axial direction of the esophagus produces significant stretching of the vortex throughout the phonatory cycle. The stagnation point created by the jet impinging on the esophageal wall varies throughout the cycle due to fluctuations in the jet trajectory, which most likely arises due to flow separation within the model prosthesis. Applications to tracheoesophageal speech, including shortcomings of the model and proposed future plans, are discussed.

  3. Algorithms, hardware, and software for a digital signal processor microcomputer-based speech processor in a multielectrode cochlear implant system.

    Science.gov (United States)

    Morris, L R; Barszczewski, P

    1989-06-01

    Software and hardware have been developed to create a powerful, inexpensive, compact digital signal processing system which in real-time extracts a low-bit rate linear predictive coding (LPC) speech system model. The model parameters derived include accurate spectral envelope, formant, pitch, and amplitude information. The system is based on the Texas Instruments TMS320 family, and the most compact realization requires only three chips (TMS320E17, A/D-D/A, op-amp), consuming a total of less than 0.5 W. The processor is part of programmable cochlear implant system under development by a multiuniversity Canadian team, but also has other applications in aids to the hearing handicapped.

  4. Hybrid model decomposition of speech and noise in a radial basis function neural model framework

    DEFF Research Database (Denmark)

    Sørensen, Helge Bjarup Dissing; Hartmann, Uwe

    1994-01-01

    applied is based on a combination of the hidden Markov model (HMM) decomposition method, for speech recognition in noise, developed by Varga and Moore (1990) from DRA and the hybrid (HMM/RBF) recognizer containing hidden Markov models and radial basis function (RBF) neural networks, developed by Singer...... and Lippmann (1992) from MIT Lincoln Lab. The present authors modified the hybrid recognizer to fit into the decomposition method to achieve high performance speech recognition in noisy environments. The approach has been denoted the hybrid model decomposition method and it provides an optimal method...... for decomposition of speech and noise by using a set of speech pattern models and a noise model(s), each realized as an HMM/RBF pattern model...

  5. 4D time-frequency representation for binaural speech signal processing

    Science.gov (United States)

    Mikhael, Raed; Szu, Harold H.

    2006-04-01

    Hearing is the ability to detect and process auditory information produced by the vibrating hair cilia residing in the corti of the ears to the auditory cortex of the brain via the auditory nerve. The primary and secondary corti of the brain interact with one another to distinguish and correlate the received information by distinguishing the varying spectrum of arriving frequencies. Binaural hearing is nature's way of employing the power inherent in working in pairs to process information, enhance sound perception, and reduce undesired noise. One ear might play a prominent role in sound recognition, while the other reinforces their perceived mutual information. Developing binaural hearing aid devices can be crucial in emulating the working powers of two ears and may be a step closer to significantly alleviating hearing loss of the inner ear. This can be accomplished by combining current speech research to already existing technologies such as RF communication between PDAs and Bluetooth. Ear Level Instrument (ELI) developed by Micro-tech Hearing Instruments and Starkey Laboratories is a good example of a digital bi-directional signal communicating between a PDA/mobile phone and Bluetooth. The agreement and disagreement of arriving auditory information to the Bluetooth device can be classified as sound and noise, respectively. Finding common features of arriving sound using a four coordinate system for sound analysis (four dimensional time-frequency representation), noise can be greatly reduced and hearing aids would become more efficient. Techniques developed by Szu within an Artificial Neural Network (ANN), Blind Source Separation (BSS), Adaptive Wavelets Transform (AWT), and Independent Component Analysis (ICA) hold many possibilities to the improvement of acoustic segmentation of phoneme, all of which will be discussed in this paper. Transmitted and perceived acoustic speech signal will improve, as the binaural hearing aid will emulate two ears in sound

  6. The Use of Model Speeches: The Research Base for Live, Taped, and Written Speeches as Models for Improving Public Speaking Skills.

    Science.gov (United States)

    Friedrich, Gustav W.

    A review of the role of theory and research in the teaching of public speaking reveals that although speech models have been an important pedagogical tool since the beginning of systematic instruction in public speaking, research investigating the value of model speeches is limited. A 1966 survey of 861 instructors in public speaking indicated…

  7. Dimension-based quality modeling of transmitted speech

    CERN Document Server

    Wältermann, Marcel

    2013-01-01

    In this book, speech transmission quality is modeled on the basis of perceptual dimensions. The author identifies those dimensions that are relevant for today's public-switched and packet-based telecommunication systems, regarding the complete transmission path from the mouth of the speaker to the ear of the listener. Both narrowband (300-3400 Hz) as well as wideband (50-7000 Hz) speech transmission is taken into account. A new analytical assessment method is presented that allows the dimensions to be rated by non-expert listeners in a direct way. Due to the efficiency of the test method, a relatively large number of stimuli can be assessed in auditory tests. The test method is applied in two auditory experiments. The book gives the evidence that this test method provides meaningful and reliable results. The resulting dimension scores together with respective overall quality ratings form the basis for a new parametric model for the quality estimation of transmitted speech based on the perceptual dimensions. I...

  8. Modeling consonant-vowel coarticulation for articulatory speech synthesis.

    Directory of Open Access Journals (Sweden)

    Peter Birkholz

    Full Text Available A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the context-sensitive articulation of consonants in consonant-vowel syllables. To achieve this, the vocal tract target shape of a consonant in the context of a given vowel is derived as the weighted average of three measured and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/. The weights are determined by mapping the target shape of the given context vowel into the vowel subspace spanned by the corner vowels. The model was applied for the synthesis of consonant-vowel syllables with the consonants /b/, /d/, /g/, /l/, /r/, /m/, /n/ in all combinations with the eight long German vowels. In a perception test, the mean recognition rate for the consonants in the isolated syllables was 82.4%. This demonstrates the potential of the approach for highly intelligible articulatory speech synthesis.

  9. The Combined Effect of Signal Strength and Background Traffic Load on Speech Quality in IEEE 802.11 WLAN

    Directory of Open Access Journals (Sweden)

    P. Pocta

    2011-04-01

    Full Text Available This paper deals with measurements of the combined effect of signal strength and background traffic load on speech quality in IEEE 802.11 WLAN. The ITU-T G.729AB encoding scheme is deployed in this study and the Distributed Internet Traffic Generator (D-ITG is used for the purpose of background traffic generation. The speech quality and background traffic load are assessed by means of the accomplished PESQ algorithm and Wireshark network analyzer, respectively. The results show that background traffic load has a bit higher impact on speech quality than signal strength when both effects are available together. Moreover, background traffic load also partially masks the impact of signal strength. The reasons for those findings are particularly discussed. The results also suggest some implications for designers of wireless networks providing VoIP service.

  10. Modelling vocal anatomy's significant effect on speech

    NARCIS (Netherlands)

    de Boer, B.

    2010-01-01

    This paper investigates the effect of larynx position on the articulatory abilities of a humanlike vocal tract. Previous work has investigated models that were built to resemble the anatomy of existing species or fossil ancestors. This has led to conflicting conclusions about the relation between

  11. A Simple Correlation-Based Model of Intelligibility for Nonlinear Speech Enhancement and Separation

    DEFF Research Database (Denmark)

    Boldt, Jesper; Ellis, Daniel P. W.

    2009-01-01

    propose a measure based on the simi- larity between the time-varying spectral envelopes of target speech and system output, as measured by correlation. As with previous correlation-based intelligibility measures, our system can broadly match subjective intelligibility for a range of enhanced signals. Our...... system, however, is notably simpler and we explain the practical motivation behind each stage. This measure, freely available as a small Matlab implementation, can provide a more meaningful eval- uation measure for nonlinear speech enhancement systems, as well as providing a transparent objective......Applying a binary mask to a pure noise signal can result in speech that is highly intelligible, despite the absence of any of the target speech signal. Therefore, to estimate the intelligibility benefit of highly nonlinear speech enhancement techniques, we contend that SNR is not useful; instead we...

  12. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  13. The impact of compression of speech signal, background noise and acoustic disturbances on the effectiveness of speaker identification

    Science.gov (United States)

    Kamiński, K.; Dobrowolski, A. P.

    2017-04-01

    The paper presents the architecture and the results of optimization of selected elements of the Automatic Speaker Recognition (ASR) system that uses Gaussian Mixture Models (GMM) in the classification process. Optimization was performed on the process of selection of individual characteristics using the genetic algorithm and the parameters of Gaussian distributions used to describe individual voices. The system that was developed was tested in order to evaluate the impact of different compression methods used, among others, in landline, mobile, and VoIP telephony systems, on effectiveness of the speaker identification. Also, the results were presented of effectiveness of speaker identification at specific levels of noise with the speech signal and occurrence of other disturbances that could appear during phone calls, which made it possible to specify the spectrum of applications of the presented ASR system.

  14. Study of wavelet packet energy entropy for emotion classification in speech and glottal signals

    Science.gov (United States)

    He, Ling; Lech, Margaret; Zhang, Jing; Ren, Xiaomei; Deng, Lihua

    2013-07-01

    The automatic speech emotion recognition has important applications in human-machine communication. Majority of current research in this area is focused on finding optimal feature parameters. In recent studies, several glottal features were examined as potential cues for emotion differentiation. In this study, a new type of feature parameter is proposed, which calculates energy entropy on values within selected Wavelet Packet frequency bands. The modeling and classification tasks are conducted using the classical GMM algorithm. The experiments use two data sets: the Speech Under Simulated Emotion (SUSE) data set annotated with three different emotions (angry, neutral and soft) and Berlin Emotional Speech (BES) database annotated with seven different emotions (angry, bored, disgust, fear, happy, sad and neutral). The average classification accuracy achieved for the SUSE data (74%-76%) is significantly higher than the accuracy achieved for the BES data (51%-54%). In both cases, the accuracy was significantly higher than the respective random guessing levels (33% for SUSE and 14.3% for BES).

  15. Subspace-Based Noise Reduction for Speech Signals via Diagonal and Triangular Matrix Decompositions: Survey and Analysis

    Directory of Open Access Journals (Sweden)

    Søren Holdt Jensen

    2007-01-01

    Full Text Available We survey the definitions and use of rank-revealing matrix decompositions in single-channel noise reduction algorithms for speech signals. Our algorithms are based on the rank-reduction paradigm and, in particular, signal subspace techniques. The focus is on practical working algorithms, using both diagonal (eigenvalue and singular value decompositions and rank-revealing triangular decompositions (ULV, URV, VSV, ULLV, and ULLIV. In addition, we show how the subspace-based algorithms can be analyzed and compared by means of simple FIR filter interpretations. The algorithms are illustrated with working Matlab code and applications in speech processing.

  16. New Results on Single-Channel Speech Separation Using Sinusoidal Modeling

    DEFF Research Database (Denmark)

    Mowlaee, Pejman; Christensen, Mads Græsbøll; Jensen, Søren Holdt

    2011-01-01

    We present new results on single-channel speech separation and suggest a new separation approach to improve the speech quality of separated signals from an observed mix- ture. The key idea is to derive a mixture estimator based on sinusoidal parameters. The proposed estimator is aimed at finding...

  17. Applying independent component analysis to detect silent speech in magnetic resonance imaging signals.

    Science.gov (United States)

    Abe, Kazuhiro; Takahashi, Toshimitsu; Takikawa, Yoriko; Arai, Hajime; Kitazawa, Shigeru

    2011-10-01

    Independent component analysis (ICA) can be usefully applied to functional imaging studies to evaluate the spatial extent and temporal profile of task-related brain activity. It requires no a priori assumptions about the anatomical areas that are activated or the temporal profile of the activity. We applied spatial ICA to detect a voluntary but hidden response of silent speech. To validate the method against a standard model-based approach, we used the silent speech of a tongue twister as a 'Yes' response to single questions that were delivered at given times. In the first task, we attempted to estimate one number that was chosen by a participant from 10 possibilities. In the second task, we increased the possibilities to 1000. In both tasks, spatial ICA was as effective as the model-based method for determining the number in the subject's mind (80-90% correct per digit), but spatial ICA outperformed the model-based method in terms of time, especially in the 1000-possibility task. In the model-based method, calculation time increased by 30-fold, to 15 h, because of the necessity of testing 1000 possibilities. In contrast, the calculation time for spatial ICA remained as short as 30 min. In addition, spatial ICA detected an unexpected response that occurred by mistake. This advantage was validated in a third task, with 13 500 possibilities, in which participants had the freedom to choose when to make one of four responses. We conclude that spatial ICA is effective for detecting the onset of silent speech, especially when it occurs unexpectedly.

  18. Speech communications in noise

    Science.gov (United States)

    1984-07-01

    The physical characteristics of speech, the methods of speech masking measurement, and the effects of noise on speech communication are investigated. Topics include the speech signal and intelligibility, the effects of noise on intelligibility, the articulation index, and various devices for evaluating speech systems.

  19. Degraded neural and behavioral processing of speech sounds in a rat model of Rett syndrome.

    Science.gov (United States)

    Engineer, Crystal T; Rahebi, Kimiya C; Borland, Michael S; Buell, Elizabeth P; Centanni, Tracy M; Fink, Melyssa K; Im, Kwok W; Wilson, Linda G; Kilgard, Michael P

    2015-11-01

    Individuals with Rett syndrome have greatly impaired speech and language abilities. Auditory brainstem responses to sounds are normal, but cortical responses are highly abnormal. In this study, we used the novel rat Mecp2 knockout model of Rett syndrome to document the neural and behavioral processing of speech sounds. We hypothesized that both speech discrimination ability and the neural response to speech sounds would be impaired in Mecp2 rats. We expected that extensive speech training would improve speech discrimination ability and the cortical response to speech sounds. Our results reveal that speech responses across all four auditory cortex fields of Mecp2 rats were hyperexcitable, responded slower, and were less able to follow rapidly presented sounds. While Mecp2 rats could accurately perform consonant and vowel discrimination tasks in quiet, they were significantly impaired at speech sound discrimination in background noise. Extensive speech training improved discrimination ability. Training shifted cortical responses in both Mecp2 and control rats to favor the onset of speech sounds. While training increased the response to low frequency sounds in control rats, the opposite occurred in Mecp2 rats. Although neural coding and plasticity are abnormal in the rat model of Rett syndrome, extensive therapy appears to be effective. These findings may help to explain some aspects of communication deficits in Rett syndrome and suggest that extensive rehabilitation therapy might prove beneficial.

  20. Analytical Study on Fundamental Frequency Contours of Thai Expressive Speech Using Fujisaki's Model

    Directory of Open Access Journals (Sweden)

    Suphattharachai Chomphan

    2010-01-01

    Full Text Available Problem statement: In spontaneous speech communication, prosody is an important factor that must be taken into account, since the prosody effects on not only the naturalness but also the intelligibility of speech. Focusing on synthesis of Thai expressive speech, a number of systems has been developed for years. However, the expressive speech with various speaking styles has not been accomplished. To achieve the generation of expressive speech, we need to model the fundamental frequency (F0 contours accurately to preserve the speech prosody. Approach: Therefore this study proposes an analysis of model parameters for Thai speech prosody with three speaking styles and two genders which is a preliminary work for speech synthesis. Fujisaki's modeling; a powerful tool to model the F0 contour has been adopted, while the speaking styles of happiness, sadness and reading have been considered. Seven derived parameters from the Fujisaki's model are as follows. The first parameter is baseline frequency which is the lowest level of F0 contour. The second and third parameters are the numbers of phrase commands and tone commands which reflect the frequencies of surges of the utterance in global and local levels, respectively. The fourth and fifth parameters are phrase command and tone command durations which reflect the speed of speaking and the length of a syllable, respectively. The sixth and seventh parameters are amplitudes of phrase command and tone command which reflect the energy of the global speech and the energy of local syllable. Results: In the experiments, each speaking style includes 200 samples of one sentence with male and female speech. Therefore our speech database contains 1200 utterances in total. The results show that most of the proposed parameters can distinguish three kinds of speaking styles explicitly. Conclusion: From the finding, it is a strong evidence to further apply the successful parameters in the speech synthesis systems or

  1. A Hybrid Approach for Co-Channel Speech Segregation based on CASA, HMM Multipitch Tracking, and Medium Frame Harmonic Model

    Directory of Open Access Journals (Sweden)

    Ashraf M. Mohy Eldin

    2013-08-01

    Full Text Available This paper proposes a hybrid approach for co-channel speech segregation. HMM (hidden Markov model is used to track the pitches of 2 talkers. The resulting pitch tracks are then enriched with the prominent pitch. The enriched tracks are correctly grouped using pitch continuity. Medium frame harmonics are used to extract the second pitch for frames with only one pitch deduced using the previous steps. Finally, the pitch tracks are input to CASA (computational auditory scene analysis to segregate the mixed speech. The center frequency range of the gamma tone filter banks is maximized to reduce the overlap between the channels filtered for better segregation. Experiments were conducted using this hybrid approach on the speech separation challenge database and compared to the single (non-hybrid approaches, i.e. signal processing and CASA. Results show that using the hybrid approach outperforms the single approaches.

  2. Some Interactions of Speech Rate, Signal Distortion, and Certain Linguistic Factors in Listening Comprehension. Professional Paper No. 39-68.

    Science.gov (United States)

    Sticht, Thomas G.

    This experiment was designed to determine the relative effects of speech rate and signal distortion due to the time-compression process on listening comprehension. In addition, linguistic factors--including sequencing of random words into story form, and inflection and phraseology--were qualitatively considered for their effects on listening…

  3. Models of calcium signalling

    CERN Document Server

    Dupont, Geneviève; Kirk, Vivien; Sneyd, James

    2016-01-01

    This book discusses the ways in which mathematical, computational, and modelling methods can be used to help understand the dynamics of intracellular calcium. The concentration of free intracellular calcium is vital for controlling a wide range of cellular processes, and is thus of great physiological importance. However, because of the complex ways in which the calcium concentration varies, it is also of great mathematical interest.This book presents the general modelling theory as well as a large number of specific case examples, to show how mathematical modelling can interact with experimental approaches, in an interdisciplinary and multifaceted approach to the study of an important physiological control mechanism. Geneviève Dupont is FNRS Research Director at the Unit of Theoretical Chronobiology of the Université Libre de Bruxelles;Martin Falcke is head of the Mathematical Cell Physiology group at the Max Delbrück Center for Molecular Medicine, Berlin;Vivien Kirk is an Associate Professor in the Depar...

  4. Application of an extended equalization-cancellation model to speech intelligibility with spatially distributed maskers.

    Science.gov (United States)

    Wan, Rui; Durlach, Nathaniel I; Colburn, H Steven

    2010-12-01

    An extended version of the equalization-cancellation (EC) model of binaural processing is described and applied to speech intelligibility tasks in the presence of multiple maskers. The model incorporates time-varying jitters, both in time and amplitude, and implements the equalization and cancellation operations in each frequency band independently. The model is consistent with the original EC model in predicting tone-detection performance for a large set of configurations. When the model is applied to speech, the speech intelligibility index is used to predict speech intelligibility performance in a variety of conditions. Specific conditions addressed include different types of maskers, different numbers of maskers, and different spatial locations of maskers. Model predictions are compared with empirical measurements reported by Hawley et al. [J. Acoust. Soc. Am. 115, 833-843 (2004)] and by Marrone et al. [J. Acoust. Soc. Am. 124, 1146-1158 (2008)]. The model succeeds in predicting speech intelligibility performance when maskers are speech-shaped noise or broadband-modulated speech-shaped noise but fails when the maskers are speech or reversed speech.

  5. Study on Acoustic Modeling in a Mandarin Continuous Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    PENG Di; LIU Gang; GUO Jun

    2007-01-01

    The design of acoustic models is of vital importance to build a reliable connection between acoustic waveform and linguistic messages in terms of individual speech units. According to the characteristic of Chinese phonemes,the base acoustic phoneme units set is decided and refined and a decision tree based state tying approach is explored.Since one of the advantages of top-down tying method is flexibility in maintaining a balance between model accuracy and complexity, relevant adjustments are conducted, such as the stopping criterion of decision tree node splitting, during which optimal thresholds are captured. Better results are achieved in improving acoustic modeling accuracy as well as minimizing the scale of the model to a trainable extent.

  6. Modeling speech intelligibility in quiet and noise in listeners with normal and impaired hearing.

    Science.gov (United States)

    Rhebergen, Koenraad S; Lyzenga, Johannes; Dreschler, Wouter A; Festen, Joost M

    2010-03-01

    The speech intelligibility index (SII) is an often used calculation method for estimating the proportion of audible speech in noise. For speech reception thresholds (SRTs), measured in normally hearing listeners using various types of stationary noise, this model predicts a fairly constant speech proportion of about 0.33, necessary for Dutch sentence intelligibility. However, when the SII model is applied for SRTs in quiet, the estimated speech proportions are often higher, and show a larger inter-subject variability, than found for speech in noise near normal speech levels [65 dB sound pressure level (SPL)]. The present model attempts to alleviate this problem by including cochlear compression. It is based on a loudness model for normally hearing and hearing-impaired listeners of Moore and Glasberg [(2004). Hear. Res. 188, 70-88]. It estimates internal excitation levels for speech and noise and then calculates the proportion of speech above noise and threshold using similar spectral weighting as used in the SII. The present model and the standard SII were used to predict SII values in quiet and in stationary noise for normally hearing and hearing-impaired listeners. The present model predicted SIIs for three listener types (normal hearing, noise-induced, and age-induced hearing loss) with markedly less variability than the standard SII.

  7. Speech Signal and Facial Image Processing for Obstructive Sleep Apnea Assessment

    Directory of Open Access Journals (Sweden)

    Fernando Espinoza-Cuadros

    2015-01-01

    Full Text Available Obstructive sleep apnea (OSA is a common sleep disorder characterized by recurring breathing pauses during sleep caused by a blockage of the upper airway (UA. OSA is generally diagnosed through a costly procedure requiring an overnight stay of the patient at the hospital. This has led to proposing less costly procedures based on the analysis of patients’ facial images and voice recordings to help in OSA detection and severity assessment. In this paper we investigate the use of both image and speech processing to estimate the apnea-hypopnea index, AHI (which describes the severity of the condition, over a population of 285 male Spanish subjects suspected to suffer from OSA and referred to a Sleep Disorders Unit. Photographs and voice recordings were collected in a supervised but not highly controlled way trying to test a scenario close to an OSA assessment application running on a mobile device (i.e., smartphones or tablets. Spectral information in speech utterances is modeled by a state-of-the-art low-dimensional acoustic representation, called i-vector. A set of local craniofacial features related to OSA are extracted from images after detecting facial landmarks using Active Appearance Models (AAMs. Support vector regression (SVR is applied on facial features and i-vectors to estimate the AHI.

  8. Statistic Model Based Dynamic Channel Compensation for Telephony Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    ZHANGHuayun; HANZhaobing; XUBo

    2004-01-01

    The degradation of speech recognition performance in real-life environments and through transmission channels is a main embarrassment for many speechbased applications around the world, especially when nonstationary noise and changing channel exist. Previous works have shown that the main reason for this performance degradation is the variational mismatch caused by different telephone channels between the testing and training sets. In this paper, we propose a statistic model based implementation to dynamically compensate this mismatch. Firstly, we focus on a Maximum-likelihood (ML) estimation algorithm for telephone channels. In experiments on Mandarin Large vocabulary continuous speech recognition (LVCSR) over telephone lines, the Character error rate (CER) decreases more than 20%. The average delay is about 300-400ms. Secondly, we will extend it by introducing a phone-conditioned prior statistic model for the channels and applying Maximum a posteriori (MAP) estimation technique. Compared to the ML based method, the MAP based algorithm follows with the variations within channels more effectively. Average delay of the algorithm is decreased to 200ms. An additional 7-8% CER relative reduction is observed in LVCSR.

  9. Speech Emotion Recognition Using Fuzzy Logic Classifier

    Directory of Open Access Journals (Sweden)

    Daniar aghsanavard

    2016-01-01

    Full Text Available Over the last two decades, emotions, speech recognition and signal processing have been one of the most significant issues in the adoption of techniques to detect them. Each method has advantages and disadvantages. This paper tries to suggest fuzzy speech emotion recognition based on the classification of speech's signals in order to better recognition along with a higher speed. In this system, the use of fuzzy logic system with 5 layers, which is the combination of neural progressive network and algorithm optimization of firefly, first, speech samples have been given to input of fuzzy orbit and then, signals will be investigated and primary classified in a fuzzy framework. In this model, a pattern of signals will be created for each class of signals, which results in reduction of signal data dimension as well as easier speech recognition. The obtained experimental results show that our proposed method (categorized by firefly, improves recognition of utterances.

  10. A signal subspace dimension estimator based on F-norm with application to subspace-based multi-channel speech enhancement

    Institute of Scientific and Technical Information of China (English)

    LI Chao; LIU Wenju

    2012-01-01

    Although the signal subspace approach has been studied extensively for speech enhancement, no good solution has been found to identify signal subspace dimension in multi- channel situation. This paper presents a signal subspace dimension estimator based on F-norm of correlation matrix, with which subspace-based multi-channel speech enhancement is robust to adverse acoustic environments such as room reverberation and low input signal to noise ratio (SNR). Experiments demonstrate the presented method leads to more noise reduction and less speech distortion comparing with traditional methods.

  11. Compressed sensing for speech signal based on linear prediction analysis%基于线性预测分析的语音信号压缩感知

    Institute of Scientific and Technical Information of China (English)

    王红柱; 陈砚圃; 高悦; 王浩

    2012-01-01

    This paper presented a new speech signal sparse domain-synthesis matrix based on linear prediction technology based on the features of speech signal,and verified the sparsity of speech signal in the new sparse domain. By speech signal and the Gaussian random matrix, used OMP to reconstruct the original speech signal. Experimental results demonstrate that the performance of the speech recovered using compressed sensing for speech signal based on linear prediction analysis is better and reconstructed signal has good segment signal to noise ratio and mean opinion score, compared with DCT domain.%根据语音信号的特点,提出了一种基于线性预测分析的合成矩阵作为语音信号的稀疏变换域,并验证了语音信号在该域上的稀疏特性.由语音信号和随机高斯矩阵构造相应的观测,采用正交匹配追踪算法重构原始语音信号.实验表明,语音信号在新的变换域上的重构性能要优于DCT域,且具有较高的分段信噪比和平均意见得分.

  12. A Novel Method for Classifying Body Mass Index on the Basis of Speech Signals for Future Clinical Applications: A Pilot Study

    Directory of Open Access Journals (Sweden)

    Bum Ju Lee

    2013-01-01

    Full Text Available Obesity is a serious public health problem because of the risk factors for diseases and psychological problems. The focus of this study is to diagnose the patient BMI (body mass index status without weight and height measurements for the use in future clinical applications. In this paper, we first propose a method for classifying the normal and the overweight using only speech signals. Also, we perform a statistical analysis of the features from speech signals. Based on 1830 subjects, the accuracy and AUC (area under the ROC curve of age- and gender-specific classifications ranged from 60.4 to 73.8% and from 0.628 to 0.738, respectively. We identified several features that were significantly different between normal and overweight subjects (P<0.05. Also, we found compact and discriminatory feature subsets for building models for diagnosing normal or overweight individuals through wrapper-based feature subset selection. Our results showed that predicting BMI status is possible using a combination of speech features, even though significant features are rare and weak in age- and gender-specific groups and that the classification accuracy with feature selection was higher than that without feature selection. Our method has the potential to be used in future clinical applications such as automatic BMI diagnosis in telemedicine or remote healthcare.

  13. Classifying acoustic signals into phoneme categories: average and dyslexic readers make use of complex dynamical patterns and multifractal scaling properties of the speech signal

    Directory of Open Access Journals (Sweden)

    Fred Hasselman

    2015-03-01

    Full Text Available Several competing aetiologies of developmental dyslexia suggest that the problems with acquiring literacy skills are causally entailed by low-level auditory and/or speech perception processes. The purpose of this study is to evaluate the diverging claims about the specific deficient peceptual processes under conditions of strong inference. Theoretically relevant acoustic features were extracted from a set of artificial speech stimuli that lie on a /bAk/-/dAk/ continuum. The features were tested on their ability to enable a simple classifier (Quadratic Discriminant Analysis to reproduce the observed classification performance of average and dyslexic readers in a speech perception experiment. The ‘classical’ features examined were based on component process accounts of developmental dyslexia such as the supposed deficit in Envelope Rise Time detection and the deficit in the detection of rapid changes in the distribution of energy in the frequency spectrum (formant transitions. Studies examining these temporal processing deficit hypotheses do not employ measures that quantify the temporal dynamics of stimuli. It is shown that measures based on quantification of the dynamics of complex, interaction-dominant systems (Recurrence Quantification Analysis and the multifractal spectrum enable QDA to classify the stimuli almost identically as observed in dyslexic and average reading participants. It seems unlikely that participants used any of the features that are traditionally associated with accounts of (impaired speech perception. The nature of the variables quantifying the temporal dynamics of the speech stimuli imply that the classification of speech stimuli cannot be regarded as a linear aggregate of component processes that each parse the acoustic signal independent of one another, as is assumed by the ‘classical’ aetiologies of developmental dyslexia. It is suggested that the results imply that the differences in speech perception

  14. Classifying acoustic signals into phoneme categories: average and dyslexic readers make use of complex dynamical patterns and multifractal scaling properties of the speech signal.

    Science.gov (United States)

    Hasselman, Fred

    2015-01-01

    Several competing aetiologies of developmental dyslexia suggest that the problems with acquiring literacy skills are causally entailed by low-level auditory and/or speech perception processes. The purpose of this study is to evaluate the diverging claims about the specific deficient peceptual processes under conditions of strong inference. Theoretically relevant acoustic features were extracted from a set of artificial speech stimuli that lie on a /bAk/-/dAk/ continuum. The features were tested on their ability to enable a simple classifier (Quadratic Discriminant Analysis) to reproduce the observed classification performance of average and dyslexic readers in a speech perception experiment. The 'classical' features examined were based on component process accounts of developmental dyslexia such as the supposed deficit in Envelope Rise Time detection and the deficit in the detection of rapid changes in the distribution of energy in the frequency spectrum (formant transitions). Studies examining these temporal processing deficit hypotheses do not employ measures that quantify the temporal dynamics of stimuli. It is shown that measures based on quantification of the dynamics of complex, interaction-dominant systems (Recurrence Quantification Analysis and the multifractal spectrum) enable QDA to classify the stimuli almost identically as observed in dyslexic and average reading participants. It seems unlikely that participants used any of the features that are traditionally associated with accounts of (impaired) speech perception. The nature of the variables quantifying the temporal dynamics of the speech stimuli imply that the classification of speech stimuli cannot be regarded as a linear aggregate of component processes that each parse the acoustic signal independent of one another, as is assumed by the 'classical' aetiologies of developmental dyslexia. It is suggested that the results imply that the differences in speech perception performance between average and

  15. Particle Swarm Optimization Based Feature Enhancement and Feature Selection for Improved Emotion Recognition in Speech and Glottal Signals

    Science.gov (United States)

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141

  16. Auditory-motor interactions in pediatric motor speech disorders: neurocomputational modeling of disordered development.

    Science.gov (United States)

    Terband, H; Maassen, B; Guenther, F H; Brumberg, J

    2014-01-01

    Differentiating the symptom complex due to phonological-level disorders, speech delay and pediatric motor speech disorders is a controversial issue in the field of pediatric speech and language pathology. The present study investigated the developmental interaction between neurological deficits in auditory and motor processes using computational modeling with the DIVA model. In a series of computer simulations, we investigated the effect of a motor processing deficit alone (MPD), and the effect of a motor processing deficit in combination with an auditory processing deficit (MPD+APD) on the trajectory and endpoint of speech motor development in the DIVA model. Simulation results showed that a motor programming deficit predominantly leads to deterioration on the phonological level (phonemic mappings) when auditory self-monitoring is intact, and on the systemic level (systemic mapping) if auditory self-monitoring is impaired. These findings suggest a close relation between quality of auditory self-monitoring and the involvement of phonological vs. motor processes in children with pediatric motor speech disorders. It is suggested that MPD+APD might be involved in typically apraxic speech output disorders and MPD in pediatric motor speech disorders that also have a phonological component. Possibilities to verify these hypotheses using empirical data collected from human subjects are discussed. The reader will be able to: (1) identify the difficulties in studying disordered speech motor development; (2) describe the differences in speech motor characteristics between SSD and subtype CAS; (3) describe the different types of learning that occur in the sensory-motor system during babbling and early speech acquisition; (4) identify the neural control subsystems involved in speech production; (5) describe the potential role of auditory self-monitoring in developmental speech disorders. Copyright © 2014 Elsevier Inc. All rights reserved.

  17. Mathematical Modelling Plant Signalling Networks

    KAUST Repository

    Muraro, D.

    2013-01-01

    During the last two decades, molecular genetic studies and the completion of the sequencing of the Arabidopsis thaliana genome have increased knowledge of hormonal regulation in plants. These signal transduction pathways act in concert through gene regulatory and signalling networks whose main components have begun to be elucidated. Our understanding of the resulting cellular processes is hindered by the complex, and sometimes counter-intuitive, dynamics of the networks, which may be interconnected through feedback controls and cross-regulation. Mathematical modelling provides a valuable tool to investigate such dynamics and to perform in silico experiments that may not be easily carried out in a laboratory. In this article, we firstly review general methods for modelling gene and signalling networks and their application in plants. We then describe specific models of hormonal perception and cross-talk in plants. This mathematical analysis of sub-cellular molecular mechanisms paves the way for more comprehensive modelling studies of hormonal transport and signalling in a multi-scale setting. © EDP Sciences, 2013.

  18. Relationship between speech recognition in noise and sparseness.

    Science.gov (United States)

    Li, Guoping; Lutman, Mark E; Wang, Shouyan; Bleeck, Stefan

    2012-02-01

    Established methods for predicting speech recognition in noise require knowledge of clean speech signals, placing limitations on their application. The study evaluates an alternative approach based on characteristics of noisy speech, specifically its sparseness as represented by the statistic kurtosis. Experiments 1 and 2 involved acoustic analysis of vowel-consonant-vowel (VCV) syllables in babble noise, comparing kurtosis, glimpsing areas, and extended speech intelligibility index (ESII) of noisy speech signals with one another and with pre-existing speech recognition scores. Experiment 3 manipulated kurtosis of VCV syllables and investigated effects on speech recognition scores in normal-hearing listeners. Pre-existing speech recognition data for Experiments 1 and 2; seven normal-hearing participants for Experiment 3. Experiments 1 and 2 demonstrated that kurtosis calculated in the time-domain from noisy speech is highly correlated (r > 0.98) with established prediction models: glimpsing and ESII. All three measures predicted speech recognition scores well. The final experiment showed a clear monotonic relationship between speech recognition scores and kurtosis. Speech recognition performance in noise is closely related to the sparseness (kurtosis) of the noisy speech signal, at least for the types of speech and noise used here and for listeners with normal hearing.

  19. Speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Ravishankar, C., Hughes Network Systems, Germantown, MD

    1998-05-08

    Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the

  20. Comparison of Linear Prediction Models for Audio Signals

    Directory of Open Access Journals (Sweden)

    2009-03-01

    Full Text Available While linear prediction (LP has become immensely popular in speech modeling, it does not seem to provide a good approach for modeling audio signals. This is somewhat surprising, since a tonal signal consisting of a number of sinusoids can be perfectly predicted based on an (all-pole LP model with a model order that is twice the number of sinusoids. We provide an explanation why this result cannot simply be extrapolated to LP of audio signals. If noise is taken into account in the tonal signal model, a low-order all-pole model appears to be only appropriate when the tonal components are uniformly distributed in the Nyquist interval. Based on this observation, different alternatives to the conventional LP model can be suggested. Either the model should be changed to a pole-zero, a high-order all-pole, or a pitch prediction model, or the conventional LP model should be preceded by an appropriate frequency transform, such as a frequency warping or downsampling. By comparing these alternative LP models to the conventional LP model in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, we obtain several new and promising approaches to LP-based audio modeling.

  1. Comparison of Linear Prediction Models for Audio Signals

    Directory of Open Access Journals (Sweden)

    van Waterschoot Toon

    2008-01-01

    Full Text Available While linear prediction (LP has become immensely popular in speech modeling, it does not seem to provide a good approach for modeling audio signals. This is somewhat surprising, since a tonal signal consisting of a number of sinusoids can be perfectly predicted based on an (all-pole LP model with a model order that is twice the number of sinusoids. We provide an explanation why this result cannot simply be extrapolated to LP of audio signals. If noise is taken into account in the tonal signal model, a low-order all-pole model appears to be only appropriate when the tonal components are uniformly distributed in the Nyquist interval. Based on this observation, different alternatives to the conventional LP model can be suggested. Either the model should be changed to a pole-zero, a high-order all-pole, or a pitch prediction model, or the conventional LP model should be preceded by an appropriate frequency transform, such as a frequency warping or downsampling. By comparing these alternative LP models to the conventional LP model in terms of frequency estimation accuracy, residual spectral flatness, and perceptual frequency resolution, we obtain several new and promising approaches to LP-based audio modeling.

  2. A Speech Transformation System for Emotion Conversion using MFCC and Modeling using GMM

    Directory of Open Access Journals (Sweden)

    Ashish R. Panat

    2013-07-01

    Full Text Available Transformation of sound by using statistical techniques is significant method for a new range of digital audio effect applications. This paper talks about the data driven voice transformation algorithm, which is used to alter the timbre of a Neutral (non-emotional voice in order to reproduce a particular emotional vocal timbre i.e. (Angry emotion. Perception based mel log spectral approximation filters are used to represent the speech timbre in the form of Mel Frequency Cepstral Coefficients (MFCC. The transformation function adopts a GMM (Gaussian Mixture Model based parameterization in order to model the spectral envelope. Expectation maximization algorithm is used to find the parameters of the Gaussian mixture model. Finally a polynomial transformation function is developed for emotion transformation of the speech signal. Objective evaluation is performed by calculating Mel Cepstral distance, Prediction error and Normalized Prediction error. Mel cepstral distance between Angry and transformed Angry is found to be much lesser than mel cepstral distance between Neutral and Angry emotion. The plot of the mel cepstral distance between neutral and angry and between neutral and transformed angry also overlaps at several locations

  3. Auditory-model based robust feature selection for speech recognition.

    Science.gov (United States)

    Koniaris, Christos; Kuropatwinski, Marcin; Kleijn, W Bastiaan

    2010-02-01

    It is shown that robust dimension-reduction of a feature set for speech recognition can be based on a model of the human auditory system. Whereas conventional methods optimize classification performance, the proposed method exploits knowledge implicit in the auditory periphery, inheriting its robustness. Features are selected to maximize the similarity of the Euclidean geometry of the feature domain and the perceptual domain. Recognition experiments using mel-frequency cepstral coefficients (MFCCs) confirm the effectiveness of the approach, which does not require labeled training data. For noisy data the method outperforms commonly used discriminant-analysis based dimension-reduction methods that rely on labeling. The results indicate that selecting MFCCs in their natural order results in subsets with good performance.

  4. Perceptual centres in speech - an acoustic analysis

    Science.gov (United States)

    Scott, Sophie Kerttu

    Perceptual centres, or P-centres, represent the perceptual moments of occurrence of acoustic signals - the 'beat' of a sound. P-centres underlie the perception and production of rhythm in perceptually regular speech sequences. P-centres have been modelled both in speech and non speech (music) domains. The three aims of this thesis were toatest out current P-centre models to determine which best accounted for the experimental data bto identify a candidate parameter to map P-centres onto (a local approach) as opposed to the previous global models which rely upon the whole signal to determine the P-centre the final aim was to develop a model of P-centre location which could be applied to speech and non speech signals. The first aim was investigated by a series of experiments in which a) speech from different speakers was investigated to determine whether different models could account for variation between speakers b) whether rendering the amplitude time plot of a speech signal affects the P-centre of the signal c) whether increasing the amplitude at the offset of a speech signal alters P-centres in the production and perception of speech. The second aim was carried out by a) manipulating the rise time of different speech signals to determine whether the P-centre was affected, and whether the type of speech sound ramped affected the P-centre shift b) manipulating the rise time and decay time of a synthetic vowel to determine whether the onset alteration was had more affect on P-centre than the offset manipulation c) and whether the duration of a vowel affected the P-centre, if other attributes (amplitude, spectral contents) were held constant. The third aim - modelling P-centres - was based on these results. The Frequency dependent Amplitude Increase Model of P-centre location (FAIM) was developed using a modelling protocol, the APU GammaTone Filterbank and the speech from different speakers. The P-centres of the stimuli corpus were highly predicted by attributes of

  5. Predicting binaural speech intelligibility using the signal-to-noise ratio in the envelope power spectrum domain

    DEFF Research Database (Denmark)

    Chabot-Leclerc, Alexandre; MacDonald, Ewen; Dau, Torsten

    2016-01-01

    time difference of the target and masker. The Pearson correlation coefficient between the simulated speech reception thresholds and the data across all experiments was 0.91. A model version that considered only BE processing performed similarly (correlation coefficient of 0.86) to the complete model...

  6. Computational neural modeling of speech motor control in childhood apraxia of speech (CAS)

    NARCIS (Netherlands)

    Terband, H.R.; Maassen, B.A.M.; Guenther, F.H.; Brumberg, J.

    2009-01-01

    PURPOSE: Childhood apraxia of speech (CAS) has been associated with a wide variety of diagnostic descriptions and has been shown to involve different symptoms during successive stages of development. In the present study, the authors attempted to associate the symptoms of CAS in a particular develop

  7. Academic Freedom in Classroom Speech: A Heuristic Model for U.S. Catholic Higher Education

    Science.gov (United States)

    Jacobs, Richard M.

    2010-01-01

    As the nation's Catholic universities and colleges continually clarify their identity, this article examines academic freedom in classroom speech, offering a heuristic model for use as board members, academic administrators, and faculty leaders discuss, evaluate, and judge allegations of misconduct in classroom speech. Focusing upon the practice…

  8. Speech-Language Pathologist and General Educator Collaboration: A Model for Tier 2 Service Delivery

    Science.gov (United States)

    Watson, Gina D.; Bellon-Harn, Monica L.

    2014-01-01

    Tier 2 supplemental instruction within a response to intervention framework provides a unique opportunity for developing partnerships between speech-language pathologists and classroom teachers. Speech-language pathologists may participate in Tier 2 instruction via a consultative or collaborative service delivery model depending on district needs.…

  9. Academic Freedom in Classroom Speech: A Heuristic Model for U.S. Catholic Higher Education

    Science.gov (United States)

    Jacobs, Richard M.

    2010-01-01

    As the nation's Catholic universities and colleges continually clarify their identity, this article examines academic freedom in classroom speech, offering a heuristic model for use as board members, academic administrators, and faculty leaders discuss, evaluate, and judge allegations of misconduct in classroom speech. Focusing upon the practice…

  10. Model-Based Synthesis of Visual Speech Movements from 3D Video

    Directory of Open Access Journals (Sweden)

    James D. Edge

    2009-01-01

    Full Text Available We describe a method for the synthesis of visual speech movements using a hybrid unit selection/model-based approach. Speech lip movements are captured using a 3D stereo face capture system and split up into phonetic units. A dynamic parameterisation of this data is constructed which maintains the relationship between lip shapes and velocities; within this parameterisation a model of how lips move is built and is used in the animation of visual speech movements from speech audio input. The mapping from audio parameters to lip movements is disambiguated by selecting only the most similar stored phonetic units to the target utterance during synthesis. By combining properties of model-based synthesis (e.g., HMMs, neural nets with unit selection we improve the quality of our speech synthesis.

  11. A Pitch Detection Algorithm for Continuous Speech Signals Using Viterbi Traceback with Temporal Forgetting

    Directory of Open Access Journals (Sweden)

    J. Bartošek

    2011-01-01

    Full Text Available This paper presents a pitch-detection algorithm (PDA for application to signals containing continuous speech. The core of the method is based on merged normalized forward-backward correlation (MNFBC working in the time domain with the ability to make basic voicing decisions. In addition, the Viterbi traceback procedure is used for post-processing the MNFBC output considering the three best fundamental frequency (F0 candidates in each step. This should make the final pitch contour smoother, and should also prevent octave errors. In transition probabilities computation between F0 candidates, two major improvements were made over existing post-processing methods. Firstly, we compare pitch distance in musical cent units. Secondly, temporal forgetting is applied in order to avoid penalizing pitch jumps after prosodic pauses of one speaker or changes in pitch connected with turn-taking in dialogs. Results computed on a pitchreference database definitely show the benefit of the first improvement, but they have not yet proved any benefits of temporal modification. We assume this only happened due to the nature of the reference corpus, which had a small amount of suprasegmental content.

  12. Interactions between distal speech rate, linguistic knowledge, and speech environment.

    Science.gov (United States)

    Morrill, Tuuli; Baese-Berk, Melissa; Heffner, Christopher; Dilley, Laura

    2015-10-01

    During lexical access, listeners use both signal-based and knowledge-based cues, and information from the linguistic context can affect the perception of acoustic speech information. Recent findings suggest that the various cues used in lexical access are implemented with flexibility and may be affected by information from the larger speech context. We conducted 2 experiments to examine effects of a signal-based cue (distal speech rate) and a knowledge-based cue (linguistic structure) on lexical perception. In Experiment 1, we manipulated distal speech rate in utterances where an acoustically ambiguous critical word was either obligatory for the utterance to be syntactically well formed (e.g., Conner knew that bread and butter (are) both in the pantry) or optional (e.g., Don must see the harbor (or) boats). In Experiment 2, we examined identical target utterances as in Experiment 1 but changed the distribution of linguistic structures in the fillers. The results of the 2 experiments demonstrate that speech rate and linguistic knowledge about critical word obligatoriness can both influence speech perception. In addition, it is possible to alter the strength of a signal-based cue by changing information in the speech environment. These results provide support for models of word segmentation that include flexible weighting of signal-based and knowledge-based cues.

  13. Phonological representations are unconsciously used when processing complex, non-speech signals.

    Directory of Open Access Journals (Sweden)

    Mahan Azadpour

    Full Text Available Neuroimaging studies of speech processing increasingly rely on artificial speech-like sounds whose perceptual status as speech or non-speech is assigned by simple subjective judgments; brain activation patterns are interpreted according to these status assignments. The naïve perceptual status of one such stimulus, spectrally-rotated speech (not consciously perceived as speech by naïve subjects, was evaluated in discrimination and forced identification experiments. Discrimination of variation in spectrally-rotated syllables in one group of naïve subjects was strongly related to the pattern of similarities in phonological identification of the same stimuli provided by a second, independent group of naïve subjects, suggesting either that (1 naïve rotated syllable perception involves phonetic-like processing, or (2 that perception is solely based on physical acoustic similarity, and similar sounds are provided with similar phonetic identities. Analysis of acoustic (Euclidean distances of center frequency values of formants and phonetic similarities in the perception of the vowel portions of the rotated syllables revealed that discrimination was significantly and independently influenced by both acoustic and phonological information. We conclude that simple subjective assessments of artificial speech-like sounds can be misleading, as perception of such sounds may initially and unconsciously utilize speech-like, phonological processing.

  14. Measuring speech sound development : An item response model approach

    NARCIS (Netherlands)

    Priester, Gertrude H.; Goorhuis - Brouwer, Siena

    2013-01-01

    Research aim: The primary aim of our study is to investigate if there is an ordering in the speech sound development of children aged 3-6, similar to the ordering in general language development. Method: The speech sound development of 1035 children was tested with a revised version of Logo-Articula

  15. Audiovisual Matching in Speech and Nonspeech Sounds: A Neurodynamical Model

    Science.gov (United States)

    Loh, Marco; Schmid, Gabriele; Deco, Gustavo; Ziegler, Wolfram

    2010-01-01

    Audiovisual speech perception provides an opportunity to investigate the mechanisms underlying multimodal processing. By using nonspeech stimuli, it is possible to investigate the degree to which audiovisual processing is specific to the speech domain. It has been shown in a match-to-sample design that matching across modalities is more difficult…

  16. The Linear Model Research on Tibetan Six-Character Poetry's Respiratory Signal

    Science.gov (United States)

    Yonghong, Li; Yangrui, Yang; Lei, Guo; Hongzhi, Yu

    In this paper, we studied the Tibetan six-character pomes' respiratory signal during reading from the perspective of the physiological. Main contents include: 1) Selected 40 representative Tibetan six-character and four lines pomes from ldquo; The Love-songs of 6th Dalai Lama Tshang•yangGya•tsho ", and recorded speech sounds, voice and respiratory signals; 2) Designed a set of respiratory signal parameters for the study of poetry; 3) Extracted the relevant parameters of poetry respiratory signal by using the well-established respiratory signal processing platform; 4) Studied the type of breathing pattern, established the linear model of poetry respiratory signal.

  17. Impaired motor inhibition in adults who stutter - evidence from speech-free stop-signal reaction time tasks.

    Science.gov (United States)

    Markett, Sebastian; Bleek, Benjamin; Reuter, Martin; Prüss, Holger; Richardt, Kirsten; Müller, Thilo; Yaruss, J Scott; Montag, Christian

    2016-10-01

    Idiopathic stuttering is a fluency disorder characterized by impairments during speech production. Deficits in the motor control circuits of the basal ganglia have been implicated in idiopathic stuttering but it is unclear how these impairments relate to the disorder. Previous work has indicated a possible deficiency in motor inhibition in children who stutter. To extend these findings to adults, we designed two experiments to probe executive motor control in people who stutter using manual reaction time tasks that do not rely on speech production. We used two versions of the stop-signal reaction time task, a measure for inhibitory motor control that has been shown to rely on the basal ganglia circuits. We show increased stop-signal reaction times in two independent samples of adults who stutter compared to age- and sex-matched control groups. Additional measures involved simple reaction time measurements and a task-switching task where no group difference was detected. Results indicate a deficiency in inhibitory motor control in people who stutter in a task that does not rely on overt speech production and cannot be explained by general deficits in executive control or speeded motor execution. This finding establishes the stop-signal reaction time as a possible target for future experimental and neuroimaging studies on fluency disorders and is a further step towards unraveling the contribution of motor control deficits to idiopathic stuttering. Copyright © 2016 Elsevier Ltd. All rights reserved.

  18. Reconstructing Boolean Models of Signaling

    Science.gov (United States)

    Karp, Richard M.

    2013-01-01

    Abstract Since the first emergence of protein–protein interaction networks more than a decade ago, they have been viewed as static scaffolds of the signaling–regulatory events taking place in cells, and their analysis has been mainly confined to topological aspects. Recently, functional models of these networks have been suggested, ranging from Boolean to constraint-based methods. However, learning such models from large-scale data remains a formidable task, and most modeling approaches rely on extensive human curation. Here we provide a generic approach to learning Boolean models automatically from data. We apply our approach to growth and inflammatory signaling systems in humans and show how the learning phase can improve the fit of the model to experimental data, remove spurious interactions, and lead to better understanding of the system at hand. PMID:23286509

  19. Bayesian model of categorical effects in L1 and L2 speech perception

    Science.gov (United States)

    Kronrod, Yakov

    In this dissertation I present a model that captures categorical effects in both first language (L1) and second language (L2) speech perception. In L1 perception, categorical effects range between extremely strong for consonants to nearly continuous perception of vowels. I treat the problem of speech perception as a statistical inference problem and by quantifying categoricity I obtain a unified model of both strong and weak categorical effects. In this optimal inference mechanism, the listener uses their knowledge of categories and the acoustics of the signal to infer the intended productions of the speaker. The model splits up speech variability into meaningful category variance and perceptual noise variance. The ratio of these two variances, which I call Tau, directly correlates with the degree of categorical effects for a given phoneme or continuum. By fitting the model to behavioral data from different phonemes, I show how a single parametric quantitative variation can lead to the different degrees of categorical effects seen in perception experiments with different phonemes. In L2 perception, L1 categories have been shown to exert an effect on how L2 sounds are identified and how well the listener is able to discriminate them. Various models have been developed to relate the state of L1 categories with both the initial and eventual ability to process the L2. These models largely lacked a formalized metric to measure perceptual distance, a means of making a-priori predictions of behavior for a new contrast, and a way of describing non-discrete gradient effects. In the second part of my dissertation, I apply the same computational model that I used to unify L1 categorical effects to examining L2 perception. I show that we can use the model to make the same type of predictions as other SLA models, but also provide a quantitative framework while formalizing all measures of similarity and bias. Further, I show how using this model to consider L2 learners at

  20. Using EEG and stimulus context to probe the modelling of auditory-visual speech.

    Science.gov (United States)

    Paris, Tim; Kim, Jeesun; Davis, Chris

    2016-02-01

    We investigated whether internal models of the relationship between lip movements and corresponding speech sounds [Auditory-Visual (AV) speech] could be updated via experience. AV associations were indexed by early and late event related potentials (ERPs) and by oscillatory power and phase locking. Different AV experience was produced via a context manipulation. Participants were presented with valid (the conventional pairing) and invalid AV speech items in either a 'reliable' context (80% AVvalid items) or an 'unreliable' context (80% AVinvalid items). The results showed that for the reliable context, there was N1 facilitation for AV compared to auditory only speech. This N1 facilitation was not affected by AV validity. Later ERPs showed a difference in amplitude between valid and invalid AV speech and there was significant enhancement of power for valid versus invalid AV speech. These response patterns did not change over the context manipulation, suggesting that the internal models of AV speech were not updated by experience. The results also showed that the facilitation of N1 responses did not vary as a function of the salience of visual speech (as previously reported); in post-hoc analyses, it appeared instead that N1 facilitation varied according to the relative time of the acoustic onset, suggesting for AV events N1 may be more sensitive to the relationship of AV timing than form. Crown Copyright © 2015. Published by Elsevier Ltd. All rights reserved.

  1. Perceptual organization of speech signals by children with and without dyslexia

    Science.gov (United States)

    Nittrouer, Susan; Lowenstein, Joanna H.

    2013-01-01

    Developmental dyslexia is a condition in which children encounter difficulty learning to read in spite of adequate instruction. Although considerable effort has been expended trying to identify the source of the problem, no single solution has been agreed upon. The current study explored a new hypothesis, that developmental dyslexia may be due to faulty perceptual organization of linguistically relevant sensory input. To test that idea, sentence-length speech signals were processed to create either sine-wave or noise-vocoded analogs. Seventy children between 8 and 11 years of age, with and without dyslexia participated. Children with dyslexia were selected to have phonological awareness deficits, although those without such deficits were retained in the study. The processed sentences were presented for recognition, and measures of reading, phonological awareness, and expressive vocabulary were collected. Results showed that children with dyslexia, regardless of phonological subtype, had poorer recognition scores than children without dyslexia for both kinds of degraded sentences. Older children with dyslexia recognized the sine-wave sentences better than younger children with dyslexia, but no such effect of age was found for the vocoded materials. Recognition scores were used as predictor variables in regression analyses with reading, phonological awareness, and vocabulary measures used as dependent variables. Scores for both sorts of sentence materials were strong predictors of performance on all three dependent measures when all children were included, but only performance for the sine-wave materials explained significant proportions of variance when only children with dyslexia were included. Finally, matching young, typical readers with older children with dyslexia on reading abilities did not mitigate the group difference in recognition of vocoded sentences. Conclusions were that children with dyslexia have difficulty organizing linguistically relevant sensory

  2. Perceptual Wavelet packet transform based Wavelet Filter Banks Modeling of Human Auditory system for improving the intelligibility of voiced and unvoiced speech: A Case Study of a system development

    Directory of Open Access Journals (Sweden)

    Ranganadh Narayanam

    2015-10-01

    Full Text Available The objective of this project is to discuss a versatile speech enhancement method based on the human auditory model. In this project a speech enhancement scheme is being described which meets the demand for quality noise reduction algorithms which are capable of operating at a very low signal to noise ratio. We will be discussing how proposed speech enhancement system is capable of reducing noise with little speech degradation in diverse noise environments. In this model to reduce the residual noise and improve the intelligibility of speech a psychoacoustic model is incorporated into the generalized perceptual wavelet denoising method to reduce the residual noise. This is a generalized time frequency subtraction algorithm which advantageously exploits the wavelet multirate signal representation to preserve the critical transient information. Simultaneous masking and temporal masking of the human auditory system are modeled by the perceptual wavelet packet transform via the frequency and temporal localization of speech components. To calculate the bark spreading energy and temporal spreading energy the wavelet coefficients are used from which a time frequency masking threshold is deduced to adaptively adjust the subtraction parameters of the discussed method. To increase the intelligibility of speech an unvoiced speech enhancement algorithm also integrated into the system.

  3. Speech sound discrimination training improves auditory cortex responses in a rat model of autism

    Directory of Open Access Journals (Sweden)

    Crystal T Engineer

    2014-08-01

    Full Text Available Children with autism often have language impairments and degraded cortical responses to speech. Extensive behavioral interventions can improve language outcomes and cortical responses. Prenatal exposure to the antiepileptic drug valproic acid (VPA increases the risk for autism and language impairment. Prenatal exposure to VPA also causes weaker and delayed auditory cortex responses in rats. In this study, we document speech sound discrimination ability in VPA exposed rats and document the effect of extensive speech training on auditory cortex responses. VPA exposed rats were significantly impaired at consonant, but not vowel, discrimination. Extensive speech training resulted in both stronger and faster anterior auditory field responses compared to untrained VPA exposed rats, and restored responses to control levels. This neural response improvement generalized to non-trained sounds. The rodent VPA model of autism may be used to improve the understanding of speech processing in autism and contribute to improving language outcomes.

  4. Speech sound discrimination training improves auditory cortex responses in a rat model of autism

    Science.gov (United States)

    Engineer, Crystal T.; Centanni, Tracy M.; Im, Kwok W.; Kilgard, Michael P.

    2014-01-01

    Children with autism often have language impairments and degraded cortical responses to speech. Extensive behavioral interventions can improve language outcomes and cortical responses. Prenatal exposure to the antiepileptic drug valproic acid (VPA) increases the risk for autism and language impairment. Prenatal exposure to VPA also causes weaker and delayed auditory cortex responses in rats. In this study, we document speech sound discrimination ability in VPA exposed rats and document the effect of extensive speech training on auditory cortex responses. VPA exposed rats were significantly impaired at consonant, but not vowel, discrimination. Extensive speech training resulted in both stronger and faster anterior auditory field (AAF) responses compared to untrained VPA exposed rats, and restored responses to control levels. This neural response improvement generalized to non-trained sounds. The rodent VPA model of autism may be used to improve the understanding of speech processing in autism and contribute to improving language outcomes. PMID:25140133

  5. Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models to Psychophysical Data

    Directory of Open Access Journals (Sweden)

    Svante Stadler

    2009-01-01

    Full Text Available Users of cochlear implants (CIs vary widely in their ability to recognize speech in noisy conditions. There are many factors that may influence their performance. We have investigated to what degree it can be explained by the users' ability to discriminate spectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. The models were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speech recognition performance was compared to experimental results, and they were significantly correlated. The presented framework may be used to simulate the effects of changing the CI encoding strategy.

  6. The Bases of Speech Pathology and Audiology: Selecting a Therapy Model

    Science.gov (United States)

    Schultz, Martin C.; Carpenter, Mary A.

    1973-01-01

    Described is the basis for therapeutic relationships in speech and hearing and presented are two models of clinical interaction, one directed to behavior change and the other directed to attitude change. (Author)

  7. Hidden Hearing Loss and Computational Models of the Auditory Pathway: Predicting Speech Intelligibility Decline

    Science.gov (United States)

    2016-11-28

    Title: Hidden Hearing Loss and Computational Models of the Auditory Pathway: Predicting Speech Intelligibility Decline Christopher J. Smalt...to utilize computational models of the auditory periphery and auditory cortex to study the effect of low spontaneous rate ANF loss on the cortical...clinical hearing thresholds is difficulty in understanding speech in noise. Recent animal studies have shown that noise exposure causes selective loss

  8. Speech enhancement

    CERN Document Server

    Benesty, Jacob; Chen, Jingdong

    2006-01-01

    We live in a noisy world! In all applications (telecommunications, hands-free communications, recording, human-machine interfaces, etc.) that require at least one microphone, the signal of interest is usually contaminated by noise and reverberation. As a result, the microphone signal has to be ""cleaned"" with digital signal processing tools before it is played out, transmitted, or stored.This book is about speech enhancement. Different well-known and state-of-the-art methods for noise reduction, with one or multiple microphones, are discussed. By speech enhancement, we mean not only noise red

  9. Speech Enhancement based on Compressive Sensing Algorithm

    Science.gov (United States)

    Sulong, Amart; Gunawan, Teddy S.; Khalifa, Othman O.; Chebil, Jalel

    2013-12-01

    There are various methods, in performance of speech enhancement, have been proposed over the years. The accurate method for the speech enhancement design mainly focuses on quality and intelligibility. The method proposed with high performance level. A novel speech enhancement by using compressive sensing (CS) is a new paradigm of acquiring signals, fundamentally different from uniform rate digitization followed by compression, often used for transmission or storage. Using CS can reduce the number of degrees of freedom of a sparse/compressible signal by permitting only certain configurations of the large and zero/small coefficients, and structured sparsity models. Therefore, CS is significantly provides a way of reconstructing a compressed version of the speech in the original signal by taking only a small amount of linear and non-adaptive measurement. The performance of overall algorithms will be evaluated based on the speech quality by optimise using informal listening test and Perceptual Evaluation of Speech Quality (PESQ). Experimental results show that the CS algorithm perform very well in a wide range of speech test and being significantly given good performance for speech enhancement method with better noise suppression ability over conventional approaches without obvious degradation of speech quality.

  10. Two-Microphone Separation of Speech Mixtures

    DEFF Research Database (Denmark)

    Pedersen, Michael Syskind; Wang, DeLiang; Larsen, Jan

    2008-01-01

    Separation of speech mixtures, often referred to as the cocktail party problem, has been studied for decades. In many source separation tasks, the separation method is limited by the assumption of at least as many sensors as sources. Further, many methods require that the number of signals within...... the recorded mixtures be known in advance. In many real-world applications, these limitations are too restrictive. We propose a novel method for underdetermined blind source separation using an instantaneous mixing model which assumes closely spaced microphones. Two source separation techniques have been...... similar signals. Using two microphones, we can separate, in principle, an arbitrary number of mixed speech signals. We show separation results for mixtures with as many as seven speech signals under instantaneous conditions. We also show that the proposed method is applicable to segregate speech signals...

  11. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  12. Blind estimation of the number of speech source in reverberant multisource scenarios based on binaural signals

    DEFF Research Database (Denmark)

    May, Tobias; van de Par, Steven

    2012-01-01

    position. Then, each candidate position is characterized by a set of features. In addition to exploiting the overall spectral shape, a new set of mask-based features is proposed which aims at characterizing the pattern of the estimated binary mask. The decision stage for detecting a speech source is based...... on a support vector machine (SVM) classifier. A systematic analysis shows that the proposed algorithm is able to blindly determine the number and the corresponding spatial positions of speech sources in multisource scenarios and generalizes well to unknown acoustic conditions...

  13. Features of the Speech Signal During the Cumulative Action of Coriolis Accelerations,

    Science.gov (United States)

    1976-11-01

    of activity in mi nutes ; Vertical - Intensity of pronunciation of vowel sound In decibels A dot indicates the enan value, a line the observed scatter...the vowels of the Insets alphabet . This structure of the speech phru. ashes it possible to dstsraiae those vowel sounds which are subject to the...pronunciation of the vowel sounds u(y) , i(w) , o (o) (up to 2 to 2.5dB) . Analysis of the speech phrase showed that the most significant changes are

  14. Blind estimation of the number of speech source in reverberant multisource scenarios based on binaural signals

    DEFF Research Database (Denmark)

    May, Tobias; van de Par, Steven

    2012-01-01

    In this paper we present a new approach for estimating the number of active speech sources in the presence of interfering noise sources and reverberation. First, a binaural front-end is used to detect the spatial positions of all active sound sources, resulting in a binary mask for each candidate...... on a support vector machine (SVM) classifier. A systematic analysis shows that the proposed algorithm is able to blindly determine the number and the corresponding spatial positions of speech sources in multisource scenarios and generalizes well to unknown acoustic conditions...

  15. Hidden Markov Model Classification of Myoelectric Signals in Speech

    Science.gov (United States)

    2007-11-02

    Biomedical Engineering, University of New Brunswick, Fredericton , Canada 2Department of Electrical and Computer Engineering, University of New Brunswick... Fredericton , Canada Proceedings – 23rd Annual Conference – IEEE/EMBS Oct.25-28, 2001, Istanbul, TURKEY 0-7803-7211-5/01$10.00©2001 IEEE Report...Performing Organization Name(s) and Address(es) Institute of Bioemdical Engineering University of New Brunswick Fredericton , Canada Performing Organization

  16. A computer model of auditory efferent suppression: implications for the recognition of speech in noise.

    Science.gov (United States)

    Brown, Guy J; Ferry, Robert T; Meddis, Ray

    2010-02-01

    The neural mechanisms underlying the ability of human listeners to recognize speech in the presence of background noise are still imperfectly understood. However, there is mounting evidence that the medial olivocochlear system plays an important role, via efferents that exert a suppressive effect on the response of the basilar membrane. The current paper presents a computer modeling study that investigates the possible role of this activity on speech intelligibility in noise. A model of auditory efferent processing [Ferry, R. T., and Meddis, R. (2007). J. Acoust. Soc. Am. 122, 3519-3526] is used to provide acoustic features for a statistical automatic speech recognition system, thus allowing the effects of efferent activity on speech intelligibility to be quantified. Performance of the "basic" model (without efferent activity) on a connected digit recognition task is good when the speech is uncorrupted by noise but falls when noise is present. However, recognition performance is much improved when efferent activity is applied. Furthermore, optimal performance is obtained when the amount of efferent activity is proportional to the noise level. The results obtained are consistent with the suggestion that efferent suppression causes a "release from adaptation" in the auditory-nerve response to noisy speech, which enhances its intelligibility.

  17. Optimal speech motor control and token-to-token variability: a Bayesian modeling approach.

    Science.gov (United States)

    Patri, Jean-François; Diard, Julien; Perrier, Pascal

    2015-12-01

    The remarkable capacity of the speech motor system to adapt to various speech conditions is due to an excess of degrees of freedom, which enables producing similar acoustical properties with different sets of control strategies. To explain how the central nervous system selects one of the possible strategies, a common approach, in line with optimal motor control theories, is to model speech motor planning as the solution of an optimality problem based on cost functions. Despite the success of this approach, one of its drawbacks is the intrinsic contradiction between the concept of optimality and the observed experimental intra-speaker token-to-token variability. The present paper proposes an alternative approach by formulating feedforward optimal control in a probabilistic Bayesian modeling framework. This is illustrated by controlling a biomechanical model of the vocal tract for speech production and by comparing it with an existing optimal control model (GEPPETO). The essential elements of this optimal control model are presented first. From them the Bayesian model is constructed in a progressive way. Performance of the Bayesian model is evaluated based on computer simulations and compared to the optimal control model. This approach is shown to be appropriate for solving the speech planning problem while accounting for variability in a principled way.

  18. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  19. Cost-Efficient Development of Acoustic Models for Speech Recognition of Related Languages

    Directory of Open Access Journals (Sweden)

    J. Nouza

    2013-09-01

    Full Text Available When adapting an existing speech recognition system to a new language, major development costs are associated with the creation of an appropriate acoustic model (AM. For its training, a certain amount of recorded and annotated speech is required. In this paper, we show that not only the annotation process, but also the process of speech acquisition can be automated to minimize the need of human and expert work. We demonstrate the proposed methodology on Croatian language, for which the target AM has been built via cross-lingual adaptation of a Czech AM in 2 ways: a using commercially available GlobalPhone database, and b by automatic speech data mining from HRT radio archive. The latter approach is cost-free, yet it yields comparable or better results in LVCSR experiments conducted on 3 Croatian test sets.

  20. Analysis of Decision Trees in Context Clustering of Hidden Markov Model Based Thai Speech Synthesis

    Directory of Open Access Journals (Sweden)

    Suphattharachai Chomphan

    2011-01-01

    Full Text Available Problem statement: In Thai speech synthesis using Hidden Markov model (HMM based synthesis system, the tonal speech quality is degraded due to tone distortion. This major problem must be treated appropriately to preserve the tone characteristics of each syllable unit. Since tone brings about the intelligibility of the synthesized speech. It is needed to establish the tone questions and other phonetic questions in tree-based context clustering process accordingly. Approach: This study describes the analysis of questions in tree-based context clustering process of an HMM-based speech synthesis system for Thai language. In the system, spectrum, pitch or F0 and state duration are modeled simultaneously in a unified framework of HMM, their parameter distributions are clustered independently by using a decision-tree based context clustering technique. The contextual factors which affect spectrum, pitch and duration, i.e., part of speech, position and number of phones in a syllable, position and number of syllables in a word, position and number of words in a sentence, phone type and tone type, are taken into account for constructing the questions of the decision tree. All in all, thirteen sets of questions are analyzed in comparison. Results: In the experiment, we analyzed the decision trees by counting the number of questions in each node coming from those thirteen sets and by calculating the dominance score given to each question as the reciprocal of the distance from the root node to the question node. The highest number and dominance score are of the set of phonetic type, while the second, third highest ones are of the set of part of speech and tone type. Conclusion: By counting the number of questions in each node and calculating the dominance score, we can set the priority of each question set. All in all, the analysis results bring about further development of Thai speech synthesis with efficient context clustering process in

  1. PCA-Based Compressed Speech Signal Sensing%基于主分量分析的语音信号压缩感知

    Institute of Scientific and Technical Information of China (English)

    季云云; 杨震

    2011-01-01

    Compressed Sensing theory is a new research focus rising in recent years. Before Compressed Sensing theory is applied to speech signal processing field,a suitable sparse representation for speech signals must be found. Based on principal component analysis theory and a large number of block signals, features of the speech signal are extracted. Moreover, according to Compressed Sensing theory, the method of constructing the dictionary and the characteristics of the speech signal, a kind of redundant dictionary, the concatenation of some orthogonal bases, for the sparse representation of speech signal is presented in this paper. For more objective description of the advantages of such a sparse representation, the average gini index is applied to compare speech signals' sparsity in DCT,GABOR and this redundant dictionary respectively. And male and female speech signals and voiced and unvoiced speech signals are analysed. Simulation results show that whether male or female speech signals and whether voiced or unvoiced speech signals, sparsity of the speech signal in this redundant dictionary is greatly better than the DCT basis and close to the CABOR basis. However, with its number of atoms far less than CABOR basis and its low computational complexity and storage,this redundant dictionary is more applicable than CABOR basis to speech signal.%压缩感知理论是近年来兴起的一个新的研究热点.寻求适合于语音信号的稀疏基是压缩感知理论应用到语音信号处理领域的前提.本文基于主分量分析理论和大量的块数据,提取语音信号的特征信息,并根据压缩感知理论、字典构造的方法以及语音信号的特点,构造出一种适合于语音信号稀疏表示的冗余字典.该冗余字典是由多个正交基级联而成.为了更为客观的说明这种稀疏表示的优势,采用平均gini系数来比较语音信号在DCT基、GABOR基和该冗余字典下的稀疏性,并且分别对男女声语音信号

  2. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  3. A cocktail party model of spatial release from masking by both noise and speech interferers.

    Science.gov (United States)

    Jones, Gary L; Litovsky, Ruth Y

    2011-09-01

    A mathematical formula for estimating spatial release from masking (SRM) in a cocktail party environment would be useful as a simpler alternative to computationally intensive algorithms and may enhance understanding of underlying mechanisms. The experiment presented herein was designed to provide a strong test of a model that divides SRM into contributions of asymmetry and angular separation [Bronkhorst (2000). Acustica 86, 117-128] and to examine whether that model can be extended to include speech maskers. Across masker types the contribution to SRM of angular separation of maskers from the target was found to grow at a diminishing rate as angular separation increased within the frontal hemifield, contrary to predictions of the model. Speech maskers differed from noise maskers in the overall magnitude of SRM and in the contribution of angular separation (both greater for speech). These results were used to develop a modified model that achieved good fits to data for noise maskers (ρ=0.93) and for speech maskers (ρ=0.94) while using the same functions to describe separation and asymmetry components of SRM for both masker types. These findings suggest that this approach can be used to accurately model SRM for speech maskers in addition to primarily "energetic" noise maskers.

  4. A cocktail party model of spatial release from masking by both noise and speech interferers a)

    Science.gov (United States)

    Jones, Gary L.; Litovsky, Ruth Y.

    2011-01-01

    A mathematical formula for estimating spatial release from masking (SRM) in a cocktail party environment would be useful as a simpler alternative to computationally intensive algorithms and may enhance understanding of underlying mechanisms. The experiment presented herein was designed to provide a strong test of a model that divides SRM into contributions of asymmetry and angular separation [Bronkhorst (2000). Acustica 86, 117–128] and to examine whether that model can be extended to include speech maskers. Across masker types the contribution to SRM of angular separation of maskers from the target was found to grow at a diminishing rate as angular separation increased within the frontal hemifield, contrary to predictions of the model. Speech maskers differed from noise maskers in the overall magnitude of SRM and in the contribution of angular separation (both greater for speech). These results were used to develop a modified model that achieved good fits to data for noise maskers (ρ = 0.93) and for speech maskers (ρ = 0.94) while using the same functions to describe separation and asymmetry components of SRM for both masker types. These findings suggest that this approach can be used to accurately model SRM for speech maskers in addition to primarily “energetic” noise maskers. PMID:21895087

  5. Automated Understanding of Selected Voice Tract Pathologies Based on the Speech Signal Analysis

    Science.gov (United States)

    2007-11-02

    THE MATERIAL OF THE STUDY The studies of the speech articulation have been carried out for persons treated for the larynx cancer (men after various...types of operations). Depending on the stage of the tumour, various types of partial larynx surgery have been applied. In the recorded and studied...material the following cases have been present: subtotal larynx remove (laryngetctomia subtotalis), unilateral vertical laryngectomy (hemilaryngectomia

  6. Effect of Simultaneous Bilingualism on Speech Intelligibility across Different Masker Types, Modalities, and Signal-to-Noise Ratios in School-Age Children

    Science.gov (United States)

    Reetzke, Rachel; Lam, Boji Pak-Wing; Xie, Zilong; Sheng, Li; Chandrasekaran, Bharath

    2016-01-01

    Recognizing speech in adverse listening conditions is a significant cognitive, perceptual, and linguistic challenge, especially for children. Prior studies have yielded mixed results on the impact of bilingualism on speech perception in noise. Methodological variations across studies make it difficult to converge on a conclusion regarding the effect of bilingualism on speech-in-noise performance. Moreover, there is a dearth of speech-in-noise evidence for bilingual children who learn two languages simultaneously. The aim of the present study was to examine the extent to which various adverse listening conditions modulate differences in speech-in-noise performance between monolingual and simultaneous bilingual children. To that end, sentence recognition was assessed in twenty-four school-aged children (12 monolinguals; 12 simultaneous bilinguals, age of English acquisition ≤ 3 yrs.). We implemented a comprehensive speech-in-noise battery to examine recognition of English sentences across different modalities (audio-only, audiovisual), masker types (steady-state pink noise, two-talker babble), and a range of signal-to-noise ratios (SNRs; 0 to -16 dB). Results revealed no difference in performance between monolingual and simultaneous bilingual children across each combination of modality, masker, and SNR. Our findings suggest that when English age of acquisition and socioeconomic status is similar between groups, monolingual and bilingual children exhibit comparable speech-in-noise performance across a range of conditions analogous to everyday listening environments. PMID:27936212

  7. 试析语音信号处理和图像信号处理的比较%Speech signal processing and image signal processing is on

    Institute of Scientific and Technical Information of China (English)

    雷红; 韩建文

    2013-01-01

      语音及图像是人类进行信息交流的主要媒介,有研究显示,人类80%以上的信息是通过视觉和听觉接受到的,其他嗅觉、触觉、味觉接收到的信息还不到20%。因此,作为视觉和听觉信息的主要载体,图像和语音信号是非常重要的。目前,这两种信号的处理研究已经成为信息传播界的重点课题。本文即是对语音信号和图像信号处理的相关知识进行比较分析,以求得两者更好地发展。%Voice and image is the main medium of human information exchange,research shows,more than 80% of the human information is received by the visual and auditory,olfactory,tactile,other received information of taste is less than 20%.Therefore,as the main carrier of visual and auditory information, image and speech signal is very important.At present,research on the treatment of these two signals has become focus of information communication industry.This paper is a comparative analysis of the speech signal and the related knowledge of image signal processing,in order to obtain the better development.

  8. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    OpenAIRE

    Santiago-Omar Caballero-Morales

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger,...

  9. Modeling the effects of a single reflection on binaural speech intelligibility.

    Science.gov (United States)

    Rennies, Jan; Warzybok, Anna; Brand, Thomas; Kollmeier, Birger

    2014-03-01

    Recently the influence of delay and azimuth of a single speech reflection on speech reception thresholds (SRTs) was systematically investigated using frontal, diffuse, and lateral noise [Warzybok et al. (2013). J. Acoust. Soc. Am. 133, 269-282]. The experiments showed that the benefit of an early reflection was independent of its azimuth and mostly independent of noise type, but that the detrimental effect of a late reflection depended on its direction relative to the noise. This study tests if different extensions of a binaural speech intelligibility model can predict these data. The extensions differ in the order in which binaural processing and temporal integration of early reflections take place. Models employing a correction for the detrimental effects of reverberation on speech intelligibility after performing the binaural processing predict SRTs in symmetric masking conditions (frontal, diffuse), but cannot predict the measured interaction of temporal and spatial integration. In contrast, a model extension accounting for the distinction between useful and detrimental reflections before the binaural processing stage predicts the data with an overall R(2) of 0.95. This indicates that any model framework predicting speech intelligibility in rooms should incorporate an interaction between binaural and temporal integration of reflections at a comparatively early stage.

  10. Improving Robustness of Deep Neural Network Acoustic Models via Speech Separation and Joint Adaptive Training

    Science.gov (United States)

    Narayanan, Arun; Wang, DeLiang

    2015-01-01

    Although deep neural network (DNN) acoustic models are known to be inherently noise robust, especially with matched training and testing data, the use of speech separation as a frontend and for deriving alternative feature representations has been shown to improve performance in challenging environments. We first present a supervised speech separation system that significantly improves automatic speech recognition (ASR) performance in realistic noise conditions. The system performs separation via ratio time-frequency masking; the ideal ratio mask (IRM) is estimated using DNNs. We then propose a framework that unifies separation and acoustic modeling via joint adaptive training. Since the modules for acoustic modeling and speech separation are implemented using DNNs, unification is done by introducing additional hidden layers with fixed weights and appropriate network architecture. On the CHiME-2 medium-large vocabulary ASR task, and with log mel spectral features as input to the acoustic model, an independently trained ratio masking frontend improves word error rates by 10.9% (relative) compared to the noisy baseline. In comparison, the jointly trained system improves performance by 14.4%. We also experiment with alternative feature representations to augment the standard log mel features, like the noise and speech estimates obtained from the separation module, and the standard feature set used for IRM estimation. Our best system obtains a word error rate of 15.4% (absolute), an improvement of 4.6 percentage points over the next best result on this corpus. PMID:26973851

  11. Compressed sensing of speech signal using sparse diagonal matrix%基于稀疏对角矩阵的语音信号压缩感知

    Institute of Scientific and Technical Information of China (English)

    马春; 孙南; 程涛军; 李新华

    2012-01-01

    探索压缩感知理论在语音信号重构中的应用,研究测量矩阵选取对语音信号重构效果的影响.改进传统随机,托普利兹,循环等测量矩阵,尝试将稀疏对角矩阵应用于测量矩阵完成对语音信号的非相干测量.在语音信号上进行实验,分别采用稀疏对角结构测量矩阵和传统测量矩阵,对比它们使用StOMP算法重构语音信号的效果.实验结果表明,采用改进的稀疏对角循环矩阵重构语音信号,较传统矩阵重构的精确度有明显提高,运行时间也有明显缩短.%The application of compressed sensing in speech signal reconstruction is explored, and the effect of the selection of measurement matrix for speech signal reconstruction is studied. Then the traditional measurement matrix, such as Radom, Toeplitz, Circulate, are improved, and the sparse diagonal matrix is tried to be use in measurement matrix to finish the incoherent measurement of speech signal. The experiments are made on speech signal, by using sparse diagonal structure measurement matrix and traditional measurement matrix for comparison of the effect of speech signal reconstruction by. StOMP algorithm. The experimental results show that the reconstruction accuracy and the operation time of using the improved sparse diagonal circulate matrix for speech signal reconstruction are superior to that of using traditional matrix.

  12. Modeling speech imitation and ecological learning of auditory-motor maps

    Directory of Open Access Journals (Sweden)

    Claudia eCanevari

    2013-06-01

    Full Text Available Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR are mainly evident in the realistic use of such systems. These limitations can be partly reduced by using normalization strategies that minimize inter-speaker variability by either explicitly removing speakers’ peculiarities or adapting different speakers to a reference model. In this paper we aim at modeling a motor-based imitation learning mechanism in ASR. We tested the utility of a speaker normalization strategy that uses motor representations of speech and compare it with strategies that ignore the motor domain. Specifically, we first trained a regressor through state-of-the-art machine learning techniques to build an auditory-motor mapping, in a sense mimicking a human learner that tries to reproduce utterances produced by other speakers. This auditory-motor mapping maps the speech acoustics of a speaker into the motor plans of a reference speaker. Since, during recognition, only speech acoustics are available, the mapping is necessary to recover motor information. Subsequently, in a phone classification task, we tested the system on either one of the speakers that was used during training or a new one. Results show that in both cases the motor-based speaker normalization strategy almost always outperforms all other strategies where only acoustics is taken into account.

  13. Enhancement of Non-Stationary Speech using Harmonic Chirp Filters

    DEFF Research Database (Denmark)

    Nørholm, Sidsel Marie; Jensen, Jesper Rindom; Christensen, Mads Græsbøll

    2015-01-01

    the non-stationarity nature of voiced speech into account via linear constraints. This is facilitated by imposing a harmonic chirp model on the speech signal. As an implicit part of the filter design, the noise statistics are also estimated based on the observed signal and parameters of the harmonic chirp...... model. Simulations on real speech show that the chirp based filters perform better than their harmonic counterparts. Further, it is seen that the gain of using the chirp model increases when the estimated chirp parameter is big corresponding to periods in the signal where the instantaneous fundamental...

  14. Phoneme Class Based Adaptation for Mismatch Acoustic Modeling of Distant Noisy Speech (Preprint)

    Science.gov (United States)

    2012-03-01

    microphone acoustic model and distant data MFCCs. First by investigating the close talk microphone acoustic model, the English phonemes were divided...Acoustic- Phonetic Continuous Speech Corpus” Linguistic Data Consortium, Philadelphia, 1993. [5] Smith, L. I., “A Tutorial on Principal Components

  15. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  16. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  17. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  18. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  19. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  20. A Comparison of Speech Sound Intervention Delivered by Telepractice and Side-by-Side Service Delivery Models

    Science.gov (United States)

    Grogan-Johnson, Sue; Schmidt, Anna Marie; Schenker, Jason; Alvares, Robin; Rowan, Lynne E.; Taylor, Jacquelyn

    2013-01-01

    Telepractice has the potential to provide greater access to speech-language intervention services for children with communication impairments. Substantiation of this delivery model is necessary for telepractice to become an accepted alternative delivery model. This study investigated the progress made by school-age children with speech sound…

  1. Wavelet Packet Entropy in Speaker-Independent Emotional State Detection from Speech Signal

    Directory of Open Access Journals (Sweden)

    Mina Kadkhodaei Elyaderani

    2015-01-01

    Full Text Available In this paper, wavelet packet entropy is proposed for speaker-independent emotion detection from speech. After pre-processing, wavelet packet decomposition using wavelet type db3 at level 4 is calculated and Shannon entropy in its nodes is calculated to be used as feature. In addition, prosodic features such as first four formants, jitter or pitch deviation amplitude, and shimmer or energy variation amplitude besides MFCC features are applied to complete the feature vector. Then, Support Vector Machine (SVM is used to classify the vectors in multi-class (all emotions or two-class (each emotion versus normal state format. 46 different utterances of a single sentence from Berlin Emotional Speech Dataset are selected. These are uttered by 10 speakers in sadness, happiness, fear, boredom, anger, and normal emotional state. Experimental results show that proposed features can improve emotional state detection accuracy in multi-class situation. Furthermore, adding to other features wavelet entropy coefficients increase the accuracy of two-class detection for anger, fear, and happiness.

  2. A predictive model for diagnosing stroke-related apraxia of speech.

    Science.gov (United States)

    Ballard, Kirrie J; Azizi, Lamiae; Duffy, Joseph R; McNeil, Malcolm R; Halaki, Mark; O'Dwyer, Nicholas; Layfield, Claire; Scholl, Dominique I; Vogel, Adam P; Robin, Donald A

    2016-01-29

    Diagnosis of the speech motor planning/programming disorder, apraxia of speech (AOS), has proven challenging, largely due to its common co-occurrence with the language-based impairment of aphasia. Currently, diagnosis is based on perceptually identifying and rating the severity of several speech features. It is not known whether all, or a subset of the features, are required for a positive diagnosis. The purpose of this study was to assess predictor variables for the presence of AOS after left-hemisphere stroke, with the goal of increasing diagnostic objectivity and efficiency. This population-based case-control study involved a sample of 72 cases, using the outcome measure of expert judgment on presence of AOS and including a large number of independently collected candidate predictors representing behavioral measures of linguistic, cognitive, nonspeech oral motor, and speech motor ability. We constructed a predictive model using multiple imputation to deal with missing data; the Least Absolute Shrinkage and Selection Operator (Lasso) technique for variable selection to define the most relevant predictors, and bootstrapping to check the model stability and quantify the optimism of the developed model. Two measures were sufficient to distinguish between participants with AOS plus aphasia and those with aphasia alone, (1) a measure of speech errors with words of increasing length and (2) a measure of relative vowel duration in three-syllable words with weak-strong stress pattern (e.g., banana, potato). The model has high discriminative ability to distinguish between cases with and without AOS (c-index=0.93) and good agreement between observed and predicted probabilities (calibration slope=0.94). Some caution is warranted, given the relatively small sample specific to left-hemisphere stroke, and the limitations of imputing missing data. These two speech measures are straightforward to collect and analyse, facilitating use in research and clinical settings.

  3. Nonlinear neurodynamics in representation of a rhythm of speech.

    Science.gov (United States)

    Skljarov, O P

    1999-06-01

    The mathematical model is offered to describe an algorithm for functioning of a speech rhythm. The duration of a speech signal is divided into the numbered sequence of durations of voice and voiceless segments. All elements of this sequence will be considered as values normalized on the maximum element. We determine this sequence of the elements as a speech rhythm. 1) The model describes a speech rhythm as the recurrent relations between elements of a rhythm. 2) The model permits use of the concept of information entropy. 3) The model explains experimental findings obtained by our research group during comparative investigation of a rhythm in normal speech and stuttering. In particular, the model explains the existence of two classes of stutterers with various rhythms of speech.

  4. Mental imagery of speech and movement implicates the dynamics of internal forward models

    Directory of Open Access Journals (Sweden)

    Xing eTian

    2010-10-01

    Full Text Available The classical concept of efference copies in the context of internal forward models has stimulated productive research in cognitive science and neuroscience. There are compelling reasons to argue for such a mechanism, but finding direct evidence in the human brain remains difficult. Here we investigate the dynamics of internal forward models from an unconventional angle: mental imagery, assessed while recording high temporal resolution neuronal activity using magnetoencephalography (MEG. We compare two overt and covert tasks; our covert, mental imagery tasks are unconfounded by overt input/output demands – but in turn necessitate the development of appropriate multi-dimensional topographic analyses. Finger tapping (studies 1-2 and speech experiments (studies 3-5 provide temporally constrained results that implicate the estimation of an efference copy. We suggest that one internal forward model over parietal cortex subserves the kinesthetic feeling in motor imagery. Secondly, observed auditory neural activity ~170 ms after motor estimation in speech experiments (studies 3-5 demonstrates the anticipated auditory consequences of planned motor commands in a second internal forward model in imagery of speech production. Our results provide neurophysiological evidence from the human brain in favor of internal forward models deploying efference copies in somatosensory and auditory cortex, in finger tapping and speech production tasks, respectively, and also suggest the dynamics and sequential updating structure of internal forward models.

  5. Syllable language models for Mandarin speech recognition: exploiting character language models.

    Science.gov (United States)

    Liu, Xunying; Hieronymus, James L; Gales, Mark J F; Woodland, Philip C

    2013-01-01

    Mandarin Chinese is based on characters which are syllabic in nature and morphological in meaning. All spoken languages have syllabiotactic rules which govern the construction of syllables and their allowed sequences. These constraints are not as restrictive as those learned from word sequences, but they can provide additional useful linguistic information. Hence, it is possible to improve speech recognition performance by appropriately combining these two types of constraints. For the Chinese language considered in this paper, character level language models (LMs) can be used as a first level approximation to allowed syllable sequences. To test this idea, word and character level n-gram LMs were trained on 2.8 billion words (equivalent to 4.3 billion characters) of texts from a wide collection of text sources. Both hypothesis and model based combination techniques were investigated to combine word and character level LMs. Significant character error rate reductions up to 7.3% relative were obtained on a state-of-the-art Mandarin Chinese broadcast audio recognition task using an adapted history dependent multi-level LM that performs a log-linearly combination of character and word level LMs. This supports the hypothesis that character or syllable sequence models are useful for improving Mandarin speech recognition performance.

  6. Science Signaling Podcast for 7 June 2016: Modeling signal integration.

    Science.gov (United States)

    Janes, Kevin A; VanHook, Annalisa M

    2016-06-07

    This Podcast features an interview with Kevin Janes, senior author of a Research Article that appears in the 7 June 2016 issue of Science Signaling, about a statistical modeling method that can extract useful information from complex data sets. Cells exist in very complex environments. They are constantly exposed to growth factors, hormones, nutrients, and many other factors that influence cellular behavior. When cells integrate information from multiple stimuli, the resulting output does not necessarily reflect a simple additive effect of the responses to each individual stimulus. Chitforoushzadeh et al employed a statistical modeling approach that maintained the multidimensional nature of the data to analyze the responses of colonic epithelial cells to various combinations of the proinflammatory cytokine TNF, the growth factor EGF, and insulin. As the model predicted, experiments confirmed that insulin suppressed TNF-induced proinflammatory signaling through a mechanism that involved the transcription factor GATA6.Listen to Podcast. Copyright © 2016, American Association for the Advancement of Science.

  7. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    Science.gov (United States)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  8. Appropriacy Planning: Speech Acts Studies and Planning Appropriate Models for ESL Learners.

    Science.gov (United States)

    Kubota, Mitsuo

    1997-01-01

    Since the emergence of the concept of communicative competence, the language teaching field has focused on teaching appropriate language use in addition to general linguistic elements. Speech act studies have contributed to providing appropriate models for second and foreign language learners. In this paper, the effort toward the creation and use…

  9. Randomized Controlled Trial of Video Self-Modeling Following Speech Restructuring Treatment for Stuttering

    Science.gov (United States)

    Cream, Angela; O'Brian, Sue; Jones, Mark; Block, Susan; Harrison, Elisabeth; Lincoln, Michelle; Hewat, Sally; Packman, Ann; Menzies, Ross; Onslow, Mark

    2010-01-01

    Purpose: In this study, the authors investigated the efficacy of video self-modeling (VSM) following speech restructuring treatment to improve the maintenance of treatment effects. Method: The design was an open-plan, parallel-group, randomized controlled trial. Participants were 89 adults and adolescents who undertook intensive speech…

  10. Comparing Three Methods to Create Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks

    Science.gov (United States)

    2000-08-01

    Glass, et al.: Multilingual Spoken Language Under- ( multilingual clusters) and 5280 monolingual clusters. This standing in the MIT VOYAGER System...UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADP010392 TITLE: Comparing Three Methods to Create Multilingual Phone...METHODS TO CREATE MULTILINGUAL PHONE MODELS FOR VOCABULARY INDEPENDENT SPEECH RECOGNITION TASKS Joachim Kdhler German National Research Center for

  11. A discourse model of affect for text-to-speech synthesis

    CSIR Research Space (South Africa)

    Schlunz, GI

    2013-12-01

    Full Text Available This paper introduces a model of affect to improve prosody in text-to-speech synthesis. It operates on the discourse level of text to predict the underlying linguistic factors that contribute towards emotional appraisal, rather than any particular...

  12. Least 1-Norm Pole-Zero Modeling with Sparse Deconvolution for Speech Analysis

    DEFF Research Database (Denmark)

    Shi, Liming; Jensen, Jesper Rindom; Christensen, Mads Græsbøll

    2017-01-01

    . Moreover, to consider the spiky excitation form of the pulse train during voiced speech, the modeling parame- ters and sparse residuals are estimated in an iterative fashion using a least 1-norm pole-zero with sparse deconvolution algorithm. Com- pared with the conventional two-stage least squares pole...

  13. Speech Recognition of Non-Native Speech Using Native and Non-Native Acoustic Models

    Science.gov (United States)

    2000-08-01

    NATIVE AND NON-NATIVE ACOUSTIC MODELS David A. van Leeuwen and Rosemary Orr vanLeeuwentm .tno. nl R. 0rr~kno. azn. nl TNO Human Factors Research...a] is pronounced closer to the [c] by the vowels . Journal of Phonetics, 25:437-470, 1997. 32 [2] D. B. Paul and J. M. Baker. The design for [9] R. H...J. Kershaw, [12] Tony Robinson. Private Communication. L. Lamel, D. A. van Leeuwen , D. Pye, A. J. Robinson, H. J. M. Steeneken, and P. C. Wood- [13

  14. Improving Language Models in Speech-Based Human-Machine Interaction

    Directory of Open Access Journals (Sweden)

    Raquel Justo

    2013-02-01

    Full Text Available This work focuses on speech-based human-machine interaction. Specifically, a Spoken Dialogue System (SDS that could be integrated into a robot is considered. Since Automatic Speech Recognition is one of the most sensitive tasks that must be confronted in such systems, the goal of this work is to improve the results obtained by this specific module. In order to do so, a hierarchical Language Model (LM is considered. Different series of experiments were carried out using the proposed models over different corpora and tasks. The results obtained show that these models provide greater accuracy in the recognition task. Additionally, the influence of the Acoustic Modelling (AM in the improvement percentage of the Language Models has also been explored. Finally the use of hierarchical Language Models in a language understanding task has been successfully employed, as shown in an additional series of experiments.

  15. Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

    CERN Document Server

    Bouselmi, Ghazi; Illina, Irina

    2007-01-01

    In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The ``phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.

  16. Modeling Two-Channel Speech Processing With the EPIC Cognitive Architecture.

    Science.gov (United States)

    Kieras, David E; Wakefield, Gregory H; Thompson, Eric R; Iyer, Nandini; Simpson, Brian D

    2016-01-01

    An important application of cognitive architectures is to provide human performance models that capture psychological mechanisms in a form that can be "programmed" to predict task performance of human-machine system designs. Although many aspects of human performance have been successfully modeled in this approach, accounting for multitalker speech task performance is a novel problem. This article presents a model for performance in a two-talker task that incorporates concepts from psychoacoustics, in particular, masking effects and stream formation.

  17. Treatment Model in Children with Speech Disorders and Its Therapeutic Efficiency

    Directory of Open Access Journals (Sweden)

    Barberena, Luciana

    2014-05-01

    Full Text Available Introduction Speech articulation disorders affect the intelligibility of speech. Studies on therapeutic models show the effectiveness of the communication treatment. Objective To analyze the progress achieved by treatment with the ABAB—Withdrawal and Multiple Probes Model in children with different degrees of phonological disorders. Methods The diagnosis of speech articulation disorder was determined by speech and hearing evaluation and complementary tests. The subjects of this research were eight children, with the average age of 5:5. The children were distributed into four groups according to the degrees of the phonological disorders, based on the percentage of correct consonants, as follows: severe, moderate to severe, mild to moderate, and mild. The phonological treatment applied was the ABAB—Withdrawal and Multiple Probes Model. The development of the therapy by generalization was observed through the comparison between the two analyses: contrastive and distinctive features at the moment of evaluation and reevaluation. Results The following types of generalization were found: to the items not used in the treatment (other words, to another position in the word, within a sound class, to other classes of sounds, and to another syllable structure. Conclusion The different types of generalization studied showed the expansion of production and proper use of therapy-trained targets in other contexts or untrained environments. Therefore, the analysis of the generalizations proved to be an important criterion to measure the therapeutic efficacy.

  18. Speech Enhancement with Multivariate Laplace Speech Model%基于多元Laplace语音模型的语音增强算法

    Institute of Scientific and Technical Information of China (English)

    周彬; 邹霞; 张雄伟

    2012-01-01

    The spectral components of speech are usually assumed to be independent in traditional short-time spectrum estimation, which is not the case in practice. To solve this problem, a new speech enhancement algorithm with multivariate Laplace speech model is proposed in this paper. Firstly, the speech Discrete Cosine Transform (DCT) coefficients are modeled by a multivariate Laplace distribution, so the correlations between speech spectral components can be exploited. And then a Minimum-Mean-Square-Error (MMSE) estimator based on the proposed model is derived using a Gaussian scale mixture representation of random vectors. Furthermore, the speech presence uncertainty with the new model is derived to modify the MMSE estimator. Experimental results show that the developed method has better noise suppression performance and lower speech distortion compared to the traditional speech enhancement method.%传统的短时谱估计语音增强算法通常假设语音谱分量相互独立,没有考虑语音谱分量间的相关性.针对这一问题,该文提出一种新的基于多元Laplace分布模型的短时谱估计算法.首先,假设语音的离散余弦变换(DCT)系数服从多元Laplace分布,以此利用谱分量间的相关性;在此基础上,利用多元随机矢量的高斯尺度混合模型表示,推导得到语音DCT系数矢量的最小均方误差(MMSE)估计的解析表达式;并进一步推导了基于该分布模型的语音存在概率,对最小均方误差估计子进行修正.实验结果表明,该算法在抑制背景噪声和减少语音失真等方面优于传统的语音增强方法.

  19. Multilevel Analysis in Analyzing Speech Data

    Science.gov (United States)

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  20. Proposal for Classifying the Severity of Speech Disorder Using a Fuzzy Model in Accordance with the Implicational Model of Feature Complexity

    Science.gov (United States)

    Brancalioni, Ana Rita; Magnago, Karine Faverzani; Keske-Soares, Marcia

    2012-01-01

    The objective of this study is to create a new proposal for classifying the severity of speech disorders using a fuzzy model in accordance with a linguistic model that represents the speech acquisition of Brazilian Portuguese. The fuzzy linguistic model was run in the MATLAB software fuzzy toolbox from a set of fuzzy rules, and it encompassed…

  1. Clinical and MRI models predicting amyloid deposition in progressive aphasia and apraxia of speech.

    Science.gov (United States)

    Whitwell, Jennifer L; Weigand, Stephen D; Duffy, Joseph R; Strand, Edythe A; Machulda, Mary M; Senjem, Matthew L; Gunter, Jeffrey L; Lowe, Val J; Jack, Clifford R; Josephs, Keith A

    2016-01-01

    Beta-amyloid (Aβ) deposition can be observed in primary progressive aphasia (PPA) and progressive apraxia of speech (PAOS). While it is typically associated with logopenic PPA, there are exceptions that make predicting Aβ status challenging based on clinical diagnosis alone. We aimed to determine whether MRI regional volumes or clinical data could help predict Aβ deposition. One hundred and thirty-nine PPA (n = 97; 15 agrammatic, 53 logopenic, 13 semantic and 16 unclassified) and PAOS (n = 42) subjects were prospectively recruited into a cross-sectional study and underwent speech/language assessments, 3.0 T MRI and C11-Pittsburgh Compound B PET. The presence of Aβ was determined using a 1.5 SUVR cut-point. Atlas-based parcellation was used to calculate gray matter volumes of 42 regions-of-interest across the brain. Penalized binary logistic regression was utilized to determine what combination of MRI regions, and what combination of speech and language tests, best predicts Aβ (+) status. The optimal MRI model and optimal clinical model both performed comparably in their ability to accurately classify subjects according to Aβ status. MRI accurately classified 81% of subjects using 14 regions. Small left superior temporal and inferior parietal volumes and large left Broca's area volumes were particularly predictive of Aβ (+) status. Clinical scores accurately classified 83% of subjects using 12 tests. Phonological errors and repetition deficits, and absence of agrammatism and motor speech deficits were particularly predictive of Aβ (+) status. In comparison, clinical diagnosis was able to accurately classify 89% of subjects. However, the MRI model performed well in predicting Aβ deposition in unclassified PPA. Clinical diagnosis provides optimum prediction of Aβ status at the group level, although regional MRI measurements and speech and language testing also performed well and could have advantages in predicting Aβ status in unclassified PPA subjects.

  2. Clinical and MRI models predicting amyloid deposition in progressive aphasia and apraxia of speech

    Directory of Open Access Journals (Sweden)

    Jennifer L. Whitwell

    2016-01-01

    Full Text Available Beta-amyloid (Aβ deposition can be observed in primary progressive aphasia (PPA and progressive apraxia of speech (PAOS. While it is typically associated with logopenic PPA, there are exceptions that make predicting Aβ status challenging based on clinical diagnosis alone. We aimed to determine whether MRI regional volumes or clinical data could help predict Aβ deposition. One hundred and thirty-nine PPA (n = 97; 15 agrammatic, 53 logopenic, 13 semantic and 16 unclassified and PAOS (n = 42 subjects were prospectively recruited into a cross-sectional study and underwent speech/language assessments, 3.0 T MRI and C11-Pittsburgh Compound B PET. The presence of Aβ was determined using a 1.5 SUVR cut-point. Atlas-based parcellation was used to calculate gray matter volumes of 42 regions-of-interest across the brain. Penalized binary logistic regression was utilized to determine what combination of MRI regions, and what combination of speech and language tests, best predicts Aβ (+ status. The optimal MRI model and optimal clinical model both performed comparably in their ability to accurately classify subjects according to Aβ status. MRI accurately classified 81% of subjects using 14 regions. Small left superior temporal and inferior parietal volumes and large left Broca's area volumes were particularly predictive of Aβ (+ status. Clinical scores accurately classified 83% of subjects using 12 tests. Phonological errors and repetition deficits, and absence of agrammatism and motor speech deficits were particularly predictive of Aβ (+ status. In comparison, clinical diagnosis was able to accurately classify 89% of subjects. However, the MRI model performed well in predicting Aβ deposition in unclassified PPA. Clinical diagnosis provides optimum prediction of Aβ status at the group level, although regional MRI measurements and speech and language testing also performed well and could have advantages in predicting Aβ status in unclassified

  3. Mapping Speech Spectra from Throat Microphone to Close-Speaking Microphone: A Neural Network Approach

    Directory of Open Access Journals (Sweden)

    B. Yegnanarayana

    2007-01-01

    Full Text Available Speech recorded from a throat microphone is robust to the surrounding noise, but sounds unnatural unlike the speech recorded from a close-speaking microphone. This paper addresses the issue of improving the perceptual quality of the throat microphone speech by mapping the speech spectra from the throat microphone to the close-speaking microphone. A neural network model is used to capture the speaker-dependent functional relationship between the feature vectors (cepstral coefficients of the two speech signals. A method is proposed to ensure the stability of the all-pole synthesis filter. Objective evaluations indicate the effectiveness of the proposed mapping scheme. The advantage of this method is that the model gives a smooth estimate of the spectra of the close-speaking microphone speech. No distortions are perceived in the reconstructed speech. This mapping technique is also used for bandwidth extension of telephone speech.

  4. Noise Removing of Audio Speech Signals by Means of Kalman Filter

    Directory of Open Access Journals (Sweden)

    Soleyman Shirzadi

    2016-06-01

    Full Text Available Nowadays, multimedia (audio and video processing is among the most important subjects discussed in engineering sciences. To apply digital filters, especially adapting filters, in the above process, are of crucial importance. Theory of adapting filters such as that of Wiener or Kalman, have been fully discussed within the continuous field, the same as in discrete-time one; in spite of this, due to the presence of computers and digital processors, the adaptable filters defiantly have more efficiency in continuous field rather than discrete-time filed. One digital filter along with an adaptable algorithm is usually applied in adaptable filters so that the filter factor can be determined by means of adaptable algorithm. In the present article the Kalman filter-which counts as one of the best filters- has been surveyed whose appropriate factors is being calculated to design a efficient filter. First of all a sample signal is randomly selected which can be the same as an Autoregressive signal. Then a merely random Gaussian noise is applied on Autoregressive signal; and consequently the noisy signal is analyzed. As soon as we analyze the noise removed. The aforesaid operation has been assimilated through the Matlab software. The results have been demonstrated as well.

  5. Massively Parallel Network Architectures for Automatic Recognition of Visual Speech Signals

    Science.gov (United States)

    1990-01-01

    opening, the lips, teeth, tongue, velum and glottis) defines the shape of the vocal tract filter, which then determines the filter’s frequency response...system, such as the glottis and velum , are not. Those articulators that are visible tend to modify the acoustic signal in ways that are more

  6. The Effects of Hearing Protectors on Speech Communication and the Perception of Warning Signals

    Science.gov (United States)

    1989-06-01

    sensorineural hearing loss (S) . While the...Figure 1. Effect of earmuffs on the audibility of a signal in noise by a subject with normal hearing (N) and one with sensorineural hearing loss (S). Upper...investigated the psychological and social effects of " sudden hearing loss " by occluding the ears of normal- hearing individuals. Subjects wore earplugs

  7. Modeling and simulation of speech emotional recognition%语音情感智能识别的建模与仿真

    Institute of Scientific and Technical Information of China (English)

    黄晓峰; 彭远芳

    2012-01-01

    语音情感信息具有非线性、信息冗余、高维等复杂特点,数据含有大量噪声,传统识别模型难以消除冗余和噪声信息,导致语音情感识别正确率十分低.为了提高语音情感识别正确率,利用小波分析去噪和神经网络的非线性处理能力,提出一种基于过程神经元网络的语音情感智能识别模型.采用小波分析对语音情感信号进行去噪处理,利用主成分分析消除语音情感特征中的冗余信息,采用过程神经元网络对语音情感进行分类识别.仿真结果表明,基于过程神经元网络的识别模型的识别率比K近邻提高了13%,比支持向量机提高了8.75%,该模型是一种有效的语音情感智能识别工具.%Speech emotion information has nonlinear, redundancy and high dimension characteristics, the data has lots of noise, the traditional methods cannot eliminate the redundancy information and noise, so speech emotion recognition accuracy is quite low. In order to improve the accuracy of speech emotion recognition, this paper puts forward a speech emotion recognition model based on process neural networks strong nonlinear processing ability and wavelet analysisdenoising. The noise of speech signal is eliminated by wavelet analysis, the redundancy information is eliminated by the principal components analysis, the speech emotional signal is recognized by the process neural networks. Simulation results show that the average recognition rate of the process neural networks is higher than K neighbor by 13%, and higher than the support vector machine by 8.75%, therefore the proposed model is an effective speech emotion recognition tool.

  8. Models of phonology in the education of speech-language pathologists.

    Science.gov (United States)

    Nelson, Ryan; Ball, Martin J

    2003-01-01

    We discuss developments in theoretical phonology and, in particular, at the divide between theories aiming to be adequate accounts of the data, as opposed to those claiming psycholinguistic validity. It would seem that the latter might have greater utility for thye speech-language pathologist. However, we need to know the dominant models of clinical phonology, in both clinical education and practise, before we can promote other theoretical approaches. This article describes preliminary results from a questionnaire designed to discover what models of phonology are taught in institutions training speech-language pathologists in the United States. Results support anecdotal evidence that only a limited number of approaches (phonemic, distinctive features, and processes) are taught in many instances. They also demonstrate that some correspondents were unable to distinguish aspects of theoretical phonology from similar sounding (but radically different) models of intervention. This ties in with the results showing that some instructors of phonology courses have little or no background in the subject.

  9. Ear, Hearing and Speech

    DEFF Research Database (Denmark)

    Poulsen, Torben

    2000-01-01

    An introduction is given to the the anatomy and the function of the ear, basic psychoacoustic matters (hearing threshold, loudness, masking), the speech signal and speech intelligibility. The lecture note is written for the course: Fundamentals of Acoustics and Noise Control (51001)......An introduction is given to the the anatomy and the function of the ear, basic psychoacoustic matters (hearing threshold, loudness, masking), the speech signal and speech intelligibility. The lecture note is written for the course: Fundamentals of Acoustics and Noise Control (51001)...

  10. Bandwidth Extension of Telephone Speech Aided by Data Embedding

    Directory of Open Access Journals (Sweden)

    David Malah

    2007-01-01

    Full Text Available A system for bandwidth extension of telephone speech, aided by data embedding, is presented. The proposed system uses the transmitted analog narrowband speech signal as a carrier of the side information needed to carry out the bandwidth extension. The upper band of the wideband speech is reconstructed at the receiving end from two components: a synthetic wideband excitation signal, generated from the narrowband telephone speech and a wideband spectral envelope, parametrically represented and transmitted as embedded data in the telephone speech. We propose a novel data embedding scheme, in which the scalar Costa scheme is combined with an auditory masking model allowing high rate transparent embedding, while maintaining a low bit error rate. The signal is transformed to the frequency domain via the discrete Hartley transform (DHT and is partitioned into subbands. Data is embedded in an adaptively chosen subset of subbands by modifying the DHT coefficients. In our simulations, high quality wideband speech was obtained from speech transmitted over a telephone line (characterized by spectral magnitude distortion, dispersion, and noise, in which side information data is transparently embedded at the rate of 600 information bits/second and with a bit error rate of approximately 3⋅10−4. In a listening test, the reconstructed wideband speech was preferred (at different degrees over conventional telephone speech in 92.5% of the test utterances.

  11. Bandwidth Extension of Telephone Speech Aided by Data Embedding

    Directory of Open Access Journals (Sweden)

    Sagi Ariel

    2007-01-01

    Full Text Available A system for bandwidth extension of telephone speech, aided by data embedding, is presented. The proposed system uses the transmitted analog narrowband speech signal as a carrier of the side information needed to carry out the bandwidth extension. The upper band of the wideband speech is reconstructed at the receiving end from two components: a synthetic wideband excitation signal, generated from the narrowband telephone speech and a wideband spectral envelope, parametrically represented and transmitted as embedded data in the telephone speech. We propose a novel data embedding scheme, in which the scalar Costa scheme is combined with an auditory masking model allowing high rate transparent embedding, while maintaining a low bit error rate. The signal is transformed to the frequency domain via the discrete Hartley transform (DHT and is partitioned into subbands. Data is embedded in an adaptively chosen subset of subbands by modifying the DHT coefficients. In our simulations, high quality wideband speech was obtained from speech transmitted over a telephone line (characterized by spectral magnitude distortion, dispersion, and noise, in which side information data is transparently embedded at the rate of 600 information bits/second and with a bit error rate of approximately . In a listening test, the reconstructed wideband speech was preferred (at different degrees over conventional telephone speech in of the test utterances.

  12. 语音信号的加窗傅里叶变换研究%Research for Windowed Fourier Transform of Speech Signal

    Institute of Scientific and Technical Information of China (English)

    徐坤玉; 张彩珍; 药雪崧

    2011-01-01

    In this paper, we applied the windowed Fourier transform for the speech signal spectrum analysis on the basis of simu- lation analysis for the rectangular, Hanning, Hamming and the Blackman window function. Speech signal in a different window function Fourier transform spectrum analysis showed that the proper choice of window function and speech signal windowed Fourier transform was an effective method for a voice signal frequency domain analysis, thus obtain the general principle of choice for window function.%本文在对矩形、汉宁(Hanning)、海明(Hamming)及布莱克曼窗函数进行仿真分析的基础上,在语音信号的频谱分析中应用了加窗傅里叶变换.语音信号在不同窗函数傅里叶变换下的频谱分析表明,恰当选择窗函数并对语音信号进行加窗傅里叶变化是语音信号频域分析的一种有效方法,从而得出了窗函数选择的一般原则.

  13. 基于信号子空间的语音增强方法%Speech enhancement method based on the signal subspace

    Institute of Scientific and Technical Information of China (English)

    曹玉萍

    2012-01-01

    基于子空间的语音增强算法不同于基于信号处理和统计估计的经典语音增强算法,其核心思想就是将带噪语音信号映射到信号子空间和噪声子空间中,并在信号子空间中估计原始信号。本文提出的算法是以线性代数和矩阵分析为基础,利用对语音信号和噪声协方差矩阵同时对角变换的条件,对混有加性白噪声和粉红噪声的语音信号进行增强处理。经过实验分析及与传统的语音增强算法相比较,语音失真较小,增强效果较好,能够在极大限度地抑制背景噪声的同时减少频谱失真和残余噪声。%The speech enhancement algorithm based on the signal subspace is different from the classic speech enhancement algorithm based on signal processing and statistical estimates. This paper proposes a speech enhancement method based on signal subspace. This method is based on linear algebra and matrix analysis, uses conditions of speech signal and noise covariance matrix and diagonal transform, processes the enhancement of mixture of white noise and pink noise voice signal. Compared through experimental analysis and enhancement algorithms with the traditional voice, it has a smaller distortion, and a better enhancement effect, greatly limits to suppress background noise while reducing the spectral distortion and residual noise

  14. Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Non-Stationary Voiced Speech

    DEFF Research Database (Denmark)

    Nørholm, Sidsel Marie; Jensen, Jesper Rindom; Christensen, Mads Græsbøll

    2016-01-01

    In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum...... this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit...... of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed....

  15. Robust Transmission of Speech LSFs Using Hidden Markov Model-Based Multiple Description Index Assignments

    Directory of Open Access Journals (Sweden)

    Rondeau Paul

    2008-01-01

    Full Text Available Speech coding techniques capable of generating encoded representations which are robust against channel losses play an important role in enabling reliable voice communication over packet networks and mobile wireless systems. In this paper, we investigate the use of multiple description index assignments (MDIAs for loss-tolerant transmission of line spectral frequency (LSF coefficients, typically generated by state-of-the-art speech coders. We propose a simulated annealing-based approach for optimizing MDIAs for Markov-model-based decoders which exploit inter- and intraframe correlations in LSF coefficients to reconstruct the quantized LSFs from coded bit streams corrupted by channel losses. Experimental results are presented which compare the performance of a number of novel LSF transmission schemes. These results clearly demonstrate that Markov-model-based decoders, when used in conjunction with optimized MDIA, can yield average spectral distortion much lower than that produced by methods such as interleaving/interpolation, commonly used to combat the packet losses.

  16. An eighth-scale speech source for subjective assessments in acoustic models

    Science.gov (United States)

    Orlowski, R. J.

    1981-08-01

    The design of a source is described which is suitable for making speech recordings in eighth-scale acoustic models of auditoria. An attempt was made to match the directionality of the source with the directionality of the human voice using data reported in the literature. A narrow aperture was required for the design which was provided by mounting an inverted conical horn over the diaphragm of a high frequency loudspeaker. Resonance problems were encountered with the use of a horn and a description is given of the electronic techniques adopted to minimize the effect of these resonances. Subjective and objective assessments on the completed speech source have proved satisfactory. It has been used in a modelling exercise concerned with the acoustic design of a theatre with a thrust-type stage.

  17. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  18. Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

    CERN Document Server

    Steiner, Ingmar

    2012-01-01

    We present a technique for the animation of a 3D kinematic tongue model, one component of the talking head of an acoustic-visual (AV) speech synthesizer. The skeletal animation approach is adapted to make use of a deformable rig controlled by tongue motion capture data obtained with electromagnetic articulography (EMA), while the tongue surface is extracted from volumetric magnetic resonance imaging (MRI) data. Initial results are shown and future work outlined.

  19. Use of Adaptive Digital Signal Processing to Improve Speech Communication for Normally Hearing aand Hearing-Impaired Subjects.

    Science.gov (United States)

    Harris, Richard W.; And Others

    1988-01-01

    A two-microphone adaptive digital noise cancellation technique improved word-recognition ability for 20 normal and 12 hearing-impaired adults by reducing multitalker speech babble and speech spectrum noise 18-22 dB. Word recognition improvements averaged 37-50 percent for normal and 27-40 percent for hearing-impaired subjects. Improvement was best…

  20. Speech Signal Encryption Based on Frequency Shift Function%基于移频函数的语音信号加密

    Institute of Scientific and Technical Information of China (English)

    闫应天

    2015-01-01

    Aiming at the problem of encryption of speech signal, based on MATLAB software programming, using frequency shift function to shift frequency encryption, and on this basis, an improved method is presented to segment a speech signal into encryption, so as to improve the effect of encryption.%针对语音信号的加密问题,基于MATLAB软件编程,利用移频函数对语音信号进行移频加密,并在此基础上给出了一种改进方法,将一个语音信号进行分割加密,从而提高加密的效果.

  1. Modelling of urban traffic networkof signalized intersections

    OpenAIRE

    2013-01-01

    This report presents how traffic network of signalized intersection in a chosen urban area called Tema is synchronized. Using a modular approach, two different types of traffic intersection commonly found in an urban area were modelled i.e. a simple intersection and a complex intersection. A direct road, even though not an intersection, was also included in the modelling because it’s commonly found in an urban area plus it connects any two intersections. Each of these scenarios was modelled u...

  2. Transfer Rate Models for Gnutella Signaling Traffic

    OpenAIRE

    2006-01-01

    This paper reports on transfer rate models for the Gnutella signaling protocol. New results on message-level and IP-level rates are presented. The models are based on traffic captured at the Blekinge Institute of Technology (BTH) campus in Sweden and offer several levels of granularity: message type, application layer and network layer. The aim is to obtain parsimonous models suitable for analysis and simulation of P2P workload. IEEE Explorer

  3. Speech intelligibility for normal hearing and hearing-impaired listeners in simulated room acoustic conditions

    DEFF Research Database (Denmark)

    Arweiler, Iris; Dau, Torsten; Poulsen, Torben

    , a classroom and a church. The data from the study provide constraints for existing models of speech intelligibility prediction (based on the speech intelligibility index, SII, or the speech transmission index, STI) which have shortcomings when reverberation and/or fluctuating noise affect speech......Speech intelligibility depends on many factors such as room acoustics, the acoustical properties and location of the signal and the interferers, and the ability of the (normal and impaired) auditory system to process monaural and binaural sounds. In the present study, the effect of reverberation...... on spatial release from masking was investigated in normal hearing and hearing impaired listeners using three types of interferers: speech shaped noise, an interfering female talker and speech-modulated noise. Speech reception thresholds (SRT) were obtained in three simulated environments: a listening room...

  4. Acoustic Model Training Using Pseudo-Speaker Features Generated by MLLR Transformations for Robust Speaker-Independent Speech Recognition

    OpenAIRE

    Arata Itoh; Sunao Hara; Norihide Kitaoka; Kazuya Takeda

    2012-01-01

    A novel speech feature generation-based acoustic model training method for robust speaker-independent speech recognition is proposed. For decades, speaker adaptation methods have been widely used. All of these adaptation methods need adaptation data. However, our proposed method aims to create speaker-independent acoustic models that cover not only known but also unknown speakers. We achieve this by adopting inverse maximum likelihood linear regression (MLLR) transformation-based feature gene...

  5. Compressed speech signal sensing with K-L incoherent dictionary based on segment MP%分段匹配追踪式Karhunen-Loeve非相干字典语音压缩感知

    Institute of Scientific and Technical Information of China (English)

    曾理; 张雄伟; 陈亮; 杨吉斌; 黄建军

    2013-01-01

    Compressed Sensing (CS),an emerging theory,provides an alternative approach for signal compression.In this study,an incoherent speech dictionary scheme is proposed via K-L Expansion to improve the matching property of previous ones,and the Segment MP (SegMP) algorithm is designed aiming at the shortages of MP and OMP.We build a speech autocorrelation model and estimate the model parameter to construct the dictionary.Afterwards,the SegMP is employed to obtain low-dimensional measurements and to reconstruct speech with the dictionary that is rebuilt by the model parameter.Finally,the compressed speech signal sensing is implemented.Extensive experiments demonstrate that the presented scheme outperforms state-of-the-art method in vocal signals' sparse description.It has three characteristics:better signal adaptability,higher reconstruction quality and lower complexity.%压缩感知(Compressed Sensing,CS)理论突破了经典采样定理的理论边界,为信号压缩提供了另一种途径.基于CS理论框架,做了两方面工作:为提高语音字典对信号的匹配性,设计了一种基于K-L展开的非相干语音字典;针对现有匹配追踪(MP,OMP)算法的不足,提出分段匹配追踪(Segment MP,SegMP)算法.首先对语音自相关函数进行建模并估计模型参数,构造语音自适应非相干字典,然后采用SegMP对语音稀疏向量分段观测,获得多个低维矢量,最后结合模型参数重建字典并重构信号,实现了语音压缩感知.语音测试结果表明:相比现有方案,本文方案对信号的稀疏表示更为精准,具有更好的重构质量,且降低了计算复杂度.

  6. Patterns of flavor signals in supersymmetric models

    Energy Technology Data Exchange (ETDEWEB)

    Goto, T. [KEK National High Energy Physics, Tsukuba (Japan)]|[Kyoto Univ. (Japan). YITP; Okada, Y. [KEK National High Energy Physics, Tsukuba (Japan)]|[Graduate Univ. for Advanced Studies, Tsukuba (Japan). Dept. of Particle and Nucelar Physics; Shindou, T. [Deutsches Elektronen-Synchrotron (DESY), Hamburg (Germany)]|[International School for Advanced Studies, Trieste (Italy); Tanaka, M. [Osaka Univ., Toyonaka (Japan). Dept. of Physics

    2007-11-15

    Quark and lepton flavor signals are studied in four supersymmetric models, namely the minimal supergravity model, the minimal supersymmetric standard model with right-handed neutrinos, SU(5) supersymmetric grand unified theory with right-handed neutrinos and the minimal supersymmetric standard model with U(2) flavor symmetry. We calculate b{yields}s(d) transition observables in B{sub d} and B{sub s} decays, taking the constraint from the B{sub s}- anti B{sub s} mixing recently observed at Tevatron into account. We also calculate lepton flavor violating processes {mu} {yields} e{gamma}, {tau} {yields} {mu}{gamma} and {tau} {yields} e{gamma} for the models with right-handed neutrinos. We investigate possibilities to distinguish the flavor structure of the supersymmetry breaking sector with use of patterns of various flavor signals which are expected to be measured in experiments such as MEG, LHCb and a future Super B Factory. (orig.)

  7. Cognitive Structures, Speech, and Social Situations: Two Integrative Models.

    Science.gov (United States)

    Giles, Howard; Hewstone, Miles

    1982-01-01

    Presents theoretical models of how language acts (1) as a dependent variable of how people subjectively construe situations and (2) as an independent variable creatively defining and redefining situations for those involved. Discusses the importance of developing an interdisciplinary model of language variation in its social context. (EKN)

  8. Logical modelling of Drosophila signalling pathways.

    Science.gov (United States)

    Mbodj, Abibatou; Junion, Guillaume; Brun, Christine; Furlong, Eileen E M; Thieffry, Denis

    2013-09-01

    A limited number of signalling pathways are involved in the specification of cell fate during the development of all animals. Several of these pathways were originally identified in Drosophila. To clarify their roles, and possible cross-talk, we have built a logical model for the nine key signalling pathways recurrently used in metazoan development. In each case, we considered the associated ligands, receptors, signal transducers, modulators, and transcription factors reported in the literature. Implemented using the logical modelling software GINsim, the resulting models qualitatively recapitulate the main characteristics of each pathway, in wild type as well as in various mutant situations (e.g. loss-of-function or gain-of-function). These models constitute pluggable modules that can be used to assemble comprehensive models of complex developmental processes. Moreover, these models of Drosophila pathways could serve as scaffolds for more complicated models of orthologous mammalian pathways. Comprehensive model annotations and GINsim files are provided for each of the nine considered pathways.

  9. Toward a Model of Pediatric Speech Sound Disorders (SSD) for Differential Diagnosis and Therapy Planning

    NARCIS (Netherlands)

    Terband, Hayo; Maassen, Bernardus; Maas, Edwin; van Lieshout, Pascal; Maassen, Ben; Terband, Hayo

    2016-01-01

    The classification and differentiation of pediatric speech sound disorders (SSD) is one of the main questions in the field of speech- and language pathology. Terms for classifying childhood and SSD and motor speech disorders (MSD) refer to speech production processes, and a variety of methods of

  10. Mathematical Models Light Up Plant Signaling

    NARCIS (Netherlands)

    Chew, Y.H.; Smith, R.W.; Jones, H.J.; Seaton, D.D.; Grima, R.; Halliday, K.J.

    2014-01-01

    Plants respond to changes in the environment by triggering a suite of regulatory networks that control and synchronize molecular signaling in different tissues, organs, and the whole plant. Molecular studies through genetic and environmental perturbations, particularly in the model plant Arabidopsis

  11. Stochastic models of intracellular calcium signals

    Energy Technology Data Exchange (ETDEWEB)

    Rüdiger, Sten, E-mail: sten.ruediger@physik.hu-berlin.de

    2014-01-10

    Cellular signaling operates in a noisy environment shaped by low molecular concentrations and cellular heterogeneity. For calcium release through intracellular channels–one of the most important cellular signaling mechanisms–feedback by liberated calcium endows fluctuations with critical functions in signal generation and formation. In this review it is first described, under which general conditions the environment makes stochasticity relevant, and which conditions allow approximating or deterministic equations. This analysis provides a framework, in which one can deduce an efficient hybrid description combining stochastic and deterministic evolution laws. Within the hybrid approach, Markov chains model gating of channels, while the concentrations of calcium and calcium binding molecules (buffers) are described by reaction–diffusion equations. The article further focuses on the spatial representation of subcellular calcium domains related to intracellular calcium channels. It presents analysis for single channels and clusters of channels and reviews the effects of buffers on the calcium release. For clustered channels, we discuss the application and validity of coarse-graining as well as approaches based on continuous gating variables (Fokker–Planck and chemical Langevin equations). Comparison with recent experiments substantiates the stochastic and spatial approach, identifies minimal requirements for a realistic modeling, and facilitates an understanding of collective channel behavior. At the end of the review, implications of stochastic and local modeling for the generation and properties of cell-wide release and the integration of calcium dynamics into cellular signaling models are discussed.

  12. Mathematical Models Light Up Plant Signaling

    NARCIS (Netherlands)

    Chew, Y.H.; Smith, R.W.; Jones, H.J.; Seaton, D.D.; Grima, R.; Halliday, K.J.

    2014-01-01

    Plants respond to changes in the environment by triggering a suite of regulatory networks that control and synchronize molecular signaling in different tissues, organs, and the whole plant. Molecular studies through genetic and environmental perturbations, particularly in the model plant Arabidopsis

  13. Content-based Compressive Sensing for Speech Signal%面向内容的语音信号压缩感知研究

    Institute of Scientific and Technical Information of China (English)

    高畅; 李海峰; 马琳

    2012-01-01

    压缩感知理论依据信号的稀疏性质进行压缩测量,将信号的获取方式从对信号的采样上升为对信息的感知,是信号处理领域的一场革命.本文提出一种基于非确定基字典(Uncertainty Basis Dictionary,UBD)对语音信号进行稀疏表示的方法,将压缩感知理论应用于对语音信号稀疏表示的压缩,并提出了基于求解线性规划问题的方法重构语音信号的算法.通过语音识别、话者识别和情感识别实验,从面向内容分析的角度,研究这种基于压缩感知理论的信息感知方法是否保留了语音信号的主要内容.实验结果表明,语音识别、话者识别和情感识别的准确率,与目前这些领域研究方法得到的结果基本一致,说明基于压缩感知理论的信息感知方法能够很好地获取语音信号的语义、话者和情感方面的信息.%Compressive sensing theory compress measurements using sparsity of signal, changes the method of signal obtaining from signal sampling to information sensing, and is a revolution of signal processing. The speech signal is sparse represented based on Uncertainty Basis Dictionary proposed in this paper, the sparse representation of speech signal is compressed by compressive sensing theory,and proposes an speech signal reconstruction algorithm based on the method of solving linear programming problem. Through the experiments of audio, speaker and emotion recognition, we research that this information sensing method based on compressive sensing theory weather preserves the main content from the angle of content-based analysis. Experiment results show that the precision of audio, speaker and emotion recognition is general the same with methods in these research domain, and proves that it can acquire the audio, speaker and emotion information of speech signal using the information sensing method based on compressive sensing theory.

  14. Segment-Based Acoustic Models for Continuous Speech Recognition

    Science.gov (United States)

    1993-04-05

    or the chi - squared -like measure used in [1] for Gaussian distributions. However, such similarity measures tend to be more useful for agglomerative...Context Modeling To implement this approach we use a simple - bakery ’" al- gorithm to assign tasks: as each machine becomes free, it We have

  15. Matrix sentence intelligibility prediction using an automatic speech recognition system.

    Science.gov (United States)

    Schädler, Marc René; Warzybok, Anna; Hochmuth, Sabine; Kollmeier, Birger

    2015-01-01

    The feasibility of predicting the outcome of the German matrix sentence test for different types of stationary background noise using an automatic speech recognition (ASR) system was studied. Speech reception thresholds (SRT) of 50% intelligibility were predicted in seven noise conditions. The ASR system used Mel-frequency cepstral coefficients as a front-end and employed whole-word Hidden Markov models on the back-end side. The ASR system was trained and tested with noisy matrix sentences on a broad range of signal-to-noise ratios. The ASR-based predictions were compared to data from the literature ( Hochmuth et al, 2015 ) obtained with 10 native German listeners with normal hearing and predictions of the speech intelligibility index (SII). The ASR-based predictions showed a high and significant correlation (R² = 0.95, p speech and noise signals. Minimum assumptions were made about human speech processing already incorporated in a reference-free ordinary ASR system.

  16. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  17. Recognition of emotions in Mexican Spanish speech: an approach based on acoustic modelling of emotion-specific vowels.

    Science.gov (United States)

    Caballero-Morales, Santiago-Omar

    2013-01-01

    An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR) system was built with Hidden Markov Models (HMMs), where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness). Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR's output for the sentence. With this approach, accuracy of 87-100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  18. Stochastic Modeling as a Means of Automatic Speech Recognition

    Science.gov (United States)

    1975-04-01

    posienon probability Pr( X| I :T|=x| l:T| | Y| I :T|«.y! I :T|. A, I, P, S ). where A, L, P, S represent the acoustic- phonetic , lexical, phonological , and... phonetic sequence by using multiple dictionary entries, phonological rules embedded in the dictionary, and a "degarbling" procedure. The search is...statistical model i f the hnguage. 2) a phonemic dictionary and statistical phonological rules. 3) a phonetic matching algorithm. 4) word level search

  19. Soft context clustering for F0 modeling in HMM-based speech synthesis

    Science.gov (United States)

    Khorram, Soheil; Sameti, Hossein; King, Simon

    2015-12-01

    This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional `hard' decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this `divide-and-conquer' approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure

  20. Prediction of signal peptides and signal anchors by a hidden Markov model

    DEFF Research Database (Denmark)

    Krogh, Anders Stærmose; Nielsen, Henrik

    1998-01-01

    A hidden Markov model of signal peptides has been developed. It contains submodels for the N-terminal part, the hydrophobic region, and the region around the cleavage site. For known signal peptides, the model can be used to assign objective boundaries between these three regions. Applied to our ...... is the poor discrimination between signal peptides and uncleaved signal anchors, but this is substantially improved by the hidden Markov model when expanding it with a very simple signal anchor model....

  1. Unifying Speech and Language in a Developmentally Sensitive Model of Production.

    Science.gov (United States)

    Redford, Melissa A

    2015-11-01

    Speaking is an intentional activity. It is also a complex motor skill; one that exhibits protracted development and the fully automatic character of an overlearned behavior. Together these observations suggest an analogy with skilled behavior in the non-language domain. This analogy is used here to argue for a model of production that is grounded in the activity of speaking and structured during language acquisition. The focus is on the plan that controls the execution of fluent speech; specifically, on the units that are activated during the production of an intonational phrase. These units are schemas: temporally structured sequences of remembered actions and their sensory outcomes. Schemas are activated and inhibited via associated goals, which are linked to specific meanings. Schemas may fuse together over developmental time with repeated use to form larger units, thereby affecting the relative timing of sequential action in participating schemas. In this way, the hierarchical structure of the speech plan and ensuing rhythm patterns of speech are a product of development. Individual schemas may also become differentiated during development, but only if subsequences are associated with meaning. The necessary association of action and meaning gives rise to assumptions about the primacy of certain linguistic forms in the production process. Overall, schema representations connect usage-based theories of language to the action of speaking.

  2. Principles of speech coding

    CERN Document Server

    Ogunfunmi, Tokunbo

    2010-01-01

    It is becoming increasingly apparent that all forms of communication-including voice-will be transmitted through packet-switched networks based on the Internet Protocol (IP). Therefore, the design of modern devices that rely on speech interfaces, such as cell phones and PDAs, requires a complete and up-to-date understanding of the basics of speech coding. Outlines key signal processing algorithms used to mitigate impairments to speech quality in VoIP networksOffering a detailed yet easily accessible introduction to the field, Principles of Speech Coding provides an in-depth examination of the

  3. Speech intelligibility and recall of first and second language words heard at different signal-to-noise ratios.

    Science.gov (United States)

    Hygge, Staffan; Kjellberg, Anders; Nöstl, Anatole

    2015-01-01

    Free recall of spoken words in Swedish (native tongue) and English were assessed in two signal-to-noise ratio (SNR) conditions (+3 and +12 dB), with and without half of the heard words being repeated back orally directly after presentation [shadowing, speech intelligibility (SI)]. A total of 24 word lists with 12 words each were presented in English and in Swedish to Swedish speaking college students. Pre-experimental measures of working memory capacity (operation span, OSPAN) were taken. A basic hypothesis was that the recall of the words would be impaired when the encoding of the words required more processing resources, thereby depleting working memory resources. This would be the case when the SNR was low or when the language was English. A low SNR was also expected to impair SI, but we wanted to compare the sizes of the SNR-effects on SI and recall. A low score on working memory capacity was expected to further add to the negative effects of SNR and language on both SI and recall. The results indicated that SNR had strong effects on both SI and recall, but also that the effect size was larger for recall than for SI. Language had a main effect on recall, but not on SI. The shadowing procedure had different effects on recall of the early and late parts of the word lists. Working memory capacity was unimportant for the effect on SI and recall. Thus, recall appear to be a more sensitive indicator than SI for the acoustics of learning, which has implications for building codes and recommendations concerning classrooms and other workplaces, where both hearing and learning is important.

  4. Speech intelligibility and recall of first and second language words heard at different signal-to-noise ratios

    Directory of Open Access Journals (Sweden)

    Staffan eHygge

    2015-09-01

    Full Text Available Free recall of spoken words in Swedish (native tongue and English were assessed in two signal-to-noise ratio (SNR conditions (+3 and +12 dB, with and without half of the heard words being repeated back orally directly after presentation (shadowing, speech intelligibility, (SI. A total of 24 wordlists with 12 words each were presented in English and in Swedish to Swedish speaking college students. Pre-experimental measures of working memory capacity (OSPAN were taken.A basic hypothesis was that the recall of the words would be impaired when the encoding of the words required more processing resources, thereby depleting working memory resources. This would be the case when the SNR was low or when the language was English. A low SNR was also expected to impair SI, but we wanted to compare the sizes of the SNR-effects on SI and recall. A low score on working memory capacity was expected to further add to the negative effects of SNR and Language on both SI and recall.The results indicated that SNR had strong effects on both SI and recall, but also that the effect size was larger for recall than for SI. Language had a main effect on recall, but not on SI. The shadowing procedure had different effects on recall of the early and late parts of the word lists. Working memory capacity was unimportant for the effect on SI and recall.Thus, recall appear to be a more sensitive indicator than SI for the acoustics of learning, which has implications for building codes and recommendations concerning classrooms and other workplaces where both hearing and learning is important.

  5. Under-resourced speech recognition based on the speech manifold

    CSIR Research Space (South Africa)

    Sahraeian, R

    2015-09-01

    Full Text Available Conventional acoustic modeling involves estimating many parameters to effectively model feature distributions. The sparseness of speech and text data, however, degrades the reliability of the estimation process and makes speech recognition a...

  6. Speech processing in mobile environments

    CERN Document Server

    Rao, K Sreenivasa

    2014-01-01

    This book focuses on speech processing in the presence of low-bit rate coding and varying background environments. The methods presented in the book exploit the speech events which are robust in noisy environments. Accurate estimation of these crucial events will be useful for carrying out various speech tasks such as speech recognition, speaker recognition and speech rate modification in mobile environments. The authors provide insights into designing and developing robust methods to process the speech in mobile environments. Covering temporal and spectral enhancement methods to minimize the effect of noise and examining methods and models on speech and speaker recognition applications in mobile environments.

  7. Automated modelling of signal transduction networks

    Directory of Open Access Journals (Sweden)

    Aach John

    2002-11-01

    Full Text Available Abstract Background Intracellular signal transduction is achieved by networks of proteins and small molecules that transmit information from the cell surface to the nucleus, where they ultimately effect transcriptional changes. Understanding the mechanisms cells use to accomplish this important process requires a detailed molecular description of the networks involved. Results We have developed a computational approach for generating static models of signal transduction networks which utilizes protein-interaction maps generated from large-scale two-hybrid screens and expression profiles from DNA microarrays. Networks are determined entirely by integrating protein-protein interaction data with microarray expression data, without prior knowledge of any pathway intermediates. In effect, this is equivalent to extracting subnetworks of the protein interaction dataset whose members have the most correlated expression profiles. Conclusion We show that our technique accurately reconstructs MAP Kinase signaling networks in Saccharomyces cerevisiae. This approach should enhance our ability to model signaling networks and to discover new components of known networks. More generally, it provides a method for synthesizing molecular data, either individual transcript abundance measurements or pairwise protein interactions, into higher level structures, such as pathways and networks.

  8. A Robust Method for Speech Emotion Recognition Based on Infinite Student’s t-Mixture Model

    Directory of Open Access Journals (Sweden)

    Xinran Zhang

    2015-01-01

    Full Text Available Speech emotion classification method, proposed in this paper, is based on Student’s t-mixture model with infinite component number (iSMM and can directly conduct effective recognition for various kinds of speech emotion samples. Compared with the traditional GMM (Gaussian mixture model, speech emotion model based on Student’s t-mixture can effectively handle speech sample outliers that exist in the emotion feature space. Moreover, t-mixture model could keep robust to atypical emotion test data. In allusion to the high data complexity caused by high-dimensional space and the problem of insufficient training samples, a global latent space is joined to emotion model. Such an approach makes the number of components divided infinite and forms an iSMM emotion model, which can automatically determine the best number of components with lower complexity to complete various kinds of emotion characteristics data classification. Conducted over one spontaneous (FAU Aibo Emotion Corpus and two acting (DES and EMO-DB universal speech emotion databases which have high-dimensional feature samples and diversiform data distributions, the iSMM maintains better recognition performance than the comparisons. Thus, the effectiveness and generalization to the high-dimensional data and the outliers are verified. Hereby, the iSMM emotion model is verified as a robust method with the validity and generalization to outliers and high-dimensional emotion characters.

  9. Public Speech.

    Science.gov (United States)

    Green, Thomas F.

    1994-01-01

    Discusses the importance of public speech in society, noting the power of public speech to create a world and a public. The paper offers a theory of public speech, identifies types of public speech, and types of public speech fallacies. Two ways of speaking of the public and of public life are distinguished. (SM)

  10. Perception drives production across sensory modalities: A network for sensorimotor integration of visual speech.

    Science.gov (United States)

    Venezia, Jonathan H; Fillmore, Paul; Matchin, William; Isenberg, A Lisette; Hickok, Gregory; Fridriksson, Julius

    2016-02-01

    Sensory information is critical for movement control, both for defining the targets of actions and providing feedback during planning or ongoing movements. This holds for speech motor control as well, where both auditory and somatosensory information have been shown to play a key role. Recent clinical research demonstrates that individuals with severe speech production deficits can show a dramatic improvement in fluency during online mimicking of an audiovisual speech signal suggesting the existence of a visuomotor pathway for speech motor control. Here we used fMRI in healthy individuals to identify this new visuomotor circuit for speech production. Participants were asked to perceive and covertly rehearse nonsense syllable sequences presented auditorily, visually, or audiovisually. The motor act of rehearsal, which is prima facie the same whether or not it is cued with a visible talker, produced different patterns of sensorimotor activation when cued by visual or audiovisual speech (relative to auditory speech). In particular, a network of brain regions including the left posterior middle temporal gyrus and several frontoparietal sensorimotor areas activated more strongly during rehearsal cued by a visible talker versus rehearsal cued by auditory speech alone. Some of these brain regions responded exclusively to rehearsal cued by visual or audiovisual speech. This result has significant implications for models of speech motor control, for the treatment of speech output disorders, and for models of the role of speech gesture imitation in development.

  11. Compressed Sensing of Speech Signals Based on Difference Transformation%基于差分变换的语音信号压缩感知

    Institute of Scientific and Technical Information of China (English)

    高悦; 王改梅; 陈砚圃; 闵刚; 杜佳

    2011-01-01

    Signals can be expressed by some sparse form by certain transformation, and which is a precondition of compressed sensing. Orthogonal Fourier transform is widely used as a sparsity transformation. However, the speech signal is quasi-periodic, so its Fourier transform will result in spectral leakage, which causes the performance degradation of signal reconstruction. According to the quasi-periodic characteristics of speech signal, a kind of sparsity transformation matrix is presented based on the difference transform method, and then using OMP algorithm to reconstruct the speech signal. Experiments show that, when the Fourier transform method is used to transform the signal into sparsity and OMP algorithm is used to reconstruct the speech signal, the performance of difference transform method is better than that of orthogonal Fourier transform method. Namely, when the reconstruction performance of the two methods is same, the compression ratio of difference transform method is less than that of the orthogonal Fourier transform, which means that the difference transform method has greatly improved the compression performance of the speech signal. The quality evaluation of reconstructed signals by PESQ shows that the reconstructed signals by the difference transform method have higher MOS scores, which also shows that for this particular signal, the difference transform method has great advantages.%信号在某种变换下可以稀疏表示是压缩感知研究的先验条件,正交傅里叶变换则是应用非常广泛的一种稀疏变换.但是,由于语音信号是准周期信号,对其进行傅里叶变换会造成频谱泄漏,因而引起信号重构性能的降低.本文基于语音信号准周期性的特点,提出了一种基于差分变换的语音稀疏化变换矩阵,在此基础上采用OMP优化算法来重构语音信号.实验表明,与采用正交傅里叶变换方法对语音信号进行稀疏化变换、OMP算法对语音信号进行重构的方

  12. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  13. Improving Language Models in Speech-Based Human-Machine Interaction

    Directory of Open Access Journals (Sweden)

    Raquel Justo

    2013-02-01

    Full Text Available This work focuses on speech‐based human‐machine interaction. Specifically, a Spoken Dialogue System (SDS that could be integrated into a robot is considered. Since Automatic Speech Recognition is one of the most sensitive tasks that must be confronted in such systems, the goal of this work is to improve the results obtained by this specific module. In order to do so, a hierarchical Language Model (LM is considered. Different series of experiments were carried out using the proposed models over different corpora and tasks. The results obtained show that these models provide greater accuracy in the recognition task. Additionally, the influence of the Acoustic Modelling (AM in the improvement percentage of the Language Models has also been explored. Finally the use of hierarchical Language Models in a language understanding task has been successfully employed, as shown in an additional series of experiments.

  14. Kalman filter-based microphone array signal processing using the equivalent source model

    Science.gov (United States)

    Bai, Mingsian R.; Chen, Ching-Cheng

    2012-10-01

    This paper demonstrates that microphone array signal processing can be implemented by using adaptive model-based filtering approaches. Nearfield and farfield sound propagation models are formulated into state-space forms in light of the Equivalent Source Method (ESM). In the model, the unknown source amplitudes of the virtual sources are adaptively estimated by using Kalman filters (KFs). The nearfield array aimed at noise source identification is based on a Multiple-Input-Multiple-Output (MIMO) state-space model with minimal realization, whereas the farfield array technique aimed at speech quality enhancement is based on a Single-Input-Multiple-Output (SIMO) state-space model. Performance of the nearfield array is evaluated in terms of relative error of the velocity reconstructed on the actual source surface. Numerical simulations for the nearfield array were conducted with a baffled planar piston source. From the error metric, the proposed KF algorithm proved effective in identifying noise sources. Objective simulations and subjective experiments are undertaken to validate the proposed farfield arrays in comparison with two conventional methods. The results of objective tests indicated that the farfield arrays significantly enhanced the speech quality and word recognition rate. The results of subjective tests post-processed with the analysis of variance (ANOVA) and a post-hoc Fisher's least significant difference (LSD) test have shown great promise in the KF-based microphone array signal processing techniques.

  15. MODEL OF PSYCHO-PEDAGOGICAL SUPPORT TO PARENTS OF CHILDREN WITH SEVERE SPEECH DISORDERS

    Directory of Open Access Journals (Sweden)

    E. N. Azletskaya

    2016-01-01

    Full Text Available The aim of the article is to describe a model of psychological and pedagogical support for parents with children with severe speech disorders, in condition of art studio of preschool educational organization. The article substantiates the relevance of the development and implementation of the model described in the goals and objectives of its implementation, as well as the conditions for the implementation of the developed model. Innovative project  devoted  to the  actual  direction  of pre-school pedagogy psychological and educational support for parents in terms of pre-school organizations. It is established that the activities of children's and parents' studio art allows you to organize an effective psychological and pedagogical support to parents with children with severe speech disorders. It creates conditions for improving their psychological and pedagogical competence, activating parental involvement in the educational activities of pre-school organizations as participants in the educational relationship. Based on the analysis of innovation found that the joint parent-child hand-productive work not only contributes to the development of fine motor skills of hands, which in turn  has  a  positive  effect  on  the  development  of  a child's speech and further solves a number of problems: bringing up children love and respect for the traditions of people, way of life; forms moral qualities to establish positive interpersonal relationships, favorable psychological climate in the group; It creates conditions for the creative expression of children; optimizes the parent-child relationship.

  16. Animal Models of Speech and Vocal Communication Deficits Associated With Psychiatric Disorders.

    Science.gov (United States)

    Konopka, Genevieve; Roberts, Todd F

    2016-01-01

    Disruptions in speech, language, and vocal communication are hallmarks of several neuropsychiatric disorders, most notably autism spectrum disorders. Historically, the use of animal models to dissect molecular pathways and connect them to behavioral endophenotypes in cognitive disorders has proven to be an effective approach for developing and testing disease-relevant therapeutics. The unique aspects of human language compared with vocal behaviors in other animals make such an approach potentially more challenging. However, the study of vocal learning in species with analogous brain circuits to humans may provide entry points for understanding this human-specific phenotype and diseases. We review animal models of vocal learning and vocal communication and specifically link phenotypes of psychiatric disorders to relevant model systems. Evolutionary constraints in the organization of neural circuits and synaptic plasticity result in similarities in the brain mechanisms for vocal learning and vocal communication. Comparative approaches and careful consideration of the behavioral limitations among different animal models can provide critical avenues for dissecting the molecular pathways underlying cognitive disorders that disrupt speech, language, and vocal communication. Copyright © 2016 Society of Biological Psychiatry. Published by Elsevier Inc. All rights reserved.

  17. A dynamical model of hierarchical selection and coordination in speech planning.

    Directory of Open Access Journals (Sweden)

    Sam Tilsen

    Full Text Available studies of the control of complex sequential movements have dissociated two aspects of movement planning: control over the sequential selection of movement plans, and control over the precise timing of movement execution. This distinction is particularly relevant in the production of speech: utterances contain sequentially ordered words and syllables, but articulatory movements are often executed in a non-sequential, overlapping manner with precisely coordinated relative timing. This study presents a hybrid dynamical model in which competitive activation controls selection of movement plans and coupled oscillatory systems govern coordination. The model departs from previous approaches by ascribing an important role to competitive selection of articulatory plans within a syllable. Numerical simulations show that the model reproduces a variety of speech production phenomena, such as effects of preparation and utterance composition on reaction time, and asymmetries in patterns of articulatory timing associated with onsets and codas. The model furthermore provides a unified understanding of a diverse group of phonetic and phonological phenomena which have not previously been related.

  18. A dynamical model of hierarchical selection and coordination in speech planning.

    Science.gov (United States)

    Tilsen, Sam

    2013-01-01

    studies of the control of complex sequential movements have dissociated two aspects of movement planning: control over the sequential selection of movement plans, and control over the precise timing of movement execution. This distinction is particularly relevant in the production of speech: utterances contain sequentially ordered words and syllables, but articulatory movements are often executed in a non-sequential, overlapping manner with precisely coordinated relative timing. This study presents a hybrid dynamical model in which competitive activation controls selection of movement plans and coupled oscillatory systems govern coordination. The model departs from previous approaches by ascribing an important role to competitive selection of articulatory plans within a syllable. Numerical simulations show that the model reproduces a variety of speech production phenomena, such as effects of preparation and utterance composition on reaction time, and asymmetries in patterns of articulatory timing associated with onsets and codas. The model furthermore provides a unified understanding of a diverse group of phonetic and phonological phenomena which have not previously been related.

  19. Linguistic Units and Speech Production Theory.

    Science.gov (United States)

    MacNeilage, Peter F.

    This paper examines the validity of the concept of linguistic units in a theory of speech production. Substantiating data are drawn from the study of the speech production process itself. Secondarily, an attempt is made to reconcile the postulation of linguistic units in speech production theory with their apparent absence in the speech signal.…

  20. Small Signal Circuit Model of Double Photodiodes

    Institute of Scientific and Technical Information of China (English)

    HAN Jian-zhong; Ni Guo-qiang; MAO Lu-hong

    2004-01-01

    The transmission delay of photogenerated carriers in a CMOS-process-compatible double photodiode (DPD) is analyzed by using device simulation. The DPD small signal equivalent circuit model which includes transmission delay of photogenerated carriers is given. From analysis on the frequency domain of the circuit model the device has two poles. One has the relationship with junction capacitance and the DPD's load,the other with the depth and the doping concentration of the N-well in the DPD. Different depth of the Nwell and different area of the DPDs with bandwidth were compared. The analysis results are important to design the high speed DPDs.

  1. Model identification for dose response signal detection

    OpenAIRE

    Bretz, Frank; Dette, Holger; Titoff, Stefanie; Volgushev, Stanislav

    2012-01-01

    We consider the problem of detecting a dose response signal if several competing regression models are available to describe the dose response relationship. In particular, we re-analyze the MCP-Mod approach from Bretz et al. (2005), which has become a very popular tool for this problem in recent years. We propose an improvement based on likelihood ratio tests and prove that in linear models this approach is always at least as powerful as the MCP-Mod method. This result remains ...

  2. The early maximum likelihood estimation model of audiovisual integration in speech perception

    DEFF Research Database (Denmark)

    Andersen, Tobias

    2015-01-01

    Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk−MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely...... focused on the fuzzy logical model of perception (FLMP), which provides excellent fits to experimental observations but also has been criticized for being too flexible, post hoc and difficult to interpret. The current study introduces the early maximum likelihood estimation (MLE) model of audiovisual......-validation can evaluate models of audiovisual integration based on typical data sets taking both goodness-of-fit and model flexibility into account. All models were tested on a published data set previously used for testing the FLMP. Cross-validation favored the early MLE while more conventional error measures...

  3. Fully Automated Non-Native Speech Recognition Using Confusion-Based Acoustic Model Integration

    OpenAIRE

    Bouselmi, Ghazi; Fohr, Dominique; Illina, Irina; Haton, Jean-Paul

    2005-01-01

    This paper presents a fully automated approach for the recognition of non-native speech based on acoustic model modification. For a native language (L1) and a spoken language (L2), pronunciation variants of the phones of L2 are automatically extracted from an existing non-native database as a confusion matrix with sequences of phones of L1. This is done using L1's and L2's ASR systems. This confusion concept deals with the problem of non existence of match between some L2 and L1 phones. The c...

  4. Speech Problems

    Science.gov (United States)

    ... of your treatment plan may include seeing a speech therapist , a person who is trained to treat speech disorders. How often you have to see the speech therapist will vary — you'll probably start out seeing ...

  5. Neural mechanisms underlying auditory feedback control of speech.

    Science.gov (United States)

    Tourville, Jason A; Reilly, Kevin J; Guenther, Frank H

    2008-02-01

    The neural substrates underlying auditory feedback control of speech were investigated using a combination of functional magnetic resonance imaging (fMRI) and computational modeling. Neural responses were measured while subjects spoke monosyllabic words under two conditions: (i) normal auditory feedback of their speech and (ii) auditory feedback in which the first formant frequency of their speech was unexpectedly shifted in real time. Acoustic measurements showed compensation to the shift within approximately 136 ms of onset. Neuroimaging revealed increased activity in bilateral superior temporal cortex during shifted feedback, indicative of neurons coding mismatches between expected and actual auditory signals, as well as right prefrontal and Rolandic cortical activity. Structural equation modeling revealed increased influence of bilateral auditory cortical areas on right frontal areas during shifted speech, indicating that projections from auditory error cells in posterior superior temporal cortex to motor correction cells in right frontal cortex mediate auditory feedback control of speech.

  6. Logic integer programming models for signaling networks.

    Science.gov (United States)

    Haus, Utz-Uwe; Niermann, Kathrin; Truemper, Klaus; Weismantel, Robert

    2009-05-01

    We propose a static and a dynamic approach to model biological signaling networks, and show how each can be used to answer relevant biological questions. For this, we use the two different mathematical tools of Propositional Logic and Integer Programming. The power of discrete mathematics for handling qualitative as well as quantitative data has so far not been exploited in molecular biology, which is mostly driven by experimental research, relying on first-order or statistical models. The arising logic statements and integer programs are analyzed and can be solved with standard software. For a restricted class of problems the logic models reduce to a polynomial-time solvable satisfiability algorithm. Additionally, a more dynamic model enables enumeration of possible time resolutions in poly-logarithmic time. Computational experiments are included.

  7. Astroparticle physics signals beyond the Standard Model

    CERN Document Server

    Katz, S D

    2001-01-01

    In this thesis two signals pointing beyond the Standard Model are discussed. The first signal is the presence of baryonic matter around us. The possibility of baryogenesis in the Minimal Supersymmetric Standard Model (MSSM) was studied. The bosonic part of the MSSM Lagrangian was put on the lattice. Finite temperature and zero temperature simulations were performed to find the jump of the Higgs length and the mass spectrum. The phase diagram of MSSM in the m_U^2 - T plane was determined. The properties of the bubble wall during the phase transition were also analyzed. The second signal that is discussed in this thesis comes from the observed cosmic rays. According to the so-called ''top-down'' scenarios, the highest energy cosmic rays can be the decay products of some metastable superheavy particle. Based on the clustering features of the detected events, first the source density was determined. Then the mass of the decaying superheavy particle was found which is in rough agreement with the grand unification ...

  8. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  9. Preschool educators’ attitudes of the preventive speech therapy programme in nursery schools as a model of stimulating environment for developing language competence capacities

    OpenAIRE

    Vidmar, Alenka

    2016-01-01

    The Masters Thesis presents an innovative concept of a preventive speech therapy programme (PSP). A model of creating an encouraging environment that supports language development of kindergarten children, the PSP provides a range of possibilities for the speech therapist to carry out his/her mission. Unique in the Republic of Slovenia, the model opens new possibilities to further develop interdisciplinary teamwork between the speech therapist and teachers in a kindergarten setting. The th...

  10. Trainable prosodic model for standard Chinese Text-to-Speech system

    Institute of Scientific and Technical Information of China (English)

    TAO Jianhua; CAI Lianhong; ZHAO Shixia

    2001-01-01

    Putonghua prosody is characterized by its hierarchical structure when influenced by linguistic environments. Based on this, a neural network, with specially weighted factors and optimizing outputs, is described and applied to construct the Putonghua prosodic model in Text-to-Speech (TTS) system. Extensive tests show that the structure of the neural network characterizes the Putonghua prosody more exactly than traditional models. Learning rate is speeded up and computational precision is improved, which makes the whole prosodic model more efficient. Furthermore, the paper also stylizes the Putonghua syllable pitch contours with SPiS parameters (Syllable Pitch Stylized Parameters), and analyzes them in adjusting the syllable pitch. It shows that the SPiS parameters effectively characterize the Putonghua syllable pitch contours, and facilitate the establishment of the network model and the prosodic controlling.

  11. Auditory-motor interactions in pediatric motor speech disorders: Neurocomputational modeling of disordered development

    NARCIS (Netherlands)

    Terband, H.; Maassen, B.; Guenther, F. H.; Brumberg, J.

    2014-01-01

    BACKGROUND/PURPOSE: Differentiating the symptom complex due to phonological-level disorders, speech delay and pediatric motor speech disorders is a controversial issue in the field of pediatric speech and language pathology. The present study investigated the developmental interaction between

  12. Auditory-Motor Interactions in Pediatric Motor Speech Disorders: Neurocomputational Modeling of Disordered Development

    NARCIS (Netherlands)

    Terband, H.R.|info:eu-repo/dai/nl/296302066; Maassen, B.A.M.; Guenther, F.H.; Brumberg, J.

    2014-01-01

    Background/Purpose: Differentiating the symptom complex due to phonological-level disorders, speech delay and pediatric motor speech disorders is a controversial issue in the field of pediatric speech and language pathology. The present study investigated the developmental interaction between

  13. Auditory-Motor Interactions in Pediatric Motor Speech Disorders: Neurocomputational Modeling of Disordered Development

    NARCIS (Netherlands)

    Terband, H.R.; Maassen, B.A.M.; Guenther, F.H.; Brumberg, J.

    2014-01-01

    Background/Purpose: Differentiating the symptom complex due to phonological-level disorders, speech delay and pediatric motor speech disorders is a controversial issue in the field of pediatric speech and language pathology. The present study investigated the developmental interaction between neurol

  14. Auditory-motor interactions in pediatric motor speech disorders: Neurocomputational modeling of disordered development

    NARCIS (Netherlands)

    Terband, H.R.; Maassen, B.A.M.; Guenther, F.H.; Brumberg, J.

    2014-01-01

    BACKGROUND/PURPOSE: Differentiating the symptom complex due to phonological-level disorders, speech delay and pediatric motor speech disorders is a controversial issue in the field of pediatric speech and language pathology. The present study investigated the developmental interaction between neurol

  15. The role of the motor system in discriminating normal and degraded speech sounds.

    Science.gov (United States)

    D'Ausilio, Alessandro; Bufalari, Ilaria; Salmas, Paola; Fadiga, Luciano

    2012-07-01

    Listening to speech recruits a network of fronto-temporo-parietal cortical areas. Classical models consider anterior, motor, sites involved in speech production whereas posterior sites involved in comprehension. This functional segregation is more and more challenged by action-perception theories suggesting that brain circuits for speech articulation and speech perception are functionally interdependent. Recent studies report that speech listening elicits motor activities analogous to production. However, the motor system could be crucially recruited only under certain conditions that make speech discrimination hard. Here, by using event-related double-pulse transcranial magnetic stimulation (TMS) on lips and tongue motor areas, we show data suggesting that the motor system may play a role in noisy, but crucially not in noise-free environments, for the discrimination of speech signals. Copyright © 2011 Elsevier Srl. All rights reserved.

  16. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  17. Towards Artificial Speech Therapy: A Neural System for Impaired Speech Segmentation.

    Science.gov (United States)

    Iliya, Sunday; Neri, Ferrante

    2016-09-01

    This paper presents a neural system-based technique for segmenting short impaired speech utterances into silent, unvoiced, and voiced sections. Moreover, the proposed technique identifies those points of the (voiced) speech where the spectrum becomes steady. The resulting technique thus aims at detecting that limited section of the speech which contains the information about the potential impairment of the speech. This section is of interest to the speech therapist as it corresponds to the possibly incorrect movements of speech organs (lower lip and tongue with respect to the vocal tract). Two segmentation models to detect and identify the various sections of the disordered (impaired) speech signals have been developed and compared. The first makes use of a combination of four artificial neural networks. The second is based on a support vector machine (SVM). The SVM has been trained by means of an ad hoc nested algorithm whose outer layer is a metaheuristic while the inner layer is a convex optimization algorithm. Several metaheuristics have been tested and compared leading to the conclusion that some variants of the compact differential evolution (CDE) algorithm appears to be well-suited to address this problem. Numerical results show that the SVM model with a radial basis function is capable of effective detection of the portion of speech that is of interest to a therapist. The best performance has been achieved when the system is trained by the nested algorithm whose outer layer is hybrid-population-based/CDE. A population-based approach displays the best performance for the isolation of silence/noise sections, and the detection of unvoiced sections. On the other hand, a compact approach appears to be clearly well-suited to detect the beginning of the steady state of the voiced signal. Both the proposed segmentation models display outperformed two modern segmentation techniques based on Gaussian mixture model and deep learning.

  18. Fifty years of progress in speech understanding systems

    Science.gov (United States)

    Zue, Victor

    2004-10-01

    Researchers working on human-machine interfaces realized nearly 50 years ago that automatic speech recognition (ASR) alone is not sufficient; one needs to impart linguistic knowledge to the system such that the signal could ultimately be understood. A speech understanding system combines speech recognition (i.e., the speech to symbols conversion) with natural language processing (i.e., the symbol to meaning transformation) to achieve understanding. Speech understanding research dates back to the DARPA Speech Understanding Project in the early 1970s. However, large-scale efforts only began in earnest in the late 1980s, with government research programs in the U.S. and Europe providing the impetus. This has resulted in many innovations including novel approaches to natural language understanding (NLU) for speech input, and integration techniques for ASR and NLU. In the past decade, speech understanding systems have become major building blocks of conversational interfaces that enable users to access and manage information using spoken dialogue, incorporating language generation, discourse modeling, dialogue management, and speech synthesis. Today, we are at the threshold of developing multimodal interfaces, augmenting sound with sight and touch. This talk will highlight past work and speculate on the future. [Work supported by an industrial consortium of the MIT Oxygen Project.

  19. A frequency-selective feedback model of auditory efferent suppression and its implications for the recognition of speech in noise.

    Science.gov (United States)

    Clark, Nicholas R; Brown, Guy J; Jürgens, Tim; Meddis, Ray

    2012-09-01

    The potential contribution of the peripheral auditory efferent system to our understanding of speech in a background of competing noise was studied using a computer model of the auditory periphery and assessed using an automatic speech recognition system. A previous study had shown that a fixed efferent attenuation applied to all channels of a multi-channel model could improve the recognition of connected digit triplets in noise [G. J. Brown, R. T. Ferry, and R. Meddis, J. Acoust. Soc. Am. 127, 943-954 (2010)]. In the current study an anatomically justified feedback loop was used to automatically regulate separate attenuation values for each auditory channel. This arrangement resulted in a further enhancement of speech recognition over fixed-attenuation conditions. Comparisons between multi-talker babble and pink noise interference conditions suggest that the benefit originates from the model's ability to modify the amount of suppression in each channel separately according to the spectral shape of the interfering sounds.

  20. The effects of mands and models on the speech of unresponsive language-delayed preschool children.

    Science.gov (United States)

    Warren, S F; McQuarter, R J; Rogers-Warren, A K

    1984-02-01

    The effects of the systematic use of mands (non-yes/no questions and instructions to verbalize), models (imitative prompts), and specific consequent events on the productive verbal behavior of three unresponsive, socially isolate, language-delayed preschool children were investigated in a multiple-baseline design within a classroom free play period. Following a lengthy intervention condition, experimental procedures were systematically faded out to check for maintenance effects. The treatment resulted in increases in total verbalizations and nonobligatory speech (initiations) by the subjects. Subjects also became more responsive in obligatory speech situations. In a second free play (generalization) setting, increased rates of total child verbalizations and nonobligatory verbalizations were observed for all three subjects, and two of the three subjects were more responsive compared to their baselines in the first free play setting. Rate of total teacher verbalizations and questions were also higher in this setting. Maintenance of the treatment effects was shown during the fading condition in the intervention setting. The subjects' MLUs (mean length of utterance) increased during the intervention condition when the teacher began prompting a minimum of two-word utterances in response to a mand or model.

  1. Tests of a Dual-systems Model of Speech Category Learning.

    Science.gov (United States)

    Maddox, W Todd; Chandrasekaran, Bharath

    2014-10-01

    In the visual domain, more than two decades of work posits the existence of dual category learning systems. The reflective system uses working memory to develop and test rules for classifying in an explicit fashion. The reflexive system operates by implicitly associating perception with actions that lead to reinforcement. Dual-systems models posit that in learning natural categories, learners initially use the reflective system and with practice, transfer control to the reflexive system. The role of reflective and reflexive systems in second language (L2) speech learning has not been systematically examined. Here monolingual, native speakers of American English were trained to categorize Mandarin tones produced by multiple talkers. Our computational modeling approach demonstrates that learners use reflective and reflexive strategies during tone category learning. Successful learners use talker-dependent, reflective analysis early in training and reflexive strategies by the end of training. Our results demonstrate that dual-learning systems are operative in L2 speech learning. Critically, learner strategies directly relate to individual differences in category learning success.

  2. Blind Separation of Acoustic Signals Combining SIMO-Model-Based Independent Component Analysis and Binary Masking

    Directory of Open Access Journals (Sweden)

    Hiekata Takashi

    2006-01-01

    Full Text Available A new two-stage blind source separation (BSS method for convolutive mixtures of speech is proposed, in which a single-input multiple-output (SIMO-model-based independent component analysis (ICA and a new SIMO-model-based binary masking are combined. SIMO-model-based ICA enables us to separate the mixed signals, not into monaural source signals but into SIMO-model-based signals from independent sources in their original form at the microphones. Thus, the separated signals of SIMO-model-based ICA can maintain the spatial qualities of each sound source. Owing to this attractive property, our novel SIMO-model-based binary masking can be applied to efficiently remove the residual interference components after SIMO-model-based ICA. The experimental results reveal that the separation performance can be considerably improved by the proposed method compared with that achieved by conventional BSS methods. In addition, the real-time implementation of the proposed BSS is illustrated.

  3. Endpoint Detection Algorithm Based on Cepstral Distance of Compressed Sensing Measurements of Speech Signal%基于压缩感知观测序列倒谱距离的语音端点检测算法

    Institute of Scientific and Technical Information of China (English)

    叶蕾; 孙林慧; 杨震

    2011-01-01

    本文基于语音信号在离散余弦基上的近似稀疏性,采用稀疏随机观测矩阵和线性规划重构算法对语音信号进行压缩感知与重构.研究了语音信号的压缩感知观测序列特性,根据语音帧和非语音帧压缩感知观测序列频谱幅度分布分散且差异较大的特性,提出基于压缩感知观测序列倒谱距离的语音端点检测算法,并对4dB-20dB下的带噪语音进行端点检测仿真实验.仿真结果显示,基于压缩感知观测序列倒谱距离的语音端点检测算法与奈奎斯特采样下语音的倒谱距离端点检测算法一样具有良好的抗噪性能,但由于采用压缩采样,减少了端点检测算法的运算数据量.%Based on the approximate sparsity of speech signal in discrete cosine basis, Compressed Sensing theory is applied to compress and decompress speech signal in this paper, that is, speech signal is projected to a sparse random measurement matrix and reconstructed by Linear Program. Features of compressed sensing measurements of speech signal is researched in this paper. According to the decentralization characteristic of amplitude spectrum distribution of compressed sensing; measurements of speech signal and difference of amplitude spectrum of compressed sensing measurements of voice and non-voice speech, endpoint detection algorithm based on cepstral distance of compressed sensing measurements of speech signal is proposed. Simulation with noisy speech signals ranging from 4dB to 20 dB is made. The simulation results show that endpoint detection algonthm based on cepstral distance of compressed sensing measurements can achieve the same performance of cepstral distance of speech aignal sampled by Nyquist Theory in noisy circumstance. The amount of calculation of endpoint detection of speech is reduced by the compressed sensing technology.

  4. Cross-Modal Interactions during Perception of Audiovisual Speech and Nonspeech Signals: An fMRI Study

    Science.gov (United States)

    Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann

    2011-01-01

    During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream--prior to its fusion with auditory phonological features [Hertrich,…

  5. Cross-Modal Interactions during Perception of Audiovisual Speech and Nonspeech Signals: An fMRI Study

    Science.gov (United States)

    Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann

    2011-01-01

    During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream--prior to its fusion with auditory phonological features [Hertrich,…

  6. A Survey on Speech Enhancement Methodologies

    Directory of Open Access Journals (Sweden)

    Ravi Kumar. K

    2016-12-01

    Full Text Available Speech enhancement is a technique which processes the noisy speech signal. The aim of speech enhancement is to improve the perceived quality of speech and/or to improve its intelligibility. Due to its vast applications in mobile telephony, VOIP, hearing aids, Skype and speaker recognition, the challenges in speech enhancement have grown over the years. It is more challenging to suppress back ground noise that effects human communication in noisy environments like airports, road works, traffic, and cars. The objective of this survey paper is to outline the single channel speech enhancement methodologies used for enhancing the speech signal which is corrupted with additive background noise and also discuss the challenges and opportunities of single channel speech enhancement. This paper mainly focuses on transform domain techniques and supervised (NMF, HMM speech enhancement techniques. This paper gives frame work for developments in speech enhancement methodologies

  7. Steganalysis of recorded speech

    Science.gov (United States)

    Johnson, Micah K.; Lyu, Siwei; Farid, Hany

    2005-03-01

    Digital audio provides a suitable cover for high-throughput steganography. At 16 bits per sample and sampled at a rate of 44,100 Hz, digital audio has the bit-rate to support large messages. In addition, audio is often transient and unpredictable, facilitating the hiding of messages. Using an approach similar to our universal image steganalysis, we show that hidden messages alter the underlying statistics of audio signals. Our statistical model begins by building a linear basis that captures certain statistical properties of audio signals. A low-dimensional statistical feature vector is extracted from this basis representation and used by a non-linear support vector machine for classification. We show the efficacy of this approach on LSB embedding and Hide4PGP. While no explicit assumptions about the content of the audio are made, our technique has been developed and tested on high-quality recorded speech.

  8. Word pair classification during imagined speech using direct brain recordings

    Science.gov (United States)

    Martin, Stephanie; Brunner, Peter; Iturrate, Iñaki; Millán, José Del R.; Schalk, Gerwin; Knight, Robert T.; Pasley, Brian N.

    2016-05-01

    People that cannot communicate due to neurological disorders would benefit from an internal speech decoder. Here, we showed the ability to classify individual words during imagined speech from electrocorticographic signals. In a word imagery task, we used high gamma (70–150 Hz) time features with a support vector machine model to classify individual words from a pair of words. To account for temporal irregularities during speech production, we introduced a non-linear time alignment into the SVM kernel. Classification accuracy reached 88% in a two-class classification framework (50% chance level), and average classification accuracy across fifteen word-pairs was significant across five subjects (mean = 58% p perception and production. These data represent a proof of concept study for basic decoding of speech imagery, and delineate a number of key challenges to usage of speech imagery neural representations for clinical applications.

  9. MODELS OF THE SPEECH PORTRAIT AND THE LINGVOPSYCHOLOGICAL PORTRAIT OF DRAMATIC CHARACTERS IN A.P. CHEKHOV'S PLAY "UNCLE VANYA"

    Directory of Open Access Journals (Sweden)

    Jiang Zhiyan

    2016-01-01

    Full Text Available This article highlights the problem of the models of the speech portrait and portraying the dramatic characters by studying the lingvopsychological lexicon in terms of a lingvopsychology. In spite of the fact that today a large number of works on the speech portrait was published, poorly studied a problem of the speech portrait of the dramatic characters of Chekhov's play. Moreover, the researchers did not find an effective approach to the description and analysis of the speech portrait of the dramatic character of Chekhov's play. Thus, in our study, we have created firstly the model for the study of the speech portrait of Chekhov's play (based on the play "Uncle Vanya", taking into account the linguistic culture and lingvopsychology. The lingvopsychological portrait consists of the lingvopsychological lexicons that reflect a mental state and a mental activity, in our opinion. Through the research object of a psychology we use the linguistic research methods, carrying out of a comparison analysis the semantiс fields which form lingvopsychological lexicons in the mental state and the mental activity. In the psychosematic fields manifest the opposition of the semantic fields of the lingvopsychological lexicons by the mental state: about passive and active, about pessimism and the optimism. In addition, the estimated-semantic and moral-semantic field of a mental state can reveal the lingvopsychological portrait. Opposition of the semantic fields by the mental activity make up the lingvopsychological lexicons: love ‒ hate, envy, like ‒ bother.

  10. Model-based inverse estimation for active contraction stresses of tongue muscles using 3D surface shape in speech production.

    Science.gov (United States)

    Koike, Narihiko; Ii, Satoshi; Yoshinaga, Tsukasa; Nozaki, Kazunori; Wada, Shigeo

    2017-09-14

    This paper presents a novel inverse estimation approach for the active contraction stresses of tongue muscles during speech. The proposed method is based on variational data assimilation using a mechanical tongue model and 3D tongue surface shapes for speech production. The mechanical tongue model considers nonlinear hyperelasticity, finite deformation, actual geometry from computed tomography (CT) images, and anisotropic active contraction by muscle fibers, the orientations of which are ideally determined using anatomical drawings. The tongue deformation is obtained by solving a stationary force-equilibrium equation using a finite element method. An inverse problem is established to find the combination of muscle contraction stresses that minimizes the Euclidean distance of the tongue surfaces between the mechanical analysis and CT results of speech production, where a signed-distance function represents the tongue surface. Our approach is validated through an ideal numerical example and extended to the real-world case of two Japanese vowels, /ʉ/ and /ɯ/. The results capture the target shape completely and provide an excellent estimation of the active contraction stresses in the ideal case, and exhibit similar tendencies as in previous observations and simulations for the actual vowel cases. The present approach can reveal the relative relationship among the muscle contraction stresses in similar utterances with different tongue shapes, and enables the investigation of the coordination of tongue muscles during speech using only the deformed tongue shape obtained from medical images. This will enhance our understanding of speech motor control. Copyright © 2017. Published by Elsevier Ltd.

  11. Visual speech information: a help or hindrance in perceptual processing of dysarthric speech.

    Science.gov (United States)

    Borrie, Stephanie A

    2015-03-01

    This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.

  12. Music expertise shapes audiovisual temporal integration windows for speech, sinewave speech, and music.

    Science.gov (United States)

    Lee, Hweeling; Noppeney, Uta

    2014-01-01

    This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech, or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogs of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms). Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. Critically, musicians relative to non-musicians exhibited significantly narrower temporal integration windows for both music and sinewave speech. Further, the temporal integration window for music decreased with the amount of music practice, but not with age of acquisition. In other words, the more musicians practiced piano in the past 3 years, the more sensitive they became to the temporal misalignment of visual and auditory signals. Collectively, our findings demonstrate that music practicing fine-tunes the audiovisual temporal integration window to various extents depending on the stimulus class. While the effect of piano practicing was most pronounced for music, it also generalized to other stimulus classes such as sinewave speech and to a marginally significant degree to natural speech.

  13. Audiovisual Asynchrony Detection in Human Speech

    Science.gov (United States)

    Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta

    2011-01-01

    Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…

  14. Audiovisual Asynchrony Detection in Human Speech

    Science.gov (United States)

    Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta

    2011-01-01

    Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…

  15. Charisma in business speeches

    DEFF Research Database (Denmark)

    Niebuhr, Oliver; Brem, Alexander; Novák-Tót, Eszter

    2016-01-01

    to business speeches. Consistent with the public opinion, our findings are indicative of Steve Jobs being a more charismatic speaker than Mark Zuckerberg. Beyond previous studies, our data suggest that rhythm and emphatic accentuation are also involved in conveying charisma. Furthermore, the differences......Charisma is a key component of spoken language interaction; and it is probably for this reason that charismatic speech has been the subject of intensive research for centuries. However, what is still largely missing is a quantitative and objective line of research that, firstly, involves analyses...... of the acoustic-prosodic signal, secondly, focuses on business speeches like product presentations, and, thirdly, in doing so, advances the still fairly fragmentary evidence on the prosodic correlates of charismatic speech. We show that the prosodic features of charisma in political speeches also apply...

  16. Speech enhancement for listeners with hearing loss based on a model for vowel coding in the auditory midbrain.

    Science.gov (United States)

    Rao, Akshay; Carney, Laurel H

    2014-07-01

    A novel signal-processing strategy is proposed to enhance speech for listeners with hearing loss. The strategy focuses on improving vowel perception based on a recent hypothesis for vowel coding in the auditory system. Traditionally, studies of neural vowel encoding have focused on the representation of formants (peaks in vowel spectra) in the discharge patterns of the population of auditory-nerve (AN) fibers. A recent hypothesis focuses instead on vowel encoding in the auditory midbrain, and suggests a robust representation of formants. AN fiber discharge rates are characterized by pitch-related fluctuations having frequency-dependent modulation depths. Fibers tuned to frequencies near formants exhibit weaker pitch-related fluctuations than those tuned to frequencies between formants. Many auditory midbrain neurons show tuning to amplitude modulation frequency in addition to audio frequency. According to the auditory midbrain vowel encoding hypothesis, the response map of a population of midbrain neurons tuned to modulations near voice pitch exhibits minima near formant frequencies, due to the lack of strong pitch-related fluctuations at their inputs. This representation is robust over the range of noise conditions in which speech intelligibility is also robust for normal-hearing listeners. Based on this hypothesis, a vowel-enhancement strategy has been proposed that aims to restore vowel encoding at the level of the auditory midbrain. The signal processing consists of pitch tracking, formant tracking, and formant enhancement. The novel formant-tracking method proposed here estimates the first two formant frequencies by modeling characteristics of the auditory periphery, such as saturated discharge rates of AN fibers and modulation tuning properties of auditory midbrain neurons. The formant enhancement stage aims to restore the representation of formants at the level of the midbrain by increasing the dominance of a single harmonic near each formant and saturating

  17. Design and realisation of an audiovisual speech activity detector

    NARCIS (Netherlands)

    Van Bree, K.C.

    2006-01-01

    For many speech telecommunication technologies a robust speech activity detector is important. An audio-only speech detector will givefalse positives when the interfering signal is speech or has speech characteristics. The modality video is suitable to solve this problem. In this report the approach

  18. Use of a biomechanical tongue model to predict the impact of tongue surgery on speech production

    CERN Document Server

    Buchaillard, Stéphanie; Perrier, Pascal; Payan, Yohan

    2008-01-01

    This paper presents predictions of the consequences of tongue surgery on speech production. For this purpose, a 3D finite element model of the tongue is used that represents this articulator as a deformable structure in which tongue muscles anatomy is realistically described. Two examples of tongue surgery, which are common in the treatment of cancers of the oral cavity, are modelled, namely a hemiglossectomy and a large resection of the mouth floor. In both cases, three kinds of possible reconstruction are simulated, assuming flaps with different stiffness. Predictions are computed for the cardinal vowels /i, a, u/ in the absence of any compensatory strategy, i.e. with the same motor commands as the one associated with the production of these vowels in non-pathological conditions. The estimated vocal tract area functions and the corresponding formants are compared to the ones obtained under normal conditions

  19. Speech production as state feedback control.

    Science.gov (United States)

    Houde, John F; Nagarajan, Srikantan S

    2011-01-01

    Spoken language exists because of a remarkable neural process. Inside a speaker's brain, an intended message gives rise to neural signals activating the muscles of the vocal tract. The process is remarkable because these muscles are activated in just the right way that the vocal tract produces sounds a listener understands as the intended message. What is the best approach to understanding the neural substrate of this crucial motor control process? One of the key recent modeling developments in neuroscience has been the use of state feedback control (SFC) theory to explain the role of the CNS in motor control. SFC postulates that the CNS controls motor output by (1) estimating the current dynamic state of the thing (e.g., arm) being controlled, and (2) generating controls based on this estimated state. SFC has successfully predicted a great range of non-speech motor phenomena, but as yet has not received attention in the speech motor control community. Here, we review some of the key characteristics of speech motor control and what they say about the role of the CNS in the process. We then discuss prior efforts to model the role of CNS in speech motor control, and argue that these models have inherent limitations - limitations that are overcome by an SFC model of speech motor control which we describe. We conclude by discussing a plausible neural substrate of our model.

  20. Speech production as state feedback control

    Directory of Open Access Journals (Sweden)

    John F Houde

    2011-10-01

    Full Text Available Spoken language exists because of a remarkable neural process. Inside a speaker’s brain, an intended message gives rise to neural signals activating the muscles of the vocal tract. The process is remarkable because these muscles are activated in just the right way that the vocal tract produces sounds a listener understands as the intended message. What is the best approach to understanding the neural substrate of this crucial motor control process? One of the key recent modeling developments in neuroscience has been the use of state feedback control (SFC theory to explain the role of the CNS in motor control. SFC postulates that the CNS controls motor output by (1 estimating the current dynamic state of the thing (e.g., arm being controlled, and (2 generating controls based on this estimated state. SFC has successfully predicted a great range of non-speech motor phenomena, but as yet has not received attention in the speech motor control community. Here, we review some of the key characteristics of speech motor control and what they say about the role of the CNS in the process. We then discuss prior efforts to model the role of CNS in speech motor control, and argue that these models have inherent limitations – limitations that are overcome by an SFC model of speech motor control which we describe. We conclude by discussing a plausible neural substrate of our model.

  1. Clear Speech - Mere Speech? How segmental and prosodic speech reduction shape the impression that speakers create on listeners

    DEFF Research Database (Denmark)

    Niebuhr, Oliver

    2017-01-01

    whether variation in the degree of reduction also has a systematic effect on the attributes we ascribe to the speaker who produces the speech signal. A perception experiment was carried out for German in which 46 listeners judged whether or not speakers showing 3 different combinations of segmental...... of reduction levels and perceived speaker attributes in which moderate reduction can make a better impression on listeners than no reduction. In addition to its relevance in reduction models and theories, this interplay is instructive for various fields of speech application from social robotics to charisma...

  2. Compressed Speech Signal Sensing Based on Autocorrelative Measurement%基于自相关观测的语音信号压缩感知

    Institute of Scientific and Technical Information of China (English)

    季云云; 杨震

    2011-01-01

    A new measurement matrix-truncated circulant autocorrelation matrix is presented based on the Compressed Sensing theory and features of speech signals in this paper. From a practical point of view, an approximate truncated circulant autocorrelation matrix based on template matching as the measurement matrix is proposed in this paper and it proves that the new measurement matrix satisfies the restricted isometry property(RIP). By speech signals and the truncated circulant autocorrelation matrix, the approximate truncated circulant autoconelation matrix and the Gaussian random matrix respectively , Basis Pursuit ( BP) algorithm is used to reconstruct the original speech signal. Simulation results demonstrate that the performance of the approximate truncated circulant autocorrelation matrix created by a linear combination of two template elements to reconstruct the original speech signal is almost the aame as the truncated circulant autocorrelation matrix, and greatly better than the classic Gaussian random matrix. Moreover,in terms of the same reconstruction performance,the ratio of compression realized by the approximate truncated circulant autocorrelation matrix created by a linear combination of two template elements is far bigger than that of the Gaussian random matrix, which means that the new measurement matrix can significantly enhance the performance of compression for speech signals.%本文基于压缩感知技术,根据语音信号的特点,提出了一种基于自相关特性的截断循环自相关矩阵作为观测矩阵,并在此基础上,从实用的角度出发,提出了基于模板匹配的近似截断循环自相关矩阵作为观测矩阵,并证明其满足RIP特性.由语音信号与截断循环自相关矩阵、近似截断循环自相关矩阵和高斯随机矩阵分别构造相应的观测,采用基追踪(BP)算法来重构原始语音信号.实验表明,由2个模板元素线性组合而成的近似截断循环自相关矩阵重构原始语

  3. Speech Intelligibility Prediction Based on Mutual Information

    DEFF Research Database (Denmark)

    Jensen, Jesper; Taal, Cees H.

    2014-01-01

    to the mutual information between critical-band amplitude envelopes of the clean signal and the corresponding noisy/processed signal. The resulting intelligibility predictor turns out to be a simple function of the mean-square error (mse) that arises when estimating a clean critical-band amplitude using......This paper deals with the problem of predicting the average intelligibility of noisy and potentially processed speech signals, as observed by a group of normal hearing listeners. We propose a model which performs this prediction based on the hypothesis that intelligibility is monotonically related...... the intelligibility of speech signals contaminated by additive noise and potentially non-linearly processed using time-frequency weighting....

  4. From where to what: a neuroanatomically based evolutionary model of the emergence of speech in humans [version 3; referees: 1 approved, 2 approved with reservations

    Directory of Open Access Journals (Sweden)

    Oren Poliva

    2017-09-01

    Full Text Available In the brain of primates, the auditory cortex connects with the frontal lobe via the temporal pole (auditory ventral stream; AVS and via the inferior parietal lobe (auditory dorsal stream; ADS. The AVS is responsible for sound recognition, and the ADS for sound-localization, voice detection and integration of calls with faces. I propose that the primary role of the ADS in non-human primates is the detection and response to contact calls. These calls are exchanged between tribe members (e.g., mother-offspring and are used for monitoring location. Detection of contact calls occurs by the ADS identifying a voice, localizing it, and verifying that the corresponding face is out of sight. Once a contact call is detected, the primate produces a contact call in return via descending connections from the frontal lobe to a network of limbic and brainstem regions. Because the ADS of present day humans also performs speech production, I further propose an evolutionary course for the transition from contact call exchange to an early form of speech. In accordance with this model, structural changes to the ADS endowed early members of the genus Homo with partial vocal control. This development was beneficial as it enabled offspring to modify their contact calls with intonations for signaling high or low levels of distress to their mother. Eventually, individuals were capable of participating in yes-no question-answer conversations. In these conversations the offspring emitted a low-level distress call for inquiring about the safety of objects (e.g., food, and his/her mother responded with a high- or low-level distress call to signal approval or disapproval of the interaction. Gradually, the ADS and its connections with brainstem motor regions became more robust and vocal control became more volitional. Speech emerged once vocal control was sufficient for inventing novel calls.

  5. From where to what: a neuroanatomically based evolutionary model of the emergence of speech in humans [version 2; referees: 1 approved, 2 approved with reservations

    Directory of Open Access Journals (Sweden)

    Oren Poliva

    2016-01-01

    Full Text Available In the brain of primates, the auditory cortex connects with the frontal lobe via the temporal pole (auditory ventral stream; AVS and via the inferior parietal lobe (auditory dorsal stream; ADS. The AVS is responsible for sound recognition, and the ADS for sound-localization, voice detection and integration of calls with faces. I propose that the primary role of the ADS in non-human primates is the detection and response to contact calls. These calls are exchanged between tribe members (e.g., mother-offspring and are used for monitoring location. Detection of contact calls occurs by the ADS identifying a voice, localizing it, and verifying that the corresponding face is out of sight. Once a contact call is detected, the primate produces a contact call in return via descending connections from the frontal lobe to a network of limbic and brainstem regions. Because the ADS of present day humans also performs speech production, I further propose an evolutionary course for the transition from contact call exchange to an early form of speech. In accordance with this model, structural changes to the ADS endowed early members of the genus Homo with partial vocal control. This development was beneficial as it enabled offspring to modify their contact calls with intonations for signaling high or low levels of distress to their mother. Eventually, individuals were capable of participating in yes-no question-answer conversations. In these conversations the offspring emitted a low-level distress call for inquiring about the safety of objects (e.g., food, and his/her mother responded with a high- or low-level distress call to signal approval or disapproval of the interaction. Gradually, the ADS and its connections with brainstem motor regions became more robust and vocal control became more volitional. Speech emerged once vocal control was sufficient for inventing novel calls.

  6. Spectral Psychoanalysis of Speech under Strain | Sharma ...

    African Journals Online (AJOL)

    Spectral Psychoanalysis of Speech under Strain. ... Different voice features from the speech signal to be influenced by strain are: loudness, fundamental frequency, jitter, zero-crossing rate, ... EMAIL FREE FULL TEXT EMAIL FREE FULL TEXT

  7. ClassTalk system for predicting and auralizing speech in noise with reverberation in classrooms

    Science.gov (United States)

    Hodgson, Murray; Graves, Daniel

    2005-04-01

    This paper discusses and demonstrates the ClassTalk system for predicting, visualizing and auralizing speech in noise with reverberation in classrooms. The classroom can contain a speech-reinforcement system (SRS). Male or female speech sources, SRS loudspeakers and overhead, slide or digital projectors, or ventilation-noise sources, can have four output levels. Empirical models are used to predict speech and noise levels, and Early Decay Times, from which Speech Transmission Index (STI) and Speech Intelligibility (SI) are calculated. ClassTalk visualizes the floor-plan, speech- and noise-source positions, and the receiver position. The user can walk through the room at will. In real time five quantities, background-noise level, speech level, signal-to-noise difference, STI and SI, are displayed along with occupied and unoccupied reverberation times. The sound module auralizes male or female speech mixed with the relevant noise signals, with predicted, frequency-varying reverberation superimposed using MaxxVerb. Technical issues related to the development of the sound module are discussed. The potential of the systems auralization module for demonstrating the effects of the acoustical environment and its control on speech is discussed and demonstrated.

  8. Why do I hear but not understand? Stochastic undersampling as a model of degraded neural encoding of speech

    Directory of Open Access Journals (Sweden)

    Enrique A Lopez-Poveda

    2014-10-01

    Full Text Available Hearing impairment is a serious disease with increasing prevalence. It is defined based on increased audiometric thresholds but increased thresholds are only partly responsible for the greater difficulty understanding speech in noisy environments experienced by some older listeners or by hearing-impaired listeners. Identifying the additional factors and mechanisms that impair intelligibility is fundamental to understanding hearing impairment but these factors remain uncertain. Traditionally, these additional factors have been sought in the way the speech spectrum is encoded in the pattern of impaired mechanical cochlear responses. Recent studies, however, are steering the focus toward impaired encoding of the speech waveform in the auditory nerve. In our recent work, we gave evidence that a significant factor might be the loss of afferent auditory nerve fibers, a pathology that comes with aging or noise overexposure. Our approach was based on a signal-processing analogy whereby the auditory nerve may be regarded as a stochastic sampler of the sound waveform and deafferentation may be described in terms of waveform undersampling. We showed that stochastic undersampling simultaneously degrades the encoding of soft and rapid waveform features, and that this degrades speech intelligibility in noise more than in quiet without significant increases in audiometric thresholds. Here, we review our recent work in a broader context and argue that the stochastic undersampling analogy may be extended to study the perceptual consequences of various different hearing pathologies and their treatment.

  9. Model human heart or brain signals

    CERN Document Server

    Tuncay, Caglar

    2008-01-01

    A new model is suggested and used to mimic various spatial or temporal designs in biological or non biological formations where the focus is on the normal or irregular electrical signals coming from human heart (ECG) or brain (EEG). The electrical activities in several muscles (EMG) or neurons or other organs of human or various animals, such as lobster pyloric neuron, guinea pig inferior olivary neuron, sepia giant axon and mouse neocortical pyramidal neuron and some spatial formations are also considered (in Appendix). In the biological applications, several elements (cells or tissues) in an organ are taken as various entries in a representative lattice (mesh) where the entries are connected to each other in terms of some molecular diffusions or electrical potential differences. The biological elements evolve in time (with the given tissue or organ) in terms of the mentioned connections (interactions) besides some individual feedings. The anatomical diversity of the species (or organs) is handled in terms o...

  10. A Process Model of the Signal Duration Phenomenon of Vigilance

    Science.gov (United States)

    2014-10-01

    A Process Model of the Signal Duration Phenomenon of Vigilance Daniel Gartenberg1, Bella Veksler2, Glenn Gunzelmann2, J. Gregory Trafton3...REPORT TYPE 3. DATES COVERED 00-00-2014 to 00-00-2014 4. TITLE AND SUBTITLE A Process Model of the Signal Duration Phenomenon of Vigilance 5a...with shorter signal durations (see Figure 1). There is currently no process model that explains the signal duration effect found in vigilance

  11. Automated Gesturing for Virtual Characters: Speech-driven and Text-driven Approaches

    Directory of Open Access Journals (Sweden)

    Goranka Zoric

    2006-04-01

    Full Text Available We present two methods for automatic facial gesturing of graphically embodied animated agents. In one case, conversational agent is driven by speech in automatic Lip Sync process. By analyzing speech input, lip movements are determined from the speech signal. Another method provides virtual speaker capable of reading plain English text and rendering it in a form of speech accompanied by the appropriate facial gestures. Proposed statistical model for generating virtual speaker’s facial gestures can be also applied as addition to lip synchronization process in order to obtain speech driven facial gesturing. In this case statistical model will be triggered with the input speech prosody instead of lexical analysis of the input text.

  12. Modeling of surface myoelectric signals--Part II: Model-based signal interpretation.

    Science.gov (United States)

    Merletti, R; Roy, S H; Kupa, E; Roatta, S; Granata, A

    1999-07-01

    Experimental electromyogram (EMG) data from the human biceps brachii were simulated using the model described in [10] of this work. A multichannel linear electrode array, spanning the length of the biceps, was used to detect monopolar and bipolar signals, from which double differential signals were computed, during either voluntary or electrically elicited isometric contractions. For relatively low-level voluntary contractions (10%-30% of maximum force) individual firings of three to four-different motor units were identified and their waveforms were closely approximated by the model. Motor unit parameters such as depth, size, fiber orientation and length, location of innervation and tendonous zones, propagation velocity, and source width were estimated using the model. Two applications of the model are described. The first analyzes the effects of electrode rotation with respect to the muscle fiber direction and shows the possibility of conduction velocity (CV) over- and under-estimation. The second focuses on the myoelectric manifestations of fatigue during a sustained electrically elicited contraction and the interrelationship between muscle fiber CV, spectral and amplitude variables, and the length of the depolarization zone. It is concluded that a) surface EMG detection using an electrode array, when combined with a model of signal propagation, provides a useful method for understanding the physiological and anatomical determinants of EMG waveform characteristics and b) the model provides a way for the interpretation of fatigue plots.

  13. Broadening the "Ports of Entry" for Speech-Language Pathologists: A Relational and Reflective Model for Clinical Supervision

    Science.gov (United States)

    Geller, Elaine; Foley, Gilbert M.

    2009-01-01

    Purpose: To offer a framework for clinical supervision in speech-language pathology that embeds a mental health perspective within the study of communication sciences and disorders. Method: Key mental health constructs are examined as to how they are applied in traditional versus relational and reflective supervision models. Comparisons between…

  14. Speech transmission index from running speech: A neural network approach

    Science.gov (United States)

    Li, F. F.; Cox, T. J.

    2003-04-01

    Speech transmission index (STI) is an important objective parameter concerning speech intelligibility for sound transmission channels. It is normally measured with specific test signals to ensure high accuracy and good repeatability. Measurement with running speech was previously proposed, but accuracy is compromised and hence applications limited. A new approach that uses artificial neural networks to accurately extract the STI from received running speech is developed in this paper. Neural networks are trained on a large set of transmitted speech examples with prior knowledge of the transmission channels' STIs. The networks perform complicated nonlinear function mappings and spectral feature memorization to enable accurate objective parameter extraction from transmitted speech. Validations via simulations demonstrate the feasibility of this new method on a one-net-one-speech extract basis. In this case, accuracy is comparable with normal measurement methods. This provides an alternative to standard measurement techniques, and it is intended that the neural network method can facilitate occupied room acoustic measurements.

  15. A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation

    Directory of Open Access Journals (Sweden)

    Mohammad H. Radfar

    2006-11-01

    Full Text Available We present a new technique for separating two speech signals from a single recording. The proposed method bridges the gap between underdetermined blind source separation techniques and those techniques that model the human auditory system, that is, computational auditory scene analysis (CASA. For this purpose, we decompose the speech signal into the excitation signal and the vocal-tract-related filter and then estimate the components from the mixed speech using a hybrid model. We first express the probability density function (PDF of the mixed speech's log spectral vectors in terms of the PDFs of the underlying speech signal's vocal-tract-related filters. Then, the mean vectors of PDFs of the vocal-tract-related filters are obtained using a maximum likelihood estimator given the mixed signal. Finally, the estimated vocal-tract-related filters along with the extracted fundamental frequencies are used to reconstruct estimates of the individual speech signals. The proposed technique effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation. We compare our model with both an underdetermined blind source separation and a CASA method. The experimental results show that our model outperforms both techniques in terms of SNR improvement and the percentage of crosstalk suppression.

  16. A Maximum Likelihood Estimation of Vocal-Tract-Related Filter Characteristics for Single Channel Speech Separation

    Directory of Open Access Journals (Sweden)

    Dansereau Richard M

    2007-01-01

    Full Text Available We present a new technique for separating two speech signals from a single recording. The proposed method bridges the gap between underdetermined blind source separation techniques and those techniques that model the human auditory system, that is, computational auditory scene analysis (CASA. For this purpose, we decompose the speech signal into the excitation signal and the vocal-tract-related filter and then estimate the components from the mixed speech using a hybrid model. We first express the probability density function (PDF of the mixed speech's log spectral vectors in terms of the PDFs of the underlying speech signal's vocal-tract-related filters. Then, the mean vectors of PDFs of the vocal-tract-related filters are obtained using a maximum likelihood estimator given the mixed signal. Finally, the estimated vocal-tract-related filters along with the extracted fundamental frequencies are used to reconstruct estimates of the individual speech signals. The proposed technique effectively adds vocal-tract-related filter characteristics as a new cue to CASA models using a new grouping technique based on an underdetermined blind source separation. We compare our model with both an underdetermined blind source separation and a CASA method. The experimental results show that our model outperforms both techniques in terms of SNR improvement and the percentage of crosstalk suppression.

  17. 基于HMM和ANN的语音情感识别研究%Research on emotion recognition of speech signal based on HMM and ANN

    Institute of Scientific and Technical Information of China (English)

    胡洋; 蒲南江; 吴黎慧; 高磊

    2011-01-01

    Speech emotion recognition is not only an important part of speech recognition but also the basic theory of harmonious human-computer interaction.As a single classifier in the limitations of speech emotion recognition.In this paper,we put forward a method:the Combination of Hidden Markov Model (HMM) and Artificial Neural Network (ANN),for the six emotion of happy,surprise,anger,sad,fear and clam,we design six HMM model for every emotion,through this method,we have the best matching sequence of each emotion.Then,the posterior ANN classifier is used to classify the test samples,through the integration of two classifiers to improve speech emotion recognition rate.Based on the emotion speech database established by recording induced,the experimental results indicate that there is great elevation in the recognition rate.%语音情感识别是语音识别中的重要分支,是和谐人机交互的基础理论。由于单一分类器在语音情感识别中的局限性,本文提出了隐马尔科夫模型(HMM)和人工神经网络(ANN)相结合的方法,对高兴、惊奇、愤怒、悲伤、恐惧、平静六种情感分别设计一个HMM模型,得到每种情感的最佳匹配序列,然后利用ANN作为后验分类器对测试样本进行分类,通过两种分类器融合提高语音情感识别率。在通过诱导录音法建立的情感语音库的基础上进行了实验验证,实验结果表明识别率有较大的提高。

  18. Integrating Music Therapy Services and Speech-Language Therapy Services for Children with Severe Communication Impairments: A Co-Treatment Model

    Science.gov (United States)

    Geist, Kamile; McCarthy, John; Rodgers-Smith, Amy; Porter, Jessica

    2008-01-01

    Documenting how music therapy can be integrated with speech-language therapy services for children with communication delay is not evident in the literature. In this article, a collaborative model with procedures, experiences, and communication outcomes of integrating music therapy with the existing speech-language services is given. Using…

  19. Integrating Music Therapy Services and Speech-Language Therapy Services for Children with Severe Communication Impairments: A Co-Treatment Model

    Science.gov (United States)

    Geist, Kamile; McCarthy, John; Rodgers-Smith, Amy; Porter, Jessica

    2008-01-01

    Documenting how music therapy can be integrated with speech-language therapy services for children with communication delay is not evident in the literature. In this article, a collaborative model with procedures, experiences, and communication outcomes of integrating music therapy with the existing speech-language services is given. Using…

  20. A parametric framework for modelling of bioelectrical signals

    CERN Document Server

    Mughal, Yar Muhammad

    2016-01-01

    This book examines non-invasive, electrical-based methods for disease diagnosis and assessment of heart function. In particular, a formalized signal model is proposed since this offers several advantages over methods that rely on measured data alone. By using a formalized representation, the parameters of the signal model can be easily manipulated and/or modified, thus providing mechanisms that allow researchers to reproduce and control such signals. In addition, having such a formalized signal model makes it possible to develop computer tools that can be used for manipulating and understanding how signal changes result from various heart conditions, as well as for generating input signals for experimenting with and evaluating the performance of e.g. signal extraction methods. The work focuses on bioelectrical information, particularly electrical bio-impedance (EBI). Once the EBI has been measured, the corresponding signals have to be modelled for analysis. This requires a structured approach in order to move...

  1. Tactile Modulation of Emotional Speech Samples

    Directory of Open Access Journals (Sweden)

    Katri Salminen

    2012-01-01

    Full Text Available Traditionally only speech communicates emotions via mobile phone. However, in daily communication the sense of touch mediates emotional information during conversation. The present aim was to study if tactile stimulation affects emotional ratings of speech when measured with scales of pleasantness, arousal, approachability, and dominance. In the Experiment 1 participants rated speech-only and speech-tactile stimuli. The tactile signal mimicked the amplitude changes of the speech. In the Experiment 2 the aim was to study whether the way the tactile signal was produced affected the ratings. The tactile signal either mimicked the amplitude changes of the speech sample in question, or the amplitude changes of another speech sample. Also, concurrent static vibration was included. The results showed that the speech-tactile stimuli were rated as more arousing and dominant than the speech-only stimuli. The speech-only stimuli were rated as more approachable than the speech-tactile stimuli, but only in the Experiment 1. Variations in tactile stimulation also affected the ratings. When the tactile stimulation was static vibration the speech-tactile stimuli were rated as more arousing than when the concurrent tactile stimulation was mimicking speech samples. The results suggest that tactile stimulation offers new ways of modulating and enriching the interpretation of speech.

  2. Building a Model of Support for Preschool Children with Speech and Language Disorders

    Science.gov (United States)

    Robertson, Natalie; Ohi, Sarah

    2016-01-01

    Speech and language disorders impede young children's abilities to communicate and are often associated with a number of behavioural problems arising in the preschool classroom. This paper reports a small-scale study that investigated 23 Australian educators' and 7 Speech Pathologists' experiences in working with three to five year old children…

  3. Modeling words with subword units in an articulatorily constrained speech recognition algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1997-11-20

    The goal of speech recognition is to find the most probable word given the acoustic evidence, i.e. a string of VQ codes or acoustic features. Speech recognition algorithms typically take advantage of the fact that the probability of a word, given a sequence of VQ codes, can be calculated.

  4. Modelling the Architecture of Phonetic Plans: Evidence from Apraxia of Speech

    Science.gov (United States)

    Ziegler, Wolfram

    2009-01-01

    In theories of spoken language production, the gestural code prescribing the movements of the speech organs is usually viewed as a linear string of holistic, encapsulated, hard-wired, phonetic plans, e.g., of the size of phonemes or syllables. Interactions between phonetic units on the surface of overt speech are commonly attributed to either the…

  5. Binaural Model-Based Speech Intelligibility Enhancement and Assessment in Hearing Aids

    NARCIS (Netherlands)

    Schlesinger, A.

    2012-01-01

    The enhancement of speech intelligibility in noise is still the main subject in hearing aid research. Based on the advanced results obtained with the hearing glasses, in the present research the speech intelligibility is even further improved by the application of binaural post-filters. The function

  6. A Computational Model of Word Segmentation from Continuous Speech Using Transitional Probabilities of Atomic Acoustic Events

    Science.gov (United States)

    Rasanen, Okko

    2011-01-01

    Word segmentation from continuous speech is a difficult task that is faced by human infants when they start to learn their native language. Several studies indicate that infants might use several different cues to solve this problem, including intonation, linguistic stress, and transitional probabilities between subsequent speech sounds. In this…

  7. Malay Wordlist Modeling for Articulation Disorder Patient by Using Computerized Speech Diagnosis System

    Directory of Open Access Journals (Sweden)

    Mohd Nizam Mazenan

    2014-05-01

    Full Text Available The aim of this research was to assess the quality and the impact of using computerized speech diagnosis system to overcome the problem of speech articulation disorder among Malaysian context. The prototype of the system is already been develop in order to help Speech Therapist (ST or Speech Language Pathologist (SLP in diagnosis, preventing and treatment for an early stage. Few assessments will be conducted by ST over the patient and mostly the process is still using manual technique whereby the ST relies from their human hearing and their years of experience. This study will surveys the technique and method use by ST at Hospital Sultanah Aminah (HSA (Speech therapist at Speech Therapy Center to help patient that suffer from speech disorder especially in articulation disorder cases. Few experiment and result had also been present in this study where the computerized speech diagnosis system is being tested by using real patient voice sample that been collected from HSA and the students from Sekolah Kebangsaan Taman Universiti Satu.

  8. Neural processing of acoustic duration and phonological German vowel length: time courses of evoked fields in response to speech and nonspeech signals.

    Science.gov (United States)

    Tomaschek, Fabian; Truckenbrodt, Hubert; Hertrich, Ingo

    2013-01-01

    Recent experiments showed that the perception of vowel length by German listeners exhibits the characteristics of categorical perception. The present study sought to find the neural activity reflecting categorical vowel length and the short-long boundary by examining the processing of non-contrastive durations and categorical length using MEG. Using disyllabic words with varying /a/-durations and temporally-matched nonspeech stimuli, we found that each syllable elicited an M50/M100-complex. The M50-amplitude to the second syllable varied along the durational continuum, possibly reflecting the mapping of duration onto a rhythm representation. Categorical length was reflected by an additional response elicited when vowel duration exceeded the short-long boundary. This was interpreted to reflect the integration of an additional timing unit for long in contrast to short vowels. Unlike to speech, responses to short nonspeech durations lacked a M100 to the first and M50 to the second syllable, indicating different integration windows for speech and nonspeech signals.

  9. Speech Enhancement

    DEFF Research Database (Denmark)

    Benesty, Jacob; Jensen, Jesper Rindom; Christensen, Mads Græsbøll;

    of methods and have been introduced in somewhat different contexts. Linear filtering methods originate in stochastic processes, while subspace methods have largely been based on developments in numerical linear algebra and matrix approximation theory. This book bridges the gap between these two classes......Speech enhancement is a classical problem in signal processing, yet still largely unsolved. Two of the conventional approaches for solving this problem are linear filtering, like the classical Wiener filter, and subspace methods. These approaches have traditionally been treated as different classes...... of methods by showing how the ideas behind subspace methods can be incorporated into traditional linear filtering. In the context of subspace methods, the enhancement problem can then be seen as a classical linear filter design problem. This means that various solutions can more easily be compared...

  10. 基于DIVA模型的语音-映射单元自动获取%Automatic acquisition of speech sound-target cells based on DIVA model

    Institute of Scientific and Technical Information of China (English)

    张少白; 刘欣

    2013-01-01

    针对DIVA模型中存在的“感知能力与语音生成技巧发育不平衡”问题,提出了一种自动获取语音-映射单元的方法。该方法将人耳模拟为一个具有不同带宽的并联带通滤波器组,分别与模型中21维度的听觉存储空间相关联,对不同听觉的不同反应,分别考虑其频带的屏蔽效应、听觉响度与频率的关系。在读取语音输入信号的过程中,模型能较好地获得初始听觉表示,其方式与婴儿咿呀学语的过程基本一致。仿真实验表明,通过边界定义、相似性比较以及搜索更新等步骤,此方法能很好地进行初始输入模式的自组织匹配,并最终使DIVA模型更具语音获取的自然特性。%Contraposing the shortage of Directions Into Velocities of Articulators ( DIVA) model about“infants per-ceptual abilities do develop faster at first than their speech production skills”, the paper presents an automatic ac-quisition method of speech sound-target cells. The method simulates the human ear as a parallel band-pass filter group with different bandwidth and associates respectively;the filter with the 21-dimensional storage space of audi-tory sense in DIVA model. This method was done in order for different auditory reactions, the shielding effect of fre-quency band, sound loudness, and frequency relation could be considered respectively for this study. In the process of reading the input signal of speech, the model can acquire good initial hearing and the process is consistent with baby's babble. The simulation results show that through boundary definition, similarity comparison, searching and updates and so on, the method has nicer self-organized pattern matching effect for initial input, which makes the DIVA model a more natural characteristic regarding speech acquisition.

  11. Speech Matters

    DEFF Research Database (Denmark)

    Hasse Jørgensen, Stina

    2011-01-01

    About Speech Matters - Katarina Gregos, the Greek curator's exhibition at the Danish Pavillion, the Venice Biannual 2011.......About Speech Matters - Katarina Gregos, the Greek curator's exhibition at the Danish Pavillion, the Venice Biannual 2011....

  12. Evaluating quantitative and conceptual models of speech production: how does SLAM fare?

    Science.gov (United States)

    Walker, Grant M; Hickok, Gregory

    2016-04-01

    In a previous publication, we presented a new computational model called SLAM (Walker & Hickok, Psychonomic Bulletin & Review doi: 10.3758/s13423-015-0903 ), based on the hierarchical state feedback control (HSFC) theory (Hickok Nature Reviews Neuroscience, 13(2), 135-145, 2012). In his commentary, Goldrick (Psychonomic Bulletin & Review doi: 10.3758/s13423-015-0946-9 ) claims that SLAM does not represent a theoretical advancement, because it cannot be distinguished from an alternative lexical + postlexical (LPL) theory proposed by Goldrick and Rapp (Cognition, 102(2), 219-260, 2007). First, we point out that SLAM implements a portion of a conceptual model (HSFC) that encompasses LPL. Second, we show that SLAM accounts for a lexical bias present in sound-related errors that LPL does not explain. Third, we show that SLAM's explanatory advantage is not a result of approximating the architectural or computational assumptions of LPL, since an implemented version of LPL fails to provide the same fit improvements as SLAM. Finally, we show that incorporating a mechanism that violates some core theoretical assumptions of LPL-making it more like SLAM in terms of interactivity-allows the model to capture some of the same effects as SLAM. SLAM therefore provides new modeling constraints regarding interactions among processing levels, while also elaborating on the structure of the phonological level. We view this as evidence that an integration of psycholinguistic, neuroscience, and motor control approaches to speech production is feasible and may lead to substantial new insights.

  13. IITKGP-SESC: Speech Database for Emotion Analysis

    Science.gov (United States)

    Koolagudi, Shashidhar G.; Maity, Sudhamay; Kumar, Vuppala Anil; Chakrabarti, Saswat; Rao, K. Sreenivasa

    In this paper, we are introducing the speech database for analyzing the emotions present in speech signals. The proposed database is recorded in Telugu language using the professional artists from All India Radio (AIR), Vijayawada, India. The speech corpus is collected by simulating eight different emotions using the neutral (emotion free) statements. The database is named as Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC). The proposed database will be useful for characterizing the emotions present in speech. Further, the emotion specific knowledge present in speech at different levels can be acquired by developing the emotion specific models using the features from vocal tract system, excitation source and prosody. This paper describes the design, acquisition, post processing and evaluation of the proposed speech database (IITKGP-SESC). The quality of the emotions present in the database is evaluated using subjective listening tests. Finally, statistical models are developed using prosodic features, and the discrimination of the emotions is carried out by performing the classification of emotions using the developed statistical models.

  14. A model with nonzero rise time for AE signals

    Indian Academy of Sciences (India)

    M A Majeed; C R L Murthy

    2001-10-01

    Acoustic emission (AE) signals are conventionally modelled as damped or decaying sinusoidal functions. A major drawback of this model is its negligible or zero rise time. This paper proposes an alternative model, which provides for the rising part of the signal without sacrificing the analytical tractability and simplicity of the conventional model. Signals obtained from the proposed model through computer programs are illustrated for demonstrating their parity with actual AE signals. Analytic expressions for the time-domain parameters, viz., peak amplitude and rise time used in conventional AE signal analysis, are also derived. The model is believed to be also of use in modelling the output signal of any transducer that has finite rise time and fall time.

  15. Autonomous Traffic Signal Control Model with Neural Network Analogy

    CERN Document Server

    Ohira, T

    1997-01-01

    We propose here an autonomous traffic signal control model based on analogy with neural networks. In this model, the length of cycle time period of traffic lights at each signal is autonomously adapted. We find a self-organizing collective behavior of such a model through simulation on a one-dimensional lattice model road: traffic congestion is greatly diffused when traffic signals have such autonomous adaptability with suitably tuned parameters. We also find that effectiveness of the system emerges through interactions between units and shows a threshold transition as a function of proportion of adaptive signals in the model.

  16. Signal Processing Model for Radiation Transport

    Energy Technology Data Exchange (ETDEWEB)

    Chambers, D H

    2008-07-28

    This note describes the design of a simplified gamma ray transport model for use in designing a sequential Bayesian signal processor for low-count detection and classification. It uses a simple one-dimensional geometry to describe the emitting source, shield effects, and detector (see Fig. 1). At present, only Compton scattering and photoelectric absorption are implemented for the shield and the detector. Other effects may be incorporated in the future by revising the expressions for the probabilities of escape and absorption. Pair production would require a redesign of the simulator to incorporate photon correlation effects. The initial design incorporates the physical effects that were present in the previous event mode sequence simulator created by Alan Meyer. The main difference is that this simulator transports the rate distributions instead of single photons. Event mode sequences and other time-dependent photon flux sequences are assumed to be marked Poisson processes that are entirely described by their rate distributions. Individual realizations can be constructed from the rate distribution using a random Poisson point sequence generator.

  17. Speech Development

    Science.gov (United States)

    ... The speech-language pathologist should consistently assess your child’s speech and language development, as well as screen for hearing problems (with ... and caregivers play a vital role in a child’s speech and language development. It is important that you talk to your ...

  18. Emotional State Categorization from Speech: Machine vs. Human

    CERN Document Server

    Shaukat, Arslan

    2010-01-01

    This paper presents our investigations on emotional state categorization from speech signals with a psychologically inspired computational model against human performance under the same experimental setup. Based on psychological studies, we propose a multistage categorization strategy which allows establishing an automatic categorization model flexibly for a given emotional speech categorization task. We apply the strategy to the Serbian Emotional Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human performance was reported in previous psychological studies. Our work is the first attempt to apply machine learning to the GEES corpus where the human recognition rates were only available prior to our study. Unlike the previous work on the DES corpus, our work focuses on a comparison to human performance under the same experimental settings. Our studies suggest that psychology-inspired systems yield behaviours that, to a great extent, resemble what humans perceived and their performance ...

  19. Speaker height estimation from speech: Fusing spectral regression and statistical acoustic models.

    Science.gov (United States)

    Hansen, John H L; Williams, Keri; Bořil, Hynek

    2015-08-01

    Estimating speaker height can assist in voice forensic analysis and provide additional side knowledge to benefit automatic speaker identification or acoustic model selection for automatic speech recognition. In this study, a statistical approach to height estimation that incorporates acoustic models within a non-uniform height bin width Gaussian mixture model structure as well as a formant analysis approach that employs linear regression on selected phones are presented. The accuracy and trade-offs of these systems are explored by examining the consistency of the results, location, and causes of error as well a combined fusion of the two systems using data from the TIMIT corpus. Open set testing is also presented using the Multi-session Audio Research Project corpus and publicly available YouTube audio to examine the effect of channel mismatch between training and testing data and provide a realistic open domain testing scenario. The proposed algorithms achieve a highly competitive performance to previously published literature. Although the different data partitioning in the literature and this study may prevent performance comparisons in absolute terms, the mean average error of 4.89 cm for males and 4.55 cm for females provided by the proposed algorithm on TIMIT utterances containing selected phones suggest a considerable estimation error decrease compared to past efforts.

  20. Channel noise enhances signal detectability in a model of acoustic neuron through the stochastic resonance paradigm.

    Science.gov (United States)

    Liberti, M; Paffi, A; Maggio, F; De Angelis, A; Apollonio, F; d'Inzeo, G

    2009-01-01

    A number of experimental investigations have evidenced the extraordinary sensitivity of neuronal cells to weak input stimulations, including electromagnetic (EM) fields. Moreover, it has been shown that biological noise, due to random channels gating, acts as a tuning factor in neuronal processing, according to the stochastic resonant (SR) paradigm. In this work the attention is focused on noise arising from the stochastic gating of ionic channels in a model of Ranvier node of acoustic fibers. The small number of channels gives rise to a high noise level, which is able to cause a spike train generation even in the absence of stimulations. A SR behavior has been observed in the model for the detection of sinusoidal signals at frequencies typical of the speech.

  1. The Relative Weight of Statistical and Prosodic Cues in Speech Segmentation: A Matter of Language-(Independency and of Signal Quality

    Directory of Open Access Journals (Sweden)

    Tânia Fernandes

    2011-06-01

    Full Text Available In an artificial language setting, we investigated the relative weight of statistical cues (transitional probabilities, TPs in comparison to two prosodic cues, Intonational Phrases (IPs, a language-independent cue and lexical stress (a language-dependent cue. The signal quality was also manipulated through white-noise superimposition. Both IPs and TPs were highly resilient to physical degradation of the signal. An overall performance gain was found when these cues were congruent, but when they were incongruent IPs prevailed over TPs (Experiment 1. After ensuring that duration is treated by Portuguese listeners as a correlate of lexical stress (Experiment 2A, the role of lexical stress and TPs in segmentation was evaluated in Experiment 2B. Lexical stress effects only emerged with physically degraded signal, constraining the extraction of TP-words to the ones supported by both TPs and IPs. Speech segmentation does not seem to be the product of one preponderant cue acting as a filter of the outputs of another, lower-weighted cue. Instead, it mainly depends on the listening conditions, and the weighting of the cues according to their role in a particular language.

  2. Speech recognition based on a combination of acoustic features with articulatory information

    Institute of Scientific and Technical Information of China (English)

    LU Xugang; DANG Jianwu

    2005-01-01

    The contributions of the static and dynamic articulatory information to speech recognition were evaluated, and the recognition approaches by combining the articulatory information with acoustic features were discussed. Articulatory movements were observed by the Electromagnetic Articulographic System for reading speech, and the speech signals were recorded simultaneously. First, we conducted several speech recognition experiments by using articulatory features alone, consisting of a number of specific articulatory channels, to evaluate the contribution of each observation point on articulators. Then, the displacement information of articulatory data were combined with acoustic features directly and adopted in speech recognition. The results show that articulatory information provides with additional information for speech recognition which is not encoded in acoustic features. Furthermore, the contribution of the dynamic information of the articulatory data was evaluated by combining them in speech recognition. It is found that the second derivative of articulatory information provided quite larger contribution to speech recognition comparing with the second derivative of acoustical information. At last, the combination methods of articulatory features and acoustic ones were investigated for speech recognition. The basic approach isthat the Bayesian Network (BN) is added to each state of HMM, where the articulatory information is represented by the BN as a factor of observed signals during training the model and is marginalized as a hidden variable in recognition stage. Results based on this HMM/BN framework show a better performance than the traditional method.

  3. Separating Underdetermined Convolutive Speech Mixtures

    DEFF Research Database (Denmark)

    Pedersen, Michael Syskind; Wang, DeLiang; Larsen, Jan

    2006-01-01

    a method for underdetermined blind source separation of convolutive mixtures. The proposed framework is applicable for separation of instantaneous as well as convolutive speech mixtures. It is possible to iteratively extract each speech signal from the mixture by combining blind source separation...

  4. PESQ Based Speech Intelligibility Measurement

    NARCIS (Netherlands)

    Beerends, J.G.; Buuren, R.A. van; Vugt, J.M. van; Verhave, J.A.

    2009-01-01

    Several measurement techniques exist to quantify the intelligibility of a speech transmission chain. In the objective domain, the Articulation Index [1] and the Speech Transmission Index STI [2], [3], [4], [5] have been standardized for predicting intelligibility. The STI uses a signal that contains

  5. Adaptive Compensation Algorithm in Open Vocabulary Mandarin Speaker-Independent Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    In speech recognition systems, the physiological characteristics of the speech production model cause the voiced sections of the speech signal to have an attenuation of approximately 20 dB per decade. Many speech recognition algorithms have been developed to solve this problem by filtering the input signal with a single-zero high pass filter. Unfortunately, this technique increases the noise energy at high frequencies above 4 kHz, which in some cases degrades the recognition accuracy. This paper solves the problem using a pre-emphasis filter in the front end of the recognizer. The aim is to develop a modified parameterization approach taking into account the whole energy zone in the spectrum to improve the performance of the existing baseline recognition system in the acoustic phase. The results show that a large vocabulary speaker-independent continuous speech recognition system using this approach has a greatly improved recognition rate.

  6. 精神分裂症患者语音信号的临床分析%Clinical invesitgaiton of speech signal features among paitents with schizophrenia

    Institute of Scientific and Technical Information of China (English)

    张静; 潘忠德; 桂超; 朱杰; 崔东红

    2016-01-01

    更多样化的患者进行更大样本量、更详细的随访研究。%Background:A new area of interest in the search for biomarkers for schizophrenia is the study of the acousitc parameters of speech called 'speech signal features'. Several of these features have been shown to be related to emoitonal responsiveness, a characterisitc that is notably restricted in paitents with schizophrenia, paritcularly those with prominent negaitve symptoms. Aim:Assess the relaitonship of selected acousitc parameters of speech to the severity of clinical symptoms in patients with chronic schizophrenia and compare these characteristics between patients and matched healthy controls. Methods:Ten speech signal features–six prosody features, formant bandwidth and amplitude, and two spectral features–were assessed using 15-minute speech samples obtained by smartphone from 26 inpatients with chronic schizophrenia (at enrollment and 1 week later) and from 30 healthy controls (at enrollment only). Clinical symptoms of the paitents were also assessed at baseline and 1 week later using the Posiitve and Negaitve Syndrome Scale, the Scale for the Assessment of Negaitve Symptoms, and the Clinical Global Impression-Schizophrenia scale. Results:In the paitent group the symptoms were stable over the 1-week interval and the 1-week test-retest reliability of the 10 speech features was good (intraclass correlaiton coeffcients [ICC] ranging from 0.55 to 0.88). Comparison of the speech features between paitents and controls found no signiifcant differences in the six prosody features or in the formant bandwidth and amplitude features, but the two spectral features were different:the Mel-frequency cepstral coeffcient (MFCC) scores were signiifcantly lower in the paitent group than in the control group, and the linear prediciton coding (LPC) scores were signiifcantly higher in the paitent group than in the control group. Within the paitent group, 10 of the 170 associaitons between the 10 speech

  7. Compressed Sensing of Noisy Speech Signal Based on Adaptive Basis Pursuit De-noising%基于自适应基追踪去噪的含噪语音压缩感知

    Institute of Scientific and Technical Information of China (English)

    孙林慧; 杨震

    2011-01-01

    针对含白噪语音信号压缩采样后采用基追踪方法重构性能差的问题,提出了自适应基追踪去噪方法,该方法根据原含噪信号的信噪比自适应选择重构最佳参数,从而在重构语音的同时提高原信号信噪比.把该方法运用到含噪语音压缩感知中,对重构语音进行了主客观评价,并分析了不同压缩比下的重构性能.仿真结果显示:本文方法既实现了压缩采样,又在重构信号时实现了语音增强,优于基追踪重构方法%An adaptive basis pursuit de-noising method is proposed for compressed sensing of speech degraded by white Gaussian noise interference. This method adaptively selects the prime regularization parameter according to the SNR of original speech signal and enhancement is achieved as it is applied to noisy speech compression and reconstruction. The reconstructed speech signal via adaptive basis pursuit de-noising is evaluated by the objective and subjective evaluation. It is demonstrated that the performance of reconstructed speech based on adaptive basis pursuit de-noising is superior to basis pursuit,to low-SNR original speech signal.

  8. A Speech Intelligibility Index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners

    Science.gov (United States)

    Rhebergen, Koenraad S.; Versfeld, Niek J.

    2005-04-01

    The SII model in its present form (ANSI S3.5-1997, American National Standards Institute, New York) can accurately describe intelligibility for speech in stationary noise but fails to do so for nonstationary noise maskers. Here, an extension to the SII model is proposed with the aim to predict the speech intelligibility in both stationary and fluctuating noise. The basic principle of the present approach is that both speech and noise signal are partitioned into small time frames. Within each time frame the conventional SII is determined, yielding the speech information available to the listener at that time frame. Next, the SII values of these time frames are averaged, resulting in the SII for that particular condition. Using speech reception threshold (SRT) data from the literature, the extension to the present SII model can give a good account for SRTs in stationary noise, fluctuating speech noise, interrupted noise, and multiple-talker noise. The predictions for sinusoidally intensity modulated (SIM) noise and real speech or speech-like maskers are better than with the original SII model, but are still not accurate. For the latter type of maskers, informational masking may play a role. .

  9. A model of cell biological signaling predicts a phase transition of signaling and provides mathematical formulae.

    Science.gov (United States)

    Tsuruyama, Tatsuaki

    2014-01-01

    A biological signal is transmitted by interactions between signaling molecules in the cell. To date, there have been extensive studies regarding signaling pathways using numerical simulation of kinetic equations that are based on equations of continuity and Fick's law. To obtain a mathematical formulation of cell signaling, we propose a stability kinetic model of cell biological signaling of a simple two-parameter model based on the kinetics of the diffusion-limiting step. In the present model, the signaling is regulated by the binding of a cofactor, such as ATP. Non-linearity of the kinetics is given by the diffusion fluctuation in the interaction between signaling molecules, which is different from previous works that hypothesized autocatalytic reactions. Numerical simulations showed the presence of a critical concentration of the cofactor beyond which the cell signaling molecule concentration is altered in a chaos-like oscillation with frequency, which is similar to a discontinuous phase transition in physics. Notably, we found that the frequency is given by the logarithm function of the difference of the outside cofactor concentration from the critical concentration. This implies that the outside alteration of the cofactor concentration is transformed into the oscillatory alteration of cell inner signaling. Further, mathematical stability kinetic analysis predicted a discontinuous dynamic phase transition in the critical state at which the cofactor concentration is equivalent to the critical concentration. In conclusion, the present model illustrates a unique feature of cell signaling, and the stability analysis may provide an analytical framework of the cell signaling system and a novel formulation of biological signaling.

  10. [Improving speech comprehension using a new cochlear implant speech processor].

    Science.gov (United States)

    Müller-Deile, J; Kortmann, T; Hoppe, U; Hessel, H; Morsnowski, A

    2009-06-01

    The aim of this multicenter clinical field study was to assess the benefits of the new Freedom 24 sound processor for cochlear implant (CI) users implanted with the Nucleus 24 cochlear implant system. The study included 48 postlingually profoundly deaf experienced CI users who demonstrated speech comprehension performance with their current speech processor on the Oldenburg sentence test (OLSA) in quiet conditions of at least 80% correct scores and who were able to perform adaptive speech threshold testing using the OLSA in noisy conditions. Following baseline measures of speech comprehension performance with their current speech processor, subjects were upgraded to the Freedom 24 speech processor. After a take-home trial period of at least 2 weeks, subject performance was evaluated by measuring the speech reception threshold with the Freiburg multisyllabic word test and speech intelligibility with the Freiburg monosyllabic word test at 50 dB and 70 dB in the sound field. The results demonstrated highly significant benefits for speech comprehension with the new speech processor. Significant benefits for speech comprehension were also demonstrated with the new speech processor when tested in competing background noise.In contrast, use of the Abbreviated Profile of Hearing Aid Benefit (APHAB) did not prove to be a suitably sensitive assessment tool for comparative subjective self-assessment of hearing benefits with each processor. Use of the preprocessing algorithm known as adaptive dynamic range optimization (ADRO) in the Freedom 24 led to additional improvements over the standard upgrade map for speech comprehension in quiet and showed equivalent performance in noise. Through use of the preprocessing beam-forming algorithm BEAM, subjects demonstrated a highly significant improved signal-to-noise ratio for speech comprehension thresholds (i.e., signal-to-noise ratio for 50% speech comprehension scores) when tested with an adaptive procedure using the Oldenburg

  11. Speech-on-speech masking with variable access to the linguistic content of the masker speech for native and nonnative english speakers.

    Science.gov (United States)

    Calandruccio, Lauren; Bradlow, Ann R; Dhar, Sumitrajit

    2014-04-01

    Masking release for an English sentence-recognition task in the presence of foreign-accented English speech compared with native-accented English speech was reported in Calandruccio et al (2010a). The masking release appeared to increase as the masker intelligibility decreased. However, it could not be ruled out that spectral differences between the speech maskers were influencing the significant differences observed. The purpose of the current experiment was to minimize spectral differences between speech maskers to determine how various amounts of linguistic information within competing speech Affiliationect masking release. A mixed-model design with within-subject (four two-talker speech maskers) and between-subject (listener group) factors was conducted. Speech maskers included native-accented English speech and high-intelligibility, moderate-intelligibility, and low-intelligibility Mandarin-accented English. Normalizing the long-term average speech spectra of the maskers to each other minimized spectral differences between the masker conditions. Three listener groups were tested, including monolingual English speakers with normal hearing, nonnative English speakers with normal hearing, and monolingual English speakers with hearing loss. The nonnative English speakers were from various native language backgrounds, not including Mandarin (or any other Chinese dialect). Listeners with hearing loss had symmetric mild sloping to moderate sensorineural hearing loss. Listeners were asked to repeat back sentences that were presented in the presence of four different two-talker speech maskers. Responses were scored based on the key words within the sentences (100 key words per masker condition). A mixed-model regression analysis was used to analyze the difference in performance scores between the masker conditions and listener groups. Monolingual English speakers with normal hearing benefited when the competing speech signal was foreign accented compared with native

  12. [The application of cybernetic modeling methods for the forensic medical personality identification based on the voice and sounding speech characteristics].

    Science.gov (United States)

    Kaganov, A Sh; Kir'yanov, P A

    2015-01-01

    The objective of the present publication was to discuss the possibility of application of cybernetic modeling methods to overcome the apparent discrepancy between two kinds of the speech records, viz. initial ones (e.g. obtained in the course of special investigation activities) and the voice prints obtained from the persons subjected to the criminalistic examination. The paper is based on the literature sources and the materials of original criminalistics expertises performed by the authors.

  13. Flexible, Robust and Dynamic Dialogue Modeling with a Speech Dialogue Interface for Controlling a Hi-Fi Audio System.

    OpenAIRE

    Fernández Martínez, Fernando; Ferreiros López, Javier; Lucas Cuesta, Juan Manuel; Echeverry Correa, Julian David; San Segundo Hernández, Rubén; Córdoba Herralde, Ricardo de

    2010-01-01

    This work is focused on the context of speech interfaces for controlling household electronic devices. In particular, we present an example of a spoken dialogue system for controlling a Hi-Fi audio system. This system demonstrates that a more natural, flexible and robust dialogue is possible. That is due to both the Bayesian Networks based solution that we propose for dialogue modeling, and also to carefully designed contextual information handling strategies.

  14. Speech endpoint detection in real noise environments

    Institute of Scientific and Technical Information of China (English)

    GUO Yanmeng; FU Qiang; YAN Yonghong

    2007-01-01

    A method of speech endpoint detection in environments of complicated additive noise is presented. Based on the analysis of noise, an adaptive model of stationary noise is proposed to detect the section where the signal is nonstationary. Then the voice is detected in this section by its harmonic structure, and the accurate endpoint is searched using energy.Compared with the typical algorithms, this algorithm operates reliably in most real noise environments.

  15. A Dynamic Stimulus-Driven Model of Signal Detection

    Science.gov (United States)

    Turner, Brandon M.; Van Zandt, Trisha; Brown, Scott

    2011-01-01

    Signal detection theory forms the core of many current models of cognition, including memory, choice, and categorization. However, the classic signal detection model presumes the a priori existence of fixed stimulus representations--usually Gaussian distributions--even when the observer has no experience with the task. Furthermore, the classic…

  16. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  17. Nonlinear spectro-temporal features based on a cochlear model for automatic speech recognition in a noisy situation.

    Science.gov (United States)

    Choi, Yong-Sun; Lee, Soo-Young

    2013-09-01

    A nonlinear speech feature extraction algorithm was developed by modeling human cochlear functions, and demonstrated as a noise-robust front-end for speech recognition systems. The algorithm was based on a model of the Organ of Corti in the human cochlea with such features as such as basilar membrane (BM), outer hair cells (OHCs), and inner hair cells (IHCs). Frequency-dependent nonlinear compression and amplification of OHCs were modeled by lateral inhibition to enhance spectral contrasts. In particular, the compression coefficients had frequency dependency based on the psychoacoustic evidence. Spectral subtraction and temporal adaptation were applied in the time-frame domain. With long-term and short-term adaptation characteristics, these factors remove stationary or slowly varying components and amplify the temporal changes such as onset or offset. The proposed features were evaluated with a noisy speech database and showed better performance than the baseline methods such as mel-frequency cepstral coefficients (MFCCs) and RASTA-PLP in unknown noisy conditions.

  18. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  19. Overreliance on auditory feedback may lead to sound/syllable repetitions: simulations of stuttering and fluency-inducing conditions with a neural model of speech production

    Science.gov (United States)

    Civier, Oren; Tasko, Stephen M.; Guenther, Frank H.

    2010-01-01

    This paper investigates the hypothesis that stuttering may result in part from impaired readout of feedforward control of speech, which forces persons who stutter (PWS) to produce speech with a motor strategy that is weighted too much toward auditory feedback control. Over-reliance on feedback control leads to production errors which, if they grow large enough, can cause the motor system to “reset” and repeat the current syllable. This hypothesis is investigated using computer simulations of a “neurally impaired” version of the DIVA model, a neural network model of speech acquisition and production. The model’s outputs are compared to published acoustic data from PWS’ fluent speech, and to combined acoustic and articulatory movement data collected from the dysfluent speech of one PWS. The simulations mimic the errors observed in the PWS subject’s speech, as well as the repairs of these errors. Additional simulations were able to account for enhancements of fluency gained by slowed/prolonged speech and masking noise. Together these results support the hypothesis that many dysfluencies in stuttering are due to a bias away from feedforward control and toward feedback control. PMID:20831971

  20. Optimal subband Kalman filter for normal and oesophageal speech enhancement.

    Science.gov (United States)

    Ishaq, Rizwan; García Zapirain, Begoña

    2014-01-01

    This paper presents the single channel speech enhancement system using subband Kalman filtering by estimating optimal Autoregressive (AR) coefficients and variance for speech and noise, using Weighted Linear Prediction (WLP) and Noise Weighting Function (NWF). The system is applied for normal and Oesophageal speech signals. The method is evaluated by Perceptual Evaluation of Speech Quality (PESQ) score and Signal to Noise Ratio (SNR) improvement for normal speech and Harmonic to Noise Ratio (HNR) for Oesophageal Speech (OES). Compared with previous systems, the normal speech indicates 30% increase in PESQ score, 4 dB SNR improvement and OES shows 3 dB HNR improvement.

  1. Audiovisual integration in speech perception: a multi-stage process

    DEFF Research Database (Denmark)

    Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias

    2011-01-01

    investigate whether the integration of auditory and visual speech observed in these two audiovisual integration effects are specific traits of speech perception. We further ask whether audiovisual integration is undertaken in a single processing stage or multiple processing stages.......Integration of speech signals from ear and eye is a well-known feature of speech perception. This is evidenced by the McGurk illusion in which visual speech alters auditory speech perception and by the advantage observed in auditory speech detection when a visual signal is present. Here we...

  2. Signals and Systems in Biomedical Engineering Signal Processing and Physiological Systems Modeling

    CERN Document Server

    Devasahayam, Suresh R

    2013-01-01

    The use of digital signal processing is ubiquitous in the field of physiology and biomedical engineering. The application of such mathematical and computational tools requires a formal or explicit understanding of physiology. Formal models and analytical techniques are interlinked in physiology as in any other field. This book takes a unitary approach to physiological systems, beginning with signal measurement and acquisition, followed by signal processing, linear systems modelling, and computer simulations. The signal processing techniques range across filtering, spectral analysis and wavelet analysis. Emphasis is placed on fundamental understanding of the concepts as well as solving numerical problems. Graphs and analogies are used extensively to supplement the mathematics. Detailed models of nerve and muscle at the cellular and systemic levels provide examples for the mathematical methods and computer simulations. Several of the models are sufficiently sophisticated to be of value in understanding real wor...

  3. Modeling of Nonlinear Signal Distortion in Fiber-Optical Networks

    CERN Document Server

    Johannisson, Pontus

    2013-01-01

    A low-complexity model for signal quality prediction in a nonlinear fiber-optical network is developed. The model, which builds on the Gaussian noise model, takes into account the signal degradation caused by a combination of chromatic dispersion, nonlinear signal distortion, and amplifier noise. The center frequencies, bandwidths, and transmit powers can be chosen independently for each channel, which makes the model suitable for analysis and optimization of resource allocation, routing, and scheduling in large-scale optical networks applying flexible-grid wavelength-division multiplexing.

  4. Prediction of the influence of reverberation on binaural speech intelligibility in noise and in quiet.

    Science.gov (United States)

    Rennies, Jan; Brand, Thomas; Kollmeier, Birger

    2011-11-01

    Reverberation usually degrades speech intelligibility for spatially separated speech and noise sources since spatial unmasking is reduced and late reflections decrease the fidelity of the received speech signal. The latter effect could not satisfactorily be predicted by a recently presented binaural speech intelligibility model [Beutelmann et al. (2010). J. Acoust. Soc. Am. 127, 2479-2497]. This study therefore evaluated three extensions of the model to improve its predictions: (1) an extension of the speech intelligibility index based on modulation transfer functions, (2) a correction factor based on the room acoustical quantity "definition," and (3) a separation of the speech signal into useful and detrimental parts. The predictions were compared to results of two experiments in which speech reception thresholds were measured in a reverberant room in quiet and in the presence of a noise source for listeners with normal hearing. All extensions yielded better predictions than the original model when the influence of reverberation was strong, while predictions were similar for conditions with less reverberation. Although model (3) differed substantially in the assumed interaction of binaural processing and early reflections, its predictions were very similar to model (2) that achieved the best fit to the data.

  5. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  6. A 3D biomechanical vocal tract model to study speech production control: How to take into account the gravity?

    CERN Document Server

    Buchaillard, S; Payan, Y; Buchaillard, St\\'{e}phanie; Perrier, Pascal; Payan, Yohan

    2007-01-01

    This paper presents a modeling study of the way speech motor control can deal with gravity to achieve steady-state tongue positions. It is based on simulations carried out with the 3D biomechanical tongue model developed at ICP, which is now controlled with the Lambda model (Equilibrium-Point Hypothesis). The influence of short-delay orosensory feedback on posture stability is assessed by testing different muscle force/muscle length relationships (Invariant Characteristics). Muscle activation patterns necessary to maintain the tongue in a schwa position are proposed, and the relations of head position, tongue shape and muscle activations are analyzed.

  7. MPD model for radar echo signal of hypersonic targets

    Directory of Open Access Journals (Sweden)

    Xu Xuefei

    2014-08-01

    Full Text Available The stop-and-go (SAG model is typically used for echo signal received by the radar using linear frequency modulation pulse compression. In this study, the authors demonstrate that this model is not applicable to hypersonic targets. Instead of SAG model, they present a more realistic echo signal model (moving-in-pulse duration (MPD for hypersonic targets. Following that, they evaluate the performances of pulse compression under the SAG and MPD models by theoretical analysis and simulations. They found that the pulse compression gain has an increase of 3 dB by using the MPD model compared with the SAG model in typical cases.

  8. Predicting Speech Intelligibility with a Multiple Speech Subsystems Approach in Children with Cerebral Palsy

    Science.gov (United States)

    Lee, Jimin; Hustad, Katherine C.; Weismer, Gary

    2014-01-01

    Purpose: Speech acoustic characteristics of children with cerebral palsy (CP) were examined with a multiple speech subsystems approach; speech intelligibility was evaluated using a prediction model in which acoustic measures were selected to represent three speech subsystems. Method: Nine acoustic variables reflecting different subsystems, and…

  9. Predicting Speech Intelligibility with a Multiple Speech Subsystems Approach in Children with Cerebral Palsy

    Science.gov (United States)

    Lee, Jimin; Hustad, Katherine C.; Weismer, Gary

    2014-01-01

    Purpose: Speech acoustic characteristics of children with cerebral palsy (CP) were examined with a multiple speech subsystems approach; speech intelligibility was evaluated using a prediction model in which acoustic measures were selected to represent three speech subsystems. Method: Nine acoustic variables reflecting different subsystems, and…

  10. Modeling admissible behavior using event signals.

    Science.gov (United States)

    Pinzon, Luz; Jafari, Mohsen A; Hanisch, Hans-Michael; Zhao, Peng

    2004-06-01

    We describe here how to obtain a model for the admissible behavior of a discrete event system that is represented by a safe Petri net (PN) model. The transitions of this PN model may be controllable or uncontrollable. Also given is a sequential specification which is modeled with a special state machine. Then, using the condition and event arcs of net condition/event systems, a combined model of plant and specification is obtained. We use only the structure of this combined model to develop a method which gives the admissible behavior of the system. Thus, we avoid the complexity of a complete state enumeration.

  11. 语音信号稀疏分解的FOA实现%Speech signal sparse decomposition with FOA

    Institute of Scientific and Technical Information of China (English)

    肖正安

    2013-01-01

    Sparse representation of signals has many important applications in signal processing, but the computational burden in signal sparse decomposition process is so huge that it’s almost impossible to apply it to industrialization. In this paper, Fruit fly optimization algorithm is implemented to fast search for optimal atom at each step of Matching Pursuit(MP), and the speed of signal sparse decomposition is improved. The validity of this algorithm is proved by experimental results.%  信号的稀疏表示在信号处理的许多方面有着重要的应用,但稀疏分解计算量十分巨大,难以产业化应用。利用果蝇优化算法实现快速寻找匹配追踪(MP)过程每一步的最优原子,大大提高了语音信号稀疏分解的速度,算法的有效性为实验结果所证实。

  12. Modelling coloured residual noise in gravitational-wave signal processing

    Energy Technology Data Exchange (ETDEWEB)

    Roever, Christian [Max-Planck-Institut fuer Gravitationsphysik (Albert-Einstein-Institut) and Leibniz Universitaet Hannover, Hannover (Germany); Meyer, Renate [Department of Statistics, University of Auckland, Auckland (New Zealand); Christensen, Nelson, E-mail: christian.roever@aei.mpg.de [Physics and Astronomy, Carleton College, Northfield, MN (United States)

    2011-01-07

    We introduce a signal processing model for signals in non-white noise, where the exact noise spectrum is a priori unknown. The model is based on a Student's t distribution and constitutes a natural generalization of the widely used normal (Gaussian) model. This way, it allows for uncertainty in the noise spectrum, or more generally is also able to accommodate outliers (heavy-tailed noise) in the data. Examples are given pertaining to data from gravitational-wave detectors.

  13. ClassTalk system for predicting and visualizing speech in noise in classrooms

    Science.gov (United States)

    Hodgson, Murray

    2002-11-01

    This paper discusses the ClassTalk system for modeling, predicting and visualizing speech in noise in classrooms. Modeling involves defining the classroom geometry, sources, sound-absorbing features, and receiver positions. Empirical models, used to predict speech and noise levels, and reverberation times, are described. Male or female speech sources, and overhead-, slide-, or LCD-projector, or ventilation-outlet noise sources, can have four output levels; values are assigned based on ranges of values found from published data and measurements. ClassTalk visualizes the floor plan, speech- and noise-source positions, and the receiver position. The user can walk through the room at will. In real time, six quantities--background-noise level, speech level, signal-to-noise level difference, useful-to-detrimental energy fraction (U50), Speech Transmission Index, and speech intelligibility--are displayed, along with occupied and unoccupied reverberation times. An example of a large classroom before and after treatment is presented. The future development of improved prediction models and of the sound module, which will auralize speech in noise with reverberation, is discussed.

  14. Modeling laser velocimeter signals as triply stochastic Poisson processes

    Science.gov (United States)

    Mayo, W. T., Jr.

    1976-01-01

    Previous models of laser Doppler velocimeter (LDV) systems have not adequately described dual-scatter signals in a manner useful for analysis and simulation of low-level photon-limited signals. At low photon rates, an LDV signal at the output of a photomultiplier tube is a compound nonhomogeneous filtered Poisson process, whose intensity function is another (slower) Poisson process with the nonstationary rate and frequency parameters controlled by a random flow (slowest) process. In the present paper, generalized Poisson shot noise models are developed for low-level LDV signals. Theoretical results useful in detection error analysis and simulation are presented, along with measurements of burst amplitude statistics. Computer generated simulations illustrate the difference between Gaussian and Poisson models of low-level signals.

  15. Towards an auditory account of speech rhythm: application of a model of the auditory 'primal sketch' to two multi-language corpora.

    Science.gov (United States)

    Lee, Christopher S; Todd, Neil P McAngus

    2004-10-01

    The world's languages display important differences in their rhythmic organization; most particularly, different languages seem to privilege different phonological units (mora, syllable, or stress foot) as their basic rhythmic unit. There is now considerable evidence that such differences have important consequences for crucial aspects of language acquisition and processing. Several questions remain, however, as to what exactly characterizes the rhythmic differences, how they are manifested at an auditory/acoustic level and how listeners, whether adult native speakers or young infants, process rhythmic information. In this paper it is proposed that the crucial determinant of rhythmic organization is the variability in the auditory prominence of phonetic events. In order to test this auditory prominence hypothesis, an auditory model is run on two multi-language data-sets, the first consisting of matched pairs of English and French sentences, and the second consisting of French, Italian, English and Dutch sentences. The model is based on a theory of the auditory primal sketch, and generates a primitive representation of an acoustic signal (the rhythmogram) which yields a crude segmentation of the speech signal and assigns prominence values to the obtained sequence of events. Its performance is compared with that of several recently proposed phonetic measures of vocalic and consonantal variability.

  16. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  17. Beyond production: Brain responses during speech perception in adults who stutter

    Directory of Open Access Journals (Sweden)

    Tali Halag-Milo

    2016-01-01

    Full Text Available Developmental stuttering is a speech disorder that disrupts the ability to produce speech fluently. While stuttering is typically diagnosed based on one's behavior during speech production, some models suggest that it involves more central representations of language, and thus may affect language perception as well. Here we tested the hypothesis that developmental stuttering implicates neural systems involved in language perception, in a task that manipulates comprehensibility without an overt speech production component. We used functional magnetic resonance imaging to measure blood oxygenation level dependent (BOLD signals in adults who do and do not stutter, while they were engaged in an incidental speech perception task. We found that speech perception evokes stronger activation in adults who stutter (AWS compared to controls, specifically in the right inferior frontal gyrus (RIFG and in left Heschl's gyrus (LHG. Significant differences were additionally found in the lateralization of response in the inferior frontal cortex: AWS showed bilateral inferior frontal activity, while controls showed a left lateralized pattern of activation. These findings suggest that developmental stuttering is associated with an imbalanced neural network for speech processing, which is not limited to speech production, but also affects cortical responses during speech perception.

  18. Speech Indexing

    NARCIS (Netherlands)

    Ordelman, R.J.F.; Jong, de F.M.G.; Leeuwen, van D.A.; Blanken, H.M.; de Vries, A.P.; Blok, H.E.; Feng, L.

    2007-01-01

    This chapter will focus on the automatic extraction of information from the speech in multimedia documents. This approach is often referred to as speech indexing and it can be regarded as a subfield of audio indexing that also incorporates for example the analysis of music and sounds. If the objecti

  19. Plowing Speech

    OpenAIRE

    Zla ba sgrol ma

    2009-01-01

    This file contains a plowing speech and a discussion about the speech This collection presents forty-nine audio files including: several folk song genres; folktales and; local history from the Sman shad Valley of Sde dge county World Oral Literature Project

  20. A simple statistical signal loss model for deep underground garage

    DEFF Research Database (Denmark)

    Nguyen, Huan Cong; Gimenez, Lucas Chavarria; Kovacs, Istvan

    2016-01-01

    In this paper we address the channel modeling aspects for a deep-indoor scenario with extreme coverage conditions in terms of signal losses, namely underground garage areas. We provide an in-depth analysis in terms of path loss (gain) and large scale signal shadowing, and a propose simple...

  1. Quantitative modelling in cognitive ergonomics: predicting signals passed at danger.

    Science.gov (United States)

    Moray, Neville; Groeger, John; Stanton, Neville

    2017-02-01

    This paper shows how to combine field observations, experimental data and mathematical modelling to produce quantitative explanations and predictions of complex events in human-machine interaction. As an example, we consider a major railway accident. In 1999, a commuter train passed a red signal near Ladbroke Grove, UK, into the path of an express. We use the Public Inquiry Report, 'black box' data, and accident and engineering reports to construct a case history of the accident. We show how to combine field data with mathematical modelling to estimate the probability that the driver observed and identified the state of the signals, and checked their status. Our methodology can explain the SPAD ('Signal Passed At Danger'), generate recommendations about signal design and placement and provide quantitative guidance for the design of safer railway systems' speed limits and the location of signals. Practitioner Summary: Detailed ergonomic analysis of railway signals and rail infrastructure reveals problems of signal identification at this location. A record of driver eye movements measures attention, from which a quantitative model for out signal placement and permitted speeds can be derived. The paper is an example of how to combine field data, basic research and mathematical modelling to solve ergonomic design problems.

  2. STUDY ON PHASE PERCEPTION IN SPEECH

    Institute of Scientific and Technical Information of China (English)

    Tong Ming; Bian Zhengzhong; Li Xiaohui; Dai Qijun; Chen Yanpu

    2003-01-01

    The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show that: (1) If the envelop of the reconstructed speech signalis unchanged, there is indistinctive auditory perception between the original speech and thereconstructed speech; (2) The auditory perception effect of the reconstructed speech mainly lieson the amplitude of the derivative of the additive phase; (3) td is the maximum relative time shiftbetween different frequency components of the reconstructed speech signal. The speech qualityis excellent while td <10ms; good while 10ms< td <20ms; common while 20ms< td <35ms, andpoor while td >35ms.

  3. Cumulative Semantic Interference Is Blind to Language: Implications for Models of Bilingual Speech Production

    Science.gov (United States)

    Runnqvist, Elin; Strijkers, Kristof; Alario, F.-Xavier; Costa, Albert

    2012-01-01

    Several studies have shown that concepts spread activation to words of both of a bilingual's languages. Therefore, a central issue that needs to be clarified is how a bilingual manages to restrict his speech production to a single language. One influential proposal is that when speaking in one language, the other language is inhibited. An…

  4. Facilitating Emergent Literacy: Efficacy of a Model that Partners Speech-Language Pathologists and Educators

    Science.gov (United States)

    Girolametto, Luigi; Weitzman, Elaine; Greenberg, Janice

    2012-01-01

    Purpose: This study examined the efficacy of a professional development program for early childhood educators that facilitated emergent literacy skills in preschoolers. The program, led by a speech-language pathologist, focused on teaching alphabet knowledge, print concepts, sound awareness, and decontextualized oral language within naturally…

  5. Modeling the Effects of the Local Environment on a Received GNSS Signal

    Science.gov (United States)

    2012-12-18

    Ohio universities, most notably Dr. Sanjeev Gunawardena, Dr. Wouter Pelgrum, Dr. Maarten Uijt de Haag, and Dr. Jade Morton. I’m grateful for the...Speech, Signal Processing (ICASSP 1985), Tampa Bay , FL, 1985, pp. 1762-1765. 92. M. Feder and E. Weinstein, “Multipath and Multiple Source Array Processing

  6. Modelling of Signal - Level Crossing System

    Directory of Open Access Journals (Sweden)

    Daniel Novak

    2006-01-01

    Full Text Available The author presents an object-oriented model of a railway level-crossing system created for the purpose of functional requirements specification. Unified Modelling Language (UML, version 1.4, which enables specification, visualisation, construction and documentation of software system artefacts, was used. The main attention was paid to analysis and design phases. The former phase resulted in creation of use case diagrams and sequential diagrams, the latter in creation of class/object diagrams and statechart diagrams.

  7. Modelling and Analysis of Biochemical Signalling Pathway Cross-talk

    CERN Document Server

    Donaldson, Robin; 10.4204/EPTCS.19.3

    2010-01-01

    Signalling pathways are abstractions that help life scientists structure the coordination of cellular activity. Cross-talk between pathways accounts for many of the complex behaviours exhibited by signalling pathways and is often critical in producing the correct signal-response relationship. Formal models of signalling pathways and cross-talk in particular can aid understanding and drive experimentation. We define an approach to modelling based on the concept that a pathway is the (synchronising) parallel composition of instances of generic modules (with internal and external labels). Pathways are then composed by (synchronising) parallel composition and renaming; different types of cross-talk result from different combinations of synchronisation and renaming. We define a number of generic modules in PRISM and five types of cross-talk: signal flow, substrate availability, receptor function, gene expression and intracellular communication. We show that Continuous Stochastic Logic properties can both detect an...

  8. Sensorimotor Interactions in Speech Learning

    Directory of Open Access Journals (Sweden)

    Douglas M Shiller

    2011-10-01

    Full Text Available Auditory input is essential for normal speech development and plays a key role in speech production throughout the life span. In traditional models, auditory input plays two critical roles: 1 establishing the acoustic correlates of speech sounds that serve, in part, as the targets of speech production, and 2 as a source of feedback about a talker's own speech outcomes. This talk will focus on both of these roles, describing a series of studies that examine the capacity of children and adults to adapt to real-time manipulations of auditory feedback during speech production. In one study, we examined sensory and motor adaptation to a manipulation of auditory feedback during production of the fricative “s”. In contrast to prior accounts, adaptive changes were observed not only in speech motor output but also in subjects' perception of the sound. In a second study, speech adaptation was examined following a period of auditory–perceptual training targeting the perception of vowels. The perceptual training was found to systematically improve subjects' motor adaptation response to altered auditory feedback during speech production. The results of both studies support the idea that perceptual and motor processes are tightly coupled in speech production learning, and that the degree and nature of this coupling may change with development.

  9. An accurate and simple large signal model of HEMT

    DEFF Research Database (Denmark)

    Liu, Qing

    1989-01-01

    A large-signal model of discrete HEMTs (high-electron-mobility transistors) has been developed. It is simple and suitable for SPICE simulation of hybrid digital ICs. The model parameters are extracted by using computer programs and data provided by the manufacturer. Based on this model, a hybrid...

  10. Semiconductor Modeling For Simulating Signal, Power, and Electromagneticintegrity

    CERN Document Server

    Leventhal, Roy

    2006-01-01

    Assists engineers in designing high-speed circuits. The emphasis is on semiconductor modeling, with PCB transmission line effects, equipment enclosure effects, and other modeling issues discussed as needed. This text addresses practical considerations, including process variation, model accuracy, validation and verification, and signal integrity.

  11. Multiscale adaptive basis function modeling of spatiotemporal vectorcardiogram signals.

    Science.gov (United States)

    Gang Liu; Hui Yang

    2013-03-01

    Mathematical modeling of cardiac electrical signals facilitates the simulation of realistic cardiac electrical behaviors, the evaluation of algorithms, and the characterization of underlying space-time patterns. However, there are practical issues pertinent to model efficacy, robustness, and generality. This paper presents a multiscale adaptive basis function modeling approach to characterize not only temporal but also spatial behaviors of vectorcardiogram (VCG) signals. Model parameters are adaptively estimated by the "best matching" projections of VCG characteristic waves onto a dictionary of nonlinear basis functions. The model performance is experimentally evaluated with respect to the number of basis functions, different types of basis function (i.e., Gaussian, Mexican hat, customized wavelet, and Hermitian wavelets), and various cardiac conditions, including 80 healthy controls and different myocardial infarctions (i.e., 89 inferior, 77 anterior-septal, 56 inferior-lateral, 47 anterior, and 43 anterior-lateral). Multiway analysis of variance shows that the basis function and the model complexity have significant effects on model performances while cardiac conditions are not significant. The customized wavelet is found to be an optimal basis function for the modeling of spacetime VCG signals. The comparison of QT intervals shows small relative errors (model representations and realworld VCG signals when the model complexity is greater than 10. The proposed model shows great potentials to model space-time cardiac pathological behaviors and can lead to potential benefits in feature extraction, data compression, algorithm evaluation, and disease prognostics.

  12. A new frequency scale of Chinese whispered speech in the application of speaker identification

    Institute of Scientific and Technical Information of China (English)

    LIN Wei; YANG Lili; XU Boling

    2006-01-01

    In this paper, the frequency characteristics of Chinese whispered speech were investigated by a filter bank analysis. It was shown that the first and the third formants were more important than the other formants in the speaker identification of Chinese whispered speech. The experiment showed that the 800-1200 Hz and 2800-3200 Hz ranges were the most significant frequency ranges in discriminating the speaker. Based on this result, a new feature scale named whisper sensitive scale (WSS) was proposed to replace the common scale, Mel scale, and to extract the cepstral coefficient from whispered speech signal. Furthermore, a speaker identification system in whispered speech was presented based on the modified Hidden Markov Models integrating advantages of WSCC (the whisper sensitive cepstral coefficient) and LPCC. And the new system performed better in solving the problem of speaker identification of Chinese whispered speech than the traditional method.

  13. A simple statistical signal loss model for deep underground garage

    DEFF Research Database (Denmark)

    Nguyen, Huan Cong; Gimenez, Lucas Chavarria; Kovacs, Istvan

    2016-01-01

    In this paper we address the channel modeling aspects for a deep-indoor scenario with extreme coverage conditions in terms of signal losses, namely underground garage areas. We provide an in-depth analysis in terms of path loss (gain) and large scale signal shadowing, and a propose simple...... propagation model which can be used to predict cellular signal levels in similar deep-indoor scenarios. The proposed frequency-independent floor attenuation factor (FAF) is shown to be in range of 5.2 dB per meter deep....

  14. Bayesian hierarchical modeling for detecting safety signals in clinical trials.

    Science.gov (United States)

    Xia, H Amy; Ma, Haijun; Carlin, Bradley P

    2011-09-01

    Detection of safety signals from clinical trial adverse event data is critical in drug development, but carries a challenging statistical multiplicity problem. Bayesian hierarchical mixture modeling is appealing for its ability to borrow strength across subgroups in the data, as well as moderate extreme findings most likely due merely to chance. We implement such a model for subject incidence (Berry and Berry, 2004 ) using a binomial likelihood, and extend it to subject-year adjusted incidence rate estimation under a Poisson likelihood. We use simulation to choose a signal detection threshold, and illustrate some effective graphics for displaying the flagged signals.

  15. THE SIGNAL APPROACH TO MODELLING THE BALANCE OF PAYMENT CRISIS

    Directory of Open Access Journals (Sweden)

    O. Chernyak

    2016-12-01

    Full Text Available The paper considers and presents synthesis of theoretical models of balance of payment crisis and investigates the most effective ways to model the crisis in Ukraine. For mathematical formalization of balance of payment crisis, comparative analysis of the effectiveness of different calculation methods of Exchange Market Pressure Index was performed. A set of indicators that signal the growing likelihood of balance of payments crisis was defined using signal approach. With the help of minimization function thresholds indicators were selected, the crossing of which signalize increase in the probability of balance of payment crisis.

  16. Model-based Bayesian signal extraction algorithm for peripheral nerves

    Science.gov (United States)

    Eggers, Thomas E.; Dweiri, Yazan M.; McCallum, Grant A.; Durand, Dominique M.

    2017-10-01

    Objective. Multi-channel cuff electrodes have recently been investigated for extracting fascicular-level motor commands from mixed neural recordings. Such signals could provide volitional, intuitive control over a robotic prosthesis for amputee patients. Recent work has demonstrated success in extracting these signals in acute and chronic preparations using spatial filtering techniques. These extracted signals, however, had low signal-to-noise ratios and thus limited their utility to binary classification. In this work a new algorithm is proposed which combines previous source localization approaches to create a model based method which operates in real time. Approach. To validate this algorithm, a saline benchtop setup was created to allow the precise placement of artificial sources within a cuff and interference sources outside the cuff. The artificial source was taken from five seconds of chronic neural activity to replicate realistic recordings. The proposed algorithm, hybrid Bayesian signal extraction (HBSE), is then compared to previous algorithms, beamforming and a Bayesian spatial filtering method, on this test data. An example chronic neural recording is also analyzed with all three algorithms. Main results. The proposed algorithm improved the signal to noise and signal to interference ratio of extracted test signals two to three fold, as well as increased the correlation coefficient between the original and recovered signals by 10-20%. These improvements translated to the chronic recording example and increased the calculated bit rate between the recovered signals and the recorded motor activity. Significance. HBSE significantly outperforms previous algorithms in extracting realistic neural signals, even in the presence of external noise sources. These results demonstrate the feasibility of extracting dynamic motor signals from a multi-fascicled intact nerve trunk, which in turn could extract motor command signals from an amputee for the end goal of

  17. Acoustic Signal Processing

    Science.gov (United States)

    Hartmann, William M.; Candy, James V.

    Signal processing refers to the acquisition, storage, display, and generation of signals - also to the extraction of information from signals and the re-encoding of information. As such, signal processing in some form is an essential element in the practice of all aspects of acoustics. Signal processing algorithms enable acousticians to separate signals from noise, to perform automatic speech recognition, or to compress information for more efficient storage or transmission. Signal processing concepts are the building blocks used to construct models of speech and hearing. Now, in the 21st century, all signal processing is effectively digital signal processing. Widespread access to high-speed processing, massive memory, and inexpensive software make signal processing procedures of enormous sophistication and power available to anyone who wants to use them. Because advanced signal processing is now accessible to everybody, there is a need for primers that introduce basic mathematical concepts that underlie the digital algorithms. The present handbook chapter is intended to serve such a purpose.

  18. Objective speech quality measures for Internet telephony

    Science.gov (United States)

    Hall, Timothy A.

    2001-07-01

    Measuring voice quality for telephony is not a new problem. However, packet-switched, best-effort networks such as the Internet present significant new challenges for the delivery of real-time voice traffic. Unlike the circuit-switched PSTN, Internet protocol (IP) networks guarantee neither sufficient bandwidth for the voice traffic nor a constant, minimal delay. Dropped packets and varying delays introduce distortions not found in traditional telephony. In addition, if a low bitrate codec is used in voice over IP (VoIP) to achieve a high compression ratio, the original waveform can be significantly distorted. These new potential sources of signal distortion present significant challenges for objectively measuring speech quality. Measurement techniques designed for the PSTN may not perform well in VoIP environments. Our objective is to find a speech quality metric that accurately predicts subjective human perception under the conditions present in VoIP systems. To do this, we compared three types of measures: perceptually weighted distortion measures such as enhanced modified Bark spectral distance (EMBSD) and measuring normalizing blocks (MNB), word-error rates of continuous speech recognizers, and the ITU E-model. We tested the performance of these measures under conditions typical of a VoIP system. We found that the E-model had the highest correlation with mean opinion scores (MOS). The E-model is well-suited for online monitoring because it does not use the original (undistorted) signal to compute its quality metric and because it is computationally simple.

  19. Speech Compression Using Multecirculerletet Transform

    Directory of Open Access Journals (Sweden)

    Sulaiman Murtadha

    2012-01-01

    Full Text Available Compressing the speech reduces the data storage requirements, leading to reducing the time of transmitting the digitized speech over long-haul links like internet. To obtain best performance in speech compression, wavelet transforms require filters that combine a number of desirable properties, such as orthogonality and symmetry.The MCT bases functions are derived from GHM bases function using 2D linear convolution .The fast computation algorithm methods introduced here added desirable features to the current transform. We further assess the performance of the MCT in speech compression application. This paper discusses the effect of using DWT and MCT (one and two dimension on speech compression. DWT and MCT performances in terms of compression ratio (CR, mean square error (MSE and peak signal to noise ratio (PSNR are assessed. Computer simulation results indicate that the two dimensions MCT offer a better compression ratio, MSE and PSNR than DWT.

  20. Discrete dynamic modeling of T cell survival signaling networks

    Science.gov (United States)

    Zhang, Ranran

    2009-03-01

    Biochemistry-based frameworks are often not applicable for the modeling of heterogeneous regulatory systems that are sparsely documented in terms of quantitative information. As an alternative, qualitative models assuming a small set of discrete states are gaining acceptance. This talk will present a discrete dynamic model of the signaling network responsible for the survival and long-term competence of cytotoxic T cells in the blood cancer T-LGL leukemia. We integrated the signaling pathways involved in normal T cell activation and the known deregulations of survival signaling in leukemic T-LGL, and formulated the regulation of each network element as a Boolean (logic) rule. Our model suggests that the persistence of two signals is sufficient to reproduce all known deregulations in leukemic T-LGL. It also indicates the nodes whose inactivity is necessary and sufficient for the reversal of the T-LGL state. We have experimentally validated several model predictions, including: (i) Inhibiting PDGF signaling induces apoptosis in leukemic T-LGL. (ii) Sphingosine kinase 1 and NFκB are essential for the long-term survival of T cells in T-LGL leukemia. (iii) T box expressed in T cells (T-bet) is constitutively activated in the T-LGL state. The model has identified potential therapeutic targets for T-LGL leukemia and can be used for generating long-term competent CTL necessary for tumor and cancer vaccine development. The success of this model, and of other discrete dynamic models, suggests that the organization of signaling networks has an determining role in their dynamics. Reference: R. Zhang, M. V. Shah, J. Yang, S. B. Nyland, X. Liu, J. K. Yun, R. Albert, T. P. Loughran, Jr., Network Model of Survival Signaling in LGL Leukemia, PNAS 105, 16308-16313 (2008).