WorldWideScience

Sample records for automated speech recognition

  1. Speech Recognition

    Directory of Open Access Journals (Sweden)

    Adrian Morariu

    2009-01-01

    Full Text Available This paper presents a method of speech recognition by pattern recognition techniques. Learning consists in determining the unique characteristics of a word (cepstral coefficients by eliminating those characteristics that are different from one word to another. For learning and recognition, the system will build a dictionary of words by determining the characteristics of each word to be used in the recognition. Determining the characteristics of an audio signal consists in the following steps: noise removal, sampling it, applying Hamming window, switching to frequency domain through Fourier transform, calculating the magnitude spectrum, filtering data, determining cepstral coefficients.

  2. Speech activity detection for the automated speaker recognition system of critical use

    Directory of Open Access Journals (Sweden)

    M. M. Bykov

    2017-06-01

    Full Text Available In the article, the authors developed a method for detecting speech activity for an automated system for recognizing critical use of speeches with wavelet parameterization of speech signal and classification at intervals of “language”/“pause” using a curvilinear neural network. The method of wavelet-parametrization proposed by the authors allows choosing the optimal parameters of wavelet transformation in accordance with the user-specified error of presentation of speech signal. Also, the method allows estimating the loss of information depending on the selected parameters of continuous wavelet transformation (NPP, which allowed to reduce the number of scalable coefficients of the LVP of the speech signal in order of magnitude with the allowable degree of distortion of the local spectrum of the LVP. An algorithm for detecting speech activity with a curvilinear neural network classifier is also proposed, which shows the high quality of segmentation of speech signals at intervals "language" / "pause" and is resistant to the presence in the speech signal of narrowband noise and technogenic noise due to the inherent properties of the curvilinear neural network.

  3. DEVELOPMENT OF AUTOMATED SPEECH RECOGNITION SYSTEM FOR EGYPTIAN ARABIC PHONE CONVERSATIONS

    Directory of Open Access Journals (Sweden)

    A. N. Romanenko

    2016-07-01

    Full Text Available The paper deals with description of several speech recognition systems for the Egyptian Colloquial Arabic. The research is based on the CALLHOME Egyptian corpus. The description of both systems, classic: based on Hidden Markov and Gaussian Mixture Models, and state-of-the-art: deep neural network acoustic models is given. We have demonstrated the contribution from the usage of speaker-dependent bottleneck features; for their extraction three extractors based on neural networks were trained. For their training three datasets in several languageswere used:Russian, English and differentArabic dialects.We have studied the possibility of application of a small Modern Standard Arabic (MSA corpus to derive phonetic transcriptions. The experiments have shown that application of the extractor obtained on the basis of the Russian dataset enables to increase significantly the quality of the Arabic speech recognition. We have also stated that the usage of phonetic transcriptions based on modern standard Arabic decreases recognition quality. Nevertheless, system operation results remain applicable in practice. In addition, we have carried out the study of obtained models application for the keywords searching problem solution. The systems obtained demonstrate good results as compared to those published before. Some ways to improve speech recognition are offered.

  4. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  5. Automated recognition of helium speech. Phase I: Investigation of microprocessor based analysis/synthesis system

    Science.gov (United States)

    Jelinek, H. J.

    1986-01-01

    This is the Final Report of Electronic Design Associates on its Phase I SBIR project. The purpose of this project is to develop a method for correcting helium speech, as experienced in diver-surface communication. The goal of the Phase I study was to design, prototype, and evaluate a real time helium speech corrector system based upon digital signal processing techniques. The general approach was to develop hardware (an IBM PC board) to digitize helium speech and software (a LAMBDA computer based simulation) to translate the speech. As planned in the study proposal, this initial prototype may now be used to assess expected performance from a self contained real time system which uses an identical algorithm. The Final Report details the work carried out to produce the prototype system. Four major project tasks were: a signal processing scheme for converting helium speech to normal sounding speech was generated. The signal processing scheme was simulated on a general purpose (LAMDA) computer. Actual helium speech was supplied to the simulation and the converted speech was generated. An IBM-PC based 14 bit data Input/Output board was designed and built. A bibliography of references on speech processing was generated.

  6. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    in the mobile context covering motivations, challenges, fundamental techniques and applications. Three ASR architectures are introduced: embedded speech recognition, distributed speech recognition and network speech recognition. Their pros and cons and implementation issues are discussed. Applications within......The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  7. Auditory Modeling for Noisy Speech Recognition

    National Research Council Canada - National Science Library

    2000-01-01

    ... digital filtering for noise cancellation which interfaces to speech recognition software. It uses auditory features in speech recognition training, and provides applications to multilingual spoken language translation...

  8. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... and inaccuracies in the anaesthesia record. Supplementing the electronic anaesthesia record interface with speech input facilities is proposed as one possible solution to a part of the problem. The testing of the various hypotheses has involved the development of a prototype of an electronic anaesthesia record...... interface with speech input facilities in Danish. The evaluation of the new interface was carried out in a full-scale anaesthesia simulator. This has been complemented by laboratory experiments on several aspects of speech recognition for this type of use, e.g. the effects of noise on speech recognition...

  9. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency ...

  10. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  11. Speech recognition implementation in radiology

    International Nuclear Information System (INIS)

    White, Keith S.

    2005-01-01

    Continuous speech recognition (SR) is an emerging technology that allows direct digital transcription of dictated radiology reports. The SR systems are being widely deployed in the radiology community. This is a review of technical and practical issues that should be considered when implementing an SR system. (orig.)

  12. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual ...

  13. A methodology of error detection: Improving speech recognition in radiology

    OpenAIRE

    Voll, Kimberly Dawn

    2006-01-01

    Automated speech recognition (ASR) in radiology report dictation demands highly accurate and robust recognition software. Despite vendor claims, current implementations are suboptimal, leading to poor accuracy, and time and money wasted on proofreading. Thus, other methods must be considered for increasing the reliability and performance of ASR before it is a viable alternative to human transcription. One such method is post-ASR error detection, used to recover from the inaccuracy of speech r...

  14. Under-resourced speech recognition based on the speech manifold

    CSIR Research Space (South Africa)

    Sahraeian, R

    2015-09-01

    Full Text Available Conventional acoustic modeling involves estimating many parameters to effectively model feature distributions. The sparseness of speech and text data, however, degrades the reliability of the estimation process and makes speech recognition a...

  15. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  16. Quadcopter Control Using Speech Recognition

    Science.gov (United States)

    Malik, H.; Darma, S.; Soekirno, S.

    2018-04-01

    This research reported a comparison from a success rate of speech recognition systems that used two types of databases they were existing databases and new databases, that were implemented into quadcopter as motion control. Speech recognition system was using Mel frequency cepstral coefficient method (MFCC) as feature extraction that was trained using recursive neural network method (RNN). MFCC method was one of the feature extraction methods that most used for speech recognition. This method has a success rate of 80% - 95%. Existing database was used to measure the success rate of RNN method. The new database was created using Indonesian language and then the success rate was compared with results from an existing database. Sound input from the microphone was processed on a DSP module with MFCC method to get the characteristic values. Then, the characteristic values were trained using the RNN which result was a command. The command became a control input to the single board computer (SBC) which result was the movement of the quadcopter. On SBC, we used robot operating system (ROS) as the kernel (Operating System).

  17. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...

  18. Automated Speech Rate Measurement in Dysarthria

    Science.gov (United States)

    Martens, Heidi; Dekens, Tomas; Van Nuffelen, Gwen; Latacz, Lukas; Verhelst, Werner; De Bodt, Marc

    2015-01-01

    Purpose: In this study, a new algorithm for automated determination of speech rate (SR) in dysarthric speech is evaluated. We investigated how reliably the algorithm calculates the SR of dysarthric speech samples when compared with calculation performed by speech-language pathologists. Method: The new algorithm was trained and tested using Dutch…

  19. Efficient CEPSTRAL Normalization for Robust Speech Recognition

    National Research Council Canada - National Science Library

    Liu, Fu-Hua; Stern, Richard M; Huang, Xuedong; Acero, Alejandro

    1993-01-01

    In this paper we describe and compare the performance of a series of cepstrum-based procedures that enable the CMU SPHINX-II speech recognition system to maintain a high level of recognition accuracy...

  20. Predicting automatic speech recognition performance over communication channels from instrumental speech quality and intelligibility scores

    NARCIS (Netherlands)

    Gallardo, L.F.; Möller, S.; Beerends, J.

    2017-01-01

    The performance of automatic speech recognition based on coded-decoded speech heavily depends on the quality of the transmitted signals, determined by channel impairments. This paper examines relationships between speech recognition performance and measurements of speech quality and intelligibility

  1. Speech emotion recognition methods: A literature review

    Science.gov (United States)

    Basharirad, Babak; Moradhaseli, Mohammadreza

    2017-10-01

    Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.

  2. Man machine interface based on speech recognition

    International Nuclear Information System (INIS)

    Jorge, Carlos A.F.; Aghina, Mauricio A.C.; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

    2007-01-01

    This work reports the development of a Man Machine Interface based on speech recognition. The system must recognize spoken commands, and execute the desired tasks, without manual interventions of operators. The range of applications goes from the execution of commands in an industrial plant's control room, to navigation and interaction in virtual environments. Results are reported for isolated word recognition, the isolated words corresponding to the spoken commands. For the pre-processing stage, relevant parameters are extracted from the speech signals, using the cepstral analysis technique, that are used for isolated word recognition, and corresponds to the inputs of an artificial neural network, that performs recognition tasks. (author)

  3. On-device mobile speech recognition

    OpenAIRE

    Mustafa, MK

    2016-01-01

    Despite many years of research, Speech Recognition remains an active area of research in Artificial Intelligence. Currently, the most common commercial application of this technology on mobile devices uses a wireless client – server approach to meet the computational and memory demands of the speech recognition process. Unfortunately, such an approach is unlikely to remain viable when fully applied over the approximately 7.22 Billion mobile phones currently in circulation. In this thesis we p...

  4. Automated smartphone audiometry: Validation of a word recognition test app.

    Science.gov (United States)

    Dewyer, Nicholas A; Jiradejvong, Patpong; Henderson Sabes, Jennifer; Limb, Charles J

    2018-03-01

    Develop and validate an automated smartphone word recognition test. Cross-sectional case-control diagnostic test comparison. An automated word recognition test was developed as an app for a smartphone with earphones. English-speaking adults with recent audiograms and various levels of hearing loss were recruited from an audiology clinic and were administered the smartphone word recognition test. Word recognition scores determined by the smartphone app and the gold standard speech audiometry test performed by an audiologist were compared. Test scores for 37 ears were analyzed. Word recognition scores determined by the smartphone app and audiologist testing were in agreement, with 86% of the data points within a clinically acceptable margin of error and a linear correlation value between test scores of 0.89. The WordRec automated smartphone app accurately determines word recognition scores. 3b. Laryngoscope, 128:707-712, 2018. © 2017 The American Laryngological, Rhinological and Otological Society, Inc.

  5. Dynamic Programming Algorithms in Speech Recognition

    Directory of Open Access Journals (Sweden)

    Titus Felix FURTUNA

    2008-01-01

    Full Text Available In a system of speech recognition containing words, the recognition requires the comparison between the entry signal of the word and the various words of the dictionary. The problem can be solved efficiently by a dynamic comparison algorithm whose goal is to put in optimal correspondence the temporal scales of the two words. An algorithm of this type is Dynamic Time Warping. This paper presents two alternatives for implementation of the algorithm designed for recognition of the isolated words.

  6. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  7. Performance Assessment of Dynaspeak Speech Recognition System on Inflight Databases

    National Research Council Canada - National Science Library

    Barry, Timothy

    2004-01-01

    .... To aid in the assessment of various commercially available speech recognition systems, several aircraft speech databases have been developed at the Air Force Research Laboratory's Human Effectiveness Directorate...

  8. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  9. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Carrier nature of speech; modulation spectrum; spectral dynamics ... the relationships between phonetic values of sounds and their short-term spectral envelopes .... the number of free parameters that need to be estimated from training data.

  10. CASRA+: A Colloquial Arabic Speech Recognition Application

    OpenAIRE

    Ramzi A. Haraty; Omar El Ariss

    2007-01-01

    The research proposed here was for an Arabic speech recognition application, concentrating on the Lebanese dialect. The system starts by sampling the speech, which was the process of transforming the sound from analog to digital and then extracts the features by using the Mel-Frequency Cepstral Coefficients (MFCC). The extracted features are then compared with the system's stored model; in this case the stored model chosen was a phoneme-based model. This reference model differs from the direc...

  11. Automated road marking recognition system

    Science.gov (United States)

    Ziyatdinov, R. R.; Shigabiev, R. R.; Talipov, D. N.

    2017-09-01

    Development of the automated road marking recognition systems in existing and future vehicles control systems is an urgent task. One way to implement such systems is the use of neural networks. To test the possibility of using neural network software has been developed with the use of a single-layer perceptron. The resulting system based on neural network has successfully coped with the task both when driving in the daytime and at night.

  12. Hidden Markov models in automatic speech recognition

    Science.gov (United States)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  13. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  14. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness

    OpenAIRE

    Ramirez, J.; Gorriz, J. M.; Segura, J. C.

    2007-01-01

    This chapter has shown an overview of the main challenges in robust speech detection and a review of the state of the art and applications. VADs are frequently used in a number of applications including speech coding, speech enhancement and speech recognition. A precise VAD extracts a set of discriminative speech features from the noisy speech and formulates the decision in terms of well defined rule. The chapter has summarized three robust VAD methods that yield high speech/non-speech discri...

  15. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    (i.e. typing, handwriting and speaking) to improve the efficiency and accuracy of the translation process. However, further studies need to be conducted to build up new knowledge about the way in which state-of-the-art speech recognition software can be applied to the post-editing process...

  16. Speech Recognition Technology for Disabilities Education

    Science.gov (United States)

    Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng

    2005-01-01

    Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…

  17. Automated Intelligibility Assessment of Pathological Speech Using Phonological Features

    Directory of Open Access Journals (Sweden)

    Catherine Middag

    2009-01-01

    Full Text Available It is commonly acknowledged that word or phoneme intelligibility is an important criterion in the assessment of the communication efficiency of a pathological speaker. People have therefore put a lot of effort in the design of perceptual intelligibility rating tests. These tests usually have the drawback that they employ unnatural speech material (e.g., nonsense words and that they cannot fully exclude errors due to listener bias. Therefore, there is a growing interest in the application of objective automatic speech recognition technology to automate the intelligibility assessment. Current research is headed towards the design of automated methods which can be shown to produce ratings that correspond well with those emerging from a well-designed and well-performed perceptual test. In this paper, a novel methodology that is built on previous work (Middag et al., 2008 is presented. It utilizes phonological features, automatic speech alignment based on acoustic models that were trained on normal speech, context-dependent speaker feature extraction, and intelligibility prediction based on a small model that can be trained on pathological speech samples. The experimental evaluation of the new system reveals that the root mean squared error of the discrepancies between perceived and computed intelligibilities can be as low as 8 on a scale of 0 to 100.

  18. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  19. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    International Nuclear Information System (INIS)

    Holzrichter, J.F.; Ng, L.C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs

  20. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    Science.gov (United States)

    Holzrichter, John F.; Ng, Lawrence C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

  1. Development of a System for Automatic Recognition of Speech

    Directory of Open Access Journals (Sweden)

    Roman Jarina

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  2. Indonesian Automatic Speech Recognition For Command Speech Controller Multimedia Player

    Directory of Open Access Journals (Sweden)

    Vivien Arief Wardhany

    2014-12-01

    Full Text Available The purpose of multimedia devices development is controlling through voice. Nowdays voice that can be recognized only in English. To overcome the issue, then recognition using Indonesian language model and accousticc model and dictionary. Automatic Speech Recognizier is build using engine CMU Sphinx with modified english language to Indonesian Language database and XBMC used as the multimedia player. The experiment is using 10 volunteers testing items based on 7 commands. The volunteers is classifiedd by the genders, 5 Male & 5 female. 10 samples is taken in each command, continue with each volunteer perform 10 testing command. Each volunteer also have to try all 7 command that already provided. Based on percentage clarification table, the word “Kanan” had the most recognize with percentage 83% while “pilih” is the lowest one. The word which had the most wrong clarification is “kembali” with percentagee 67%, while the word “kanan” is the lowest one. From the result of Recognition Rate by male there are several command such as “Kembali”, “Utama”, “Atas “ and “Bawah” has the low Recognition Rate. Especially for “kembali” cannot be recognized as the command in the female voices but in male voice that command has 4% of RR this is because the command doesn’t have similar word in english near to “kembali” so the system unrecognize the command. Also for the command “Pilih” using the female voice has 80% of RR but for the male voice has only 4% of RR. This problem is mostly because of the different voice characteristic between adult male and female which male has lower voice frequencies (from 85 to 180 Hz than woman (165 to 255 Hz.The result of the experiment showed that each man had different number of recognition rate caused by the difference tone, pronunciation, and speed of speech. For further work needs to be done in order to improving the accouracy of the Indonesian Automatic Speech Recognition system

  3. Analysis of Feature Extraction Methods for Speaker Dependent Speech Recognition

    Directory of Open Access Journals (Sweden)

    Gurpreet Kaur

    2017-02-01

    Full Text Available Speech recognition is about what is being said, irrespective of who is saying. Speech recognition is a growing field. Major progress is taking place on the technology of automatic speech recognition (ASR. Still, there are lots of barriers in this field in terms of recognition rate, background noise, speaker variability, speaking rate, accent etc. Speech recognition rate mainly depends on the selection of features and feature extraction methods. This paper outlines the feature extraction techniques for speaker dependent speech recognition for isolated words. A brief survey of different feature extraction techniques like Mel-Frequency Cepstral Coefficients (MFCC, Linear Predictive Coding Coefficients (LPCC, Perceptual Linear Prediction (PLP, Relative Spectra Perceptual linear Predictive (RASTA-PLP analysis are presented and evaluation is done. Speech recognition has various applications from daily use to commercial use. We have made a speaker dependent system and this system can be useful in many areas like controlling a patient vehicle using simple commands.

  4. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Lévy Christophe

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  5. Error analysis to improve the speech recognition accuracy on ...

    Indian Academy of Sciences (India)

    dictionary plays a key role in the speech recognition accuracy. .... Sophisticated microphone is used for the recording speech corpus in a noise free environment. .... values, word error rate (WER) and error-rate will be calculated as follows:.

  6. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  7. Speech recognition in natural background noise.

    Science.gov (United States)

    Meyer, Julien; Dentel, Laure; Meunier, Fanny

    2013-01-01

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR). Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A), reference at 1 meter) at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB). Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda). Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for future studies.

  8. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  9. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  10. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  11. Speech Recognition and Cognitive Skills in Bimodal Cochlear Implant Users

    Science.gov (United States)

    Hua, Håkan; Johansson, Björn; Magnusson, Lennart; Lyxell, Björn; Ellis, Rachel J.

    2017-01-01

    Purpose: To examine the relation between speech recognition and cognitive skills in bimodal cochlear implant (CI) and hearing aid users. Method: Seventeen bimodal CI users (28-74 years) were recruited to the study. Speech recognition tests were carried out in quiet and in noise. The cognitive tests employed included the Reading Span Test and the…

  12. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep

  13. Features Speech Signature Image Recognition on Mobile Devices

    Directory of Open Access Journals (Sweden)

    Alexander Mikhailovich Alyushin

    2015-12-01

    Full Text Available The algorithms fordynamic spectrograms images recognition, processing and soundspeech signature (SS weredeveloped. The software for mobile phones, thatcan recognize speech signatureswas prepared. The investigation of the SS recognition speed on its boundarytypes was conducted. Recommendations on the boundary types choice in the optimal ratio of recognitionspeed and required space were given.

  14. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  15. ACOUSTIC SPEECH RECOGNITION FOR MARATHI LANGUAGE USING SPHINX

    Directory of Open Access Journals (Sweden)

    Aman Ankit

    2016-09-01

    Full Text Available Speech recognition or speech to text processing, is a process of recognizing human speech by the computer and converting into text. In speech recognition, transcripts are created by taking recordings of speech as audio and their text transcriptions. Speech based applications which include Natural Language Processing (NLP techniques are popular and an active area of research. Input to such applications is in natural language and output is obtained in natural language. Speech recognition mostly revolves around three approaches namely Acoustic phonetic approach, Pattern recognition approach and Artificial intelligence approach. Creation of acoustic model requires a large database of speech and training algorithms. The output of an ASR system is recognition and translation of spoken language into text by computers and computerized devices. ASR today finds enormous application in tasks that require human machine interfaces like, voice dialing, and etc. Our key contribution in this paper is to create corpora for Marathi language and explore the use of Sphinx engine for automatic speech recognition

  16. Automated Discovery of Speech Act Categories in Educational Games

    Science.gov (United States)

    Rus, Vasile; Moldovan, Cristian; Niraula, Nobal; Graesser, Arthur C.

    2012-01-01

    In this paper we address the important task of automated discovery of speech act categories in dialogue-based, multi-party educational games. Speech acts are important in dialogue-based educational systems because they help infer the student speaker's intentions (the task of speech act classification) which in turn is crucial to providing adequate…

  17. Hybrid methodological approach to context-dependent speech recognition

    Directory of Open Access Journals (Sweden)

    Dragiša Mišković

    2017-01-01

    Full Text Available Although the importance of contextual information in speech recognition has been acknowledged for a long time now, it has remained clearly underutilized even in state-of-the-art speech recognition systems. This article introduces a novel, methodologically hybrid approach to the research question of context-dependent speech recognition in human–machine interaction. To the extent that it is hybrid, the approach integrates aspects of both statistical and representational paradigms. We extend the standard statistical pattern-matching approach with a cognitively inspired and analytically tractable model with explanatory power. This methodological extension allows for accounting for contextual information which is otherwise unavailable in speech recognition systems, and using it to improve post-processing of recognition hypotheses. The article introduces an algorithm for evaluation of recognition hypotheses, illustrates it for concrete interaction domains, and discusses its implementation within two prototype conversational agents.

  18. Speech recognition using articulatory and excitation source features

    CERN Document Server

    Rao, K Sreenivasa

    2017-01-01

    This book discusses the contribution of articulatory and excitation source information in discriminating sound units. The authors focus on excitation source component of speech -- and the dynamics of various articulators during speech production -- for enhancement of speech recognition (SR) performance. Speech recognition is analyzed for read, extempore, and conversation modes of speech. Five groups of articulatory features (AFs) are explored for speech recognition, in addition to conventional spectral features. Each chapter provides the motivation for exploring the specific feature for SR task, discusses the methods to extract those features, and finally suggests appropriate models to capture the sound unit specific knowledge from the proposed features. The authors close by discussing various combinations of spectral, articulatory and source features, and the desired models to enhance the performance of SR systems.

  19. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  20. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Hiroshi Saruwatari

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a 93.9% word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  1. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  2. Automatic speech recognition for radiological reporting

    International Nuclear Information System (INIS)

    Vidal, B.

    1991-01-01

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  3. End-to-end visual speech recognition with LSTMS

    NARCIS (Netherlands)

    Petridis, Stavros; Li, Zuwei; Pantic, Maja

    2017-01-01

    Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on

  4. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  5. Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

    NARCIS (Netherlands)

    Huijbregts, M.A.H.; Ordelman, Roeland J.F.; de Jong, Franciska M.G.

    2007-01-01

    This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life

  6. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  7. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    Science.gov (United States)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  8. Use of digital speech recognition in diagnostics radiology

    International Nuclear Information System (INIS)

    Arndt, H.; Stockheim, D.; Mutze, S.; Petersein, J.; Gregor, P.; Hamm, B.

    1999-01-01

    Purpose: Applicability and benefits of digital speech recognition in diagnostic radiology were tested using the speech recognition system SP 6000. Methods: The speech recognition system SP 6000 was integrated into the network of the institute and connected to the existing Radiological Information System (RIS). Three subjects used this system for writing 2305 findings from dictation. After the recognition process the date, length of dictation, time required for checking/correction, kind of examination and error rate were recorded for every dictation. With the same subjects, a correlation was performed with 625 conventionally written finding. Results: After an 1-hour initial training the average error rates were 8.4 to 13.3%. The first adaptation of the speech recognition system (after nine days) decreased the average error rates to 2.4 to 10.7% due to the ability of the program to learn. The 2 nd and 3 rd adaptations resulted only in small changes of the error rate. An individual comparison of the error rate developments in the same kind of investigation showed the relative independence of the error rate on the individual user. Conclusion: The results show that the speech recognition system SP 6000 can be evaluated as an advantageous alternative for quickly recording radiological findings. A comparison between manually writing and dictating the findings verifies the individual differences of the writing speeds and shows the advantage of the application of voice recognition when faced with normal keyboard performance. (orig.) [de

  9. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suh Youngjoo

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  10. Garbage Modeling for On-device Speech Recognition

    NARCIS (Netherlands)

    Van Gysel, C.; Velikovich, L.; McGraw, I.; Beaufays, F.

    2015-01-01

    User interactions with mobile devices increasingly depend on voice as a primary input modality. Due to the disadvantages of sending audio across potentially spotty network connections for speech recognition, in recent years there has been growing attention to performing recognition on-device. The

  11. Lexicon Optimization for Dutch Speech Recognition in Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  12. Lexicon optimization for Dutch speech recognition in spoken document retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.; Dalsgaard, P.; Lindberg, B.; Benner, H.

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  13. Automatic speech recognition used for evaluation of text-to-speech systems

    Czech Academy of Sciences Publication Activity Database

    Vích, Robert; Nouza, J.; Vondra, Martin

    -, č. 5042 (2008), s. 136-148 ISSN 0302-9743 R&D Projects: GA AV ČR 1ET301710509; GA AV ČR 1QS108040569 Institutional research plan: CEZ:AV0Z20670512 Keywords : speech recognition * speech processing Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering

  14. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  15. Experiments on Automatic Recognition of Nonnative Arabic Speech

    Directory of Open Access Journals (Sweden)

    Douglas O'Shaughnessy

    2008-05-01

    Full Text Available The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA. We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC and the hidden Markov Model Toolkit (HTK are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.

  16. Experiments on Automatic Recognition of Nonnative Arabic Speech

    Directory of Open Access Journals (Sweden)

    Selouani Sid-Ahmed

    2008-01-01

    Full Text Available The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA. We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC and the hidden Markov Model Toolkit (HTK are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.

  17. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  18. Speech Recognition for the iCub Platform

    Directory of Open Access Journals (Sweden)

    Bertrand Higy

    2018-02-01

    Full Text Available This paper describes open source software (available at https://github.com/robotology/natural-speech to build automatic speech recognition (ASR systems and run them within the YARP platform. The toolkit is designed (i to allow non-ASR experts to easily create their own ASR system and run it on iCub and (ii to build deep learning-based models specifically addressing the main challenges an ASR system faces in the context of verbal human–iCub interactions. The toolkit mostly consists of Python, C++ code and shell scripts integrated in YARP. As additional contribution, a second codebase (written in Matlab is provided for more expert ASR users who want to experiment with bio-inspired and developmental learning-inspired ASR systems. Specifically, we provide code for two distinct kinds of speech recognition: “articulatory” and “unsupervised” speech recognition. The first is largely inspired by influential neurobiological theories of speech perception which assume speech perception to be mediated by brain motor cortex activities. Our articulatory systems have been shown to outperform strong deep learning-based baselines. The second type of recognition systems, the “unsupervised” systems, do not use any supervised information (contrary to most ASR systems, including our articulatory systems. To some extent, they mimic an infant who has to discover the basic speech units of a language by herself. In addition, we provide resources consisting of pre-trained deep learning models for ASR, and a 2.5-h speech dataset of spoken commands, the VoCub dataset, which can be used to adapt an ASR system to the typical acoustic environments in which iCub operates.

  19. Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

    OpenAIRE

    Zhang, Zewang; Sun, Zheng; Liu, Jiaqi; Chen, Jingwen; Huo, Zhao; Zhang, Xiao

    2016-01-01

    A deep learning approach has been widely applied in sequence modeling problems. In terms of automatic speech recognition (ASR), its performance has significantly been improved by increasing large speech corpus and deeper neural network. Especially, recurrent neural network and deep convolutional neural network have been applied in ASR successfully. Given the arising problem of training speed, we build a novel deep recurrent convolutional network for acoustic modeling and then apply deep resid...

  20. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    Automatic speech recognition (ASR) relies on three resources: audio, orthographic transcriptions and a pronunciation dictionary. The dictionary or lexicon maps orthographic words to sequences of phones or phonemes that represent the pronunciation of the corresponding word. The quality of a speech....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  1. Recent advances in Automatic Speech Recognition for Vietnamese

    OpenAIRE

    Le , Viet-Bac; Besacier , Laurent; Seng , Sopheap; Bigi , Brigitte; Do , Thi-Ngoc-Diep

    2008-01-01

    International audience; This paper presents our recent activities for automatic speech recognition for Vietnamese. First, our text data collection and processing methods and tools are described. For language modeling, we investigate word, sub-word and also hybrid word/sub-word models. For acoustic modeling, when only limited speech data are available for Vietnamese, we propose some crosslingual acoustic modeling techniques. Furthermore, since the use of sub-word units can reduce the high out-...

  2. Evaluating deep learning architectures for Speech Emotion Recognition.

    Science.gov (United States)

    Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence

    2017-08-01

    Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.

  3. Speech recognition in individuals with sensorineural hearing loss.

    Science.gov (United States)

    de Andrade, Adriana Neves; Iorio, Maria Cecilia Martinelli; Gil, Daniela

    2016-01-01

    Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left) and type of stimulus presentation. The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance. Copyright © 2015 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.

  4. Speech recognition in individuals with sensorineural hearing loss

    Directory of Open Access Journals (Sweden)

    Adriana Neves de Andrade

    Full Text Available ABSTRACT INTRODUCTION: Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. OBJECTIVE: To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left and type of stimulus presentation. METHODS: The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. RESULTS: There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. CONCLUSIONS: The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance.

  5. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... and observations almost simultaneously when they are given or made. The tested speech input strategies were successful, even with the ambient noise. Speaking to the system while working appeared feasible, although improvements in speech recognition rates are needed. Conclusion: A vocal interface leads to shorter...

  6. Human phoneme recognition depending on speech-intrinsic variability.

    Science.gov (United States)

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  7. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    Science.gov (United States)

    1983-05-01

    ANDELES CA 0 SHDAP ET AL MAY 93 UNCISSIFED UCG -020-8 MDA04-8’-C-415F/ 17/2 N mE = h IEEE 11111 10’ ~ 2.0 11-41 & 11111I25IID MICROCOPY RESOLUTION TEST...speech. Private industry, which sees a major market for improved speech recognition systems, is attempting to solve the problems involved in...manufacturer is able to market such a recognition system. A second requirement for the spotting of keywords in distress signals concerns the need for a

  8. Current trends in small vocabulary speech recognition for equipment control

    Science.gov (United States)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  9. Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2009-03-01

    Full Text Available The authors describes the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively...

  10. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System.

    Science.gov (United States)

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  11. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  12. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue

    2006-01-01

    of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize...... that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...

  13. Histogram equalization with Bayesian estimation for noise robust speech recognition.

    Science.gov (United States)

    Suh, Youngjoo; Kim, Hoirin

    2018-02-01

    The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.

  14. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  15. High-performance speech recognition using consistency modeling

    Science.gov (United States)

    Digalakis, Vassilios; Murveit, Hy; Monaco, Peter; Neumeyer, Leo; Sankar, Ananth

    1994-12-01

    The goal of SRI's consistency modeling project is to improve the raw acoustic modeling component of SRI's DECIPHER speech recognition system and develop consistency modeling technology. Consistency modeling aims to reduce the number of improper independence assumptions used in traditional speech recognition algorithms so that the resulting speech recognition hypotheses are more self-consistent and, therefore, more accurate. At the initial stages of this effort, SRI focused on developing the appropriate base technologies for consistency modeling. We first developed the Progressive Search technology that allowed us to perform large-vocabulary continuous speech recognition (LVCSR) experiments. Since its conception and development at SRI, this technique has been adopted by most laboratories, including other ARPA contracting sites, doing research on LVSR. Another goal of the consistency modeling project is to attack difficult modeling problems, when there is a mismatch between the training and testing phases. Such mismatches may include outlier speakers, different microphones and additive noise. We were able to either develop new, or transfer and evaluate existing, technologies that adapted our baseline genonic HMM recognizer to such difficult conditions.

  16. Multitasking During Degraded Speech Recognition in School-Age Children.

    Science.gov (United States)

    Grieco-Calub, Tina M; Ward, Kristina M; Brehm, Laurel

    2017-01-01

    Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children's multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children's accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children's dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children's proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition.

  17. Channel normalization technique for speech recognition in mismatched conditions

    CSIR Research Space (South Africa)

    Kleynhans, N

    2008-11-01

    Full Text Available , where one wishes to use any available training data for a variety of purposes. Research into a new channel normalization (CN) technique for channel mismatched speech recognition is presented. A process of inverse linear filtering is used in order...

  18. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error

  19. Appropriate baseline values for HMM-based speech recognition

    CSIR Research Space (South Africa)

    Barnard, E

    2004-11-01

    Full Text Available A number of issues realted to the development of speech-recognition systems with Hidden Markov Models (HMM) are discussed. A set of systematic experiments using the HTK toolkit and the TMIT database are used to elucidate matters such as the number...

  20. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  1. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    OpenAIRE

    Andreas Maier; Tino Haderlein; Florian Stelzle; Elmar Nöth; Emeka Nkenke; Frank Rosanowski; Anne Schützenberger; Maria Schuster

    2010-01-01

    In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngect...

  2. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    Science.gov (United States)

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  3. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing

  4. Incorporating Speech Recognition into a Natural User Interface

    Science.gov (United States)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to

  5. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  6. Automatic speech recognition for report generation in computed tomography

    International Nuclear Information System (INIS)

    Teichgraeber, U.K.M.; Ehrenstein, T.; Lemke, M.; Liebig, T.; Stobbe, H.; Hosten, N.; Keske, U.; Felix, R.

    1999-01-01

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.) [de

  7. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  8. Hemispheric lateralization of linguistic prosody recognition in comparison to speech and speaker recognition.

    Science.gov (United States)

    Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina

    2014-11-15

    Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on

  9. Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition

    Science.gov (United States)

    Malcangi, Mario

    2009-08-01

    Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.

  10. Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

    Directory of Open Access Journals (Sweden)

    M. Bashirpour

    2016-09-01

    Full Text Available Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC in a speech emotion recognition system. We investigate its performance in emotion recognition using clean and noisy speech materials and compare it with the performances of the well-known MFCC, LPCC, RASTA-PLP, and also TEMFCC features. Speech samples are extracted from the Berlin emotional speech database (Emo DB and Persian emotional speech database (Persian ESD which are corrupted with 4 different noise types under various SNR levels. The experiments are conducted in clean train/noisy test scenarios to simulate practical conditions with noise sources. Simulation results show that higher recognition rates are achieved for PNCC as compared with the conventional features under noisy conditions.

  11. Speech pattern recognition for forensic acoustic purposes

    OpenAIRE

    Herrera Martínez, Marcelo; Aldana Blanco, Andrea Lorena; Guzmán Palacios, Ana María

    2014-01-01

    The present paper describes the development of a software for analysis of acoustic voice parameters (APAVOIX), which can be used for forensic acoustic purposes, based on the speaker recognition and identification. This software enables to observe in a clear manner, the parameters which are sufficient and necessary when performing a comparison between two voice signals, the suspicious and the original one. These parameters are used according to the classic method, generally used by state entit...

  12. Auditory Modeling for Noisy Speech Recognition.

    Science.gov (United States)

    2000-01-01

    multiple platforms including PCs, workstations, and DSPs. A prototype version of the SOS process was tested on the Japanese Hiragana language with good...judgment among linguists. American English has 48 phonetic sounds in the ARPABET representation. Hiragana , the Japanese phonetic language, has only 20... Japanese Hiragana ," H.L. Pfister, FL 95, 1995. "State Recognition for Noisy Dynamic Systems," H.L. Pfister, Tech 2005, Chicago, 1995. "Experiences

  13. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    , like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...... for particular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules. Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies...

  14. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part...... and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  15. An automatic speech recognition system with speaker-independent identification support

    Science.gov (United States)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  16. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  17. Auditory analysis for speech recognition based on physiological models

    Science.gov (United States)

    Jeon, Woojay; Juang, Biing-Hwang

    2004-05-01

    To address the limitations of traditional cepstrum or LPC based front-end processing methods for automatic speech recognition, more elaborate methods based on physiological models of the human auditory system may be used to achieve more robust speech recognition in adverse environments. For this purpose, a modified version of a model of the primary auditory cortex featuring a three dimensional mapping of auditory spectra [Wang and Shamma, IEEE Trans. Speech Audio Process. 3, 382-395 (1995)] is adopted and investigated for its use as an improved front-end processing method. The study is conducted in two ways: first, by relating the model's redundant representation to traditional spectral representations and showing that the former not only encompasses information provided by the latter, but also reveals more relevant information that makes it superior in describing the identifying features of speech signals; and second, by observing the statistical features of the representation for various classes of sound to show how different identifying features manifest themselves as specific patterns on the cortical map, thereby becoming a place-coded data set on which detection theory could be applied to simulate auditory perception and cognition.

  18. Phase effects in masking by harmonic complexes: speech recognition.

    Science.gov (United States)

    Deroche, Mickael L D; Culling, John F; Chatterjee, Monita

    2013-12-01

    Harmonic complexes that generate highly modulated temporal envelopes on the basilar membrane (BM) mask a tone less effectively than complexes that generate relatively flat temporal envelopes, because the non-linear active gain of the BM selectively amplifies a low-level tone in the dips of a modulated masker envelope. The present study examines a similar effect in speech recognition. Speech reception thresholds (SRTs) were measured for a voice masked by harmonic complexes with partials in sine phase (SP) or in random phase (RP). The masker's fundamental frequency (F0) was 50, 100 or 200 Hz. SRTs were considerably lower for SP than for RP maskers at 50-Hz F0, but the two converged at 100-Hz F0, while at 200-Hz F0, SRTs were a little higher for SP than RP maskers. The results were similar whether the target voice was male or female and whether the masker's spectral profile was flat or speech-shaped. Although listening in the masker dips has been shown to play a large role for artificial stimuli such as Schroeder-phase complexes at high levels, it contributes weakly to speech recognition in the presence of harmonic maskers with different crest factors at more moderate sound levels (65 dB SPL). Copyright © 2013 Elsevier B.V. All rights reserved.

  19. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    OpenAIRE

    Zhang, Wei; Zhang, Xueying; Sun, Ying

    2017-01-01

    Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM) approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between e...

  20. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  1. EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Directory of Open Access Journals (Sweden)

    Theodoros Anagnostopoulos

    2014-11-01

    Full Text Available Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space.

  2. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    Science.gov (United States)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  3. Emerging technologies with potential for objectively evaluating speech recognition skills.

    Science.gov (United States)

    Rawool, Vishakha Waman

    2016-01-01

    Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed.

  4. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  5. Speech recognition: impact on workflow and report availability

    International Nuclear Information System (INIS)

    Glaser, C.; Trumm, C.; Nissen-Meyer, S.; Francke, M.; Kuettner, B.; Reiser, M.

    2005-01-01

    With ongoing technical refinements speech recognition systems (SRS) are becoming an increasingly attractive alternative to traditional methods of preparing and transcribing medical reports. The two main components of any SRS are the acoustic model and the language model. Features of modern SRS with continuous speech recognition are macros with individually definable texts and report templates as well as the option to navigate in a text or to control SRS or RIS functions by speech recognition. The best benefit from SRS can be obtained if it is integrated into a RIS/RIS-PACS installation. Report availability and time efficiency of the reporting process (related to recognition rate, time expenditure for editing and correcting a report) are the principal determinants of the clinical performance of any SRS. For practical purposes the recognition rate is estimated by the error rate (unit ''word''). Error rates range from 4 to 28%. Roughly 20% of them are errors in the vocabulary which may result in clinically relevant misinterpretation. It is thus mandatory to thoroughly correct any transcribed text as well as to continuously train and adapt the SRS vocabulary. The implementation of SRS dramatically improves report availability. This is most pronounced for CT and CR. However, the individual time expenditure for (SRS-based) reporting increased by 20-25% (CR) and according to literature data there is an increase by 30% for CT and MRI. The extent to which the transcription staff profits from SRS depends largely on its qualification. Online dictation implies a workload shift from the transcription staff to the reporting radiologist. (orig.) [de

  6. Emotionally conditioning the target-speech voice enhances recognition of the target speech under "cocktail-party" listening conditions.

    Science.gov (United States)

    Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang

    2018-05-01

    Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.

  7. Speech Silicon: An FPGA Architecture for Real-Time Hidden Markov-Model-Based Speech Recognition

    Directory of Open Access Journals (Sweden)

    Schuster Jeffrey

    2006-01-01

    Full Text Available This paper examines the design of an FPGA-based system-on-a-chip capable of performing continuous speech recognition on medium sized vocabularies in real time. Through the creation of three dedicated pipelines, one for each of the major operations in the system, we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls in the system. Further, by implementing a token-passing scheme between the later stages of the system, the complexity of the control was greatly reduced and the amount of active data present in the system at any time was minimized. Additionally, through in-depth analysis of the SPHINX 3 large vocabulary continuous speech recognition engine, we were able to design models that could be efficiently benchmarked against a known software platform. These results, combined with the ability to reprogram the system for different recognition tasks, serve to create a system capable of performing real-time speech recognition in a vast array of environments.

  8. Speech Silicon: An FPGA Architecture for Real-Time Hidden Markov-Model-Based Speech Recognition

    Directory of Open Access Journals (Sweden)

    Alex K. Jones

    2006-11-01

    Full Text Available This paper examines the design of an FPGA-based system-on-a-chip capable of performing continuous speech recognition on medium sized vocabularies in real time. Through the creation of three dedicated pipelines, one for each of the major operations in the system, we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls in the system. Further, by implementing a token-passing scheme between the later stages of the system, the complexity of the control was greatly reduced and the amount of active data present in the system at any time was minimized. Additionally, through in-depth analysis of the SPHINX 3 large vocabulary continuous speech recognition engine, we were able to design models that could be efficiently benchmarked against a known software platform. These results, combined with the ability to reprogram the system for different recognition tasks, serve to create a system capable of performing real-time speech recognition in a vast array of environments.

  9. Integration of asynchronous knowledge sources in a novel speech recognition framework

    OpenAIRE

    Van hamme, Hugo

    2008-01-01

    Van hamme H., ''Integration of asynchronous knowledge sources in a novel speech recognition framework'', Proceedings ITRW on speech analysis and processing for knowledge discovery, 4 pp., June 2008, Aalborg, Denmark.

  10. Recognition of In-Ear Microphone Speech Data Using Multi-Layer Neural Networks

    National Research Council Canada - National Science Library

    Bulbuller, Gokhan

    2006-01-01

    .... In this study, a speech recognition system is presented, specifically an isolated word recognizer which uses speech collected from the external auditory canals of the subjects via an in-ear microphone...

  11. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  12. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...... for noisy speech while achieving higher compression rates....

  13. Automated recognition system for power quality disturbances

    Science.gov (United States)

    Abdelgalil, Tarek

    amplitudes and frequencies, an Artificial Neural Network is employed to identify the switched capacitor by using amplitudes and frequencies extracted from the transient signal. The new algorithms for detecting, tracking, and classifying power quality disturbances demonstrate the potential for further development of a fully automated recognition system for the assessment of power quality. This is possible because the implementation of the proposed algorithms for the power quality monitoring device becomes a straight forward process by modifying the device software.

  14. The effect of network degradation on speech recognition

    CSIR Research Space (South Africa)

    Joubert, G

    2005-11-01

    Full Text Available become increasingly popular, VoIP (Voice over Internet Protocol) is predicted to become the standard means of spoken telecommunication. As a consequence, a significant amount of research has been undertaken on the effect of various packet... to measure the effect of network traffic degeneration during a VoIP transmission, on speech-recognition accuracy. Sentences from the TIMIT database [2] were selected as basis for comparison. The open-source toolkit SOX [3] was used to code the samples...

  15. Emotion Recognition of Speech Signals Based on Filter Methods

    Directory of Open Access Journals (Sweden)

    Narjes Yazdanian

    2016-10-01

    Full Text Available Speech is the basic mean of communication among human beings.With the increase of transaction between human and machine, necessity of automatic dialogue and removing human factor has been considered. The aim of this study was to determine a set of affective features the speech signal is based on emotions. In this study system was designs that include three mains sections, features extraction, features selection and classification. After extraction of useful features such as, mel frequency cepstral coefficient (MFCC, linear prediction cepstral coefficients (LPC, perceptive linear prediction coefficients (PLP, ferment frequency, zero crossing rate, cepstral coefficients and pitch frequency, Mean, Jitter, Shimmer, Energy, Minimum, Maximum, Amplitude, Standard Deviation, at a later stage with filter methods such as Pearson Correlation Coefficient, t-test, relief and information gain, we came up with a method to rank and select effective features in emotion recognition. Then Result, are given to the classification system as a subset of input. In this classification stage, multi support vector machine are used to classify seven type of emotion. According to the results, that method of relief, together with multi support vector machine, has the most classification accuracy with emotion recognition rate of 93.94%.

  16. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2009-01-01

    Objective: The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that

  17. Masked Speech Recognition and Reading Ability in School-Age Children: Is There a Relationship?

    Science.gov (United States)

    Miller, Gabrielle; Lewis, Barbara; Benchek, Penelope; Buss, Emily; Calandruccio, Lauren

    2018-01-01

    Purpose: The relationship between reading (decoding) skills, phonological processing abilities, and masked speech recognition in typically developing children was explored. This experiment was designed to evaluate the relationship between phonological processing and decoding abilities and 2 aspects of masked speech recognition in typically…

  18. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    NARCIS (Netherlands)

    Jesse, A.; McQueen, J.M.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes

  19. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Science.gov (United States)

    Ellis, Rachel J; Rönnberg, Jerker

    2015-01-01

    Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  20. Segment-based acoustic models for continuous speech recognition

    Science.gov (United States)

    Ostendorf, Mari; Rohlicek, J. R.

    1993-07-01

    This research aims to develop new and more accurate stochastic models for speaker-independent continuous speech recognition, by extending previous work in segment-based modeling and by introducing a new hierarchical approach to representing intra-utterance statistical dependencies. These techniques, which are more costly than traditional approaches because of the large search space associated with higher order models, are made feasible through rescoring a set of HMM-generated N-best sentence hypotheses. We expect these different modeling techniques to result in improved recognition performance over that achieved by current systems, which handle only frame-based observations and assume that these observations are independent given an underlying state sequence. In the fourth quarter of the project, we have completed the following: (1) ported our recognition system to the Wall Street Journal task, a standard task in the ARPA community; (2) developed an initial dependency-tree model of intra-utterance observation correlation; and (3) implemented baseline language model estimation software. Our initial results on the Wall Street Journal task are quite good and represent significantly improved performance over most HMM systems reporting on the Nov. 1992 5k vocabulary test set.

  1. Automated Linguistic Personality Description and Recognition Methods

    Directory of Open Access Journals (Sweden)

    Danylyuk Illya

    2016-12-01

    Full Text Available Background: The relevance of our research, above all, is theoretically motivated by the development of extraordinary scientific and practical interest in the possibilities of language processing of huge amount of data generated by people in everyday professional and personal life in the electronic forms of communication (e-mail, sms, voice, audio and video blogs, social networks, etc.. Purpose: The purpose of the article is to describe the theoretical and practical framework of the project "Communicative-pragmatic and discourse-grammatical lingvopersonology: structuring linguistic identity and computer modeling". The description of key techniques is given, such as machine learning for language modeling, speech synthesis, handwriting simulation. Results: Lingvopersonology developed some great theoretical foundations, its methods, tools, and significant achievements let us predict that the newest promising trend is a linguistic identity modeling by means of information technology, including language. We see three aspects of the modeling: 1 modeling the semantic level of linguistic identity – by means of the use of corpus linguistics; 2 sound level formal modeling of linguistic identity – with the help of speech synthesis; 3 formal graphic level modeling of linguistic identity – with the help of image synthesis (handwriting. For the first case, we suppose to use machine learning technics and vector-space (word2vec algorithm for textual speech modeling. Hybrid CUTE method for personality speech modeling will be applied to the second case. Finally, trained with the person handwriting images neural network can be an instrument for the last case. Discussion: The project "Communicative-pragmatic, discourse, and grammatical lingvopersonology: structuring linguistic identity and computer modeling", which is implementing by the Department of General and Applied Linguistics and Slavonic philology, selected a task to model Yuriy Shevelyov (Sherekh

  2. Speech recognition technology: an outlook for human-to-machine interaction.

    Science.gov (United States)

    Erdel, T; Crooks, S

    2000-01-01

    Speech recognition, as an enabling technology in healthcare-systems computing, is a topic that has been discussed for quite some time, but is just now coming to fruition. Traditionally, speech-recognition software has been constrained by hardware, but improved processors and increased memory capacities are starting to remove some of these limitations. With these barriers removed, companies that create software for the healthcare setting have the opportunity to write more successful applications. Among the criticisms of speech-recognition applications are the high rates of error and steep training curves. However, even in the face of such negative perceptions, there remains significant opportunities for speech recognition to allow healthcare providers and, more specifically, physicians, to work more efficiently and ultimately spend more time with their patients and less time completing necessary documentation. This article will identify opportunities for inclusion of speech-recognition technology in the healthcare setting and examine major categories of speech-recognition software--continuous speech recognition, command and control, and text-to-speech. We will discuss the advantages and disadvantages of each area, the limitations of the software today, and how future trends might affect them.

  3. Lexical decoder for continuous speech recognition: sequential neural network approach

    International Nuclear Information System (INIS)

    Iooss, Christine

    1991-01-01

    The work presented in this dissertation concerns the study of a connectionist architecture to treat sequential inputs. In this context, the model proposed by J.L. Elman, a recurrent multilayers network, is used. Its abilities and its limits are evaluated. Modifications are done in order to treat erroneous or noisy sequential inputs and to classify patterns. The application context of this study concerns the realisation of a lexical decoder for analytical multi-speakers continuous speech recognition. Lexical decoding is completed from lattices of phonemes which are obtained after an acoustic-phonetic decoding stage relying on a K Nearest Neighbors search technique. Test are done on sentences formed from a lexicon of 20 words. The results are obtained show the ability of the proposed connectionist model to take into account the sequentiality at the input level, to memorize the context and to treat noisy or erroneous inputs. (author) [fr

  4. Robust Digital Speech Watermarking For Online Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Ali Nematollahi

    2015-01-01

    Full Text Available A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz. Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT, Massachusetts Institute of Technology (MIT, and Mobile Biometry (MOBIO show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

  5. Individual differences in language and working memory affect children's speech recognition in noise.

    Science.gov (United States)

    McCreery, Ryan W; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-05-01

    We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Ninety-six children with normal hearing, who were between 5 and 12 years of age. Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Working memory and language both influence children's speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child's auditory skills, consistent with the Ease of Language Understanding model.

  6. Individual differences in language and working memory affect children’s speech recognition in noise

    Science.gov (United States)

    McCreery, Ryan W.; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-01-01

    Objective We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. Design As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Study sample Ninety-six children with normal hearing, who were between 5 and 12 years of age. Results Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Conclusions Working memory and language both influence children’s speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child’s auditory skills, consistent with the Ease of Language Understanding model. PMID:27981855

  7. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    OpenAIRE

    Jesse, A.; McQueen, J.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker...

  8. Interfacing COTS Speech Recognition and Synthesis Software to a Lotus Notes Military Command and Control Database

    Science.gov (United States)

    Carr, Oliver

    2002-10-01

    Speech recognition and synthesis technologies have become commercially viable over recent years. Two current market leading products in speech recognition technology are Dragon NaturallySpeaking and IBM ViaVoice. This report describes the development of speech user interfaces incorporating these products with Lotus Notes and Java applications. These interfaces enable data entry using speech recognition and allow warnings and instructions to be issued via speech synthesis. The development of a military vocabulary to improve user interaction is discussed. The report also describes an evaluation in terms of speed of the various speech user interfaces developed using Dragon NaturallySpeaking and IBM ViaVoice with a Lotus Notes Command and Control Support System Log database.

  9. Automated pattern recognition system for noise analysis

    International Nuclear Information System (INIS)

    Sides, W.H. Jr.; Piety, K.R.

    1980-01-01

    A pattern recognition system was developed at ORNL for on-line monitoring of noise signals from sensors in a nuclear power plant. The system continuousy measures the power spectral density (PSD) values of the signals and the statistical characteristics of the PSDs in unattended operation. Through statistical comparison of current with past PSDs (pattern recognition), the system detects changes in the noise signals. Because the noise signals contain information about the current operational condition of the plant, a change in these signals could indicate a change, either normal or abnormal, in the operational condition

  10. Recognition of speaker-dependent continuous speech with KEAL

    Science.gov (United States)

    Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.

    1989-04-01

    A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.

  11. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    Science.gov (United States)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  12. Silent Speech Recognition as an Alternative Communication Device for Persons with Laryngectomy.

    Science.gov (United States)

    Meltzner, Geoffrey S; Heaton, James T; Deng, Yunbin; De Luca, Gianluca; Roy, Serge H; Kline, Joshua C

    2017-12-01

    Each year thousands of individuals require surgical removal of their larynx (voice box) due to trauma or disease, and thereby require an alternative voice source or assistive device to verbally communicate. Although natural voice is lost after laryngectomy, most muscles controlling speech articulation remain intact. Surface electromyographic (sEMG) activity of speech musculature can be recorded from the neck and face, and used for automatic speech recognition to provide speech-to-text or synthesized speech as an alternative means of communication. This is true even when speech is mouthed or spoken in a silent (subvocal) manner, making it an appropriate communication platform after laryngectomy. In this study, 8 individuals at least 6 months after total laryngectomy were recorded using 8 sEMG sensors on their face (4) and neck (4) while reading phrases constructed from a 2,500-word vocabulary. A unique set of phrases were used for training phoneme-based recognition models for each of the 39 commonly used phonemes in English, and the remaining phrases were used for testing word recognition of the models based on phoneme identification from running speech. Word error rates were on average 10.3% for the full 8-sensor set (averaging 9.5% for the top 4 participants), and 13.6% when reducing the sensor set to 4 locations per individual (n=7). This study provides a compelling proof-of-concept for sEMG-based alaryngeal speech recognition, with the strong potential to further improve recognition performance.

  13. Composite Wavelet Filters for Enhanced Automated Target Recognition

    Science.gov (United States)

    Chiang, Jeffrey N.; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2012-01-01

    Automated Target Recognition (ATR) systems aim to automate target detection, recognition, and tracking. The current project applies a JPL ATR system to low-resolution sonar and camera videos taken from unmanned vehicles. These sonar images are inherently noisy and difficult to interpret, and pictures taken underwater are unreliable due to murkiness and inconsistent lighting. The ATR system breaks target recognition into three stages: 1) Videos of both sonar and camera footage are broken into frames and preprocessed to enhance images and detect Regions of Interest (ROIs). 2) Features are extracted from these ROIs in preparation for classification. 3) ROIs are classified as true or false positives using a standard Neural Network based on the extracted features. Several preprocessing, feature extraction, and training methods are tested and discussed in this paper.

  14. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

    Directory of Open Access Journals (Sweden)

    Petar S. Aleksic

    2002-11-01

    Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

  15. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    Science.gov (United States)

    Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  16. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    Directory of Open Access Journals (Sweden)

    Izzet B Yildiz

    Full Text Available Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  17. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users

    OpenAIRE

    Kaandorp, Marre W.; Smits, Cas; Merkus, Paul; Festen, Joost M.; Goverts, S. Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with ...

  18. Semi-automated contour recognition using DICOMautomaton

    International Nuclear Information System (INIS)

    Clark, H; Duzenli, C; Wu, J; Moiseenko, V; Lee, R; Gill, B; Thomas, S

    2014-01-01

    Purpose: A system has been developed which recognizes and classifies Digital Imaging and Communication in Medicine contour data with minimal human intervention. It allows researchers to overcome obstacles which tax analysis and mining systems, including inconsistent naming conventions and differences in data age or resolution. Methods: Lexicographic and geometric analysis is used for recognition. Well-known lexicographic methods implemented include Levenshtein-Damerau, bag-of-characters, Double Metaphone, Soundex, and (word and character)-N-grams. Geometrical implementations include 3D Fourier Descriptors, probability spheres, boolean overlap, simple feature comparison (e.g. eccentricity, volume) and rule-based techniques. Both analyses implement custom, domain-specific modules (e.g. emphasis differentiating left/right organ variants). Contour labels from 60 head and neck patients are used for cross-validation. Results: Mixed-lexicographical methods show an effective improvement in more than 10% of recognition attempts compared with a pure Levenshtein-Damerau approach when withholding 70% of the lexicon. Domain-specific and geometrical techniques further boost performance. Conclusions: DICOMautomaton allows users to recognize contours semi-automatically. As usage increases and the lexicon is filled with additional structures, performance improves, increasing the overall utility of the system.

  19. The effects of reverberant self- and overlap-masking on speech recognition in cochlear implant listeners.

    Science.gov (United States)

    Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S

    2014-06-01

    Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

  20. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    Energy Technology Data Exchange (ETDEWEB)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C., E-mail: calexandre@ien.gov.b, E-mail: mol@ien.gov.b, E-mail: cmnap@ien.gov.b, E-mail: mag@ien.gov.b [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Nomiya, Diogo V., E-mail: diogonomiya@gmail.co [Universidade Federal do Rio de Janeiro (UFRJ), RJ (Brazil)

    2009-07-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  1. Research Into the Use of Speech Recognition Enhanced Microworlds in an Authorable Language Tutor

    National Research Council Canada - National Science Library

    Plott, Beth

    1999-01-01

    .... Once the first microworld exercise was completed and integrated into MILT, ARI funded the investigation of the use of discreet speech recognition technology in language learning using the microworld exercise as a basis...

  2. Hearing Handicap and Speech Recognition Correlate With Self-Reported Listening Effort and Fatigue.

    Science.gov (United States)

    Alhanbali, Sara; Dawes, Piers; Lloyd, Simon; Munro, Kevin J

    To investigate the correlations between hearing handicap, speech recognition, listening effort, and fatigue. Eighty-four adults with hearing loss (65 to 85 years) completed three self-report questionnaires: the Fatigue Assessment Scale, the Effort Assessment Scale, and the Hearing Handicap Inventory for Elderly. Audiometric assessment included pure-tone audiometry and speech recognition in noise. There was a significant positive correlation between handicap and fatigue (r = 0.39, p speech recognition and fatigue (r = 0.22, p speech recognition both correlate with self-reported listening effort and fatigue, which is consistent with a model of listening effort and fatigue where perceived difficulty is related to sustained effort and fatigue for unrewarding tasks over which the listener has low control. A clinical implication is that encouraging clients to recognize and focus on the pleasure and positive experiences of listening may result in greater satisfaction and benefit from hearing aid use.

  3. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    International Nuclear Information System (INIS)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C.; Nomiya, Diogo V.

    2009-01-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  4. Adapting Speech Recognition in Augmented Reality for Mobile Devices in Outdoor Environments

    OpenAIRE

    Pascoal, Rui; Ribeiro, Ricardo; Batista, Fernando; de Almeida, Ana

    2017-01-01

    This paper describes the process of integrating automatic speech recognition (ASR) into a mobile application and explores the benefits and challenges of integrating speech with augmented reality (AR) in outdoor environments. The augmented reality allows end-users to interact with the information displayed and perform tasks, while increasing the user’s perception about the real world by adding virtual information to it. Speech is the most natural way of communication: it allows hands-free inte...

  5. A New Bigram-PLSA Language Model for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bahrani Mohammad

    2010-01-01

    Full Text Available A novel method for combining bigram model and Probabilistic Latent Semantic Analysis (PLSA is introduced for language modeling. The motivation behind this idea is the relaxation of the "bag of words" assumption fundamentally present in latent topic models including the PLSA model. An EM-based parameter estimation technique for the proposed model is presented in this paper. Previous attempts to incorporate word order in the PLSA model are surveyed and compared with our new proposed model both in theory and by experimental evaluation. Perplexity measure is employed to compare the effectiveness of recently introduced models with the new proposed model. Furthermore, experiments are designed and carried out on continuous speech recognition (CSR tasks using word error rate (WER as the evaluation criterion. The superiority of the new bigram-PLSA model over Nie et al.'s bigram-PLSA and simple PLSA models is demonstrated in the results of our experiments. Experiments on BLLIP WSJ corpus show about 12% reduction in perplexity and 2.8% WER improvement compared to Nie et al.'s bigram-PLSA model.

  6. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  7. Speech recognition by means of a three-integrated-circuit set

    Energy Technology Data Exchange (ETDEWEB)

    Zoicas, A.

    1983-11-03

    The author uses pattern recognition methods for detecting word boundaries, and monitors incoming speech at 12 millisecond intervals. Frequency is divided into eight bands and analysis is achieved in an analogue interface integrated circuit, a pipeline digital processor and a control integrated circuit. Applications are suggested, including speech input to personal computers. 3 references.

  8. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    Science.gov (United States)

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  9. Influences of Infant-Directed Speech on Early Word Recognition

    Science.gov (United States)

    Singh, Leher; Nestor, Sarah; Parikh, Chandni; Yull, Ashley

    2009-01-01

    When addressing infants, many adults adopt a particular type of speech, known as infant-directed speech (IDS). IDS is characterized by exaggerated intonation, as well as reduced speech rate, shorter utterance duration, and grammatical simplification. It is commonly asserted that IDS serves in part to facilitate language learning. Although…

  10. A Study on Efficient Robust Speech Recognition with Stochastic Dynamic Time Warping

    OpenAIRE

    孫, 喜浩

    2014-01-01

    In recent years, great progress has been made in automatic speech recognition (ASR) system. The hidden Markov model (HMM) and dynamic time warping (DTW) are the two main algorithms which have been widely applied to ASR system. Although, HMM technique achieves higher recognition accuracy in clear speech environment and noisy environment. It needs large-set of words and realizes the algorithm more complexly.Thus, more and more researchers have focused on DTW-based ASR system.Dynamic time warpin...

  11. Listeners Experience Linguistic Masking Release in Noise-Vocoded Speech-in-Speech Recognition

    Science.gov (United States)

    Viswanathan, Navin; Kokkinakis, Kostas; Williams, Brittany T.

    2018-01-01

    Purpose: The purpose of this study was to evaluate whether listeners with normal hearing perceiving noise-vocoded speech-in-speech demonstrate better intelligibility of target speech when the background speech was mismatched in language (linguistic release from masking [LRM]) and/or location (spatial release from masking [SRM]) relative to the…

  12. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...... has been shifting towards dimensional affect recognition based on the idea that emotional states are not independent from one another but related in a systematic manner. In this paper, we design a continuous dimensional speech affect recognition model that combines acoustic and semantic features. We...... show that combining semantic and acoustic information for dimensional speech recognition improves the results. Moreover, we show that valence is better estimated using semantic features while arousal is better estimated using acoustic features....

  13. New automated procedure to assess context recognition memory in mice.

    Science.gov (United States)

    Reiss, David; Walter, Ondine; Bourgoin, Lucie; Kieffer, Brigitte L; Ouagazzal, Abdel-Mouttalib

    2014-11-01

    Recognition memory is an important aspect of human declarative memory and is one of the routine memory abilities altered in patients with amnestic syndrome and Alzheimer's disease. In rodents, recognition memory has been most widely assessed using the novel object preference paradigm, which exploits the spontaneous preference that animals display for novel objects. Here, we used nose-poke units instead of objects to design a simple automated method for assessing context recognition memory in mice. In the acquisition trial, mice are exposed for the first time to an operant chamber with one blinking nose-poke unit. In the choice session, a novel nonblinking nose-poke unit is inserted into an empty spatial location and the number of nose poking dedicated to each set of nose-poke unit is used as an index of recognition memory. We report that recognition performance varies as a function of the length of the acquisition period and the retention delay and is sensitive to conventional amnestic treatments. By manipulating the features of the operant chamber during a brief retrieval episode (3-min long), we further demonstrate that reconsolidation of the original contextual memory depends on the magnitude and the type of environmental changes introduced into the familiar spatial environment. These results show that the nose-poke recognition task provides a rapid and reliable way for assessing context recognition memory in mice and offers new possibilities for the deciphering of the brain mechanisms governing the reconsolidation process.

  14. A Russian Keyword Spotting System Based on Large Vocabulary Continuous Speech Recognition and Linguistic Knowledge

    Directory of Open Access Journals (Sweden)

    Valentin Smirnov

    2016-01-01

    Full Text Available The paper describes the key concepts of a word spotting system for Russian based on large vocabulary continuous speech recognition. Key algorithms and system settings are described, including the pronunciation variation algorithm, and the experimental results on the real-life telecom data are provided. The description of system architecture and the user interface is provided. The system is based on CMU Sphinx open-source speech recognition platform and on the linguistic models and algorithms developed by Speech Drive LLC. The effective combination of baseline statistic methods, real-world training data, and the intensive use of linguistic knowledge led to a quality result applicable to industrial use.

  15. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system.

    Science.gov (United States)

    Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo

    2009-04-01

    The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained

  16. Automated analysis of free speech predicts psychosis onset in high-risk youths

    Science.gov (United States)

    Bedi, Gillinder; Carrillo, Facundo; Cecchi, Guillermo A; Slezak, Diego Fernández; Sigman, Mariano; Mota, Natália B; Ribeiro, Sidarta; Javitt, Daniel C; Copelli, Mauro; Corcoran, Cheryl M

    2015-01-01

    Background/Objectives: Psychiatry lacks the objective clinical tests routinely used in other specializations. Novel computerized methods to characterize complex behaviors such as speech could be used to identify and predict psychiatric illness in individuals. AIMS: In this proof-of-principle study, our aim was to test automated speech analyses combined with Machine Learning to predict later psychosis onset in youths at clinical high-risk (CHR) for psychosis. Methods: Thirty-four CHR youths (11 females) had baseline interviews and were assessed quarterly for up to 2.5 years; five transitioned to psychosis. Using automated analysis, transcripts of interviews were evaluated for semantic and syntactic features predicting later psychosis onset. Speech features were fed into a convex hull classification algorithm with leave-one-subject-out cross-validation to assess their predictive value for psychosis outcome. The canonical correlation between the speech features and prodromal symptom ratings was computed. Results: Derived speech features included a Latent Semantic Analysis measure of semantic coherence and two syntactic markers of speech complexity: maximum phrase length and use of determiners (e.g., which). These speech features predicted later psychosis development with 100% accuracy, outperforming classification from clinical interviews. Speech features were significantly correlated with prodromal symptoms. Conclusions: Findings support the utility of automated speech analysis to measure subtle, clinically relevant mental state changes in emergent psychosis. Recent developments in computer science, including natural language processing, could provide the foundation for future development of objective clinical tests for psychiatry. PMID:27336038

  17. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  18. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    Science.gov (United States)

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  19. Joint variable frame rate and length analysis for speech recognition under adverse conditions

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Kraljevski, Ivan

    2014-01-01

    This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal......-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased...... frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness....

  20. Report generation using digital speech recognition in radiology

    International Nuclear Information System (INIS)

    Vorbeck, F.; Ba-Ssalamah, A.; Kettenbach, J.; Huebsch, P.

    2000-01-01

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5 %, and the mean error rate for conventional typing was 0.4 %. Reports edited with the CSR, on average, were generated 19 % faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28 % for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. (orig.)

  1. Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students

    Science.gov (United States)

    Sersen, William J.

    2011-01-01

    The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…

  2. AUTOMATIC SPEECH RECOGNITION SYSTEM CONCERNING THE MOROCCAN DIALECTE (Darija and Tamazight)

    OpenAIRE

    A. EL GHAZI; C. DAOUI; N. IDRISSI

    2012-01-01

    In this work we present an automatic speech recognition system for Moroccan dialect mainly: Darija (Arab dialect) and Tamazight. Many approaches have been used to model the Arabic and Tamazightphonetic units. In this paper, we propose to use the hidden Markov model (HMM) for modeling these phoneticunits. Experimental results show that the proposed approach further improves the recognition.

  3. Comparing grapheme-based and phoneme-based speech recognition for Afrikaans

    CSIR Research Space (South Africa)

    Basson, WD

    2012-11-01

    Full Text Available This paper compares the recognition accuracy of a phoneme-based automatic speech recognition system with that of a grapheme-based system, using Afrikaans as case study. The first system is developed using a conventional pronunciation dictionary...

  4. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Directory of Open Access Journals (Sweden)

    Rosa S Gisladottir

    Full Text Available The ability to recognize speech acts (verbal actions in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  5. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    Science.gov (United States)

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  6. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners.

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.

  7. Optimal pattern synthesis for speech recognition based on principal component analysis

    Science.gov (United States)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  8. Effect of speech rate variation on acoustic phone stability in Afrikaans speech recognition

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2007-11-01

    Full Text Available The authors analyse the effect of speech rate variation on Afrikaans phone stability from an acoustic perspective. Specifically they introduce two techniques for the acoustic analysis of speech rate variation, apply these techniques to an Afrikaans...

  9. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    Science.gov (United States)

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line

  10. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  11. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    of training. Methods: Eight volunteers read aloud a total of about 3 600 typical short anaesthesia comments to be transcribed by a continuous speech recognition system. Background noises were collected in an operating room and reproduced. A regression analysis and descriptive statistics were done to evaluate...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type...... the relative effect of various factors. Results: Some factors have a major impact, such as the words to be recognised, the type of recognition, and participants. The type of microphone is especially significant when combined with the type of noise. While loud noises in the operating room can have a predominant...

  12. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

    Science.gov (United States)

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai

    2017-01-01

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed. PMID:28737705

  13. Eyes and ears: Using eye tracking and pupillometry to understand challenges to speech recognition.

    Science.gov (United States)

    Van Engen, Kristin J; McLaughlin, Drew J

    2018-05-04

    Although human speech recognition is often experienced as relatively effortless, a number of common challenges can render the task more difficult. Such challenges may originate in talkers (e.g., unfamiliar accents, varying speech styles), the environment (e.g. noise), or in listeners themselves (e.g., hearing loss, aging, different native language backgrounds). Each of these challenges can reduce the intelligibility of spoken language, but even when intelligibility remains high, they can place greater processing demands on listeners. Noisy conditions, for example, can lead to poorer recall for speech, even when it has been correctly understood. Speech intelligibility measures, memory tasks, and subjective reports of listener difficulty all provide critical information about the effects of such challenges on speech recognition. Eye tracking and pupillometry complement these methods by providing objective physiological measures of online cognitive processing during listening. Eye tracking records the moment-to-moment direction of listeners' visual attention, which is closely time-locked to unfolding speech signals, and pupillometry measures the moment-to-moment size of listeners' pupils, which dilate in response to increased cognitive load. In this paper, we review the uses of these two methods for studying challenges to speech recognition. Copyright © 2018. Published by Elsevier B.V.

  14. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN.

    Science.gov (United States)

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai; Zhou, Jiehan; Zhang, Weishan

    2017-07-24

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed.

  15. Noise-robust speech recognition through auditory feature detection and spike sequence decoding.

    Science.gov (United States)

    Schafer, Phillip B; Jin, Dezhe Z

    2014-03-01

    Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

  16. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Directory of Open Access Journals (Sweden)

    Rachel Jane Ellis

    2015-08-01

    Full Text Available Proactive interference (PI is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  17. The Relationship Between Spectral Modulation Detection and Speech Recognition: Adult Versus Pediatric Cochlear Implant Recipients.

    Science.gov (United States)

    Gifford, René H; Noble, Jack H; Camarata, Stephen M; Sunderhaus, Linsey W; Dwyer, Robert T; Dawant, Benoit M; Dietrich, Mary S; Labadie, Robert F

    2018-01-01

    Adult cochlear implant (CI) recipients demonstrate a reliable relationship between spectral modulation detection and speech understanding. Prior studies documenting this relationship have focused on postlingually deafened adult CI recipients-leaving an open question regarding the relationship between spectral resolution and speech understanding for adults and children with prelingual onset of deafness. Here, we report CI performance on the measures of speech recognition and spectral modulation detection for 578 CI recipients including 477 postlingual adults, 65 prelingual adults, and 36 prelingual pediatric CI users. The results demonstrated a significant correlation between spectral modulation detection and various measures of speech understanding for 542 adult CI recipients. For 36 pediatric CI recipients, however, there was no significant correlation between spectral modulation detection and speech understanding in quiet or in noise nor was spectral modulation detection significantly correlated with listener age or age at implantation. These findings suggest that pediatric CI recipients might not depend upon spectral resolution for speech understanding in the same manner as adult CI recipients. It is possible that pediatric CI users are making use of different cues, such as those contained within the temporal envelope, to achieve high levels of speech understanding. Further investigation is warranted to investigate the relationship between spectral and temporal resolution and speech recognition to describe the underlying mechanisms driving peripheral auditory processing in pediatric CI users.

  18. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  19. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    Science.gov (United States)

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  20. Automated speech quality monitoring tool based on perceptual evaluation

    OpenAIRE

    Vozňák, Miroslav; Rozhon, Jan

    2010-01-01

    The paper deals with a speech quality monitoring tool which we have developed in accordance with PESQ (Perceptual Evaluation of Speech Quality) and is automatically running and calculating the MOS (Mean Opinion Score). Results are stored into database and used in a research project investigating how meteorological conditions influence the speech quality in a GSM network. The meteorological station, which is located in our university campus provides information about a temperature,...

  1. Pattern recognition

    CERN Document Server

    Theodoridis, Sergios

    2003-01-01

    Pattern recognition is a scientific discipline that is becoming increasingly important in the age of automation and information handling and retrieval. Patter Recognition, 2e covers the entire spectrum of pattern recognition applications, from image analysis to speech recognition and communications. This book presents cutting-edge material on neural networks, - a set of linked microprocessors that can form associations and uses pattern recognition to ""learn"" -and enhances student motivation by approaching pattern recognition from the designer's point of view. A direct result of more than 10

  2. Speech Recognition in Adults With Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging.

    Science.gov (United States)

    Moberly, Aaron C; Harris, Michael S; Boyce, Lauren; Nittrouer, Susan

    2017-04-14

    Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users.

  3. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  4. Speech Recognition of Aged Voices in the AAL Context: Detection of Distress Sentences

    OpenAIRE

    Aman , Frédéric; Vacher , Michel; Rossato , Solange; Portet , François

    2013-01-01

    International audience; By 2050, about a third of the French population will be over 65. In the context of technologies development aiming at helping aged people to live independently at home, the CIRDO project aims at implementing an ASR system into a social inclusion product designed for elderly people in order to detect distress situations. Speech recognition systems present higher word error rate when speech is uttered by elderly speakers compared to when non-aged voice is considered. Two...

  5. Practising verbal maritime communication with computer dialogue systems using automatic speech recognition (My Practice session)

    OpenAIRE

    John, Peter; Wellmann, J.; Appell, J.E.

    2016-01-01

    This My Practice session presents a novel online tool for practising verbal communication in a maritime setting. It is based on low-fi ChatBot simulation exercises which employ computer-based dialogue systems. The ChatBot exercises are equipped with an automatic speech recognition engine specifically designed for maritime communication. The speech input and output functionality enables learners to communicate with the computer freely and spontaneously. The exercises replicate real communicati...

  6. Investigations on search methods for speech recognition using weighted finite state transducers

    OpenAIRE

    Rybach, David

    2014-01-01

    The search problem in the statistical approach to speech recognition is to find the most likely word sequence for an observed speech signal using a combination of knowledge sources, i.e. the language model, the pronunciation model, and the acoustic models of phones. The resulting search space is enormous. Therefore, an efficient search strategy is required to compute the result with a feasible amount of time and memory. The structured statistical models as well as their combination, the searc...

  7. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  8. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  9. Speech Recognition for Medical Dictation: Overview in Quebec and Systematic Review.

    Science.gov (United States)

    Poder, Thomas G; Fisette, Jean-François; Déry, Véronique

    2018-04-03

    Speech recognition is increasingly used in medical reporting. The aim of this article is to identify in the literature the strengths and weaknesses of this technology, as well as barriers to and facilitators of its implementation. A systematic review of systematic reviews was performed using PubMed, Scopus, the Cochrane Library and the Center for Reviews and Dissemination through August 2017. The gray literature has also been consulted. The quality of systematic reviews has been assessed with the AMSTAR checklist. The main inclusion criterion was use of speech recognition for medical reporting (front-end or back-end). A survey has also been conducted in Quebec, Canada, to identify the dissemination of this technology in this province, as well as the factors leading to the success or failure of its implementation. Five systematic reviews were identified. These reviews indicated a high level of heterogeneity across studies. The quality of the studies reported was generally poor. Speech recognition is not as accurate as human transcription, but it can dramatically reduce turnaround times for reporting. In front-end use, medical doctors need to spend more time on dictation and correction than required with human transcription. With speech recognition, major errors occur up to three times more frequently. In back-end use, a potential increase in productivity of transcriptionists was noted. In conclusion, speech recognition offers several advantages for medical reporting. However, these advantages are countered by an increased burden on medical doctors and by risks of additional errors in medical reports. It is also hard to identify for which medical specialties and which clinical activities the use of speech recognition will be the most beneficial.

  10. Predicting Intelligibility Gains in Dysarthria through Automated Speech Feature Analysis

    Science.gov (United States)

    Fletcher, Annalise R.; Wisler, Alan A.; McAuliffe, Megan J.; Lansford, Kaitlin L.; Liss, Julie M.

    2017-01-01

    Purpose: Behavioral speech modifications have variable effects on the intelligibility of speakers with dysarthria. In the companion article, a significant relationship was found between measures of speakers' baseline speech and their intelligibility gains following cues to speak louder and reduce rate (Fletcher, McAuliffe, Lansford, Sinex, &…

  11. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Directory of Open Access Journals (Sweden)

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  12. Automated target recognition and tracking using an optical pattern recognition neural network

    Science.gov (United States)

    Chao, Tien-Hsin

    1991-01-01

    The on-going development of an automatic target recognition and tracking system at the Jet Propulsion Laboratory is presented. This system is an optical pattern recognition neural network (OPRNN) that is an integration of an innovative optical parallel processor and a feature extraction based neural net training algorithm. The parallel optical processor provides high speed and vast parallelism as well as full shift invariance. The neural network algorithm enables simultaneous discrimination of multiple noisy targets in spite of their scales, rotations, perspectives, and various deformations. This fully developed OPRNN system can be effectively utilized for the automated spacecraft recognition and tracking that will lead to success in the Automated Rendezvous and Capture (AR&C) of the unmanned Cargo Transfer Vehicle (CTV). One of the most powerful optical parallel processors for automatic target recognition is the multichannel correlator. With the inherent advantages of parallel processing capability and shift invariance, multiple objects can be simultaneously recognized and tracked using this multichannel correlator. This target tracking capability can be greatly enhanced by utilizing a powerful feature extraction based neural network training algorithm such as the neocognitron. The OPRNN, currently under investigation at JPL, is constructed with an optical multichannel correlator where holographic filters have been prepared using the neocognitron training algorithm. The computation speed of the neocognitron-type OPRNN is up to 10(exp 14) analog connections/sec that enabling the OPRNN to outperform its state-of-the-art electronics counterpart by at least two orders of magnitude.

  13. Collecting and evaluating speech recognition corpora for 11 South African languages

    CSIR Research Space (South Africa)

    Badenhorst, J

    2011-08-01

    Full Text Available . In addition, speech-based access to information may empower illiterate or semi-literate peo- ple, 98% of whom live in the developing world. SDSs can play a useful role in a wide range of applications. Of particular importance in Africa are applications... speech (i.e. appropriate for the recognition task in terms of the language used, the profile of the speakers, speaking style, etc.) This speech generally needs to be curated and transcribed prior to the development of ASR sys- tems, and for most...

  14. Effects of noise on speech recognition: Challenges for communication by service members.

    Science.gov (United States)

    Le Prell, Colleen G; Clavier, Odile H

    2017-06-01

    Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.

  15. [Intermodal timing cues for audio-visual speech recognition].

    Science.gov (United States)

    Hashimoto, Masahiro; Kumashiro, Masaharu

    2004-06-01

    The purpose of this study was to investigate the limitations of lip-reading advantages for Japanese young adults by desynchronizing visual and auditory information in speech. In the experiment, audio-visual speech stimuli were presented under the six test conditions: audio-alone, and audio-visually with either 0, 60, 120, 240 or 480 ms of audio delay. The stimuli were the video recordings of a face of a female Japanese speaking long and short Japanese sentences. The intelligibility of the audio-visual stimuli was measured as a function of audio delays in sixteen untrained young subjects. Speech intelligibility under the audio-delay condition of less than 120 ms was significantly better than that under the audio-alone condition. On the other hand, the delay of 120 ms corresponded to the mean mora duration measured for the audio stimuli. The results implied that audio delays of up to 120 ms would not disrupt lip-reading advantage, because visual and auditory information in speech seemed to be integrated on a syllabic time scale. Potential applications of this research include noisy workplace in which a worker must extract relevant speech from all the other competing noises.

  16. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the langu......In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...

  17. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Wei Zhang

    2017-01-01

    Full Text Available Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between emotions and certain mathematical derivations to determine the network structure. The proposed algorithm can handle a large number of concepts, whereas a typical FCM can handle only relatively simple networks (maps. Different acoustic features, including fundamental speech features and a new spectral feature, are extracted to evaluate the performance of the proposed method. Three experiments are conducted in this paper, namely, single feature experiment, feature combination experiment, and comparison between the proposed algorithm and typical networks. All experiments are performed on TYUT2.0 and EMO-DB databases. Results of the feature combination experiments show that the recognition rates of the combination features are 10%–20% better than those of single features. The proposed FCM learning algorithm generates 5%–20% performance improvement compared with traditional classification networks.

  18. Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels

    NARCIS (Netherlands)

    Baskent, Deniz

    Speech recognition by normal-hearing listeners improves as a function of the number of spectral channels when tested with a noiseband vocoder simulating cochlear implant signal processing. Speech recognition by the best cochlear implant users, however, saturates around eight channels and does not

  19. Memristive Computational Architecture of an Echo State Network for Real-Time Speech Emotion Recognition

    Science.gov (United States)

    2015-05-28

    recognition is simpler and requires less computational resources compared to other inputs such as facial expressions . The Berlin database of Emotional ...Processing Magazine, IEEE, vol. 18, no. 1, pp. 32– 80, 2001. [15] K. R. Scherer, T. Johnstone, and G. Klasmeyer, “Vocal expression of emotion ...Network for Real-Time Speech- Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q

  20. Divided attention disrupts perceptual encoding during speech recognition.

    Science.gov (United States)

    Mattys, Sven L; Palmer, Shekeila D

    2015-03-01

    Performing a secondary task while listening to speech has a detrimental effect on speech processing, but the locus of the disruption within the speech system is poorly understood. Recent research has shown that cognitive load imposed by a concurrent visual task increases dependency on lexical knowledge during speech processing, but it does not affect lexical activation per se. This suggests that "lexical drift" under cognitive load occurs either as a post-lexical bias at the decisional level or as a secondary consequence of reduced perceptual sensitivity. This study aimed to adjudicate between these alternatives using a forced-choice task that required listeners to identify noise-degraded spoken words with or without the addition of a concurrent visual task. Adding cognitive load increased the likelihood that listeners would select a word acoustically similar to the target even though its frequency was lower than that of the target. Thus, there was no evidence that cognitive load led to a high-frequency response bias. Rather, cognitive load seems to disrupt sublexical encoding, possibly by impairing perceptual acuity at the auditory periphery.

  1. Multilingual Techniques for Low Resource Automatic Speech Recognition

    Science.gov (United States)

    2016-05-20

    linguistic and ASR expertise, and Regina, bringing in another point of view from the NLP side, really help shape the direction of some of the work in...improved keyword spotting. In Proc. ASRU, 2013. 41 [43] K. Kirchhoff and D. Vergyri. Cross-dialectal data sharing for acoustic modeling in Arabic speech

  2. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly

    Directory of Open Access Journals (Sweden)

    Na W

    2017-08-01

    Full Text Available Wondo Na,1 Gibbeum Kim,1 Gungu Kim,1 Woojae Han,2 Jinsook Kim2 1Department of Speech Pathology and Audiology, Graduate School, 2Division of Speech Pathology and Audiology, Research Institute of Audiology and Speech Pathology, College of Natural Sciences, Hallym University, Chuncheon, Republic of Korea Purpose: The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss.Methods: One hundred subjects aged 65–84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants’ working memory.Results: 1 As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2 As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3 Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss.Conclusion: The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing

  3. Automated recognition system for ELM classification in JET

    International Nuclear Information System (INIS)

    Duro, N.; Dormido, R.; Vega, J.; Dormido-Canto, S.; Farias, G.; Sanchez, J.; Vargas, H.; Murari, A.

    2009-01-01

    Edge localized modes (ELMs) are instabilities occurring in the edge of H-mode plasmas. Considerable efforts are being devoted to understanding the physics behind this non-linear phenomenon. A first characterization of ELMs is usually their identification as type I or type III. An automated pattern recognition system has been developed in JET for off-line ELM recognition and classification. The empirical method presented in this paper analyzes each individual ELM instead of starting from a temporal segment containing many ELM bursts. The ELM recognition and isolation is carried out using three signals: Dα, line integrated electron density and stored diamagnetic energy. A reduced set of characteristics (such as diamagnetic energy drop, ELM period or Dα shape) has been extracted to build supervised and unsupervised learning systems for classification purposes. The former are based on support vector machines (SVM). The latter have been developed with hierarchical and K-means clustering methods. The success rate of the classification systems is about 98% for a database of almost 300 ELMs.

  4. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  5. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  6. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  7. Tone realisation in a Yoruba speech recognition corpus

    CSIR Research Space (South Africa)

    Van Niekerk, D

    2012-05-01

    Full Text Available development. Extracted contours are processed and analysed statistically to describe acoustic properties in different tonal contexts. The authors demonstrate how features useful for tone recognition or synthesis can be successfully extracted from a corpus...

  8. Progressive-Search Algorithms for Large-Vocabulary Speech Recognition

    National Research Council Canada - National Science Library

    Murveit, Hy; Butzberger, John; Digalakis, Vassilios; Weintraub, Mitch

    1993-01-01

    .... An algorithm, the "Forward-Backward Word-Life Algorithm," is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements...

  9. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    Science.gov (United States)

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  10. Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

    NARCIS (Netherlands)

    Ahmadi, S.; Ahadi, S.M.; Cranen, B.; Boves, L.W.J.

    2014-01-01

    The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous research in automatic speech recognition converted this very rich representation into the equivalent of a sequence of short-time power spectra, mainly to simplify the computation of the

  11. Multistage Data Selection-based Unsupervised Speaker Adaptation for Personalized Speech Emotion Recognition

    NARCIS (Netherlands)

    Kim, Jaebok; Park, Jeong-Sik

    This paper proposes an efficient speech emotion recognition (SER) approach that utilizes personal voice data accumulated on personal devices. A representative weakness of conventional SER systems is the user-dependent performance induced by the speaker independent (SI) acoustic model framework. But,

  12. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet Phuong; van Leeuwen, David A.; de Jong, Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two

  13. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  14. Enhancing Speech Recognition Using Improved Particle Swarm Optimization Based Hidden Markov Model

    Directory of Open Access Journals (Sweden)

    Lokesh Selvaraj

    2014-01-01

    Full Text Available Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO is suggested. The suggested methodology contains four stages, namely, (i denoising, (ii feature mining (iii, vector quantization, and (iv IPSO based hidden Markov model (HMM technique (IP-HMM. At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC, mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  15. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  16. Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; O'Toole, John Mitchell

    2015-01-01

    The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…

  17. A Neuro-Linguistic Model for Speech Recognition in Tone Language

    African Journals Online (AJOL)

    The primary aim for this work is to develop a speech recognition system that exploits the computational paradigm with learning ability and the inherent robustness and parallelism in ANN coupled with the capability of fuzzy logic to model vagueness, handling uncertainness and support for human reasoning. This research ...

  18. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  19. Learning spectral-temporal features with 3D CNNs for speech emotion recognition

    NARCIS (Netherlands)

    Kim, Jaebok; Truong, Khiet; Englebienne, Gwenn; Evers, Vanessa

    2017-01-01

    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed

  20. Using word spotting to evaluate ROILA: a speech recognition friendly artificial language

    NARCIS (Netherlands)

    Mubin, O.; Bartneck, C.; Feijs, L.M.G.

    2010-01-01

    In our research we argue for the benefits that an artificial language could provide to improve the accuracy of speech recognition. We briefly present the design and implementation of a vocabulary of our intended artificial language (ROILA), the latter by means of a genetic algorithm that attempted

  1. Evaluation of missing data techniques for in-car automatic speech recognition

    OpenAIRE

    Wang, Y.; Vuerinckx, R.; Gemmeke, J.F.; Cranen, B.; Hamme, H. Van

    2009-01-01

    Wang Y., Vuerinckx R., Gemmeke J., Cranen B., Van hamme H., ''Evaluation of missing data techniques for in-car automatic speech recognition'', Proceedings NAG/DAGA 2009 - international conference on acoustics, 4 pp., March 23-26, 2009, Rotterdam, The Netherlands.

  2. Dealing with Phrase Level Co-Articulation (PLC) in speech recognition: a first approach

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; van Leeuwen, David A.; Robinson, Tony; Renals, Steve

    1999-01-01

    Whereas nowadays within-word co-articulation effects are usually sufficiently dealt with in automatic speech recognition, this is not always the case with phrase level co-articulation effects (PLC). This paper describes a first approach in dealing with phrase level co-articulation by applying these

  3. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  4. Speech Recognition: Acoustic-Phonetic Knowledge Acquisition and Representation.

    Science.gov (United States)

    1987-09-25

    Society of "" America , Anaheim, CA, Dec. 1986. # Randolph, M. A., and V. W. Zue, "The Role of Syllable Structure in the Acoustic Realizations of Stops...input speech signal is first transformed into a represen- ences in sociolinguistic background, dialect, and vocal tract tation that takes into account...Perceptual Evidence,’ Journal of the Acovuticai Society of America , vol. 59, * no. 5, pp. 1208-1221, May 1976. � G. E. Kupec and M. A. Bush, ’Network

  5. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  6. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  7. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly.

    Science.gov (United States)

    Na, Wondo; Kim, Gibbeum; Kim, Gungu; Han, Woojae; Kim, Jinsook

    2017-01-01

    The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss. One hundred subjects aged 65-84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants' working memory. 1) As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2) As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3) Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss. The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing-impaired elderly who experience difficulty in communication.

  8. Development of a Low-Cost, Noninvasive, Portable Visual Speech Recognition Program.

    Science.gov (United States)

    Kohlberg, Gavriel D; Gal, Ya'akov Kobi; Lalwani, Anil K

    2016-09-01

    Loss of speech following tracheostomy and laryngectomy severely limits communication to simple gestures and facial expressions that are largely ineffective. To facilitate communication in these patients, we seek to develop a low-cost, noninvasive, portable, and simple visual speech recognition program (VSRP) to convert articulatory facial movements into speech. A Microsoft Kinect-based VSRP was developed to capture spatial coordinates of lip movements and translate them into speech. The articulatory speech movements associated with 12 sentences were used to train an artificial neural network classifier. The accuracy of the classifier was then evaluated on a separate, previously unseen set of articulatory speech movements. The VSRP was successfully implemented and tested in 5 subjects. It achieved an accuracy rate of 77.2% (65.0%-87.6% for the 5 speakers) on a 12-sentence data set. The mean time to classify an individual sentence was 2.03 milliseconds (1.91-2.16). We have demonstrated the feasibility of a low-cost, noninvasive, portable VSRP based on Kinect to accurately predict speech from articulation movements in clinically trivial time. This VSRP could be used as a novel communication device for aphonic patients. © The Author(s) 2016.

  9. Feasibility of automated speech sample collection with stuttering children using interactive voice response (IVR) technology.

    Science.gov (United States)

    Vogel, Adam P; Block, Susan; Kefalianos, Elaina; Onslow, Mark; Eadie, Patricia; Barth, Ben; Conway, Laura; Mundt, James C; Reilly, Sheena

    2015-04-01

    To investigate the feasibility of adopting automated interactive voice response (IVR) technology for remotely capturing standardized speech samples from stuttering children. Participants were 10 6-year-old stuttering children. Their parents called a toll-free number from their homes and were prompted to elicit speech from their children using a standard protocol involving conversation, picture description and games. The automated IVR system was implemented using an off-the-shelf telephony software program and delivered by a standard desktop computer. The software infrastructure utilizes voice over internet protocol. Speech samples were automatically recorded during the calls. Video recordings were simultaneously acquired in the home at the time of the call to evaluate the fidelity of the telephone collected samples. Key outcome measures included syllables spoken, percentage of syllables stuttered and an overall rating of stuttering severity using a 10-point scale. Data revealed a high level of relative reliability in terms of intra-class correlation between the video and telephone acquired samples on all outcome measures during the conversation task. Findings were less consistent for speech samples during picture description and games. Results suggest that IVR technology can be used successfully to automate remote capture of child speech samples.

  10. Speech variability effects on recognition accuracy associated with concurrent task performance by pilots

    Science.gov (United States)

    Simpson, C. A.

    1985-01-01

    In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.

  11. Automated detection of unfilled pauses in speech of healthy and brain-damaged individuals

    NARCIS (Netherlands)

    Ossewaarde, Roelant; Jonkers, Roel; Jalvingh, Fedor; Bastiaanse, Yvonne

    Automated detection of un lled pauses in speech of healthy and brain-damaged individuals Roelant Ossewaardea,b, Roel Jonkersa, Fedor Jalvingha,c, Roelien Bastiaansea aCenter for Language and Cognition, University of Groningen; bInstitute for ICT, HU University of Applied Science, Utrecht; cSt.

  12. A Comparison of Two Scoring Methods for an Automated Speech Scoring System

    Science.gov (United States)

    Xi, Xiaoming; Higgins, Derrick; Zechner, Klaus; Williamson, David

    2012-01-01

    This paper compares two alternative scoring methods--multiple regression and classification trees--for an automated speech scoring system used in a practice environment. The two methods were evaluated on two criteria: construct representation and empirical performance in predicting human scores. The empirical performance of the two scoring models…

  13. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio-visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi-stream Dynamic Bayesian Network and coupled HMM are widely used for audio-visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial-temporal multimodal features from Tibetan audio-visual speech data and build an accurate audio-visual speech recognition model under a no frame-independency assumption. The experiment results on Tibetan speech data from some real-world environments showed the proposed DDBN outperforms the state-of-art methods in word recognition accuracy.

  14. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The automatic analysis of affect is a relatively new and challenging multidisciplinary research area that has gained a lot of interest over the past few years. The research and development of affect recognition systems has opened many opportunities for improving the interaction between man and

  15. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes...

  16. Automated Speech and Audio Analysis for Semantic Access to Multimedia

    NARCIS (Netherlands)

    Jong, F.M.G. de; Ordelman, R.; Huijbregts, M.

    2006-01-01

    The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to

  17. Automated speech and audio analysis for semantic access to multimedia

    NARCIS (Netherlands)

    de Jong, Franciska M.G.; Ordelman, Roeland J.F.; Huijbregts, M.A.H.; Avrithis, Y.; Kompatsiaris, Y.; Staab, S.; O' Connor, N.E.

    2006-01-01

    The deployment and integration of audio processing tools can enhance the semantic annotation of multimedia content, and as a consequence, improve the effectiveness of conceptual access tools. This paper overviews the various ways in which automatic speech and audio analysis can contribute to

  18. A glimpsing account of the role of temporal fine structure information in speech recognition.

    Science.gov (United States)

    Apoux, Frédéric; Healy, Eric W

    2013-01-01

    Many behavioral studies have reported a significant decrease in intelligibility when the temporal fine structure (TFS) of a sound mixture is replaced with noise or tones (i.e., vocoder processing). This finding has led to the conclusion that TFS information is critical for speech recognition in noise. How the normal -auditory system takes advantage of the original TFS, however, remains unclear. Three -experiments on the role of TFS in noise are described. All three experiments measured speech recognition in various backgrounds while manipulating the envelope, TFS, or both. One experiment tested the hypothesis that vocoder processing may artificially increase the apparent importance of TFS cues. Another experiment evaluated the relative contribution of the target and masker TFS by disturbing only the TFS of the target or that of the masker. Finally, a last experiment evaluated the -relative contribution of envelope and TFS information. In contrast to previous -studies, however, the original envelope and TFS were both preserved - to some extent - in all conditions. Overall, the experiments indicate a limited influence of TFS and suggest that little speech information is extracted from the TFS. Concomitantly, these experiments confirm that most speech information is carried by the temporal envelope in real-world conditions. When interpreted within the framework of the glimpsing model, the results of these experiments suggest that TFS is primarily used as a grouping cue to select the time-frequency regions -corresponding to the target speech signal.

  19. Exploring the link between cognitive abilities and speech recognition in the elderly under different listening conditions

    DEFF Research Database (Denmark)

    Nuesse, Theresa; Steenken, Rike; Neher, Tobias

    2018-01-01

    , which included measures of verbal working- and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, N = 20) or aided hearing impairment (EHI, N = 21) participated...... for the ENH listeners. Whereas better lexical and semantic abilities were associated with lower (better) SRTs in this group, there was a negative association between attentional abilities and speech recognition in the presence of spatially separated speech-like maskers. For the EHI group, the pure...

  20. Mandarin-Speaking Children’s Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours

    Directory of Open Access Journals (Sweden)

    Hong Zhou

    2017-06-01

    Full Text Available The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency (F0 contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F0 information. One the one hand, age group modulated significantly the use of F0 contours, indicating that elementary school children relied more on natural F0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F0 contours compared with natural F0 contours, while older children could benefit from semantic context even when natural F0 contours were altered, thus confirming the important role of F0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.

  1. Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO

    Science.gov (United States)

    Tamati, Terrin N.; Pisoni, David B.

    2015-01-01

    Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary

  2. Effects of hearing loss and cognitive load on speech recognition with competing talkers

    Directory of Open Access Journals (Sweden)

    Hartmut eMeister

    2016-03-01

    Full Text Available Everyday communication frequently comprises situations with more than one talker speaking at a time. These situations are challenging since they pose high attentional and memory demands placing cognitive load on the listener. Hearing impairment additionally exacerbates communication problems under these circumstances. We examined the effects of hearing loss and attention tasks on speech recognition with competing talkers in older adults with and without hearing impairment. We hypothesized that hearing loss would affect word identification, talker separation and word recall and that the difficulties experienced by the hearing impaired listeners would be especially pronounced in a task with high attentional and memory demands. Two listener groups closely matched regarding their age and neuropsychological profile but differing in hearing acuity were examined regarding their speech recognition with competing talkers in two different tasks. One task required repeating back words from one target talker (1TT while ignoring the competing talker whereas the other required repeating back words from both talkers (2TT. The competing talkers differed with respect to their voice characteristics. Moreover, sentences either with low or high context were used in order to consider linguistic properties. Compared to their normal hearing peers, listeners with hearing loss revealed limited speech recognition in both tasks. Their difficulties were especially pronounced in the more demanding 2TT task. In order to shed light on the underlying mechanisms, different error sources, namely having misunderstood, confused, or omitted words were investigated. Misunderstanding and omitting words were more frequently observed in the hearing impaired than in the normal hearing listeners. In line with common speech perception models it is suggested that these effects are related to impaired object formation and taxed working memory capacity (WMC. In a post hoc analysis the

  3. Exploring the Link Between Cognitive Abilities and Speech Recognition in the Elderly Under Different Listening Conditions

    Directory of Open Access Journals (Sweden)

    Theresa Nuesse

    2018-05-01

    Full Text Available Elderly listeners are known to differ considerably in their ability to understand speech in noise. Several studies have addressed the underlying factors that contribute to these differences. These factors include audibility, and age-related changes in supra-threshold auditory processing abilities, and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing or hearing-impairment. To that end, speech recognition threshold (SRT measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered, which included measures of verbal working and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, n = 20 or aided hearing impairment (EHI, n = 21 participated. In repeated linear regression analyses, composite scores of cognitive test outcomes (evaluated using PCA were included to predict SRTs. These associations were different for the two groups. When hearing thresholds were controlled for, composed cognitive factors were significantly associated with the SRTs for the ENH listeners. Whereas better lexical and semantic abilities were associated with lower (better SRTs in this group, there was a negative association between attentional abilities and speech recognition in the presence of spatially separated speech-like maskers. For the EHI group, the pure-tone thresholds (averaged across 0.5, 1, 2, and 4 kHz were significantly associated with the SRTs, despite the fact that all signals were amplified and therefore in principle

  4. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  5. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  6. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    translations, combining machine translation with computer assisted translation has drawn attention in current research. This combines two prospects: the opportunity of ensuring high quality translation along with a significant performance gain. Automatic Speech Recognition (ASR) is another important area......, which caters important functionalities in language processing and natural language understanding tasks. In this work we integrate automatic speech recognition and machine translation in parallel. We aim to avoid manual typing of possible translations as dictating the translation would take less time...... to the n-best list rescoring, we also use word graphs with the expectation of arriving at a tighter integration of ASR and MT models. Integration methods include constraining ASR models using language and translation models of MT, and vice versa. We currently develop and experiment different methods...

  7. Exploring the link between cognitive abilities and speech recognition in the elderly under different listening conditions

    DEFF Research Database (Denmark)

    Nuesse, Theresa; Steenken, Rike; Neher, Tobias

    2018-01-01

    , and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing...... or hearing-impairment. To that end, speech recognition threshold (SRT) measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered......, which included measures of verbal working- and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, N = 20) or aided hearing impairment (EHI, N = 21) participated...

  8. Syntactic and semantic errors in radiology reports associated with speech recognition software.

    Science.gov (United States)

    Ringler, Michael D; Goss, Brian C; Bartholmai, Brian J

    2017-03-01

    Speech recognition software can increase the frequency of errors in radiology reports, which may affect patient care. We retrieved 213,977 speech recognition software-generated reports from 147 different radiologists and proofread them for errors. Errors were classified as "material" if they were believed to alter interpretation of the report. "Immaterial" errors were subclassified as intrusion/omission or spelling errors. The proportion of errors and error type were compared among individual radiologists, imaging subspecialty, and time periods. In all, 20,759 reports (9.7%) contained errors, of which 3992 (1.9%) were material errors. Among immaterial errors, spelling errors were more common than intrusion/omission errors ( p reports, reports reinterpreting results of outside examinations, and procedural studies (all p < .001). Error rate decreased over time ( p < .001), which suggests that a quality control program with regular feedback may reduce errors.

  9. Using speech recognition to enhance the Tongue Drive System functionality in computer access.

    Science.gov (United States)

    Huo, Xueliang; Ghovanloo, Maysam

    2011-01-01

    Tongue Drive System (TDS) is a wireless tongue operated assistive technology (AT), which can enable people with severe physical disabilities to access computers and drive powered wheelchairs using their volitional tongue movements. TDS offers six discrete commands, simultaneously available to the users, for pointing and typing as a substitute for mouse and keyboard in computer access, respectively. To enhance the TDS performance in typing, we have added a microphone, an audio codec, and a wireless audio link to its readily available 3-axial magnetic sensor array, and combined it with a commercially available speech recognition software, the Dragon Naturally Speaking, which is regarded as one of the most efficient ways for text entry. Our preliminary evaluations indicate that the combined TDS and speech recognition technologies can provide end users with significantly higher performance than using each technology alone, particularly in completing tasks that require both pointing and text entry, such as web surfing.

  10. Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

    Directory of Open Access Journals (Sweden)

    Reinhold Orglmeister

    2010-01-01

    Full Text Available When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.

  11. Searching for sources of variance in speech recognition: Young adults with normal hearing

    Science.gov (United States)

    Watson, Charles S.; Kidd, Gary R.

    2005-04-01

    In the present investigation, sensory-perceptual abilities of one thousand young adults with normal hearing are being evaluated with a range of auditory, visual, and cognitive measures. Four auditory measures were derived from factor-analytic analyses of previous studies with 18-20 speech and non-speech variables [G. R. Kidd et al., J. Acoust. Soc. Am. 108, 2641 (2000)]. Two measures of visual acuity are obtained to determine whether variation in sensory skills tends to exist primarily within or across sensory modalities. A working memory test, grade point average, and Scholastic Aptitude Test scores (Verbal and Quantitative) are also included. Preliminary multivariate analyses support previous studies of individual differences in auditory abilities (e.g., A. M. Surprenant and C. S. Watson, J. Acoust. Soc. Am. 110, 2085-2095 (2001)] which found that spectral and temporal resolving power obtained with pure tones and more complex unfamiliar stimuli have little or no correlation with measures of speech recognition under difficult listening conditions. The current findings show that visual acuity, working memory, and intellectual measures are also very poor predictors of speech recognition ability, supporting the independence of this processing skill. Remarkable performance by some exceptional listeners will be described. [Work supported by the Office of Naval Research, Award No. N000140310644.

  12. Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment

    Science.gov (United States)

    1988-08-01

    the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for

  13. Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

    OpenAIRE

    Li, Xiangang; Wu, Xihong

    2014-01-01

    Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed an...

  14. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearing.

    Science.gov (United States)

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-07-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes.

  15. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearinga)

    Science.gov (United States)

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-01-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes. PMID:26233047

  16. Speech recognition training for enhancing written language generation by a traumatic brain injury survivor.

    Science.gov (United States)

    Manasse, N J; Hux, K; Rankin-Erickson, J L

    2000-11-01

    Impairments in motor functioning, language processing, and cognitive status may impact the written language performance of traumatic brain injury (TBI) survivors. One strategy to minimize the impact of these impairments is to use a speech recognition system. The purpose of this study was to explore the effect of mild dysarthria and mild cognitive-communication deficits secondary to TBI on a 19-year-old survivor's mastery and use of such a system-specifically, Dragon Naturally Speaking. Data included the % of the participant's words accurately perceived by the system over time, the participant's accuracy over time in using commands for navigation and error correction, and quantitative and qualitative changes in the participant's written texts generated with and without the use of the speech recognition system. Results showed that Dragon NaturallySpeaking was approximately 80% accurate in perceiving words spoken by the participant, and the participant quickly and easily mastered all navigation and error correction commands presented. Quantitatively, the participant produced a greater amount of text using traditional word processing and a standard keyboard than using the speech recognition system. Minimal qualitative differences appeared between writing samples. Discussion of factors that may have contributed to the obtained results and that may affect the generalization of the findings to other TBI survivors is provided.

  17. Evaluation of Speech Recognition of Cochlear Implant Recipients Using Adaptive, Digital Remote Microphone Technology and a Speech Enhancement Sound Processing Algorithm.

    Science.gov (United States)

    Wolfe, Jace; Morais, Mila; Schafer, Erin; Agrawal, Smita; Koch, Dawn

    2015-05-01

    Cochlear implant recipients often experience difficulty with understanding speech in the presence of noise. Cochlear implant manufacturers have developed sound processing algorithms designed to improve speech recognition in noise, and research has shown these technologies to be effective. Remote microphone technology utilizing adaptive, digital wireless radio transmission has also been shown to provide significant improvement in speech recognition in noise. There are no studies examining the potential improvement in speech recognition in noise when these two technologies are used simultaneously. The goal of this study was to evaluate the potential benefits and limitations associated with the simultaneous use of a sound processing algorithm designed to improve performance in noise (Advanced Bionics ClearVoice) and a remote microphone system that incorporates adaptive, digital wireless radio transmission (Phonak Roger). A two-by-two way repeated measures design was used to examine performance differences obtained without these technologies compared to the use of each technology separately as well as the simultaneous use of both technologies. Eleven Advanced Bionics (AB) cochlear implant recipients, ages 11 to 68 yr. AzBio sentence recognition was measured in quiet and in the presence of classroom noise ranging in level from 50 to 80 dBA in 5-dB steps. Performance was evaluated in four conditions: (1) No ClearVoice and no Roger, (2) ClearVoice enabled without the use of Roger, (3) ClearVoice disabled with Roger enabled, and (4) simultaneous use of ClearVoice and Roger. Speech recognition in quiet was better than speech recognition in noise for all conditions. Use of ClearVoice and Roger each provided significant improvement in speech recognition in noise. The best performance in noise was obtained with the simultaneous use of ClearVoice and Roger. ClearVoice and Roger technology each improves speech recognition in noise, particularly when used at the same time

  18. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1981-01-01

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author) [fr

  19. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition.

    Science.gov (United States)

    Jesse, Alexandra; McQueen, James M

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker say fragments of word pairs that were segmentally identical but differed in their stress realization (e.g., 'ca-vi from cavia "guinea pig" vs. 'ka-vi from kaviaar "caviar"). Participants were able to distinguish between these pairs from seeing a speaker alone. Only the presence of primary stress in the fragment, not its absence, was informative. Participants were able to distinguish visually primary from secondary stress on first syllables, but only when the fragment-bearing target word carried phrase-level emphasis. Furthermore, participants distinguished fragments with primary stress on their second syllable from those with secondary stress on their first syllable (e.g., pro-'jec from projector "projector" vs. 'pro-jec from projectiel "projectile"), independently of phrase-level emphasis. Seeing a speaker thus contributes to spoken-word recognition by providing suprasegmental information about the presence of primary lexical stress.

  20. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  1. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-12-01

    Full Text Available This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  2. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-06-01

    This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  3. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem.

    Directory of Open Access Journals (Sweden)

    Cai Wingfield

    2017-09-01

    Full Text Available There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental 'machine states', generated as the ASR analysis progresses over time, to the incremental 'brain states', measured using combined electro- and magneto-encephalography (EMEG, generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.

  4. The effect of sensorineural hearing loss and tinnitus on speech recognition over air and bone conduction military communications headsets.

    Science.gov (United States)

    Manning, Candice; Mermagen, Timothy; Scharine, Angelique

    2017-06-01

    Military personnel are at risk for hearing loss due to noise exposure during deployment (USACHPPM, 2008). Despite mandated use of hearing protection, hearing loss and tinnitus are prevalent due to reluctance to use hearing protection. Bone conduction headsets can offer good speech intelligibility for normal hearing (NH) listeners while allowing the ears to remain open in quiet environments and the use of hearing protection when needed. Those who suffer from tinnitus, the experience of perceiving a sound not produced by an external source, often show degraded speech recognition; however, it is unclear whether this is a result of decreased hearing sensitivity or increased distractibility (Moon et al., 2015). It has been suggested that the vibratory stimulation of a bone conduction headset might ameliorate the effects of tinnitus on speech perception; however, there is currently no research to support or refute this claim (Hoare et al., 2014). Speech recognition of words presented over air conduction and bone conduction headsets was measured for three groups of listeners: NH, sensorineural hearing impaired, and/or tinnitus sufferers. Three levels of speech-to-noise (SNR = 0, -6, -12 dB) were created by embedding speech items in pink noise. Better speech recognition performance was observed with the bone conduction headset regardless of hearing profile, and speech intelligibility was a function of SNR. Discussion will include study limitations and the implications of these findings for those serving in the military. Published by Elsevier B.V.

  5. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  6. Speech recognition and parent-ratings from auditory development questionnaires in children who are hard of hearing

    Science.gov (United States)

    McCreery, Ryan W.; Walker, Elizabeth A.; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia

    2015-01-01

    Objectives Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HA) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children’s auditory experience on parent-report auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Design Parent ratings on auditory development questionnaires and children’s speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years of age. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children Rating Scale, and an adaptation of the Speech, Spatial and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open and Closed set task, Early Speech Perception Test, Lexical Neighborhood Test, and Phonetically-balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared to peers with normal hearing matched for age, maternal educational level and nonverbal intelligence. The effects of aided audibility, HA use and language ability on parent responses to auditory development questionnaires and on children’s speech recognition were also examined. Results Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater

  7. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  8. Prefixes versus suffixes: a search for a word-beginning superiority effect in word recognition from degraded speech

    NARCIS (Netherlands)

    Nooteboom, S.G.; Vlugt, van der M.J.

    1985-01-01

    This paper reports on a word recognition experiment in search of evidence for a word- beginning superiority effect in recognition from low-quality speech . In the experiment, lexical redundancy was controlled by combining monosyllable word stems with strongly constraining or weakly constraining

  9. The software for automatic creation of the formal grammars used by speech recognition, computer vision, editable text conversion systems, and some new functions

    Science.gov (United States)

    Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan

    2017-02-01

    For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.

  10. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1975-01-01

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author) [fr

  11. Event-related potential evidence of form and meaning coding during online speech recognition.

    Science.gov (United States)

    Friedrich, Claudia K; Kotz, Sonja A

    2007-04-01

    It is still a matter of debate whether initial analysis of speech is independent of contextual influences or whether meaning can modulate word activation directly. Utilizing event-related brain potentials (ERPs), we tested the neural correlates of speech recognition by presenting sentences that ended with incomplete words, such as To light up the dark she needed her can-. Immediately following the incomplete words, subjects saw visual words that (i) matched form and meaning, such as candle; (ii) matched meaning but not form, such as lantern; (iii) matched form but not meaning, such as candy; or (iv) mismatched form and meaning, such as number. We report ERP evidence for two distinct cohorts of lexical tokens: (a) a left-lateralized effect, the P250, differentiates form-matching words (i, iii) and form-mismatching words (ii, iv); (b) a right-lateralized effect, the P220, differentiates words that match in form and/or meaning (i, ii, iii) from mismatching words (iv). Lastly, fully matching words (i) reduce the amplitude of the N400. These results accommodate bottom-up and top-down accounts of human speech recognition. They suggest that neural representations of form and meaning are activated independently early on and are integrated at a later stage during sentence comprehension.

  12. Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Michalis Papakostas

    2017-06-01

    Full Text Available Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.

  13. Analysis of Documentation Speed Using Web-Based Medical Speech Recognition Technology: Randomized Controlled Trial.

    Science.gov (United States)

    Vogel, Markus; Kaisers, Wolfgang; Wassmuth, Ralf; Mayatepek, Ertan

    2015-11-03

    Clinical documentation has undergone a change due to the usage of electronic health records. The core element is to capture clinical findings and document therapy electronically. Health care personnel spend a significant portion of their time on the computer. Alternatives to self-typing, such as speech recognition, are currently believed to increase documentation efficiency and quality, as well as satisfaction of health professionals while accomplishing clinical documentation, but few studies in this area have been published to date. This study describes the effects of using a Web-based medical speech recognition system for clinical documentation in a university hospital on (1) documentation speed, (2) document length, and (3) physician satisfaction. Reports of 28 physicians were randomized to be created with (intervention) or without (control) the assistance of a Web-based system of medical automatic speech recognition (ASR) in the German language. The documentation was entered into a browser's text area and the time to complete the documentation including all necessary corrections, correction effort, number of characters, and mood of participant were stored in a database. The underlying time comprised text entering, text correction, and finalization of the documentation event. Participants self-assessed their moods on a scale of 1-3 (1=good, 2=moderate, 3=bad). Statistical analysis was done using permutation tests. The number of clinical reports eligible for further analysis stood at 1455. Out of 1455 reports, 718 (49.35%) were assisted by ASR and 737 (50.65%) were not assisted by ASR. Average documentation speed without ASR was 173 (SD 101) characters per minute, while it was 217 (SD 120) characters per minute using ASR. The overall increase in documentation speed through Web-based ASR assistance was 26% (P=.04). Participants documented an average of 356 (SD 388) characters per report when not assisted by ASR and 649 (SD 561) characters per report when assisted

  14. Neuroscience-inspired computational systems for speech recognition under noisy conditions

    Science.gov (United States)

    Schafer, Phillip B.

    Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes

  15. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    Science.gov (United States)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  16. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    Science.gov (United States)

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  17. The Relationship between Binaural Benefit and Difference in Unilateral Speech Recognition Performance for Bilateral Cochlear Implant Users

    Science.gov (United States)

    Yoon, Yang-soo; Li, Yongxin; Kang, Hou-Yong; Fu, Qian-Jie

    2011-01-01

    Objective The full benefit of bilateral cochlear implants may depend on the unilateral performance with each device, the speech materials, processing ability of the user, and/or the listening environment. In this study, bilateral and unilateral speech performances were evaluated in terms of recognition of phonemes and sentences presented in quiet or in noise. Design Speech recognition was measured for unilateral left, unilateral right, and bilateral listening conditions; speech and noise were presented at 0° azimuth. The “binaural benefit” was defined as the difference between bilateral performance and unilateral performance with the better ear. Study Sample 9 adults with bilateral cochlear implants participated. Results On average, results showed a greater binaural benefit in noise than in quiet for all speech tests. More importantly, the binaural benefit was greater when unilateral performance was similar across ears. As the difference in unilateral performance between ears increased, the binaural advantage decreased; this functional relationship was observed across the different speech materials and noise levels even though there was substantial intra- and inter-subject variability. Conclusions The results indicate that subjects who show symmetry in speech recognition performance between implanted ears in general show a large binaural benefit. PMID:21696329

  18. A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

    Science.gov (United States)

    Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

    2015-11-01

    This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks.

  19. Aplikasi sistem pakar diagnosis penyakit ispa berbasis speech recognition menggunakan metode naive bayes classifier

    Directory of Open Access Journals (Sweden)

    Mariam Marlina

    2017-05-01

    Full Text Available AbstrakISPA (Infeksi Saluran Pernafasan Akut adalah suatu penyakit gangguan saluran pernapasan yang dapat menimbulkan berbagai spektrum penyakit mulai dari penyakit tanpa gejala, infeksi ringan sampai penyakit yang parah dan mematikan akibat faktor lingkungan. Kurangnya pengetahuan masyarakat mengenai gejala dan cara penanganan penyakit ISPA merupakan salah satu faktor penyebab tingginya angka kematian akibat ISPA. Peran sistem pakar yang disediakan dalam bentuk aplikasi sangat diperlukan untuk membantu seseorang dalam melakukan diagnosa penyakit ISPA secara mudah dan cepat. Dengan berusaha mengadopsi pengetahuan manusia ke komputer, sistem pakar mampu menyelesaikan permasalahan seperti yang dilakukan oleh seorang pakar. Oleh Karena itu, Aplikasi Sistem Pakar Diagnosis Penyakit ISPA Berbasis Speech Recognition Menggunakan Metode Naive Bayes Classifier dapat digunakan untuk mendiagnosis penyakit ISPA terhadap seseorang berdasarkan konversi hasil deteksi suara pengguna. Dengan aplikasi ini pengguna seakan berkonsultasi kepada seorang dokter/pakar yang menangani penyakit ISPA. Aplikasi dibangun berbasis android dengan menggunakan bahasa pemrograman Java dan database MySQL. Kata kunci : Sistem pakar, speech recognition, ISPA, metode naïve bayes classifier, Android. AbstractISPA (Acute Respiratory Tract Infection is a respiratory disorder disease that can lead to a wide spectrum of diseases ranging from asymptomatic disease, mild infection to severe and deadly disease due to environmental factors. So if someone complains of respiratory disorders not necessarily just have regular respiratory problems because it could be the person has ARI disease. The role of expert systems provided in the form of an application is needed to help a person in the diagnosis of ARI disease easily and quickly. By trying to adopt human knowledge into a computer, an expert system is capable of solving problems like that of an expert. Therefore, the Application of Expert

  20. Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

    Science.gov (United States)

    Ryumin, D.; Karpov, A. A.

    2017-05-01

    In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.

  1. Effects of Age and Working Memory Capacity on Speech Recognition Performance in Noise Among Listeners With Normal Hearing.

    Science.gov (United States)

    Gordon-Salant, Sandra; Cole, Stacey Samuels

    2016-01-01

    This study aimed to determine if younger and older listeners with normal hearing who differ on working memory span perform differently on speech recognition tests in noise. Older adults typically exhibit poorer speech recognition scores in noise than younger adults, which is attributed primarily to poorer hearing sensitivity and more limited working memory capacity in older than younger adults. Previous studies typically tested older listeners with poorer hearing sensitivity and shorter working memory spans than younger listeners, making it difficult to discern the importance of working memory capacity on speech recognition. This investigation controlled for hearing sensitivity and compared speech recognition performance in noise by younger and older listeners who were subdivided into high and low working memory groups. Performance patterns were compared for different speech materials to assess whether or not the effect of working memory capacity varies with the demands of the specific speech test. The authors hypothesized that (1) normal-hearing listeners with low working memory span would exhibit poorer speech recognition performance in noise than those with high working memory span; (2) older listeners with normal hearing would show poorer speech recognition scores than younger listeners with normal hearing, when the two age groups were matched for working memory span; and (3) an interaction between age and working memory would be observed for speech materials that provide contextual cues. Twenty-eight older (61 to 75 years) and 25 younger (18 to 25 years) normal-hearing listeners were assigned to groups based on age and working memory status. Northwestern University Auditory Test No. 6 words and Institute of Electrical and Electronics Engineers sentences were presented in noise using an adaptive procedure to measure the signal-to-noise ratio corresponding to 50% correct performance. Cognitive ability was evaluated with two tests of working memory (Listening

  2. Automated Degradation Diagnosis in Character Recognition System Subject to Camera Vibration

    Directory of Open Access Journals (Sweden)

    Chunmei Liu

    2014-01-01

    Full Text Available Degradation diagnosis plays an important role for degraded character processing, which can tell the recognition difficulty of a given degraded character. In this paper, we present a framework for automated degraded character recognition system by statistical syntactic approach using 3D primitive symbol, which is integrated by degradation diagnosis to provide accurate and reliable recognition results. Our contribution is to design the framework to build the character recognition submodels corresponding to degradation subject to camera vibration or out of focus. In each character recognition submodel, statistical syntactic approach using 3D primitive symbol is proposed to improve degraded character recognition performance. In the experiments, we show attractive experimental results, highlighting the system efficiency and recognition performance by statistical syntactic approach using 3D primitive symbol on the degraded character dataset.

  3. Robust audio-visual speech recognition under noisy audio-video conditions.

    Science.gov (United States)

    Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji

    2014-02-01

    This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.

  4. Speech Recognition in Real-Life Background Noise by Young and Middle-Aged Adults with Normal Hearing

    OpenAIRE

    Lee, Ji Young; Lee, Jin Tae; Heo, Hye Jeong; Choi, Chul-Hee; Choi, Seong Hee; Lee, Kyungjae

    2015-01-01

    Background and Objectives People usually converse in real-life background noise. They experience more difficulty understanding speech in noise than in a quiet environment. The present study investigated how speech recognition in real-life background noise is affected by the type of noise, signal-to-noise ratio (SNR), and age. Subjects and Methods Eighteen young adults and fifteen middle-aged adults with normal hearing participated in the present study. Three types of noise [subway noise, vacu...

  5. Cognition and speech-in-noise recognition: the role of proactive interference.

    Science.gov (United States)

    Ellis, Rachel J; Rönnberg, Jerker

    2014-01-01

    Complex working memory (WM) span tasks have been shown to predict speech-in-noise (SIN) recognition. Studies of complex WM span tasks suggest that, rather than indexing a single cognitive process, performance on such tasks may be governed by separate cognitive subprocesses embedded within WM. Previous research has suggested that one such subprocess indexed by WM tasks is proactive interference (PI), which refers to difficulties memorizing current information because of interference from previously stored long-term memory representations for similar information. The aim of the present study was to investigate phonological PI and to examine the relationship between PI (semantic and phonological) and SIN perception. A within-subjects experimental design was used. An opportunity sample of 24 young listeners with normal hearing was recruited. Measures of resistance to, and release from, semantic and phonological PI were calculated alongside the signal-to-noise ratio required to identify 50% of keywords correctly in a SIN recognition task. The data were analyzed using t-tests and correlations. Evidence of release from and resistance to semantic interference was observed. These measures correlated significantly with SIN recognition. Limited evidence of phonological PI was observed. The results show that capacity to resist semantic PI can be used to predict SIN recognition scores in young listeners with normal hearing. On the basis of these findings, future research will focus on investigating whether tests of PI can be used in the treatment and/or rehabilitation of hearing loss. American Academy of Audiology.

  6. Comparing models of the combined-stimulation advantage for speech recognition.

    Science.gov (United States)

    Micheyl, Christophe; Oxenham, Andrew J

    2012-05-01

    The "combined-stimulation advantage" refers to an improvement in speech recognition when cochlear-implant or vocoded stimulation is supplemented by low-frequency acoustic information. Previous studies have been interpreted as evidence for "super-additive" or "synergistic" effects in the combination of low-frequency and electric or vocoded speech information by human listeners. However, this conclusion was based on predictions of performance obtained using a suboptimal high-threshold model of information combination. The present study shows that a different model, based on Gaussian signal detection theory, can predict surprisingly large combined-stimulation advantages, even when performance with either information source alone is close to chance, without involving any synergistic interaction. A reanalysis of published data using this model reveals that previous results, which have been interpreted as evidence for super-additive effects in perception of combined speech stimuli, are actually consistent with a more parsimonious explanation, according to which the combined-stimulation advantage reflects an optimal combination of two independent sources of information. The present results do not rule out the possible existence of synergistic effects in combined stimulation; however, they emphasize the possibility that the combined-stimulation advantages observed in some studies can be explained simply by non-interactive combination of two information sources.

  7. Comparison of middle latency responses in presbycusis patients with two different speech recognition scores.

    Science.gov (United States)

    Kirkim, Gunay; Madanoglu, Nevma; Akdas, Ferda; Serbetcioglu, M Bulent

    2007-12-01

    The purpose of this study is to evaluate whether the middle latency responses (MLR) can be used for an objective differentiation of patients with presbycusis having relatively good (Group I) and relatively poor speech recognition scores (Group II). All the participants of these groups had high frequency down-sloping hearing loss with an average of 26-60 dB HL. Data were collected from two described study groups and a control group, using pure tone audiometry, monosyllabic phonetically balanced word and synthetic sentence identification, as well as MLR. The study groups were compared with the control group. When patients in Group I were compared with the control group, only ipsilateral Na latency of middle latency evoked response was statistically significant in the right ear whereas ipsilateral Na latency in the right ear, ipsilateral and contralateral Na latency in the left ear of the patients in Group II were statistically significant. Thus, as an objective complementary tool for the evaluation of the speech perception ability of the patients with presbycusis, Na latency of MLR may be used in combination with the speech discrimination tests.

  8. Deficits in audiovisual speech perception in normal aging emerge at the level of whole-word recognition.

    Science.gov (United States)

    Stevenson, Ryan A; Nelms, Caitlin E; Baum, Sarah H; Zurkovsky, Lilia; Barense, Morgan D; Newhouse, Paul A; Wallace, Mark T

    2015-01-01

    Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals. Copyright © 2015 Elsevier Inc. All rights reserved.

  9. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  10. DEVELOPING VISUAL NOVEL GAME WITH SPEECH-RECOGNITION INTERACTIVITY TO ENHANCE STUDENTS’ MASTERY ON ENGLISH EXPRESSIONS

    Directory of Open Access Journals (Sweden)

    Elizabeth Anggraeni Amalo

    2017-11-01

    Full Text Available The teaching of English-expressions has always been done through conversation samples in form of written texts, audio recordings, and videos. In the meantime, the development of computer-aided learning technology has made autonomous language learning possible. Game, as one of computer-aided learning technology products, can serve as a medium to provide educational contents like that of language teaching and learning. Visual Novel is considered as a conversational game that is suitable to be combined with English-expressions material. Unlike the other click-based interaction Visual Novel Games, the visual novel game in this research implements speech recognition as the interaction trigger. Hence, this paper aims at elaborating how visual novel games are utilized to deliver English-expressions with speech recognition command for the interaction. This research used Research and Development (R&D method with Experimental design through control and experimental groups to measure its effectiveness in enhancing students’ English-expressions mastery. ANOVA was utilized to prove the significant differences between the control and experimental groups. It is expected that the result of this development and experiment can devote benefits to the English teaching and learning, especially on English-expressions.

  11. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  12. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  13. Automated analysis of connected speech reveals early biomarkers of Parkinson's disease in patients with rapid eye movement sleep behaviour disorder.

    Science.gov (United States)

    Hlavnička, Jan; Čmejla, Roman; Tykalová, Tereza; Šonka, Karel; Růžička, Evžen; Rusz, Jan

    2017-02-02

    For generations, the evaluation of speech abnormalities in neurodegenerative disorders such as Parkinson's disease (PD) has been limited to perceptual tests or user-controlled laboratory analysis based upon rather small samples of human vocalizations. Our study introduces a fully automated method that yields significant features related to respiratory deficits, dysphonia, imprecise articulation and dysrhythmia from acoustic microphone data of natural connected speech for predicting early and distinctive patterns of neurodegeneration. We compared speech recordings of 50 subjects with rapid eye movement sleep behaviour disorder (RBD), 30 newly diagnosed, untreated PD patients and 50 healthy controls, and showed that subliminal parkinsonian speech deficits can be reliably captured even in RBD patients, which are at high risk of developing PD or other synucleinopathies. Thus, automated vocal analysis should soon be able to contribute to screening and diagnostic procedures for prodromal parkinsonian neurodegeneration in natural environments.

  14. Application of Business Process Management to drive the deployment of a speech recognition system in a healthcare organization.

    Science.gov (United States)

    González Sánchez, María José; Framiñán Torres, José Manuel; Parra Calderón, Carlos Luis; Del Río Ortega, Juan Antonio; Vigil Martín, Eduardo; Nieto Cervera, Jaime

    2008-01-01

    We present a methodology based on Business Process Management to guide the development of a speech recognition system in a hospital in Spain. The methodology eases the deployment of the system by 1) involving the clinical staff in the process, 2) providing the IT professionals with a description of the process and its requirements, 3) assessing advantages and disadvantages of the speech recognition system, as well as its impact in the organisation, and 4) help reorganising the healthcare process before implementing the new technology in order to identify how it can better contribute to the overall objective of the organisation.

  15. Usability Assessment of Text-to-Speech Synthesis for Additional Detail in an Automated Telephone Banking System

    OpenAIRE

    Morton , Hazel; Gunson , Nancie; Marshall , Diarmid; McInnes , Fergus; Ayres , Andrea; Jack , Mervyn

    2010-01-01

    Abstract This paper describes a comprehensive usability evaluation of an automated telephone banking system which employs text-to-speech (TTS) synthesis in offering additional detail on customers? account transactions. The paper describes a series of four experiments in which TTS was employed to offer an extra level of detail to recent transactions listings within an established banking service which otherwise uses recorded speech from a professional recording artist. Results from ...

  16. CAR2 - Czech Database of Car Speech

    Directory of Open Access Journals (Sweden)

    P. Sovka

    1999-12-01

    Full Text Available This paper presents new Czech language two-channel (stereo speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done.

  17. CAR2 - Czech Database of Car Speech

    OpenAIRE

    Pollak, P.; Vopicka, J.; Hanzl, V.; Sovka, Pavel

    1999-01-01

    This paper presents new Czech language two-channel (stereo) speech database recorded in car environment. The created database was designed for experiments with speech enhancement for communication purposes and for the study and the design of a robust speech recognition systems. Tools for automated phoneme labelling based on Baum-Welch re-estimation were realised. The noise analysis of the car background environment was done.

  18. Use of Automated Scoring in Spoken Language Assessments for Test Takers with Speech Impairments. Research Report. ETS RR-17-42

    Science.gov (United States)

    Loukina, Anastassia; Buzick, Heather

    2017-01-01

    This study is an evaluation of the performance of automated speech scoring for speakers with documented or suspected speech impairments. Given that the use of automated scoring of open-ended spoken responses is relatively nascent and there is little research to date that includes test takers with disabilities, this small exploratory study focuses…

  19. Some factors underlying individual differences in speech recognition on PRESTO: a first report.

    Science.gov (United States)

    Tamati, Terrin N; Gilbert, Jaimie L; Pisoni, David B

    2013-01-01

    Previous studies investigating speech recognition in adverse listening conditions have found extensive variability among individual listeners. However, little is currently known about the core underlying factors that influence speech recognition abilities. To investigate sensory, perceptual, and neurocognitive differences between good and poor listeners on the Perceptually Robust English Sentence Test Open-set (PRESTO), a new high-variability sentence recognition test under adverse listening conditions. Participants who fell in the upper quartile (HiPRESTO listeners) or lower quartile (LoPRESTO listeners) on key word recognition on sentences from PRESTO in multitalker babble completed a battery of behavioral tasks and self-report questionnaires designed to investigate real-world hearing difficulties, indexical processing skills, and neurocognitive abilities. Young, normal-hearing adults (N = 40) from the Indiana University community participated in the current study. Participants' assessment of their own real-world hearing difficulties was measured with a self-report questionnaire on situational hearing and hearing health history. Indexical processing skills were assessed using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Neurocognitive abilities were measured with the Auditory Digit Span Forward (verbal short-term memory) and Digit Span Backward (verbal working memory) tests, the Stroop Color and Word Test (attention/inhibition), the WordFam word familiarity test (vocabulary size), the Behavioral Rating Inventory of Executive Function-Adult Version (BRIEF-A) self-report questionnaire on executive function, and two performance subtests of the Wechsler Abbreviated Scale of Intelligence (WASI) Performance Intelligence Quotient (IQ; nonverbal intelligence). Scores on self-report questionnaires and behavioral tasks were tallied and analyzed by listener group (HiPRESTO and LoPRESTO). The extreme

  20. Development of equally intelligible Telugu sentence-lists to test speech recognition in noise.

    Science.gov (United States)

    Tanniru, Kishore; Narne, Vijaya Kumar; Jain, Chandni; Konadath, Sreeraj; Singh, Niraj Kumar; Sreenivas, K J Ramadevi; K, Anusha

    2017-09-01

    To develop sentence lists in the Telugu language for the assessment of speech recognition threshold (SRT) in the presence of background noise through identification of the mean signal-to-noise ratio required to attain a 50% sentence recognition score (SRTn). This study was conducted in three phases. The first phase involved the selection and recording of Telugu sentences. In the second phase, 20 lists, each consisting of 10 sentences with equal intelligibility, were formulated using a numerical optimisation procedure. In the third phase, the SRTn of the developed lists was estimated using adaptive procedures on individuals with normal hearing. A total of 68 native Telugu speakers with normal hearing participated in the study. Of these, 18 (including the speakers) performed on various subjective measures in first phase, 20 performed on sentence/word recognition in noise for second phase and 30 participated in the list equivalency procedures in third phase. In all, 15 lists of comparable difficulty were formulated as test material. The mean SRTn across these lists corresponded to -2.74 (SD = 0.21). The developed sentence lists provided a valid and reliable tool to measure SRTn in Telugu native speakers.

  1. AUTOMATION PROGRAM FOR RECOGNITION OF ALGORITHM SOLUTION OF MATHEMATIC TASK

    Directory of Open Access Journals (Sweden)

    Denis N. Butorin

    2014-01-01

    Full Text Available In the article are been describing technology for manage of testing task in computer program. It was found for recognition of algorithm solution of mathematic task. There are been justifi ed the using hierarchical structure for a special set of testing questions. Also, there has been presented the release of the described tasks in the computer program openSEE. 

  2. AUTOMATION PROGRAM FOR RECOGNITION OF ALGORITHM SOLUTION OF MATHEMATIC TASK

    OpenAIRE

    Denis N. Butorin

    2014-01-01

    In the article are been describing technology for manage of testing task in computer program. It was found for recognition of algorithm solution of mathematic task. There are been justifi ed the using hierarchical structure for a special set of testing questions. Also, there has been presented the release of the described tasks in the computer program openSEE. 

  3. The Compensatory Effectiveness of Optical Character Recognition/Speech Synthesis on Reading Comprehension of Postsecondary Students with Learning Disabilities.

    Science.gov (United States)

    Higgins, Eleanor L.; Raskind, Marshall H.

    1997-01-01

    Thirty-seven college students with learning disabilities were given a reading comprehension task under the following conditions: (1) using an optical character recognition/speech synthesis system; (2) having the text read aloud by a human reader; or (3) reading silently without assistance. Findings indicated that the greater the disability, the…

  4. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  5. Financial and workflow analysis of radiology reporting processes in the planning phase of implementation of a speech recognition system

    Science.gov (United States)

    Whang, Tom; Ratib, Osman M.; Umamoto, Kathleen; Grant, Edward G.; McCoy, Michael J.

    2002-05-01

    The goal of this study is to determine the financial value and workflow improvements achievable by replacing traditional transcription services with a speech recognition system in a large, university hospital setting. Workflow metrics were measured at two hospitals, one of which exclusively uses a transcription service (UCLA Medical Center), and the other which exclusively uses speech recognition (West Los Angeles VA Hospital). Workflow metrics include time spent per report (the sum of time spent interpreting, dictating, reviewing, and editing), transcription turnaround, and total report turnaround. Compared to traditional transcription, speech recognition resulted in radiologists spending 13-32% more time per report, but it also resulted in reduction of report turnaround time by 22-62% and reduction of marginal cost per report by 94%. The model developed here helps justify the introduction of a speech recognition system by showing that the benefits of reduced operating costs and decreased turnaround time outweigh the cost of increased time spent per report. Whether the ultimate goal is to achieve a financial objective or to improve operational efficiency, it is important to conduct a thorough analysis of workflow before implementation.

  6. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  7. The classification problem in machine learning: an overview with study cases in emotion recognition and music-speech differentiation

    OpenAIRE

    Rodríguez Cadavid, Santiago

    2015-01-01

    This work addresses the well-known classification problem in machine learning -- The goal of this study is to approach the reader to the methodological aspects of the feature extraction, feature selection and classifier performance through simple and understandable theoretical aspects and two study cases -- Finally, a very good classification performance was obtained for the emotion recognition from speech

  8. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  9. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Seymour Rowan

    2008-01-01

    Full Text Available Abstract We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  10. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Ji Ming

    2008-03-01

    Full Text Available We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  11. A freely-available authoring system for browser-based CALL with speech recognition

    Directory of Open Access Journals (Sweden)

    Myles O'Brien

    2017-06-01

    Full Text Available A system for authoring browser-based CALL material incorporating Google speech recognition has been developed and made freely available for download. The system provides a teacher with a simple way to set up CALL material, including an optional image, sound or video, which will elicit spoken (and/or typed answers from the user and check them against a list of specified permitted answers, giving feedback with hints when necessary. The teacher needs no HTML or Javascript expertise, just the facilities and ability to edit text files and upload to the Internet. The structure and functioning of the system are explained in detail, and some suggestions are given for practical use. Finally, some of its limitations are described.

  12. Speech recognition: impact on workflow and report availability; Spracherkennung: Auswirkung auf Workflow und Befundverfuegbarkeit

    Energy Technology Data Exchange (ETDEWEB)

    Glaser, C.; Trumm, C.; Nissen-Meyer, S.; Francke, M.; Kuettner, B.; Reiser, M. [Klinikum Grosshadern der Ludwig-Maximilians-Universitaet Muenchen (Germany). Institut fuer Klinische Radiologie

    2005-08-01

    With ongoing technical refinements speech recognition systems (SRS) are becoming an increasingly attractive alternative to traditional methods of preparing and transcribing medical reports. The two main components of any SRS are the acoustic model and the language model. Features of modern SRS with continuous speech recognition are macros with individually definable texts and report templates as well as the option to navigate in a text or to control SRS or RIS functions by speech recognition. The best benefit from SRS can be obtained if it is integrated into a RIS/RIS-PACS installation. Report availability and time efficiency of the reporting process (related to recognition rate, time expenditure for editing and correcting a report) are the principal determinants of the clinical performance of any SRS. For practical purposes the recognition rate is estimated by the error rate (unit ''word''). Error rates range from 4 to 28%. Roughly 20% of them are errors in the vocabulary which may result in clinically relevant misinterpretation. It is thus mandatory to thoroughly correct any transcribed text as well as to continuously train and adapt the SRS vocabulary. The implementation of SRS dramatically improves report availability. This is most pronounced for CT and CR. However, the individual time expenditure for (SRS-based) reporting increased by 20-25% (CR) and according to literature data there is an increase by 30% for CT and MRI. The extent to which the transcription staff profits from SRS depends largely on its qualification. Online dictation implies a workload shift from the transcription staff to the reporting radiologist. (orig.) [German] Mit der voranschreitenden technischen Entwicklung werden Spracherkennungssysteme (SES) - gerade vor dem Hintergrund der aktuell unabweisbaren Kostenreduktion bei gleichbleibender Qualitaet in der Patientenversorgung - eine zunehmend attraktive Alternative zur traditionellen Befunderstellung. Die 2

  13. Extraction of prostatic lumina and automated recognition for prostatic calculus image using PCA-SVM.

    Science.gov (United States)

    Wang, Zhuocai; Xu, Xiangmin; Ding, Xiaojun; Xiao, Hui; Huang, Yusheng; Liu, Jian; Xing, Xiaofen; Wang, Hua; Liao, D Joshua

    2011-01-01

    Identification of prostatic calculi is an important basis for determining the tissue origin. Computation-assistant diagnosis of prostatic calculi may have promising potential but is currently still less studied. We studied the extraction of prostatic lumina and automated recognition for calculus images. Extraction of lumina from prostate histology images was based on local entropy and Otsu threshold recognition using PCA-SVM and based on the texture features of prostatic calculus. The SVM classifier showed an average time 0.1432 second, an average training accuracy of 100%, an average test accuracy of 93.12%, a sensitivity of 87.74%, and a specificity of 94.82%. We concluded that the algorithm, based on texture features and PCA-SVM, can recognize the concentric structure and visualized features easily. Therefore, this method is effective for the automated recognition of prostatic calculi.

  14. Extraction of Prostatic Lumina and Automated Recognition for Prostatic Calculus Image Using PCA-SVM

    Science.gov (United States)

    Wang, Zhuocai; Xu, Xiangmin; Ding, Xiaojun; Xiao, Hui; Huang, Yusheng; Liu, Jian; Xing, Xiaofen; Wang, Hua; Liao, D. Joshua

    2011-01-01

    Identification of prostatic calculi is an important basis for determining the tissue origin. Computation-assistant diagnosis of prostatic calculi may have promising potential but is currently still less studied. We studied the extraction of prostatic lumina and automated recognition for calculus images. Extraction of lumina from prostate histology images was based on local entropy and Otsu threshold recognition using PCA-SVM and based on the texture features of prostatic calculus. The SVM classifier showed an average time 0.1432 second, an average training accuracy of 100%, an average test accuracy of 93.12%, a sensitivity of 87.74%, and a specificity of 94.82%. We concluded that the algorithm, based on texture features and PCA-SVM, can recognize the concentric structure and visualized features easily. Therefore, this method is effective for the automated recognition of prostatic calculi. PMID:21461364

  15. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  16. Assessment of hearing aid algorithms using a master hearing aid: the influence of hearing aid experience on the relationship between speech recognition and cognitive capacity.

    Science.gov (United States)

    Rählmann, Sebastian; Meis, Markus; Schulte, Michael; Kießling, Jürgen; Walger, Martin; Meister, Hartmut

    2017-04-27

    Model-based hearing aid development considers the assessment of speech recognition using a master hearing aid (MHA). It is known that aided speech recognition in noise is related to cognitive factors such as working memory capacity (WMC). This relationship might be mediated by hearing aid experience (HAE). The aim of this study was to examine the relationship of WMC and speech recognition with a MHA for listeners with different HAE. Using the MHA, unaided and aided 80% speech recognition thresholds in noise were determined. Individual WMC capacity was assed using the Verbal Learning and Memory Test (VLMT) and the Reading Span Test (RST). Forty-nine hearing aid users with mild to moderate sensorineural hearing loss divided into three groups differing in HAE. Whereas unaided speech recognition did not show a significant relationship with WMC, a significant correlation could be observed between WMC and aided speech recognition. However, this only applied to listeners with HAE of up to approximately three years, and a consistent weakening of the correlation could be observed with more experience. Speech recognition scores obtained in acute experiments with an MHA are less influenced by individual cognitive capacity when experienced HA users are taken into account.

  17. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Science.gov (United States)

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.

  18. Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners.

    Science.gov (United States)

    Bradlow, Ann R; Alexander, Jennifer A

    2007-04-01

    Previous research has shown that speech recognition differences between native and proficient non-native listeners emerge under suboptimal conditions. Current evidence has suggested that the key deficit that underlies this disproportionate effect of unfavorable listening conditions for non-native listeners is their less effective use of compensatory information at higher levels of processing to recover from information loss at the phoneme identification level. The present study investigated whether this non-native disadvantage could be overcome if enhancements at various levels of processing were presented in combination. Native and non-native listeners were presented with English sentences in which the final word varied in predictability and which were produced in either plain or clear speech. Results showed that, relative to the low-predictability-plain-speech baseline condition, non-native listener final word recognition improved only when both semantic and acoustic enhancements were available (high-predictability-clear-speech). In contrast, the native listeners benefited from each source of enhancement separately and in combination. These results suggests that native and non-native listeners apply similar strategies for speech-in-noise perception: The crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non-native listeners to take advantage of this contextual information per se.

  19. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  20. Automated track recognition and event reconstruction in nuclear emulsion

    International Nuclear Information System (INIS)

    Deines-Jones, P.; Aranas, A.; Cherry, M.L.; Dugas, J.; Kudzia, D.; Nilsen, B.S.; Sengupta, K.; Waddington, C.J.; Wefel, J.P.; Wilczynska, B.; Wilczynski, H.; Wosiek, B.

    1997-01-01

    The major advantages of nuclear emulsion for detecting charged particles are its submicron position resolution and sensitivity to minimum ionizing particles. These must be balanced, however, against the difficult manual microscope measurement by skilled observers required for the analysis. We have developed an automated system to acquire and analyze the microscope images from emulsion chambers. Each emulsion plate is analyzed independently, allowing coincidence techniques to be used in order to reject background and estimate error rates. The system has been used to analyze a sample of high-multiplicity Pb-Pb interactions (charged particle multiplicities ∝ 1100) produced by the 158 GeV/c per nucleon 208 Pb beam at CERN. Automatically measured events agree with our best manual measurements on 97% of all the tracks. We describe the image analysis and track reconstruction techniques, and discuss the measurement and reconstruction uncertainties. (orig.)

  1. Experiments and Pilot Study Evaluating the Performance of Reading Miscue Detector and Automated Reading Tutor for Filipino: A Children's Speech Technology for Improving Literacy

    Directory of Open Access Journals (Sweden)

    Ronald M. Pascual

    2017-06-01

    Full Text Available The latest advances in speech processing technology have allowed the development of automated reading tutors (ART for improving children's literacy. An ART is a computer-assisted learning system based on oral reading fluency (ORF instruction and automated speech recognition (ASR technology. However, the design of an ART system is language-specif ic, and thus, requires developing a system specif ically for the Filipino language. In a previous work, the authors have presented the development of the children's Filipino speech corpus (CFSC for the purpose of designing an ART in Filipino. In this paper, the authors present the evaluation of the ART in Filipino which integrates a reference verification (RV- and word duration analysis-based reading miscue detector (RMD, a user interface, and a feedback and instruction set. The authors also present the performance evaluation of the RMD in offline tests, and the effectiveness of the ART as shown by the results of the intervention program, a month-long pilot study that involved the use of the ART by a small group of students. Offline test results show that the RMD's performance (i.e., FA rate ≈ 3% and MDerr rate ≈ 5% is at par with those from state-of-the-art RMDs reported in the literature. The results of the ART intervention experiment showed that the students, on the average, have improved in their words correct per minute (WCPM rate by 4.66 times, in their ORF-16 scores by 6.0 times, and in their reading comprehension exam scores by 4.4 times, after using the ART.

  2. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users.

    Science.gov (United States)

    Kaandorp, Marre W; Smits, Cas; Merkus, Paul; Festen, Joost M; Goverts, S Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with monosyllables in quiet (consonant-vowel-consonant [CVC]), sentences-in-noise (SIN), and digit-triplets in noise (DIN). In addition to a composite variable of lexical-access ability (LA), measured with a lexical-decision test (LDT) and word-naming task, vocabulary size, working-memory capacity (Reading Span test [RSpan]), and a visual analogue of the SIN test (text reception threshold test) were measured. The DIN test was used to correct for auditory factors in SIN thresholds by taking the difference between SIN and DIN: SRT diff . Correlation analyses revealed that duration of hearing loss (dHL) was related to SIN thresholds. Better working-memory capacity was related to SIN and SRT diff scores. LDT reaction time was positively correlated with SRT diff scores. No significant relationships were found for CVC or DIN scores with the predictor variables. Regression analyses showed that together with dHL, RSpan explained 55% of the variance in SIN thresholds. When controlling for auditory performance, LA, LDT, and RSpan separately explained, together with dHL, respectively 37%, 36%, and 46% of the variance in SRT diff outcome. The results suggest that poor verbal working-memory capacity and to a lesser extent poor lexical-access ability limit speech-recognition ability in listeners with a CI.

  3. Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system

    CSIR Research Space (South Africa)

    Kamper, H

    2014-11-01

    Full Text Available -West University, Vanderbijlpark, South Africa 2Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa {etienne.barnard, marelie.davel, cvheerden}@gmail.com, {fdwet, jbadenhorst}@csir.co.za Abstract The NCHLT speech...

  4. Feature Extraction and Selection Strategies for Automated Target Recognition

    Science.gov (United States)

    Greene, W. Nicholas; Zhang, Yuhan; Lu, Thomas T.; Chao, Tien-Hsin

    2010-01-01

    Several feature extraction and selection methods for an existing automatic target recognition (ATR) system using JPLs Grayscale Optical Correlator (GOC) and Optimal Trade-Off Maximum Average Correlation Height (OT-MACH) filter were tested using MATLAB. The ATR system is composed of three stages: a cursory region of-interest (ROI) search using the GOC and OT-MACH filter, a feature extraction and selection stage, and a final classification stage. Feature extraction and selection concerns transforming potential target data into more useful forms as well as selecting important subsets of that data which may aide in detection and classification. The strategies tested were built around two popular extraction methods: Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Performance was measured based on the classification accuracy and free-response receiver operating characteristic (FROC) output of a support vector machine(SVM) and a neural net (NN) classifier.

  5. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener.

  6. Development and Evaluation of a Speech Recognition Test for Persian Speaking Adults

    Directory of Open Access Journals (Sweden)

    Mohammad Mosleh

    2001-05-01

    Full Text Available Method and Materials: This research is carried out for development and evaluation of 25 phonemically balanced word lists for Persian speaking adults in two separate stages: development and evaluation. In the first stage, in order to balance the lists phonemically, frequency -of- occurrences of each 29phonems (6 vowels and 23 Consonants of the Persian language in adults speech are determined. This section showed some significant differences between some phonemes' frequencies. Then, all Persian monosyllabic words extracted from the Mo ‘in Persian dictionary. The semantically difficult words were refused and the appropriate words choosed according to judgment of 5 adult native speakers of Persian with high school diploma. 12 openset 25 word lists are prepared. The lists were recorded on magnetic tapes in an audio studio by a professional speaker of IRIB. "nIn the second stage, in order to evaluate the test's validity and reliability, 60 normal hearing adults (30 male, 30 female, were randomly selected and evaluated as test and retest. Findings: 1- Normal hearing adults obtained 92-1 0O scores for each list at their MCL through test-retest. 2- No significant difference was observed a/ in test-retest scores in each list (‘P>O.05 b/ between the lists at test or retest scores (P>0.05, c/between sex (P>0.05. Conclusion: This research is reliable and valid, the lists are phonemically balanced and equal in difficulty and valuable for evaluation of Persian speaking adults speech recognition.

  7. Automated recognition of forest patterns using aerial photographs

    Science.gov (United States)

    Barbezat, Vincent; Kreiss, Philippe; Sulzmann, Armin; Jacot, Jacques

    1996-12-01

    In Switzerland, aerial photos are indispensable tools for research into ecosystems and their management. Every six years since 1950, the whole of Switzerland has been systematically surveyed by aerial photos. In the forestry field, these documents not only provide invaluable information but also give support to field activities such as the drawing up of tree population maps, intervention planning, precise positioning of the upper forest limit, evaluation of forest damage and rates of tree growth. Up to now, the analysis of aerial photos has been carried out by specialists who painstakingly examine every photograph, which makes it a very long, exacting and expensive job. The IMT-DMT of the EPFL and Antenne romande of FNP, aware of the special interest involved and the necessity of automated classification of aerial photos, have pooled their resources to develop a software program capable of differentiating between single trees, copses and dense forests. The developed algorithms detect the crowns of the trees and the surface of the orthogonal projection. Form the shadow of each tree they calculate its height. They also determine the position of the tree in the Swiss national coordinate thanks to the implementation of a numeric altitude model. For the future, we have the prospect of many new and better uses of aerial photos being available to us, particularly where isolated stands are concerned and also when evolutions based on a diachronic series of photos have to be assessed: from timberline monitoring in the research on global change to the exploitation of wooded pastures on small surface areas.

  8. An analysis of machine translation and speech synthesis in speech-to-speech translation system

    OpenAIRE

    Hashimoto, K.; Yamagishi, J.; Byrne, W.; King, S.; Tokuda, K.

    2011-01-01

    This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. Therefore, in this paper, we focus on machine translation and speech synthesis, ...

  9. Review of Design of Speech Recognition and Text Analytics based Digital Banking Customer Interface and Future Directions of Technology Adoption

    OpenAIRE

    Saha, Amal K

    2017-01-01

    Banking is one of the most significant adopters of cutting-edge information technologies. Since its modern era beginning in the form of paper based accounting maintained in the branch, adoption of computerized system made it possible to centralize the processing in data centre and improve customer experience by making a more available and efficient system. The latest twist in this evolution is adoption of natural language processing and speech recognition in the user interface between the hum...

  10. The Usefulness of Automatic Speech Recognition (ASR Eyespeak Software in Improving Iraqi EFL Students’ Pronunciation

    Directory of Open Access Journals (Sweden)

    Lina Fathi Sidig Sidgi

    2017-02-01

    Full Text Available The present study focuses on determining whether automatic speech recognition (ASR technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not found in their native language (Arabic. This study is concerned with ASR and its effectiveness in overcoming this difficulty. The data were obtained from twenty participants randomly selected from first-year college students at Al-Turath University College from the Department of English in Baghdad-Iraq. The students had participated in a two month pronunciation instruction course using ASR Eyespeak software. At the end of the pronunciation instruction course using ASR Eyespeak software, the students completed a questionnaire to get their opinions about the usefulness of the ASR Eyespeak in improving their pronunciation. The findings of the study revealed that the students found ASR Eyespeak software very useful in improving their pronunciation and helping them realise their pronunciation mistakes. They also reported that learning pronunciation with ASR Eyespeak enjoyable.

  11. Psychometrically equivalent bisyllabic words for speech recognition threshold testing in Vietnamese.

    Science.gov (United States)

    Harris, Richard W; McPherson, David L; Hanson, Claire M; Eggett, Dennis L

    2017-08-01

    This study identified, digitally recorded, edited and evaluated 89 bisyllabic Vietnamese words with the goal of identifying homogeneous words that could be used to measure the speech recognition threshold (SRT) in native talkers of Vietnamese. Native male and female talker productions of 89 Vietnamese bisyllabic words were recorded, edited and then presented at intensities ranging from -10 to 20 dBHL. Logistic regression was used to identify the best words for measuring the SRT. Forty-eight words were selected and digitally edited to have 50% intelligibility at a level equal to the mean pure-tone average (PTA) for normally hearing participants (5.2 dBHL). Twenty normally hearing native Vietnamese participants listened to and repeated bisyllabic Vietnamese words at intensities ranging from -10 to 20 dBHL. A total of 48 male and female talker recordings of bisyllabic words with steep psychometric functions (>9.0%/dB) were chosen for the final bisyllabic SRT list. Only words homogeneous with respect to threshold audibility with steep psychometric function slopes were chosen for the final list. Digital recordings of bisyllabic Vietnamese words are now available for use in measuring the SRT for patients whose native language is Vietnamese.

  12. Effects of Familiarity and Feeding on Newborn Speech-Voice Recognition

    Science.gov (United States)

    Valiante, A. Grace; Barr, Ronald G.; Zelazo, Philip R.; Brant, Rollin; Young, Simon N.

    2013-01-01

    Newborn infants preferentially orient to familiar over unfamiliar speech sounds. They are also better at remembering unfamiliar speech sounds for short periods of time if learning and retention occur after a feed than before. It is unknown whether short-term memory for speech is enhanced when the sound is familiar (versus unfamiliar) and, if so,…

  13. Right-Ear Advantage for Speech-in-Noise Recognition in Patients with Nonlateralized Tinnitus and Normal Hearing Sensitivity.

    Science.gov (United States)

    Tai, Yihsin; Husain, Fatima T

    2018-04-01

    Despite having normal hearing sensitivity, patients with chronic tinnitus may experience more difficulty recognizing speech in adverse listening conditions as compared to controls. However, the association between the characteristics of tinnitus (severity and loudness) and speech recognition remains unclear. In this study, the Quick Speech-in-Noise test (QuickSIN) was conducted monaurally on 14 patients with bilateral tinnitus and 14 age- and hearing-matched adults to determine the relation between tinnitus characteristics and speech understanding. Further, Tinnitus Handicap Inventory (THI), tinnitus loudness magnitude estimation, and loudness matching were obtained to better characterize the perceptual and psychological aspects of tinnitus. The patients reported low THI scores, with most participants in the slight handicap category. Significant between-group differences in speech-in-noise performance were only found at the 5-dB signal-to-noise ratio (SNR) condition. The tinnitus group performed significantly worse in the left ear than in the right ear, even though bilateral tinnitus percept and symmetrical thresholds were reported in all patients. This between-ear difference is likely influenced by a right-ear advantage for speech sounds, as factors related to testing order and fatigue were ruled out. Additionally, significant correlations found between SNR loss in the left ear and tinnitus loudness matching suggest that perceptual factors related to tinnitus had an effect on speech-in-noise performance, pointing to a possible interaction between peripheral and cognitive factors in chronic tinnitus. Further studies, that take into account both hearing and cognitive abilities of patients, are needed to better parse out the effect of tinnitus in the absence of hearing impairment.

  14. Long-term temporal tracking of speech rate affects spoken-word recognition.

    Science.gov (United States)

    Baese-Berk, Melissa M; Heffner, Christopher C; Dilley, Laura C; Pitt, Mark A; Morrill, Tuuli H; McAuley, J Devin

    2014-08-01

    Humans unconsciously track a wide array of distributional characteristics in their sensory environment. Recent research in spoken-language processing has demonstrated that the speech rate surrounding a target region within an utterance influences which words, and how many words, listeners hear later in that utterance. On the basis of hypotheses that listeners track timing information in speech over long timescales, we investigated the possibility that the perception of words is sensitive to speech rate over such a timescale (e.g., an extended conversation). Results demonstrated that listeners tracked variation in the overall pace of speech over an extended duration (analogous to that of a conversation that listeners might have outside the lab) and that this global speech rate influenced which words listeners reported hearing. The effects of speech rate became stronger over time. Our findings are consistent with the hypothesis that neural entrainment by speech occurs on multiple timescales, some lasting more than an hour. © The Author(s) 2014.

  15. TreeRipper web application: towards a fully automated optical tree recognition software

    Directory of Open Access Journals (Sweden)

    Hughes Joseph

    2011-05-01

    Full Text Available Abstract Background Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century. Results TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/. The program accepts a range of input image formats (PNG, JPG/JPEG or GIF. The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves. Conclusions Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3.

  16. Integration of multispectral face recognition and multi-PTZ camera automated surveillance for security applications

    Science.gov (United States)

    Chen, Chung-Hao; Yao, Yi; Chang, Hong; Koschan, Andreas; Abidi, Mongi

    2013-06-01

    Due to increasing security concerns, a complete security system should consist of two major components, a computer-based face-recognition system and a real-time automated video surveillance system. A computerbased face-recognition system can be used in gate access control for identity authentication. In recent studies, multispectral imaging and fusion of multispectral narrow-band images in the visible spectrum have been employed and proven to enhance the recognition performance over conventional broad-band images, especially when the illumination changes. Thus, we present an automated method that specifies the optimal spectral ranges under the given illumination. Experimental results verify the consistent performance of our algorithm via the observation that an identical set of spectral band images is selected under all tested conditions. Our discovery can be practically used for a new customized sensor design associated with given illuminations for an improved face recognition performance over conventional broad-band images. In addition, once a person is authorized to enter a restricted area, we still need to continuously monitor his/her activities for the sake of security. Because pantilt-zoom (PTZ) cameras are capable of covering a panoramic area and maintaining high resolution imagery for real-time behavior understanding, researches in automated surveillance systems with multiple PTZ cameras have become increasingly important. Most existing algorithms require the prior knowledge of intrinsic parameters of the PTZ camera to infer the relative positioning and orientation among multiple PTZ cameras. To overcome this limitation, we propose a novel mapping algorithm that derives the relative positioning and orientation between two PTZ cameras based on a unified polynomial model. This reduces the dependence on the knowledge of intrinsic parameters of PTZ camera and relative positions. Experimental results demonstrate that our proposed algorithm presents substantially

  17. Speech recognition software and electronic psychiatric progress notes: physicians' ratings and preferences

    Directory of Open Access Journals (Sweden)

    Derman Yaron D

    2010-08-01

    Full Text Available Abstract Background The context of the current study was mandatory adoption of electronic clinical documentation within a large mental health care organization. Psychiatric electronic documentation has unique needs by the nature of dense narrative content. Our goal was to determine if speech recognition (SR would ease the creation of electronic progress note (ePN documents by physicians at our institution. Methods Subjects: Twelve physicians had access to SR software on their computers for a period of four weeks to create ePN. Measurements: We examined SR software in relation to its perceived usability, data entry time savings, impact on the quality of care and quality of documentation, and the impact on clinical and administrative workflow, as compared to existing methods for data entry. Data analysis: A series of Wilcoxon signed rank tests were used to compare pre- and post-SR measures. A qualitative study design was used. Results Six of twelve participants completing the study favoured the use of SR (five with SR alone plus one with SR via hand-held digital recorder for creating electronic progress notes over their existing mode of data entry. There was no clear perceived benefit from SR in terms of data entry time savings, quality of care, quality of documentation, or impact on clinical and administrative workflow. Conclusions Although our findings are mixed, SR may be a technology with some promise for mental health documentation. Future investigations of this nature should use more participants, a broader range of document types, and compare front- and back-end SR methods.

  18. Automated facial recognition of manually generated clay facial approximations: Potential application in unidentified persons data repositories.

    Science.gov (United States)

    Parks, Connie L; Monson, Keith L

    2018-01-01

    This research examined how accurately 2D images (i.e., photographs) of 3D clay facial approximations were matched to corresponding photographs of the approximated individuals using an objective automated facial recognition system. Irrespective of search filter (i.e., blind, sex, or ancestry) or rank class (R 1 , R 10 , R 25 , and R 50 ) employed, few operationally informative results were observed. In only a single instance of 48 potential match opportunities was a clay approximation matched to a corresponding life photograph within the top 50 images (R 50 ) of a candidate list, even with relatively small gallery sizes created from the application of search filters (e.g., sex or ancestry search restrictions). Increasing the candidate lists to include the top 100 images (R 100 ) resulted in only two additional instances of correct match. Although other untested variables (e.g., approximation method, 2D photographic process, and practitioner skill level) may have impacted the observed results, this study suggests that 2D images of manually generated clay approximations are not readily matched to life photos by automated facial recognition systems. Further investigation is necessary in order to identify the underlying cause(s), if any, of the poor recognition results observed in this study (e.g., potential inferior facial feature detection and extraction). Additional inquiry exploring prospective remedial measures (e.g., stronger feature differentiation) is also warranted, particularly given the prominent use of clay approximations in unidentified persons casework. Copyright © 2017. Published by Elsevier B.V.

  19. The Army word recognition system

    Science.gov (United States)

    Hadden, David R.; Haratz, David

    1977-01-01

    The application of speech recognition technology in the Army command and control area is presented. The problems associated with this program are described as well as as its relevance in terms of the man/machine interactions, voice inflexions, and the amount of training needed to interact with and utilize the automated system.

  20. Speech perception for adult cochlear implant recipients in a realistic background noise: effectiveness of preprocessing strategies and external options for improving speech recognition in noise.

    Science.gov (United States)

    Gifford, René H; Revit, Lawrence J

    2010-01-01

    Although cochlear implant patients are achieving increasingly higher levels of performance, speech perception in noise continues to be problematic. The newest generations of implant speech processors are equipped with preprocessing and/or external accessories that are purported to improve listening in noise. Most speech perception measures in the clinical setting, however, do not provide a close approximation to real-world listening environments. To assess speech perception for adult cochlear implant recipients in the presence of a realistic restaurant simulation generated by an eight-loudspeaker (R-SPACE) array in order to determine whether commercially available preprocessing strategies and/or external accessories yield improved sentence recognition in noise. Single-subject, repeated-measures design with two groups of participants: Advanced Bionics and Cochlear Corporation recipients. Thirty-four subjects, ranging in age from 18 to 90 yr (mean 54.5 yr), participated in this prospective study. Fourteen subjects were Advanced Bionics recipients, and 20 subjects were Cochlear Corporation recipients. Speech reception thresholds (SRTs) in semidiffuse restaurant noise originating from an eight-loudspeaker array were assessed with the subjects' preferred listening programs as well as with the addition of either Beam preprocessing (Cochlear Corporation) or the T-Mic accessory option (Advanced Bionics). In Experiment 1, adaptive SRTs with the Hearing in Noise Test sentences were obtained for all 34 subjects. For Cochlear Corporation recipients, SRTs were obtained with their preferred everyday listening program as well as with the addition of Focus preprocessing. For Advanced Bionics recipients, SRTs were obtained with the integrated behind-the-ear (BTE) mic as well as with the T-Mic. Statistical analysis using a repeated-measures analysis of variance (ANOVA) evaluated the effects of the preprocessing strategy or external accessory in reducing the SRT in noise. In addition

  1. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features

    NARCIS (Netherlands)

    Raaijmakers, S.; Truong, K.P.

    2008-01-01

    We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic

  2. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent secre...

  3. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise....... The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed......, denoted the speech-based envelope power spectrum model (sEPSM) which predicts the intelligibility based on the signal-to-noise ratio in the envelope domain. In contrast to the STI, the sEPSM is sensitive to the increased amount of the noise envelope power as a consequence of the spectral subtraction...

  4. An investigation and comparison of speech recognition software for determining if bird song recordings contain legible human voices

    Directory of Open Access Journals (Sweden)

    Tim D. Hunt

    Full Text Available The purpose of this work was to test the effectiveness of using readily available speech recognition API services to determine if recordings of bird song had inadvertently recorded human voices. A mobile phone was used to record a human speaking at increasing distances from the phone in an outside setting with bird song occurring in the background. One of the services was trained with sample recordings and each service was compared for their ability to return recognized words. The services from Google and IBM performed similarly and the Microsoft service, that allowed training, performed slightly better. However, all three services failed to perform at a level that would enable recordings with recognizable human speech to be deleted in order to maintain full privacy protection.

  5. Text recognition and correction for automated data collection by mobile devices

    Science.gov (United States)

    Ozarslan, Suleyman; Eren, P. Erhan

    2014-03-01

    Participatory sensing is an approach which allows mobile devices such as mobile phones to be used for data collection, analysis and sharing processes by individuals. Data collection is the first and most important part of a participatory sensing system, but it is time consuming for the participants. In this paper, we discuss automatic data collection approaches for reducing the time required for collection, and increasing the amount of collected data. In this context, we explore automated text recognition on images of store receipts which are captured by mobile phone cameras, and the correction of the recognized text. Accordingly, our first goal is to evaluate the performance of the Optical Character Recognition (OCR) method with respect to data collection from store receipt images. Images captured by mobile phones exhibit some typical problems, and common image processing methods cannot handle some of them. Consequently, the second goal is to address these types of problems through our proposed Knowledge Based Correction (KBC) method used in support of the OCR, and also to evaluate the KBC method with respect to the improvement on the accurate recognition rate. Results of the experiments show that the KBC method improves the accurate data recognition rate noticeably.

  6. OLIVE: Speech-Based Video Retrieval

    NARCIS (Netherlands)

    de Jong, Franciska M.G.; Gauvain, Jean-Luc; den Hartog, Jurgen; den Hartog, Jeremy; Netter, Klaus

    1999-01-01

    This paper describes the Olive project which aims to support automated indexing of video material by use of human language technologies. Olive is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which serve as the

  7. Auditory and Non-Auditory Contributions for Unaided Speech Recognition in Noise as a Function of Hearing Aid Use.

    Science.gov (United States)

    Gieseler, Anja; Tahden, Maike A S; Thiel, Christiane M; Wagener, Kirsten C; Meis, Markus; Colonius, Hans

    2017-01-01

    Differences in understanding speech in noise among hearing-impaired individuals cannot be explained entirely by hearing thresholds alone, suggesting the contribution of other factors beyond standard auditory ones as derived from the audiogram. This paper reports two analyses addressing individual differences in the explanation of unaided speech-in-noise performance among n = 438 elderly hearing-impaired listeners ( mean = 71.1 ± 5.8 years). The main analysis was designed to identify clinically relevant auditory and non-auditory measures for speech-in-noise prediction using auditory (audiogram, categorical loudness scaling) and cognitive tests (verbal-intelligence test, screening test of dementia), as well as questionnaires assessing various self-reported measures (health status, socio-economic status, and subjective hearing problems). Using stepwise linear regression analysis, 62% of the variance in unaided speech-in-noise performance was explained, with measures Pure-tone average (PTA), Age , and Verbal intelligence emerging as the three most important predictors. In the complementary analysis, those individuals with the same hearing loss profile were separated into hearing aid users (HAU) and non-users (NU), and were then compared regarding potential differences in the test measures and in explaining unaided speech-in-noise recognition. The groupwise comparisons revealed significant differences in auditory measures and self-reported subjective hearing problems, while no differences in the cognitive domain were found. Furthermore, groupwise regression analyses revealed that Verbal intelligence had a predictive value in both groups, whereas Age and PTA only emerged significant in the group of hearing aid NU.

  8. Improving speech-in-noise recognition for children with hearing loss: potential effects of language abilities, binaural summation, and head shadow.

    Science.gov (United States)

    Nittrouer, Susan; Caldwell-Tarr, Amanda; Tarr, Eric; Lowenstein, Joanna H; Rice, Caitlin; Moberly, Aaron C

    2013-08-01

    This study examined speech recognition in noise for children with hearing loss, compared it to recognition for children with normal hearing, and examined mechanisms that might explain variance in children's abilities to recognize speech in noise. Word recognition was measured in two levels of noise, both when the speech and noise were co-located in front and when the noise came separately from one side. Four mechanisms were examined as factors possibly explaining variance: vocabulary knowledge, sensitivity to phonological structure, binaural summation, and head shadow. Participants were 113 eight-year-old children. Forty-eight had normal hearing (NH) and 65 had hearing loss: 18 with hearing aids (HAs), 19 with one cochlear implant (CI), and 28 with two CIs. Phonological sensitivity explained a significant amount of between-groups variance in speech-in-noise recognition. Little evidence of binaural summation was found. Head shadow was similar in magnitude for children with NH and with CIs, regardless of whether they wore one or two CIs. Children with HAs showed reduced head shadow effects. These outcomes suggest that in order to improve speech-in-noise recognition for children with hearing loss, intervention needs to be comprehensive, focusing on both language abilities and auditory mechanisms.

  9. The accuracy of radiology speech recognition reports in a multilingual South African teaching hospital

    International Nuclear Information System (INIS)

    Toit, Jacqueline du; Hattingh, Retha; Pitcher, Richard

    2015-01-01

    Speech recognition (SR) technology, the process whereby spoken words are converted to digital text, has been used in radiology reporting since 1981. It was initially anticipated that SR would dominate radiology reporting, with claims of up to 99% accuracy, reduced turnaround times and significant cost savings. However, expectations have not yet been realised. The limited data available suggest SR reports have significantly higher levels of inaccuracy than traditional dictation transcription (DT) reports, as well as incurring greater aggregate costs. There has been little work on the clinical significance of such errors, however, and little is known of the impact of reporter seniority on the generation of errors, or the influence of system familiarity on reducing error rates. Furthermore, there have been conflicting findings on the accuracy of SR amongst users with English as first- and second-language respectively. The aim of the study was to compare the accuracy of SR and DT reports in a resource-limited setting. The first 300 SR and the first 300 DT reports generated during March 2010 were retrieved from the hospital’s PACS, and reviewed by a single observer. Text errors were identified, and then classified as either clinically significant or insignificant based on their potential impact on patient management. In addition, a follow-up analysis was conducted exactly 4 years later. Of the original 300 SR reports analysed, 25.6% contained errors, with 9.6% being clinically significant. Only 9.3% of the DT reports contained errors, 2.3% having potential clinical impact. Both the overall difference in SR and DT error rates, and the difference in ‘clinically significant’ error rates (9.6% vs. 2.3%) were statistically significant. In the follow-up study, the overall SR error rate was strikingly similar at 24.3%, 6% being clinically significant. Radiologists with second-language English were more likely to generate reports containing errors, but level of seniority

  10. Amplitude Modulation Detection and Speech Recognition in Late-Implanted Prelingually and Postlingually Deafened Cochlear Implant Users.

    Science.gov (United States)

    De Ruiter, Anke M; Debruyne, Joke A; Chenault, Michelene N; Francart, Tom; Brokx, Jan P L

    2015-01-01

    Many late-implanted prelingually deafened cochlear implant (CI) patients struggle to obtain open-set speech understanding. Because it is known that low-frequency temporal-envelope information contains important cues for speech understanding, the goal of this study was to compare the temporal-envelope processing abilities of late-implanted prelingually and postlingually deafened CI users. Furthermore, the possible relation between temporal processing abilities and speech recognition performances was investigated. Amplitude modulation detection thresholds were obtained in eight prelingually and 18 postlingually deafened CI users, by means of a sinusoidally modulated broadband noise carrier, presented through a loudspeaker to the CI user's clinical device. Thresholds were determined with a two-down-one-up three-interval oddity adaptive procedure, at seven modulation frequencies. Phoneme recognition (consonant-nucleus-consonant [CNC]) scores (percentage correct at 65 dB SPL) were gathered for all CI users. For the prelingually deafened group, scores on two additional speech tests were obtained: (1) a closed-set monosyllable-trochee-spondee test (percentage correct scores at 65 dB SPL on word recognition and categorization of the suprasegmental word patterns), and (2) a speech tracking test (number of correctly repeated words per minute) with texts specifically designed for this population. The prelingually deafened CI users had a significantly lower sensitivity to amplitude modulations than the postlingually deafened CI users, and the attenuation rate of their temporal modulation transfer function (TMTF) was greater. None of the prelingually deafened CI users were able to detect modulations at 150 and 200 Hz. High and significant correlations were found between the results on the amplitude modulation detection test and CNC phoneme scores, for the entire group of CI users. In the prelingually deafened group, CNC phoneme scores, word scores on the monosyllable

  11. Impact of a PACS/RIS-integrated speech recognition system on radiology reporting time and report availability

    International Nuclear Information System (INIS)

    Trumm, C.G.; Glaser, C.; Paasche, V.; Kuettner, B.; Francke, M.; Nissen-Meyer, S.; Reiser, M.; Crispin, A.; Popp, P.

    2006-01-01

    Purpose: Quantification of the impact of a PACS/RIS-integrated speech recognition system (SRS) on the time expenditure for radiology reporting and on hospital-wide report availability (RA) in a university institution. Material and Methods: In a prospective pilot study, the following parameters were assessed for 669 radiographic examinations (CR): 1. time requirement per report dictation (TED: dictation time (s)/number of images [examination] x number of words [report]) with either a combination of PACS/tape-based dictation (TD: analog dictation device/minicassette/transcription) or PACS/RIS/speech recognition system (RR: remote recognition/transcription and OR: online recognition/self-correction by radiologist), respectively, and 2. the Report Turnaround Time (RTT) as the time interval from the entry of the first image into the PACS to the available RIS/HIS report. Two equal time periods were chosen retrospectively from the RIS database: 11/2002-2/2003 (only TD) and 11/2003-2/2004 (only RR or OR with speech recognition system [SRS]). The midterm (≥24 h, 24 h intervals) and short-term (< 24 h, 1 h intervals), RA after examination completion were calculated for all modalities and for Cr, CT, MR and XA/DS separately. The relative increase in the mid-term RA (RIMRA: related to total number of examinations in each time period) and increase in the short-term RA (ISRA: ratio of available reports during the 1st to 24th hour) were calculated. Results: Prospectively, there was a significant difference between TD/RR/OR (n=151/257/261) regarding mean TED (0.44/0.54/0.62 s [per word and image]) and mean RTT (10.47/6.65/1.27 h), respectively. Retrospectively, 37 898/39 680 reports were computed from the RIS database for the time periods of 11/2002-2/2003 and 11/2003-2/2004. For CR/CT there was a shift of the short-term RA to the first 6 hours after examination completion (mean cumulative RA 20% higher) with a more than three-fold increase in the total number of available

  12. Metaphor and framing in political speeches : Effects of conceptual metaphor on recognition and recall

    NARCIS (Netherlands)

    Lagerwerf, L.; Yu, L.; Baicchi, Annalisa; Pinelli, Erica

    2017-01-01

    Cognitive linguists suggest that metaphorical framing has strong cognitive effects. However, experimental research only showed small or contradictory effects. In this chapter, an experiment is reported in which metaphor and framing were manipulated independently. Audible political speeches were

  13. Vision-based obstacle recognition system for automated lawn mower robot development

    Science.gov (United States)

    Mohd Zin, Zalhan; Ibrahim, Ratnawati

    2011-06-01

    Digital image processing techniques (DIP) have been widely used in various types of application recently. Classification and recognition of a specific object using vision system require some challenging tasks in the field of image processing and artificial intelligence. The ability and efficiency of vision system to capture and process the images is very important for any intelligent system such as autonomous robot. This paper gives attention to the development of a vision system that could contribute to the development of an automated vision based lawn mower robot. The works involve on the implementation of DIP techniques to detect and recognize three different types of obstacles that usually exist on a football field. The focus was given on the study on different types and sizes of obstacles, the development of vision based obstacle recognition system and the evaluation of the system's performance. Image processing techniques such as image filtering, segmentation, enhancement and edge detection have been applied in the system. The results have shown that the developed system is able to detect and recognize various types of obstacles on a football field with recognition rate of more 80%.

  14. Robust Speech Processing & Recognition: Speaker ID, Language ID, Speech Recognition/Keyword Spotting, Diarization/Co-Channel/Environmental Characterization, Speaker State Assessment

    Science.gov (United States)

    2015-10-01

    Dallas Erik Jonsson School of Engineering & Computer Science EC32 P.O. Box 830688 Richardson, Texas 75083-0688 8. PERFORMING ORGANIZATION REPORT...87 4.3 Whisper Based Processing for ASR ………………………………………….…. 92 5.0 Task 5: SPEAKER STATE ASSESSMENT/ ENVIROMENTAL SNIFFING (SSA/ENVS...Dec. 7-10, 2014 [3] S. Amuda, H. Boril, A. Sangwan, J.H.L. Hansen, T.S. Ibiyemi, “ Engineering analysis and recognition of Nigerian English: An

  15. Computer-based auditory phoneme discrimination training improves speech recognition in noise in experienced adult cochlear implant listeners.

    Science.gov (United States)

    Schumann, Annette; Serman, Maja; Gefeller, Olaf; Hoppe, Ulrich

    2015-03-01

    Specific computer-based auditory training may be a useful completion in the rehabilitation process for cochlear implant (CI) listeners to achieve sufficient speech intelligibility. This study evaluated the effectiveness of a computerized, phoneme-discrimination training programme. The study employed a pretest-post-test design; participants were randomly assigned to the training or control group. Over a period of three weeks, the training group was instructed to train in phoneme discrimination via computer, twice a week. Sentence recognition in different noise conditions (moderate to difficult) was tested pre- and post-training, and six months after the training was completed. The control group was tested and retested within one month. Twenty-seven adult CI listeners who had been using cochlear implants for more than two years participated in the programme; 15 adults in the training group, 12 adults in the control group. Besides significant improvements for the trained phoneme-identification task, a generalized training effect was noted via significantly improved sentence recognition in moderate noise. No significant changes were noted in the difficult noise conditions. Improved performance was maintained over an extended period. Phoneme-discrimination training improves experienced CI listeners' speech perception in noise. Additional research is needed to optimize auditory training for individual benefit.

  16. Effects of Active and Passive Hearing Protection Devices on Sound Source Localization, Speech Recognition, and Tone Detection.

    Directory of Open Access Journals (Sweden)

    Andrew D Brown

    Full Text Available Hearing protection devices (HPDs such as earplugs offer to mitigate noise exposure and reduce the incidence of hearing loss among persons frequently exposed to intense sound. However, distortions of spatial acoustic information and reduced audibility of low-intensity sounds caused by many existing HPDs can make their use untenable in high-risk (e.g., military or law enforcement environments where auditory situational awareness is imperative. Here we assessed (1 sound source localization accuracy using a head-turning paradigm, (2 speech-in-noise recognition using a modified version of the QuickSIN test, and (3 tone detection thresholds using a two-alternative forced-choice task. Subjects were 10 young normal-hearing males. Four different HPDs were tested (two active, two passive, including two new and previously untested devices. Relative to unoccluded (control performance, all tested HPDs significantly degraded performance across tasks, although one active HPD slightly improved high-frequency tone detection thresholds and did not degrade speech recognition. Behavioral data were examined with respect to head-related transfer functions measured using a binaural manikin with and without tested HPDs in place. Data reinforce previous reports that HPDs significantly compromise a variety of auditory perceptual facilities, particularly sound localization due to distortions of high-frequency spectral cues that are important for the avoidance of front-back confusions.

  17. Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear When Exposed to 65 dB of Auditory Noise.

    Science.gov (United States)

    Söderlund, Göran B W; Jobs, Elisabeth Nilsson

    2016-01-01

    The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD), affecting ∼6-9% of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman's speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB). Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure.

  18. Differences in Speech Recognition Between Children with Attention Deficits and Typically Developed Children Disappear when Exposed to 65 dB of Auditory Noise

    Directory of Open Access Journals (Sweden)

    Göran B W Söderlund

    2016-01-01

    Full Text Available The most common neuropsychiatric condition in the in children is attention deficit hyperactivity disorder (ADHD, affecting approximately 6-9 % of the population. ADHD is distinguished by inattention and hyperactive, impulsive behaviors as well as poor performance in various cognitive tasks often leading to failures at school. Sensory and perceptual dysfunctions have also been noticed. Prior research has mainly focused on limitations in executive functioning where differences are often explained by deficits in pre-frontal cortex activation. Less notice has been given to sensory perception and subcortical functioning in ADHD. Recent research has shown that children with ADHD diagnosis have a deviant auditory brain stem response compared to healthy controls. The aim of the present study was to investigate if the speech recognition threshold differs between attentive and children with ADHD symptoms in two environmental sound conditions, with and without external noise. Previous research has namely shown that children with attention deficits can benefit from white noise exposure during cognitive tasks and here we investigate if noise benefit is present during an auditory perceptual task. For this purpose we used a modified Hagerman’s speech recognition test where children with and without attention deficits performed a binaural speech recognition task to assess the speech recognition threshold in no noise and noise conditions (65 dB. Results showed that the inattentive group displayed a higher speech recognition threshold than typically developed children (TDC and that the difference in speech recognition threshold disappeared when exposed to noise at supra threshold level. From this we conclude that inattention can partly be explained by sensory perceptual limitations that can possibly be ameliorated through noise exposure.

  19. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search......Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we......, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even...

  20. Towards Real-Time Speech Emotion Recognition for Affective E-Learning

    Science.gov (United States)

    Bahreini, Kiavash; Nadolski, Rob; Westera, Wim

    2016-01-01

    This paper presents the voice emotion recognition part of the FILTWAM framework for real-time emotion recognition in affective e-learning settings. FILTWAM (Framework for Improving Learning Through Webcams And Microphones) intends to offer timely and appropriate online feedback based upon learner's vocal intonations and facial expressions in order…

  1. An Intelligent Systems Approach to Automated Object Recognition: A Preliminary Study

    Science.gov (United States)

    Maddox, Brian G.; Swadley, Casey L.

    2002-01-01

    Attempts at fully automated object recognition systems have met with varying levels of success over the years. However, none of the systems have achieved high enough accuracy rates to be run unattended. One of the reasons for this may be that they are designed from the computer's point of view and rely mainly on image-processing methods. A better solution to this problem may be to make use of modern advances in computational intelligence and distributed processing to try to mimic how the human brain is thought to recognize objects. As humans combine cognitive processes with detection techniques, such a system would combine traditional image-processing techniques with computer-based intelligence to determine the identity of various objects in a scene.

  2. Improving Mobile Phone Speech Recognition by Personalized Amplification: Application in People with Normal Hearing and Mild-to-Moderate Hearing Loss.

    Science.gov (United States)

    Kam, Anna Chi Shan; Sung, John Ka Keung; Lee, Tan; Wong, Terence Ka Cheong; van Hasselt, Andrew

    In this study, the authors evaluated the effect of personalized amplification on mobile phone speech recognition in people with and without hearing loss. This prospective study used double-blind, within-subjects, repeated measures, controlled trials to evaluate the effectiveness of applying personalized amplification based on the hearing level captured on the mobile device. The personalized amplification settings were created using modified one-third gain targets. The participants in this study included 100 adults of age between 20 and 78 years (60 with age-adjusted normal hearing and 40 with hearing loss). The performance of the participants with personalized amplification and standard settings was compared using both subjective and speech-perception measures. Speech recognition was measured in quiet and in noise using Cantonese disyllabic words. Subjective ratings on the quality, clarity, and comfortableness of the mobile signals were measured with an 11-point visual analog scale. Subjective preferences of the settings were also obtained by a paired-comparison procedure. The personalized amplification application provided better speech recognition via the mobile phone both in quiet and in noise for people with hearing impairment (improved 8 to 10%) and people with normal hearing (improved 1 to 4%). The improvement in speech recognition was significantly better for people with hearing impairment. When the average device output level was matched, more participants preferred to have the individualized gain than not to have it. The personalized amplification application has the potential to improve speech recognition for people with mild-to-moderate hearing loss, as well as people with normal hearing, in particular when listening in noisy environments.

  3. Automated, high accuracy classification of Parkinsonian disorders: a pattern recognition approach.

    Directory of Open Access Journals (Sweden)

    Andre F Marquand

    Full Text Available Progressive supranuclear palsy (PSP, multiple system atrophy (MSA and idiopathic Parkinson's disease (IPD can be clinically indistinguishable, especially in the early stages, despite distinct patterns of molecular pathology. Structural neuroimaging holds promise for providing objective biomarkers for discriminating these diseases at the single subject level but all studies to date have reported incomplete separation of disease groups. In this study, we employed multi-class pattern recognition to assess the value of anatomical patterns derived from a widely available structural neuroimaging sequence for automated classification of these disorders. To achieve this, 17 patients with PSP, 14 with IPD and 19 with MSA were scanned using structural MRI along with 19 healthy controls (HCs. An advanced probabilistic pattern recognition approach was employed to evaluate the diagnostic value of several pre-defined anatomical patterns for discriminating the disorders, including: (i a subcortical motor network; (ii each of its component regions and (iii the whole brain. All disease groups could be discriminated simultaneously with high accuracy using the subcortical motor network. The region providing the most accurate predictions overall was the midbrain/brainstem, which discriminated all disease groups from one another and from HCs. The subcortical network also produced more accurate predictions than the whole brain and all of its constituent regions. PSP was accurately predicted from the midbrain/brainstem, cerebellum and all basal ganglia compartments; MSA from the midbrain/brainstem and cerebellum and IPD from the midbrain/brainstem only. This study demonstrates that automated analysis of structural MRI can accurately predict diagnosis in individual patients with Parkinsonian disorders, and identifies distinct patterns of regional atrophy particularly useful for this process.

  4. The Role of Variety Recognition in Japanese University Students' Attitudes towards English Speech Varieties

    Science.gov (United States)

    McKenzie, Robert M.

    2008-01-01

    Language attitude studies have tended to assume that informants who listen to and evaluate speech stimuli are able to identify with consistent accuracy the varieties of English in question. However, misidentification could reduce the validity of any results obtained, particularly when it involves the evaluations of non-native English-speaking…

  5. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN...

  6. Hints About Some Baseful but Indispensable Elements in Speech Recognition and Reconstruction

    Directory of Open Access Journals (Sweden)

    Mihaela Costin

    2002-07-01

    Full Text Available The cochlear implant (CI is a device used to reconstruct the hearing capabilities of a person diagnosed with total cophosis. This impairment may occur after some accidents, chemotherapy etc., the person still having an intact hearing nerve. The cochlear implant has two parts: a programmable, external part, the Digital Signal Processing (DSP device which process and transform the speech signal, and another surgically implanted part, with a certain number of electrodes (depending on brand used to stimulate the hearing nerve. The speech signal is fully processed in the DSP external device resulting the ``coded'' information on speech. This is modulated with the support of the fundamental frequency F0 and the energy impulses are inductively sent to the hearing nerve. The correct detection of this frequency is very important, determining the manner of hearing and making the difference between a "computer'' voice and a natural one. The results are applicable not only in the medical domain, but also in the Romanian speech synthesis.

  7. External Validity of Electronic Sniffers for Automated Recognition of Acute Respiratory Distress Syndrome.

    Science.gov (United States)

    McKown, Andrew C; Brown, Ryan M; Ware, Lorraine B; Wanderer, Jonathan P

    2017-01-01

    Automated electronic sniffers may be useful for early detection of acute respiratory distress syndrome (ARDS) for institution of treatment or clinical trial screening. In a prospective cohort of 2929 critically ill patients, we retrospectively applied published sniffer algorithms for automated detection of acute lung injury to assess their utility in diagnosis of ARDS in the first 4 ICU days. Radiographic full-text reports were searched for "edema" OR ("bilateral" AND "infiltrate") and a more detailed algorithm for descriptions consistent with ARDS. Patients were flagged as possible ARDS if a radiograph met search criteria and had a PaO 2 /FiO 2 or SpO 2 /FiO 2 of 300 or 315, respectively. Test characteristics of the electronic sniffers and clinical suspicion of ARDS were compared to a gold standard of 2-physician adjudicated ARDS. Thirty percent of 2841 patients included in the analysis had gold standard diagnosis of ARDS. The simpler algorithm had sensitivity for ARDS of 78.9%, specificity of 52%, positive predictive value (PPV) of 41%, and negative predictive value (NPV) of 85.3% over the 4-day study period. The more detailed algorithm had sensitivity of 88.2%, specificity of 55.4%, PPV of 45.6%, and NPV of 91.7%. Both algorithms were more sensitive but less specific than clinician suspicion, which had sensitivity of 40.7%, specificity of 94.8%, PPV of 78.2%, and NPV of 77.7%. Published electronic sniffer algorithms for ARDS may be useful automated screening tools for ARDS and improve on clinical recognition, but they are limited to screening rather than diagnosis because their specificity is poor.

  8. Digital speech processing using Matlab

    CERN Document Server

    Gopi, E S

    2014-01-01

    Digital Speech Processing Using Matlab deals with digital speech pattern recognition, speech production model, speech feature extraction, and speech compression. The book is written in a manner that is suitable for beginners pursuing basic research in digital speech processing. Matlab illustrations are provided for most topics to enable better understanding of concepts. This book also deals with the basic pattern recognition techniques (illustrated with speech signals using Matlab) such as PCA, LDA, ICA, SVM, HMM, GMM, BPN, and KSOM.

  9. Let Me Relax: Toward Automated Sedentary State Recognition and Ubiquitous Mental Wellness Solutions

    Directory of Open Access Journals (Sweden)

    Vijay Rajanna

    2018-12-01

    Full Text Available Advances in ubiquitous computing technology improve workplace productivity, reduce physical exertion, but ultimately result in a sedentary work style. Sedentary behavior is associated with an increased risk of stress, obesity, and other health complications. Let Me Relax is a fully automated sedentary-state recognition framework using a smartwatch and smartphone, which encourages mental wellness through interventions in the form of simple relaxation techniques. The system was evaluated through a comparative user study of 22 participants split into a test and a control group. An analysis of NASA Task Load Index pre- and post- study survey revealed that test subjects who followed relaxation methods, showed a trend of both increased activity as well as reduced mental stress. Reduced mental stress was found even in those test subjects that had increased inactivity. These results suggest that repeated interventions, driven by an intelligent activity recognition system, is an effective strategy for promoting healthy habits, which reduce stress, anxiety, and other health risks associated with sedentary workplaces.

  10. How Accurately Can the Google Web Speech API Recognize and Transcribe Japanese L2 English Learners' Oral Production?

    Science.gov (United States)

    Ashwell, Tim; Elam, Jesse R.

    2017-01-01

    The ultimate aim of our research project was to use the Google Web Speech API to automate scoring of elicited imitation (EI) tests. However, in order to achieve this goal, we had to take a number of preparatory steps. We needed to assess how accurate this speech recognition tool is in recognizing native speakers' production of the test items; we…

  11. Working Memory and Speech Recognition in Noise under Ecologically Relevant Listening Conditions: Effects of Visual Cues and Noise Type among Adults with Hearing Loss

    Science.gov (United States)

    Miller, Christi W.; Stewart, Erin K.; Wu, Yu-Hsiang; Bishop, Christopher; Bentler, Ruth A.; Tremblay, Kelly

    2017-01-01

    Purpose: This study evaluated the relationship between working memory (WM) and speech recognition in noise with different noise types as well as in the presence of visual cues. Method: Seventy-six adults with bilateral, mild to moderately severe sensorineural hearing loss (mean age: 69 years) participated. Using a cross-sectional design, 2…

  12. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    Science.gov (United States)

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  13. Applications of Speech-to-Text Recognition and Computer-Aided Translation for Facilitating Cross-Cultural Learning through a Learning Activity: Issues and Their Solutions

    Science.gov (United States)

    Shadiev, Rustam; Wu, Ting-Ting; Sun, Ai; Huang, Yueh-Min

    2018-01-01

    In this study, 21 university students, who represented thirteen nationalities, participated in an online cross-cultural learning activity. The participants were engaged in interactions and exchanges carried out on Facebook® and Skype® platforms, and their multilingual communications were supported by speech-to-text recognition (STR) and…

  14. Effects of Noise on Speech Recognition and Listening Effort in Children with Normal Hearing and Children with Mild Bilateral or Unilateral Hearing Loss

    Science.gov (United States)

    Lewis, Dawna; Schmid, Kendra; O'Leary, Samantha; Spalding, Jody; Heinrichs-Graham, Elizabeth; High, Robin

    2016-01-01

    Purpose: This study examined the effects of stimulus type and hearing status on speech recognition and listening effort in children with normal hearing (NH) and children with mild bilateral hearing loss (MBHL) or unilateral hearing loss (UHL). Method Children (5-12 years of age) with NH (Experiment 1) and children (8-12 years of age) with MBHL,…

  15. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    2015-01-01

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers...

  16. SOFTWARE EFFORT ESTIMATION FRAMEWORK TO IMPROVE ORGANIZATION PRODUCTIVITY USING EMOTION RECOGNITION OF SOFTWARE ENGINEERS IN SPONTANEOUS SPEECH

    Directory of Open Access Journals (Sweden)

    B.V.A.N.S.S. Prabhakar Rao

    2015-10-01

    Full Text Available Productivity is a very important part of any organisation in general and software industry in particular. Now a day’s Software Effort estimation is a challenging task. Both Effort and Productivity are inter-related to each other. This can be achieved from the employee’s of the organization. Every organisation requires emotionally stable employees in their firm for seamless and progressive working. Of course, in other industries this may be achieved without man power. But, software project development is labour intensive activity. Each line of code should be delivered from software engineer. Tools and techniques may helpful and act as aid or supplementary. Whatever be the reason software industry has been suffering with success rate. Software industry is facing lot of problems in delivering the project on time and within the estimated budget limit. If we want to estimate the required effort of the project it is significant to know the emotional state of the team member. The responsibility of ensuring emotional contentment falls on the human resource department and the department can deploy a series of systems to carry out its survey. This analysis can be done using a variety of tools, one such, is through study of emotion recognition. The data needed for this is readily available and collectable and can be an excellent source for the feedback systems. The challenge of recognition of emotion in speech is convoluted primarily due to the noisy recording condition, the variations in sentiment in sample space and exhibition of multiple emotions in a single sentence. The ambiguity in the labels of training set also increases the complexity of problem addressed. The existing models using probabilistic models have dominated the study but present a flaw in scalability due to statistical inefficiency. The problem of sentiment prediction in spontaneous speech can thus be addressed using a hybrid system comprising of a Convolution Neural Network and

  17. Relating hearing loss and executive functions to hearing aid users’ preference for, and speech recognition with, different combinations of binaural noise reduction and microphone directionality

    Directory of Open Access Journals (Sweden)

    Tobias eNeher

    2014-12-01

    Full Text Available Knowledge of how executive functions relate to preferred hearing aid (HA processing is sparse and seemingly inconsistent with related knowledge for speech recognition outcomes. This study thus aimed to find out if (1 performance on a measure of reading span (RS is related to preferred binaural noise reduction (NR strength, (2 similar relations exist for two different, nonverbal measures of executive function, (3 pure-tone average hearing loss (PTA, signal-to-noise ratio (SNR, and microphone directionality (DIR also influence preferred NR strength, and (4 preference and speech recognition outcomes are similar. Sixty elderly HA users took part. Six HA conditions consisting of omnidirectional or cardioid microphones followed by inactive, moderate, or strong binaural NR as well as linear amplification were tested. Outcome was assessed at fixed SNRs using headphone simulations of a frontal target talker in a busy cafeteria. Analyses showed positive effects of active NR and DIR on preference, and negative and positive effects of, respectively, strong NR and DIR on speech recognition. Also, while moderate NR was the most preferred NR setting overall, preference for strong NR increased with SNR. No relation between RS and preference was found. However, larger PTA was related to weaker preference for inactive NR and stronger preference for strong NR for both microphone modes. Equivalent (but weaker relations between worse performance on one nonverbal measure of executive function and the HA conditions without DIR were found. For speech recognition, there were relations between HA condition, PTA, and RS, but their pattern differed from that for preference. Altogether, these results indicate that, while moderate NR works well in general, a notable proportion of HA users prefer stronger NR. Furthermore, PTA and executive functions can account for some of the variability in preference for, and speech recognition with, different binaural NR and DIR settings.

  18. The role of automated speech and audio analysis in semantic multimedia annotation

    NARCIS (Netherlands)

    de Jong, Franciska M.G.; Ordelman, Roeland J.F.; van Hessen, Adrianus J.

    This paper overviews the various ways in which automatic speech and audio analysis can be deployed to enhance the semantic annotation of multimedia content, and as a consequence to improve the effectiveness of conceptual access tools. A number of techniques will be presented, including the alignment

  19. Emotion recognition from speech by combining databases and fusion of classifiers

    NARCIS (Netherlands)

    Lefter, I.; Rothkrantz, L.J.M.; Wiggers, P.; Leeuwen, D.A. van

    2010-01-01

    We explore possibilities for enhancing the generality, portability and robustness of emotion recognition systems by combining data-bases and by fusion of classifiers. In a first experiment, we investigate the performance of an emotion detection system tested on a certain database given that it is

  20. Spoken Word Recognition Errors in Speech Audiometry: A Measure of Hearing Performance?

    Directory of Open Access Journals (Sweden)

    Martine Coene

    2015-01-01

    Full Text Available This report provides a detailed analysis of incorrect responses from an open-set spoken word-repetition task which is part of a Dutch speech audiometric test battery. Single-consonant confusions were analyzed from 230 normal hearing participants in terms of the probability of choice of a particular response on the basis of acoustic-phonetic, lexical, and frequency variables. The results indicate that consonant confusions are better predicted by lexical knowledge than by acoustic properties of the stimulus word. A detailed analysis of the transmission of phonetic features indicates that “voicing” is best preserved whereas “manner of articulation” yields most perception errors. As consonant confusion matrices are often used to determine the degree and type of a patient’s hearing impairment, to predict a patient’s gain in hearing performance with hearing devices and to optimize the device settings in view of maximum output, the observed findings are highly relevant for the audiological practice. Based on our findings, speech audiometric outcomes provide a combined auditory-linguistic profile of the patient. The use of confusion matrices might therefore not be the method best suited to measure hearing performance. Ideally, they should be complemented by other listening task types that are known to have less linguistic bias, such as phonemic discrimination.

  1. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  2. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  3. A vision-based automated guided vehicle system with marker recognition for indoor use.

    Science.gov (United States)

    Lee, Jeisung; Hyun, Chang-Ho; Park, Mignon

    2013-08-07

    We propose an intelligent vision-based Automated Guided Vehicle (AGV) system using fiduciary markers. In this paper, we explore a low-cost, efficient vehicle guiding method using a consumer grade web camera and fiduciary markers. In the proposed method, the system uses fiduciary markers with a capital letter or triangle indicating direction in it. The markers are very easy to produce, manipulate, and maintain. The marker information is used to guide a vehicle. We use hue and saturation values in the image to extract marker candidates. When the known size fiduciary marker is detected by using a bird's eye view and Hough transform, the positional relation between the marker and the vehicle can be calculated. To recognize the character in the marker, a distance transform is used. The probability of feature matching was calculated by using a distance transform, and a feature having high probability is selected as a captured marker. Four directional signals and 10 alphabet features are defined and used as markers. A 98.87% recognition rate was achieved in the testing phase. The experimental results with the fiduciary marker show that the proposed method is a solution for an indoor AGV system.

  4. A large-scale dataset of solar event reports from automated feature recognition modules

    Science.gov (United States)

    Schuh, Michael A.; Angryk, Rafal A.; Martens, Petrus C.

    2016-05-01

    The massive repository of images of the Sun captured by the Solar Dynamics Observatory (SDO) mission has ushered in the era of Big Data for Solar Physics. In this work, we investigate the entire public collection of events reported to the Heliophysics Event Knowledgebase (HEK) from automated solar feature recognition modules operated by the SDO Feature Finding Team (FFT). With the SDO mission recently surpassing five years of operations, and over 280,000 event reports for seven types of solar phenomena, we present the broadest and most comprehensive large-scale dataset of the SDO FFT modules to date. We also present numerous statistics on these modules, providing valuable contextual information for better understanding and validating of the individual event reports and the entire dataset as a whole. After extensive data cleaning through exploratory data analysis, we highlight several opportunities for knowledge discovery from data (KDD). Through these important prerequisite analyses presented here, the results of KDD from Solar Big Data will be overall more reliable and better understood. As the SDO mission remains operational over the coming years, these datasets will continue to grow in size and value. Future versions of this dataset will be analyzed in the general framework established in this work and maintained publicly online for easy access by the community.

  5. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suzuki Motoyuki

    2009-01-01

    Full Text Available Abstract We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the "query relevance." Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  6. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Akinori Ito

    2009-01-01

    Full Text Available We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the “query relevance.” Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  7. It doesn't matter what you say: FMRI correlates of voice learning and recognition independent of speech content.

    Science.gov (United States)

    Zäske, Romi; Awwad Shiekh Hasan, Bashar; Belin, Pascal

    2017-09-01

    Listeners can recognize newly learned voices from previously unheard utterances, suggesting the acquisition of high-level speech-invariant voice representations during learning. Using functional magnetic resonance imaging (fMRI) we investigated the anatomical basis underlying the acquisition of voice representations for unfamiliar speakers independent of speech, and their subsequent recognition among novel voices. Specifically, listeners studied voices of unfamiliar speakers uttering short sentences and subsequently classified studied and novel voices as "old" or "new" in a recognition test. To investigate "pure" voice learning, i.e., independent of sentence meaning, we presented German sentence stimuli to non-German speaking listeners. To disentangle stimulus-invariant and stimulus-dependent learning, during the test phase we contrasted a "same sentence" condition in which listeners heard speakers repeating the sentences from the preceding study phase, with a "different sentence" condition. Voice recognition performance was above chance in both conditions although, as expected, performance was higher for same than for different sentences. During study phases activity in the left inferior frontal gyrus (IFG) was related to subsequent voice recognition performance and same versus different sentence condition, suggesting an involvement of the left IFG in the interactive processing of speaker and speech information during learning. Importantly, at test reduced activation for voices correctly classified as "old" compared to "new" emerged in a network of brain areas including temporal voice areas (TVAs) of the right posterior superior temporal gyrus (pSTG), as well as the right inferior/middle frontal gyrus (IFG/MFG), the right medial frontal gyrus, and the left caudate. This effect of voice novelty did not interact with sentence condition, suggesting a role of temporal voice-selective areas and extra-temporal areas in the explicit recognition of learned voice identity

  8. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

    Directory of Open Access Journals (Sweden)

    Kan Li

    2018-04-01

    Full Text Available This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM speech processing as well as neuromorphic implementations based on spiking neural network (SNN, yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR regime.

  9. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space.

    Science.gov (United States)

    Li, Kan; Príncipe, José C

    2018-01-01

    This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.

  10. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  11. The Sweet-Home speech and multimodal corpus for home automation interaction

    OpenAIRE

    Vacher , Michel; Lecouteux , Benjamin; Chahuara , Pedro; Portet , François; Meillon , Brigitte; Bonnefond , Nicolas

    2014-01-01

    International audience; Ambient Assisted Living aims at enhancing the quality of life of older and disabled people at home thanks to Smart Homes and Home Automation. However, many studies do not include tests in real settings, because data collection in this domain is very expensive and challenging and because of the few available data sets. The SWEET-H OME multimodal corpus is a dataset recorded in realistic conditions in D OMUS, a fully equipped Smart Home with microphones and home automati...

  12. Automated Detection of Selective Logging in Amazon Forests Using Airborne Lidar Data and Pattern Recognition Algorithms

    Science.gov (United States)

    Keller, M. M.; d'Oliveira, M. N.; Takemura, C. M.; Vitoria, D.; Araujo, L. S.; Morton, D. C.

    2012-12-01

    Selective logging, the removal of several valuable timber trees per hectare, is an important land use in the Brazilian Amazon and may degrade forests through long term changes in structure, loss of forest carbon and species diversity. Similar to deforestation, the annual area affected by selected logging has declined significantly in the past decade. Nonetheless, this land use affects several thousand km2 per year in Brazil. We studied a 1000 ha area of the Antimary State Forest (FEA) in the State of Acre, Brazil (9.304 ○S, 68.281 ○W) that has a basal area of 22.5 m2 ha-1 and an above-ground biomass of 231 Mg ha-1. Logging intensity was low, approximately 10 to 15 m3 ha-1. We collected small-footprint airborne lidar data using an Optech ALTM 3100EA over the study area once each in 2010 and 2011. The study area contained both recent and older logging that used both conventional and technologically advanced logging techniques. Lidar return density averaged over 20 m-2 for both collection periods with estimated horizontal and vertical precision of 0.30 and 0.15 m. A relative density model comparing returns from 0 to 1 m elevation to returns in 1-5 m elevation range revealed the pattern of roads and skid trails. These patterns were confirmed by ground-based GPS survey. A GIS model of the road and skid network was built using lidar and ground data. We tested and compared two pattern recognition approaches used to automate logging detection. Both segmentation using commercial eCognition segmentation and a Frangi filter algorithm identified the road and skid trail network compared to the GIS model. We report on the effectiveness of these two techniques.

  13. Speech recognition and communication outcomes with cochlear implantation in Usher syndrome type 3.

    Science.gov (United States)

    Pietola, Laura; Aarnisalo, Antti A; Abdel-Rahman, Akram; Västinsalo, Hanna; Isosomppi, Juha; Löppönen, Heikki; Kentala, Erna; Johansson, Reijo; Valtonen, Hannu; Vasama, Juha-Pekka; Sankila, Eeva-Marja; Jero, Jussi

    2012-01-01

    Usher syndrome Type 3 (USH3) is an autosomal recessive disorder characterized by variable type and degree of progressive sensorineural hearing loss and retinitis pigmentosa. Cochlear implants are widely used among these patients. To evaluate the results and benefits of cochlear implantation in patients with USH3. A nationwide multicenter retrospective review. During the years 1995-2005, in 5 Finnish university hospitals, 19 patients with USH3 received a cochlear implant. Saliva samples were collected to verify the USH3 genotype. Patients answered to 3 questionnaires: Glasgow Benefit Inventory, Glasgow Health Status Inventory, and a self-made questionnaire. Audiological data were collected from patient records. All the patients with USH3 in the study were homozygous for the Finnish major mutation (p.Y176X). Either they had severe sensorineural hearing loss or they were profoundly deaf. The mean preoperative hearing level (pure-tone average, 0.5-4 kHz) was 110 ± 8 dB hearing loss (HL) and the mean aided hearing level was 58 ± 11 dB HL. The postoperative hearing level (34 ± 9 dB HL) and word recognition scores were significantly better than before surgery. According to the Glasgow Benefit Inventory scores and Glasgow Health Status Inventory data related to hearing, the cochlear implantation was beneficial to patients with USH3. Cochlear implantation is beneficial to patients with USH3, and patients learn to use the implant without assistance.

  14. Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech.

    Science.gov (United States)

    Campo, D.; Quintero, O. L.; Bastidas, M.

    2016-04-01

    We propose a study of the mathematical properties of voice as an audio signal. This work includes signals in which the channel conditions are not ideal for emotion recognition. Multiresolution analysis- discrete wavelet transform - was performed through the use of Daubechies Wavelet Family (Db1-Haar, Db6, Db8, Db10) allowing the decomposition of the initial audio signal into sets of coefficients on which a set of features was extracted and analyzed statistically in order to differentiate emotional states. ANNs proved to be a system that allows an appropriate classification of such states. This study shows that the extracted features using wavelet decomposition are enough to analyze and extract emotional content in audio signals presenting a high accuracy rate in classification of emotional states without the need to use other kinds of classical frequency-time features. Accordingly, this paper seeks to characterize mathematically the six basic emotions in humans: boredom, disgust, happiness, anxiety, anger and sadness, also included the neutrality, for a total of seven states to identify.

  15. Study of the vocal signal in the amplitude-time representation. Speech segmentation and recognition algorithms

    International Nuclear Information System (INIS)

    Baudry, Marc

    1978-01-01

    This dissertation exposes an acoustical and phonetical study of vocal signal. The complex pattern of the signal is segmented into simple sub-patterns and each one of these sub-patterns may be segmented again into another more simplest patterns with lower level. Application of pattern recognition techniques facilitates on one hand this segmentation and on the other hand the definition of the structural relations between the sub-patterns. Particularly, we have developed syntactic techniques in which the rewriting rules, context-sensitive, are controlled by predicates using parameters evaluated on the sub-patterns themselves. This allow to generalize a pure syntactic analysis by adding a semantic information. The system we expose, realizes pre-classification and a partial identification of the phonemes as also the accurate detection of each pitch period. The voice signal is analysed directly using the amplitude-time representation. This system has been implemented on a mini-computer and it works in the real time. (author) [fr

  16. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms

    Science.gov (United States)

    Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200

  17. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms.

    Science.gov (United States)

    Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.

  18. Automated Recognition of Vegetation and Water Bodies on the Territory of Megacities in Satellite Images of Visible and IR Bands

    Science.gov (United States)

    Mozgovoy, Dmitry k.; Hnatushenko, Volodymyr V.; Vasyliev, Volodymyr V.

    2018-04-01

    Vegetation and water bodies are a fundamental element of urban ecosystems, and water mapping is critical for urban and landscape planning and management. A methodology of automated recognition of vegetation and water bodies on the territory of megacities in satellite images of sub-meter spatial resolution of the visible and IR bands is proposed. By processing multispectral images from the satellite SuperView-1A, vector layers of recognized plant and water objects were obtained. Analysis of the results of image processing showed a sufficiently high accuracy of the delineation of the boundaries of recognized objects and a good separation of classes. The developed methodology provides a significant increase of the efficiency and reliability of updating maps of large cities while reducing financial costs. Due to the high degree of automation, the proposed methodology can be implemented in the form of a geo-information web service functioning in the interests of a wide range of public services and commercial institutions.

  19. Comparison of bimodal and bilateral cochlear implant users on speech recognition with competing talker, music perception, affective prosody discrimination, and talker identification.

    Science.gov (United States)

    Cullington, Helen E; Zeng, Fan-Gang

    2011-02-01

    Despite excellent performance in speech recognition in quiet, most cochlear implant users have great difficulty with speech recognition in noise, music perception, identifying tone of voice, and discriminating different talkers. This may be partly due to the pitch coding in cochlear implant speech processing. Most current speech processing strategies use only the envelope information; the temporal fine structure is discarded. One way to improve electric pitch perception is to use residual acoustic hearing via a hearing aid on the nonimplanted ear (bimodal hearing). This study aimed to test the hypothesis that bimodal users would perform better than bilateral cochlear implant users on tasks requiring good pitch perception. Four pitch-related tasks were used. 1. Hearing in Noise Test (HINT) sentences spoken by a male talker with a competing female, male, or child talker. 2. Montreal Battery of Evaluation of Amusia. This is a music test with six subtests examining pitch, rhythm and timing perception, and musical memory. 3. Aprosodia Battery. This has five subtests evaluating aspects of affective prosody and recognition of sarcasm. 4. Talker identification using vowels spoken by 10 different talkers (three men, three women, two boys, and two girls). Bilateral cochlear implant users were chosen as the comparison group. Thirteen bimodal and 13 bilateral adult cochlear implant users were recruited; all had good speech perception in quiet. There were no significant differences between the mean scores of the bimodal and bilateral groups on any of the tests, although the bimodal group did perform better than the bilateral group on almost all tests. Performance on the different pitch-related tasks was not correlated, meaning that if a subject performed one task well they would not necessarily perform well on another. The correlation between the bimodal users' hearing threshold levels in the aided ear and their performance on these tasks was weak. Although the bimodal cochlear

  20. Time-Frequency Feature Representation Using Multi-Resolution Texture Analysis and Acoustic Activity Detector for Real-Life Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Kun-Ching Wang

    2015-01-01

    Full Text Available The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI. In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII. The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

  1. An automated approach for annual layer counting in ice cores

    DEFF Research Database (Denmark)

    Winstrup, Mai; Svensson, A. M.; Rasmussen, S. O.

    2012-01-01

    A novel method for automated annual layer counting in seasonally-resolved paleoclimate records has been developed. It relies on algorithms from the statistical framework of Hidden Markov Models (HMMs), which originally was developed for use in machine speech-recognition. The strength of the layer...

  2. An automated approach for annual layer counting in ice cores

    DEFF Research Database (Denmark)

    Winstrup, Mai; Svensson, A. M.; Rasmussen, S. O.

    2012-01-01

    A novel method for automated annual layer counting in seasonally-resolved paleoclimate records has been developed. It relies on algorithms from the statistical framework of hidden Markov models (HMMs), which originally was developed for use in machine speech recognition. The strength of the layer...

  3. Speech Recognition with the Advanced Combination Encoder and Transient Emphasis Spectral Maxima Strategies in Nucleus 24 Recipients

    Science.gov (United States)

    Holden, Laura K.; Vandali, Andrew E.; Skinner, Margaret W.; Fourakis, Marios S.; Holden, Timothy A.

    2005-01-01

    One of the difficulties faced by cochlear implant (CI) recipients is perception of low-intensity speech cues. A. E. Vandali (2001) has developed the transient emphasis spectral maxima (TESM) strategy to amplify short-duration, low-level sounds. The aim of the present study was to determine whether speech scores would be significantly higher with…

  4. Recognition

    DEFF Research Database (Denmark)

    Gimmler, Antje

    2017-01-01

    In this article, I shall examine the cognitive, heuristic and theoretical functions of the concept of recognition. To evaluate both the explanatory power and the limitations of a sociological concept, the theory construction must be analysed and its actual productivity for sociological theory mus...

  5. Study of the Ability of Articulation Index (Al for Predicting the Unaided and Aided Speech Recognition Performance of 25 to 65 Years Old Hearing-Impaired Adults

    Directory of Open Access Journals (Sweden)

    Ghasem Mohammad Khani

    2001-05-01

    Full Text Available Background: In recent years there has been increased interest in the use of Al for assessing hearing handicap and for measuring the potential effectiveness of amplification system. AI is an expression of proportion of average speech signal that is audible to a given patient, and it can vary between 0.0 to 1.0. Method and Materials: This cross-sectional analytical study was carried out in department of audiology, rehabilitation, faculty, IUMS form 31 Oct 98 to 7 March 1999, on 40 normal hearing persons (80 ears; 19 males and 21 females and 40 hearing impaired persons (61 ears; 36 males and 25 females, 25-65 years old with moderate to moderately severe SNI-IL The pavlovic procedure (1988 for calculating Al, open set taped standard mono syllabic word lists, and the real -ear probe- tube microphone system to measure insertion gain were used, through test-retest. Results: 1/A significant correlation was shown between the Al scores and the speech recognition scores of normal hearing and hearing-impaired group with and without the hearing aid (P<0.05 2/ There was no significant differences in age group & sex: also 3 In test-retest measures of the insertion gain in each test and 4/No significant in test-retest of speech recognition test score. Conclusion: According to these results the Al can predict the unaided and aided monosyllabic recognition test scores very well, and age and sex variables have no effect on its ability. Therefore with respect to high reliability of the Al results and its simplicity, easy -to- use, cost effective, and little time consuming for calculation, its recommended the wide use of the Al, especially in clinical situation.

  6. The Use of Lexical Neighborhood Test (LNT) in the Assessment of Speech Recognition Performance of Cochlear Implantees with Normal and Malformed Cochlea.

    Science.gov (United States)

    Kant, Anjali R; Banik, Arun A

    2017-09-01

    The present study aims to use the model-based test Lexical Neighborhood Test (LNT), to assess speech recognition performance in early and late implanted hearing impaired children with normal and malformed cochlea. The LNT was administered to 46 children with congenital (prelingual) bilateral severe-profound sensorineural hearing loss, using Nucleus 24 cochlear implant. The children were grouped into Group 1-(early implantees with normal cochlea-EI); n = 15, 31/2-61/2 years of age; mean age at implantation-3½ years. Group 2-(late implantees with normal cochlea-LI); n = 15, 6-12 years of age; mean age at implantation-5 years. Group 3-(early implantees with malformed cochlea-EIMC); n = 9; 4.9-10.6 years of age; mean age at implantation-3.10 years. Group 4-(late implantees with malformed cochlea-LIMC); n = 7; 7-12.6 years of age; mean age at implantation-6.3 years. The following were the malformations: dysplastic cochlea, common cavity, Mondini's, incomplete partition-1 and 2 (IP-1 and 2), enlarged IAC. The children were instructed to repeat the words on hearing them. Means of the word and phoneme scores were computed. The LNT can also be used to assess speech recognition performance of hearing impaired children with malformed cochlea. When both easy and hard lists of LNT are considered, although, late implantees (with or without normal cochlea), have achieved higher word scores than early implantees, the differences are not statistically significant. Using LNT for assessing speech recognition enables a quantitative as well as descriptive report of phonological processes used by the children.

  7. Probabilistic Neural Networks for Chemical Sensor Array Pattern Recognition: Comparison Studies, Improvements and Automated Outlier Rejection

    National Research Council Canada - National Science Library

    Shaffer, Ronald E

    1998-01-01

    For application to chemical sensor arrays, the ideal pattern recognition is accurate, fast, simple to train, robust to outliers, has low memory requirements, and has the ability to produce a measure...

  8. Towards Automation 2.0: A Neurocognitive Model for Environment Recognition, Decision-Making, and Action Execution

    Directory of Open Access Journals (Sweden)

    Zucker Gerhard

    2011-01-01

    Full Text Available The ongoing penetration of building automation by information technology is by far not saturated. Today's systems need not only be reliable and fault tolerant, they also have to regard energy efficiency and flexibility in the overall consumption. Meeting the quality and comfort goals in building automation while at the same time optimizing towards energy, carbon footprint and cost-efficiency requires systems that are able to handle large amounts of information and negotiate system behaviour that resolves conflicting demands—a decision-making process. In the last years, research has started to focus on bionic principles for designing new concepts in this area. The information processing principles of the human mind have turned out to be of particular interest as the mind is capable of processing huge amounts of sensory data and taking adequate decisions for (re-actions based on these analysed data. In this paper, we discuss how a bionic approach can solve the upcoming problems of energy optimal systems. A recently developed model for environment recognition and decision-making processes, which is based on research findings from different disciplines of brain research is introduced. This model is the foundation for applications in intelligent building automation that have to deal with information from home and office environments. All of these applications have in common that they consist of a combination of communicating nodes and have many, partly contradicting goals.

  9. Thai Automatic Speech Recognition

    National Research Council Canada - National Science Library

    Suebvisai, Sinaporn; Charoenpornsawat, Paisarn; Black, Alan; Woszczyna, Monika; Schultz, Tanja

    2005-01-01

    .... We focus on the discussion of the rapid deployment of ASR for Thai under limited time and data resources, including rapid data collection issues, acoustic model bootstrap, and automatic generation of pronunciations...

  10. Current trends in multilingual speech processing

    Indian Academy of Sciences (India)

    2016-08-26

    ; speech-to-speech translation; language identification. ... interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers.

  11. Detection of target phonemes in spontaneous and read speech

    NARCIS (Netherlands)

    Mehta, G.; Cutler, A.

    1988-01-01

    Although spontaneous speech occurs more frequently in most listeners' experience than read speech, laboratory studies of human speech recognition typically use carefully controlled materials read from a script. The phonological and prosodic characteristics of spontaneous and read speech differ

  12. Invention and validation of an automated camera system that uses optical character recognition to identify patient name mislabeled samples.

    Science.gov (United States)

    Hawker, Charles D; McCarthy, William; Cleveland, David; Messinger, Bonnie L

    2014-03-01

    Mislabeled samples are a serious problem in most clinical laboratories. Published error rates range from 0.39/1000 to as high as 1.12%. Standardization of bar codes and label formats has not yet achieved the needed improvement. The mislabel rate in our laboratory, although low compared with published rates, prompted us to seek a solution to achieve zero errors. To reduce or eliminate our mislabeled samples, we invented an automated device using 4 cameras to photograph the outside of a sample tube. The system uses optical character recognition (OCR) to look for discrepancies between the patient name in our laboratory information system (LIS) vs the patient name on the customer label. All discrepancies detected by the system's software then require human inspection. The system was installed on our automated track and validated with production samples. We obtained 1 009 830 images during the validation period, and every image was reviewed. OCR passed approximately 75% of the samples, and no mislabeled samples were passed. The 25% failed by the system included 121 samples actually mislabeled by patient name and 148 samples with spelling discrepancies between the patient name on the customer label and the patient name in our LIS. Only 71 of the 121 mislabeled samples detected by OCR were found through our normal quality assurance process. We have invented an automated camera system that uses OCR technology to identify potential mislabeled samples. We have validated this system using samples transported on our automated track. Full implementation of this technology offers the possibility of zero mislabeled samples in the preanalytic stage.

  13. Auditory and Non-Auditory Contributions for Unaided Speech Recognition in Noise as a Function of Hearing Aid Use

    OpenAIRE

    Gieseler, Anja; Tahden, Maike A. S.; Thiel, Christiane M.; Wagener, Kirsten C.; Meis, Markus; Colonius, Hans

    2017-01-01

    Differences in understanding speech in noise among hearing-impaired individuals cannot be explained entirely by hearing thresholds alone, suggesting the contribution of other factors beyond standard auditory ones as derived from the audiogram. This paper reports two analyses addressing individual differences in the explanation of unaided speech-in-noise performance among n = 438 elderly hearing-impaired listeners (mean = 71.1 ± 5.8 years). The main analysis was designed to identify clinically...

  14. Is talking to an automated teller machine natural and fun?

    Science.gov (United States)

    Chan, F Y; Khalid, H M

    Usability and affective issues of using automatic speech recognition technology to interact with an automated teller machine (ATM) are investigated in two experiments. The first uncovered dialogue patterns of ATM users for the purpose of designing the user interface for a simulated speech ATM system. Applying the Wizard-of-Oz methodology, multiple mapping and word spotting techniques, the speech driven ATM accommodates bilingual users of Bahasa Melayu and English. The second experiment evaluates the usability of a hybrid speech ATM, comparing it with a simulated manual ATM. The aim is to investigate how natural and fun can talking to a speech ATM be for these first-time users. Subjects performed the withdrawal and balance enquiry tasks. The ANOVA was performed on the usability and affective data. The results showed significant differences between systems in the ability to complete the tasks as well as in transaction errors. Performance was measured on the time taken by subjects to complete the task and the number of speech recognition errors that occurred. On the basis of user emotions, it can be said that the hybrid speech system enabled pleasurable interaction. Despite the limitations of speech recognition technology, users are set to talk to the ATM when it becomes available for public use.

  15. Forensic speaker recognition

    NARCIS (Netherlands)

    Meuwly, Didier

    2013-01-01

    The aim of forensic speaker recognition is to establish links between individuals and criminal activities, through audio speech recordings. This field is multidisciplinary, combining predominantly phonetics, linguistics, speech signal processing, and forensic statistics. On these bases, expert-based

  16. Expanding the functionality of speech recognition in radiology: creating a real-time methodology for measurement and analysis of occupational stress and fatigue.

    Science.gov (United States)

    Reiner, Bruce I

    2013-02-01

    While occupational stress and fatigue have been well described throughout medicine, the radiology community is particularly susceptible due to declining reimbursements, heightened demands for service deliverables, and increasing exam volume and complexity. The resulting occupational stress can be variable in nature and dependent upon a number of intrinsic and extrinsic stressors. Intrinsic stressors largely account for inter-radiologist stress variability and relate to unique attributes of the radiologist such as personality, emotional state, education/training, and experience. Extrinsic stressors may account for intra-radiologist stress variability and include cumulative workload and task complexity. The creation of personalized stress profiles creates a mechanism for accounting for both inter- and intra-radiologist stress variability, which is essential in creating customizable stress intervention strategies. One viable option for real-time occupational stress measurement is voice stress analysis, which can be directly implemented through existing speech recognition technology and has been proven to be effective in stress measurement and analysis outside of medicine. This technology operates by detecting stress in the acoustic properties of speech through a number of different variables including duration, glottis source factors, pitch distribution, spectral structure, and intensity. The correlation of these speech derived stress measures with outcomes data can be used to determine the user-specific inflection point at which stress becomes detrimental to clinical performance.

  17. PLAYER POSITION DETECTION AND MOVEMENT PATTERN RECOGNITION FOR AUTOMATED TACTICAL ANALYSIS IN BADMINTON

    OpenAIRE

    KOKUM GAYANATH WEERATUNGA

    2018-01-01

    This thesis documents the development of a comprehensive approach to automate badminton tactical analysis. First, a computer algorithm was developed to automatically track badminton players moving on a court. Next, a machine learning algorithm was developed to analyse these movements and understand their underlying tactical implications. Both algorithms were tested and validated using video footage recorded at International badminton tournaments. The results demonstrate that the combination o...

  18. Automated recognition of cell phenotypes in histology images based on membrane- and nuclei-targeting biomarkers

    International Nuclear Information System (INIS)

    Karaçalı, Bilge; Vamvakidou, Alexandra P; Tözeren, Aydın

    2007-01-01

    Three-dimensional in vitro culture of cancer cells are used to predict the effects of prospective anti-cancer drugs in vivo. In this study, we present an automated image analysis protocol for detailed morphological protein marker profiling of tumoroid cross section images. Histologic cross sections of breast tumoroids developed in co-culture suspensions of breast cancer cell lines, stained for E-cadherin and progesterone receptor, were digitized and pixels in these images were classified into five categories using k-means clustering. Automated segmentation was used to identify image regions composed of cells expressing a given biomarker. Synthesized images were created to check the accuracy of the image processing system. Accuracy of automated segmentation was over 95% in identifying regions of interest in synthesized images. Image analysis of adjacent histology slides stained, respectively, for Ecad and PR, accurately predicted regions of different cell phenotypes. Image analysis of tumoroid cross sections from different tumoroids obtained under the same co-culture conditions indicated the variation of cellular composition from one tumoroid to another. Variations in the compositions of cross sections obtained from the same tumoroid were established by parallel analysis of Ecad and PR-stained cross section images. Proposed image analysis methods offer standardized high throughput profiling of molecular anatomy of tumoroids based on both membrane and nuclei markers that is suitable to rapid large scale investigations of anti-cancer compounds for drug development

  19. Multilevel Analysis in Analyzing Speech Data

    Science.gov (United States)

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  20. Self-Assessed Hearing Handicap in Older Adults with Poorer-than-Predicted Speech Recognition in Noise

    Science.gov (United States)

    Eckert, Mark A.; Matthews, Lois J.; Dubno, Judy R.

    2017-01-01

    Purpose: Even older adults with relatively mild hearing loss report hearing handicap, suggesting that hearing handicap is not completely explained by reduced speech audibility. Method: We examined the extent to which self-assessed ratings of hearing handicap using the Hearing Handicap Inventory for the Elderly (HHIE; Ventry & Weinstein, 1982)…

  1. Automated recognition and tracking of aerosol threat plumes with an IR camera pod

    Science.gov (United States)

    Fauth, Ryan; Powell, Christopher; Gruber, Thomas; Clapp, Dan

    2012-06-01

    Protection of fixed sites from chemical, biological, or radiological aerosol plume attacks depends on early warning so that there is time to take mitigating actions. Early warning requires continuous, autonomous, and rapid coverage of large surrounding areas; however, this must be done at an affordable cost. Once a potential threat plume is detected though, a different type of sensor (e.g., a more expensive, slower sensor) may be cued for identification purposes, but the problem is to quickly identify all of the potential threats around the fixed site of interest. To address this problem of low cost, persistent, wide area surveillance, an IR camera pod and multi-image stitching and processing algorithms have been developed for automatic recognition and tracking of aerosol plumes. A rugged, modular, static pod design, which accommodates as many as four micro-bolometer IR cameras for 45deg to 180deg of azimuth coverage, is presented. Various OpenCV1 based image-processing algorithms, including stitching of multiple adjacent FOVs, recognition of aerosol plume objects, and the tracking of aerosol plumes, are presented using process block diagrams and sample field test results, including chemical and biological simulant plumes. Methods for dealing with the background removal, brightness equalization between images, and focus quality for optimal plume tracking are also discussed.

  2. Automated Field-of-View, Illumination, and Recognition Algorithm Design of a Vision System for Pick-and-Place Considering Colour Information in Illumination and Images.

    Science.gov (United States)

    Chen, Yibing; Ogata, Taiki; Ueyama, Tsuyoshi; Takada, Toshiyuki; Ota, Jun

    2018-05-22

    Machine vision is playing an increasingly important role in industrial applications, and the automated design of image recognition systems has been a subject of intense research. This study has proposed a system for automatically designing the field-of-view (FOV) of a camera, the illumination strength and the parameters in a recognition algorithm. We formulated the design problem as an optimisation problem and used an experiment based on a hierarchical algorithm to solve it. The evaluation experiments using translucent plastics objects showed that the use of the proposed system resulted in an effective solution with a wide FOV, recognition of all objects and 0.32 mm and 0.4° maximal positional and angular errors when all the RGB (red, green and blue) for illumination and R channel image for recognition were used. Though all the RGB illumination and grey scale images also provided recognition of all the objects, only a narrow FOV was selected. Moreover, full recognition was not achieved by using only G illumination and a grey-scale image. The results showed that the proposed method can automatically design the FOV, illumination and parameters in the recognition algorithm and that tuning all the RGB illumination is desirable even when single-channel or grey-scale images are used for recognition.

  3. Sensor data fusion for automated threat recognition in manned-unmanned infantry platoons

    Science.gov (United States)

    Wildt, J.; Varela, M.; Ulmke, M.; Brüggermann, B.

    2017-05-01

    To support a dismounted infantry platoon during deployment we team it with several unmanned aerial and ground vehicles (UAV and UGV, respectively). The unmanned systems integrate seamlessly into the infantry platoon, providing automated reconnaissance during movement while keeping formation as well as conducting close range reconnaissance during halt. The sensor data each unmanned system provides is continuously analyzed in real time by specialized algorithms, detecting humans in live videos of UAV mounted infrared cameras as well as gunshot detection and bearing by acoustic sensors. All recognized threats are fused into a consistent situational picture in real time, available to platoon and squad leaders as well as higher level command and control (C2) systems. This gives friendly forces local information superiority and increased situational awareness without the need to constantly monitor the unmanned systems and sensor data.

  4. Application of neural network and pattern recognition software to the automated analysis of continuous nuclear monitoring of on-load reactors

    Energy Technology Data Exchange (ETDEWEB)

    Howell, J.A.; Eccleston, G.W.; Halbig, J.K.; Klosterbuer, S.F. [Los Alamos National Lab., NM (United States); Larson, T.W. [California Polytechnic State Univ., San Luis Obispo, CA (US)

    1993-08-01

    Automated analysis using pattern recognition and neural network software can help interpret data, call attention to potential anomalies, and improve safeguards effectiveness. Automated software analysis, based on pattern recognition and neural networks, was applied to data collected from a radiation core discharge monitor system located adjacent to an on-load reactor core. Unattended radiation sensors continuously collect data to monitor on-line refueling operations in the reactor. The huge volume of data collected from a number of radiation channels makes it difficult for a safeguards inspector to review it all, check for consistency among the measurement channels, and find anomalies. Pattern recognition and neural network software can analyze large volumes of data from continuous, unattended measurements, thereby improving and automating the detection of anomalies. The authors developed a prototype pattern recognition program that determines the reactor power level and identifies the times when fuel bundles are pushed through the core during on-line refueling. Neural network models were also developed to predict fuel bundle burnup to calculate the region on the on-load reactor face from which fuel bundles were discharged based on the radiation signals. In the preliminary data set, which was limited and consisted of four distinct burnup regions, the neural network model correctly predicted the burnup region with an accuracy of 92%.

  5. Does computer-synthesized speech manifest personality? Experimental tests of recognition, similarity-attraction, and consistency-attraction.

    Science.gov (United States)

    Nass, C; Lee, K M

    2001-09-01

    Would people exhibit similarity-attraction and consistency-attraction toward unambiguously computer-generated speech even when personality is clearly not relevant? In Experiment 1, participants (extrovert or introvert) heard a synthesized voice (extrovert or introvert) on a book-buying Web site. Participants accurately recognized personality cues in text to speech and showed similarity-attraction in their evaluation of the computer voice, the book reviews, and the reviewer. Experiment 2, in a Web auction context, added personality of the text to the previous design. The results replicated Experiment 1 and demonstrated consistency (voice and text personality)-attraction. To maximize liking and trust, designers should set parameters, for example, words per minute or frequency range, that create a personality that is consistent with the user and the content being presented.

  6. Automated alignment system for optical wireless communication systems using image recognition.

    Science.gov (United States)

    Brandl, Paul; Weiss, Alexander; Zimmermann, Horst

    2014-07-01

    In this Letter, we describe the realization of a tracked line-of-sight optical wireless communication system for indoor data distribution. We built a laser-based transmitter with adaptive focus and ray steering by a microelectromechanical systems mirror. To execute the alignment procedure, we used a CMOS image sensor at the transmitter side and developed an algorithm for image recognition to localize the receiver's position. The receiver is based on a self-developed optoelectronic integrated chip with low requirements on the receiver optics to make the system economically attractive. With this system, we were able to set up the communication link automatically without any back channel and to perform error-free (bit error rate <10⁻⁹) data transmission over a distance of 3.5 m with a data rate of 3 Gbit/s.

  7. The Pandora multi-algorithm approach to automated pattern recognition of cosmic-ray muon and neutrino events in the MicroBooNE detector

    Energy Technology Data Exchange (ETDEWEB)

    Acciarri, R.; Bagby, L.; Baller, B.; Carls, B.; Castillo Fernandez, R.; Cavanna, F.; Greenlee, H.; James, C.; Jostlein, H.; Ketchum, W.; Kirby, M.; Kobilarcik, T.; Lockwitz, S.; Lundberg, B.; Marchionni, A.; Moore, C.D.; Palamara, O.; Pavlovic, Z.; Raaf, J.L.; Schukraft, A.; Snider, E.L.; Spentzouris, P.; Strauss, T.; Toups, M.; Wolbers, S.; Yang, T.; Zeller, G.P. [Fermi National Accelerator Laboratory (FNAL), Batavia, IL (United States); Adams, C. [Harvard University, Cambridge, MA (United States); Yale University, New Haven, CT (United States); An, R.; Littlejohn, B.R.; Martinez Caicedo, D.A. [Illinois Institute of Technology (IIT), Chicago, IL (United States); Anthony, J.; Escudero Sanchez, L.; De Vries, J.J.; Marshall, J.; Smith, A.; Thomson, M. [University of Cambridge, Cambridge (United Kingdom); Asaadi, J. [University of Texas, Arlington, TX (United States); Auger, M.; Ereditato, A.; Goeldi, D.; Kreslo, I.; Lorca, D.; Luethi, M.; Rudolf von Rohr, C.; Sinclair, J.; Weber, M. [Universitaet Bern, Bern (Switzerland); Balasubramanian, S.; Fleming, B.T.; Gramellini, E.; Hackenburg, A.; Luo, X.; Russell, B.; Tufanli, S. [Yale University, New Haven, CT (United States); Barnes, C.; Mousseau, J.; Spitz, J. [University of Michigan, Ann Arbor, MI (United States); Barr, G.; Bass, M.; Del Tutto, M.; Laube, A.; Soleti, S.R.; De Pontseele, W.V. [University of Oxford, Oxford (United Kingdom); Bay, F. [TUBITAK Space Technologies Research Institute, Ankara (Turkey); Bishai, M.; Chen, H.; Joshi, J.; Kirby, B.; Li, Y.; Mooney, M.; Qian, X.; Viren, B.; Zhang, C. [Brookhaven National Laboratory (BNL), Upton, NY (United States); Blake, A.; Devitt, D.; Lister, A.; Nowak, J. [Lancaster University, Lancaster (United Kingdom); Bolton, T.; Horton-Smith, G.; Meddage, V.; Rafique, A. [Kansas State University (KSU), Manhattan, KS (United States); Camilleri, L.; Caratelli, D.; Crespo-Anadon, J.I.; Fadeeva, A.A.; Genty, V.; Kaleko, D.; Seligman, W.; Shaevitz, M.H. [Columbia University, New York, NY (United States); Church, E. [Pacific Northwest National Laboratory (PNNL), Richland, WA (United States); Cianci, D.; Karagiorgi, G. [Columbia University, New York, NY (United States); The University of Manchester (United Kingdom); Cohen, E.; Piasetzky, E. [Tel Aviv University, Tel Aviv (Israel); Collin, G.H.; Conrad, J.M.; Hen, O.; Hourlier, A.; Moon, J.; Wongjirad, T.; Yates, L. [Massachusetts Institute of Technology (MIT), Cambridge, MA (United States); Convery, M.; Eberly, B.; Rochester, L.; Tsai, Y.T.; Usher, T. [SLAC National Accelerator Laboratory, Menlo Park, CA (United States); Dytman, S.; Graf, N.; Jiang, L.; Naples, D.; Paolone, V.; Wickremasinghe, D.A. [University of Pittsburgh, Pittsburgh, PA (United States); Esquivel, J.; Hamilton, P.; Pulliam, G.; Soderberg, M. [Syracuse University, Syracuse, NY (United States); Foreman, W.; Ho, J.; Schmitz, D.W.; Zennamo, J. [University of Chicago, IL (United States); Furmanski, A.P.; Garcia-Gamez, D.; Hewes, J.; Hill, C.; Murrells, R.; Porzio, D.; Soeldner-Rembold, S.; Szelc, A.M. [The University of Manchester (United Kingdom); Garvey, G.T.; Huang, E.C.; Louis, W.C.; Mills, G.B.; De Water, R.G.V. [Los Alamos National Laboratory (LANL), Los Alamos, NM (United States); Gollapinni, S. [Kansas State University (KSU), Manhattan, KS (United States); University of Tennessee, Knoxville, TN (United States); and others

    2018-01-15

    The development and operation of liquid-argon time-projection chambers for neutrino physics has created a need for new approaches to pattern recognition in order to fully exploit the imaging capabilities offered by this technology. Whereas the human brain can excel at identifying features in the recorded events, it is a significant challenge to develop an automated, algorithmic solution. The Pandora Software Development Kit provides functionality to aid the design and implementation of pattern-recognition algorithms. It promotes the use of a multi-algorithm approach to pattern recognition, in which individual algorithms each address a specific task in a particular topology. Many tens of algorithms then carefully build up a picture of the event and, together, provide a robust automated pattern-recognition solution. This paper describes details of the chain of over one hundred Pandora algorithms and tools used to reconstruct cosmic-ray muon and neutrino events in the MicroBooNE detector. Metrics that assess the current pattern-recognition performance are presented for simulated MicroBooNE events, using a selection of final-state event topologies. (orig.)

  8. The Pandora multi-algorithm approach to automated pattern recognition of cosmic-ray muon and neutrino events in the MicroBooNE detector

    Science.gov (United States)

    Acciarri, R.; Adams, C.; An, R.; Anthony, J.; Asaadi, J.; Auger, M.; Bagby, L.; Balasubramanian, S.; Baller, B.; Barnes, C.; Barr, G.; Bass, M.; Bay, F.; Bishai, M.; Blake, A.; Bolton, T.; Camilleri, L.; Caratelli, D.; Carls, B.; Castillo Fernandez, R.; Cavanna, F.; Chen, H.; Church, E.; Cianci, D.; Cohen, E.; Collin, G. H.; Conrad, J. M.; Convery, M.; Crespo-Anadón, J. I.; Del Tutto, M.; Devitt, D.; Dytman, S.; Eberly, B.; Ereditato, A.; Escudero Sanchez, L.; Esquivel, J.; Fadeeva, A. A.; Fleming, B. T.; Foreman, W.; Furmanski, A. P.; Garcia-Gamez, D.; Garvey, G. T.; Genty, V.; Goeldi, D.; Gollapinni, S.; Graf, N.; Gramellini, E.; Greenlee, H.; Grosso, R.; Guenette, R.; Hackenburg, A.; Hamilton, P.; Hen, O.; Hewes, J.; Hill, C.; Ho, J.; Horton-Smith, G.; Hourlier, A.; Huang, E.-C.; James, C.; Jan de Vries, J.; Jen, C.-M.; Jiang, L.; Johnson, R. A.; Joshi, J.; Jostlein, H.; Kaleko, D.; Karagiorgi, G.; Ketchum, W.; Kirby, B.; Kirby, M.; Kobilarcik, T.; Kreslo, I.; Laube, A.; Li, Y.; Lister, A.; Littlejohn, B. R.; Lockwitz, S.; Lorca, D.; Louis, W. C.; Luethi, M.; Lundberg, B.; Luo, X.; Marchionni, A.; Mariani, C.; Marshall, J.; Martinez Caicedo, D. A.; Meddage, V.; Miceli, T.; Mills, G. B.; Moon, J.; Mooney, M.; Moore, C. D.; Mousseau, J.; Murrells, R.; Naples, D.; Nienaber, P.; Nowak, J.; Palamara, O.; Paolone, V.; Papavassiliou, V.; Pate, S. F.; Pavlovic, Z.; Piasetzky, E.; Porzio, D.; Pulliam, G.; Qian, X.; Raaf, J. L.; Rafique, A.; Rochester, L.; Rudolf von Rohr, C.; Russell, B.; Schmitz, D. W.; Schukraft, A.; Seligman, W.; Shaevitz, M. H.; Sinclair, J.; Smith, A.; Snider, E. L.; Soderberg, M.; Söldner-Rembold, S.; Soleti, S. R.; Spentzouris, P.; Spitz, J.; St. John, J.; Strauss, T.; Szelc, A. M.; Tagg, N.; Terao, K.; Thomson, M.; Toups, M.; Tsai, Y.-T.; Tufanli, S.; Usher, T.; Van De Pontseele, W.; Van de Water, R. G.; Viren, B.; Weber, M.; Wickremasinghe, D. A.; Wolbers, S.; Wongjirad, T.; Woodruff, K.; Yang, T.; Yates, L.; Zeller, G. P.; Zennamo, J.; Zhang, C.

    2018-01-01

    The development and operation of liquid-argon time-projection chambers for neutrino physics has created a need for new approaches to pattern recognition in order to fully exploit the imaging capabilities offered by this technology. Whereas the human brain can excel at identifying features in the recorded events, it is a significant challenge to develop an automated, algorithmic solution. The Pandora Software Development Kit provides functionality to aid the design and implementation of pattern-recognition algorithms. It promotes the use of a multi-algorithm approach to pattern recognition, in which individual algorithms each address a specific task in a particular topology. Many tens of algorithms then carefully build up a picture of the event and, together, provide a robust automated pattern-recognition solution. This paper describes details of the chain of over one hundred Pandora algorithms and tools used to reconstruct cosmic-ray muon and neutrino events in the MicroBooNE detector. Metrics that assess the current pattern-recognition performance are presented for simulated MicroBooNE events, using a selection of final-state event topologies.

  9. Reliability of an Automated High-Resolution Manometry Analysis Program across Expert Users, Novice Users, and Speech-Language Pathologists

    Science.gov (United States)

    Jones, Corinne A.; Hoffman, Matthew R.; Geng, Zhixian; Abdelhalim, Suzan M.; Jiang, Jack J.; McCulloch, Timothy M.

    2014-01-01

    Purpose: The purpose of this study was to investigate inter- and intrarater reliability among expert users, novice users, and speech-language pathologists with a semiautomated high-resolution manometry analysis program. We hypothesized that all users would have high intrarater reliability and high interrater reliability. Method: Three expert…

  10. Automated recognition of rear seat occupants' head position using Kinect™ 3D point cloud.

    Science.gov (United States)

    Loeb, Helen; Kim, Jinyong; Arbogast, Kristy; Kuo, Jonny; Koppel, Sjaan; Cross, Suzanne; Charlton, Judith

    2017-12-01

    Child occupant safety in motor-vehicle crashes is evaluated using Anthropomorphic Test Devices (ATD) seated in optimal positions. However, child occupants often assume suboptimal positions during real-world driving trips. Head impact to the seat back has been identified as one important injury causation scenario for seat belt restrained, head-injured children (Bohman et al., 2011). There is therefore a need to understand the interaction of children with the Child Restraint System to optimize protection. Naturalistic driving studies (NDS) will improve understanding of out-of-position (OOP) trends. To quantify OOP positions, an NDS was conducted. Families used a study vehicle for two weeks during their everyday driving trips. The positions of rear-seated child occupants, representing 22 families, were evaluated. The study vehicle - instrumented with data acquisition systems, including Microsoft Kinect™ V1 - recorded rear seat occupants in 1120 driving 26 trips. Three novel analytical methods were used to analyze data. To assess skeletal tracking accuracy, analysts recorded occurrences where Kinect™ exhibited invalid head recognition among a randomly-selected subset (81 trips). Errors included incorrect target detection (e.g., vehicle headrest) or environmental interference (e.g., sunlight). When head data was present, Kinect™ was correct 41% of the time; two other algorithms - filtering for extreme motion, and background subtraction/head-based depth detection are described in this paper and preliminary results are presented. Accuracy estimates were not possible because of their experimental nature and the difficulty to use a ground truth for this large database. This NDS tested methods to quantify the frequency and magnitude of head positions for rear-seated child occupants utilizing Kinect™ motion-tracking. This study's results informed recent ATD sled tests that replicated observed positions (most common and most extreme), and assessed the validity of child

  11. Investigation into diagnostic agreement using automated computer-assisted histopathology pattern recognition image analysis

    Directory of Open Access Journals (Sweden)

    Joshua D Webster

    2012-01-01

    Full Text Available The extent to which histopathology pattern recognition image analysis (PRIA agrees with microscopic assessment has not been established. Thus, a commercial PRIA platform was evaluated in two applications using whole-slide images. Substantial agreement, lacking significant constant or proportional errors, between PRIA and manual morphometric image segmentation was obtained for pulmonary metastatic cancer areas (Passing/Bablok regression. Bland-Altman analysis indicated heteroscedastic measurements and tendency toward increasing variance with increasing tumor burden, but no significant trend in mean bias. The average between-methods percent tumor content difference was -0.64. Analysis of between-methods measurement differences relative to the percent tumor magnitude revealed that method disagreement had an impact primarily in the smallest measurements (tumor burden 0.988, indicating high reproducibility for both methods, yet PRIA reproducibility was superior (C.V.: PRIA = 7.4, manual = 17.1. Evaluation of PRIA on morphologically complex teratomas led to diagnostic agreement with pathologist assessments of pluripotency on subsets of teratomas. Accommodation of the diversity of teratoma histologic features frequently resulted in detrimental trade-offs, increasing PRIA error elsewhere in images. PRIA error was nonrandom and influenced by variations in histomorphology. File-size limitations encountered while training algorithms and consequences of spectral image processing dominance contributed to diagnostic inaccuracies experienced for some teratomas. PRIA appeared better suited for tissues with limited phenotypic diversity. Technical improvements may enhance diagnostic agreement, and consistent pathologist input will benefit further development and application of PRIA.

  12. Investigation into diagnostic agreement using automated computer-assisted histopathology pattern recognition image analysis.

    Science.gov (United States)

    Webster, Joshua D; Michalowski, Aleksandra M; Dwyer, Jennifer E; Corps, Kara N; Wei, Bih-Rong; Juopperi, Tarja; Hoover, Shelley B; Simpson, R Mark

    2012-01-01

    The extent to which histopathology pattern recognition image analysis (PRIA) agrees with microscopic assessment has not been established. Thus, a commercial PRIA platform was evaluated in two applications using whole-slide images. Substantial agreement, lacking significant constant or proportional errors, between PRIA and manual morphometric image segmentation was obtained for pulmonary metastatic cancer areas (Passing/Bablok regression). Bland-Altman analysis indicated heteroscedastic measurements and tendency toward increasing variance with increasing tumor burden, but no significant trend in mean bias. The average between-methods percent tumor content difference was -0.64. Analysis of between-methods measurement differences relative to the percent tumor magnitude revealed that method disagreement had an impact primarily in the smallest measurements (tumor burden 0.988, indicating high reproducibility for both methods, yet PRIA reproducibility was superior (C.V.: PRIA = 7.4, manual = 17.1). Evaluation of PRIA on morphologically complex teratomas led to diagnostic agreement with pathologist assessments of pluripotency on subsets of teratomas. Accommodation of the diversity of teratoma histologic features frequently resulted in detrimental trade-offs, increasing PRIA error elsewhere in images. PRIA error was nonrandom and influenced by variations in histomorphology. File-size limitations encountered while training algorithms and consequences of spectral image processing dominance contributed to diagnostic inaccuracies experienced for some teratomas. PRIA appeared better suited for tissues with limited phenotypic diversity. Technical improvements may enhance diagnostic agreement, and consistent pathologist input will benefit further development and application of PRIA.

  13. Mathematical Modelling for the Evaluation of Automated Speech Recognition Systems--Research Area 3.3.1 (c)

    Science.gov (United States)

    2016-01-07

    news. Both of these resemble typical activities of intelligence analysts in OSINT processing and production applications. We assessed two task...intelligence analysts in a number of OSINT processing and production applications. (5) Summary of the most important results In both settings

  14. Evaluation of the effects of nonlinear frequency compression on speech recognition and sound quality for adults with mild to moderate hearing loss.

    Science.gov (United States)

    Picou, Erin M; Marcrum, Steven C; Ricketts, Todd A

    2015-03-01

    While potentially improving audibility for listeners with considerable high frequency hearing loss, the effects of implementing nonlinear frequency compression (NFC) for listeners with moderate high frequency hearing loss are unclear. The purpose of this study was to investigate the effects of activating NFC for listeners who are not traditionally considered candidates for this technology. Participants wore study hearing aids with NFC activated for a 3-4 week trial period. After the trial period, they were tested with NFC and with conventional processing on measures of consonant discrimination threshold in quiet, consonant recognition in quiet, sentence recognition in noise, and acceptableness of sound quality of speech and music. Seventeen adult listeners with symmetrical, mild to moderate sensorineural hearing loss participated. Better ear, high frequency pure-tone averages (4, 6, and 8 kHz) were 60 dB HL or better. Activating NFC resulted in lower (better) thresholds for discrimination of /s/, whose spectral center was 9 kHz. There were no other significant effects of NFC compared to conventional processing. These data suggest that the benefits, and detriments, of activating NFC may be limited for this population.

  15. Children's speech recognition and loudness perception with the Desired Sensation Level v5 Quiet and Noise prescriptions.

    Science.gov (United States)

    Crukley, Jeffery; Scollie, Susan D

    2012-12-01

    To determine whether Desired Sensation Level (DSL) v5 Noise is a viable hearing instrument prescriptive algorithm for children, in comparison with DSL v5 Quiet. In particular, the authors compared children's performance on measures of consonant recognition in quiet, sentence recognition in noise, and loudness perception when fitted with DSL v5 Quiet and Noise. Eleven children (ages 8 to 17 years) with stable, congenital sensorineural hearing losses participated in the study. Participants were fitted bilaterally to DSL v5 prescriptions with behind-the-ear hearing instruments. The order of prescription was counterbalanced across participants. Repeated measures analysis of variance was used to compare performance between prescriptions. Use of the Noise prescription resulted in a significant decrease in consonant perception in Quiet with low-level input, but no difference with average-level input. There was no significant difference in sentence-in-noise recognition between the two prescriptions. Loudness ratings for input levels above 72 dB SPL were significantly lower with the noise prescription. Average-level consonant recognition in quiet was preserved and aversive loudness was alleviated by the Noise prescription relative to the quiet prescription, which suggests that the DSL v5 Noise prescription may be an effective approach to managing the nonquiet listening needs of children with hearing loss.

  16. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels; Reconnaissance automatique de la parole (methode des passages par zero). Reconnaissance automatique de voyelles isolees

    Energy Technology Data Exchange (ETDEWEB)

    Dupeyrat, Benoit

    1975-06-10

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author) [French] Cette note decrit une methode de reconnaissance automatique de voyelles isolees basee sur un pretraitement particulier du signal vocal. Ce pretraitement consiste a extraire les extrema du signal vocal et les intervalles de temps les separant (distances entre passages par zero de la derivee du signal). La reconnaissance des voyelles est faite en utilisant des histogrammes normalises des valeurs de ces interval les. Le programme de reconnaissance utilise une distance entre l'histogramme du son a reconnaitre et des histogrammes modeles provenant d'un apprentissage. Les resultats obtenus en temps reels sur un minicalculateur, sont assez independants du locuteur, pourvu que la frequence fondamentale de la voix ne varie pas trop (locuteurs de meme sexe). (auteur)

  17. Automated oestrus detection using multimetric behaviour recognition in seasonal-calving dairy cattle on pasture.

    Science.gov (United States)

    Brassel, J; Rohrssen, F; Failing, K; Wehrend, A

    2018-06-11

    To evaluate the performance of a novel accelerometer-based oestrus detection system (ODS) for dairy cows on pasture, in comparison with measurement of concentrations of progesterone in milk, ultrasonographic examination of ovaries and farmer observations. Mixed-breed lactating dairy cows (n=109) in a commercial, seasonal-calving herd managed at pasture under typical farming conditions in Ireland, were fitted with oestrus detection collars 3 weeks prior to mating start date. The ODS performed multimetric analysis of eight different motion patterns to generate oestrus alerts. Data were collected during the artificial insemination period of 66 days, commencing on 16 April 2015. Transrectal ultrasonographic examinations of the reproductive tract and measurements of concentrations of progesterone in milk were used to confirm oestrus events. Visual observations by the farmer and the number of theoretically expected oestrus events were used to evaluate the number of false negative ODS alerts. The percentage of eligible cows that were detected in oestrus at least once (and were confirmed true positives) was calculated for the first 21, 42 and 63 days of the insemination period. During the insemination period, the ODS generated 194 oestrus alerts and 140 (72.2%) were confirmed as true positives. Six confirmed oestrus events recognised by the farmer did not generate ODS alerts. The positive predictive value of the ODS was 72.2 (95% CI=65.3-78.4)%. To account for oestrus events not identified by the ODS or the farmer, four theoretical missed oestrus events were added to the false negatives. Estimated sensitivity of the automated ODS was 93.3 (95% CI=88.1-96.8)%. The proportion of eligible cows that were detected in oestrus during the first 21 days of the insemination period was 92/106 (86.8%), and during the first 42 and 63 days of the insemination period was 103/106 (97.2%) and 105/106 (99.1%), respectively. The ODS under investigation was suitable for oestrus detection in

  18. Clinical and audiological features of a syndrome with deterioration in speech recognition out of proportion to pure hearing loss

    Directory of Open Access Journals (Sweden)

    Abdi S

    2007-04-01

    Full Text Available Background: The objective of this study was to describe the audiologic and related characteristics of a group patient with speech perception affected out of proportion to pure tone hearing loss. A case series of patient were referred for evaluation and management to the Hearing Research Center.To describe the clinical picture of the patients with the key clinical feature of hearing loss for pure tones and reduction in speech discrimination out of proportion to the pure tone loss, having some of the criteria of auditory neuropathy (i.e. normal otoacoustic emissions, OAE, and abnormal auditory brainstem evoked potentials, ABR and lacking others (e.g. present auditory reflexes. Methods: Hearing abilities were measured by Pure Tone Audiometry (PTA and Speech Discrimination Scores (SDS, measured in all patients using a standardized list of 25 monosyllabic Farsi words at MCL in quiet. Auditory pathway integrity was measured by using Auditory Brainstem Response (ABR and Otoacoustic Emission (OAE and anatomical lesions Computed Tomography Scan (CT and Magnetic Resonance Image (MRI of brain and retrocochlea. Patient included in the series were 35 patients who have SDS disproportionably low with regard to PTA, absent ABR waves and normal OAE. Results: All patients reported the beginning of their problem around adolescence. Neither of them had anatomical lesion in imaging studies and neither of them had any finding suggestive of conductive hearing lesion. Although in most of the cases the hearing loss had been more apparent in the lower frequencies (i.e. 1000 Hz and less, a stronger correlation was found between SDS and hearing threshold at higher frequencies. These patients may not benefit from hearing aids, as the outer hair cells are functional and amplification doesn’t seem to help; though, it was tried for all. Conclusion: These patients share a pattern of sensory –neural loss with no detectable lesion. The age of onset and the gradual

  19. The Pandora multi-algorithm approach to automated pattern recognition of cosmic-ray muon and neutrino events in the MicroBooNE detector

    CERN Document Server

    Acciarri, R.; An, R.; Anthony, J.; Asaadi, J.; Auger, M.; Bagby, L.; Balasubramanian, S.; Baller, B.; Barnes, C.; Barr, G.; Bass, M.; Bay, F.; Bishai, M.; Blake, A.; Bolton, T.; Camilleri, L.; Caratelli, D.; Carls, B.; Castillo Fernandez, R.; Cavanna, F.; Chen, H.; Church, E.; Cianci, D.; Cohen, E.; Collin, G. H.; Conrad, J. M.; Convery, M.; Crespo-Anadón, J. I.; Del Tutto, M.; Devitt, D.; Dytman, S.; Eberly, B.; Ereditato, A.; Escudero Sanchez, L.; Esquivel, J.; Fadeeva, A. A.; Fleming, B. T.; Foreman, W.; Furmanski, A. P.; Garcia-Gamez, D.; Garvey, G. T.; Genty, V.; Goeldi, D.; Gollapinni, S.; Graf, N.; Gramellini, E.; Greenlee, H.; Grosso, R.; Guenette, R.; Hackenburg, A.; Hamilton, P.; Hen, O.; Hewes, J.; Hill, C.; Ho, J.; Horton-Smith, G.; Hourlier, A.; Huang, E.-C.; James, C.; Jan de Vries, J.; Jen, C.-M.; Jiang, L.; Johnson, R. A.; Joshi, J.; Jostlein, H.; Kaleko, D.; Karagiorgi, G.; Ketchum, W.; Kirby, B.; Kirby, M.; Kobilarcik, T.; Kreslo, I.; Laube, A.; Li, Y.; Lister, A.; Littlejohn, B. R.; Lockwitz, S.; Lorca, D.; Louis, W. C.; Luethi, M.; Lundberg, B.; Luo, X.; Marchionni, A.; Mariani, C.; Marshall, J.; Martinez Caicedo, D. A.; Meddage, V.; Miceli, T.; Mills, G. B.; Moon, J.; Mooney, M.; Moore, C. D.; Mousseau, J.; Murrells, R.; Naples, D.; Nienaber, P.; Nowak, J.; Palamara, O.; Paolone, V.; Papavassiliou, V.; Pate, S. F.; Pavlovic, Z.; Piasetzky, E.; Porzio, D.; Pulliam, G.; Qian, X.; Raaf, J. L.; Rafique, A.; Rochester, L.; Rudolf von Rohr, C.; Russell, B.; Schmitz, D. W.; Schukraft, A.; Seligman, W.; Shaevitz, M. H.; Sinclair, J.; Smith, A.; Snider, E. L.; Soderberg, M.; Söldner-Rembold, S.; Soleti, S. R.; Spentzouris, P.; Spitz, J.; St. John, J.; Strauss, T.; Szelc, A. M.; Tagg, N.; Terao, K.; Thomson, M.; Toups, M.; Tsai, Y.-T.; Tufanli, S.; Usher, T.; Van De Pontseele, W.; Van de Water, R. G.; Viren, B.; Weber, M.; Wickremasinghe, D. A.; Wolbers, S.; Wongjirad, T.; Woodruff, K.; Yang, T.; Yates, L.; Zeller, G. P.; Zennamo, J.; Zhang, C.

    2017-01-01

    The development and operation of Liquid-Argon Time-Projection Chambers for neutrino physics has created a need for new approaches to pattern recognition in order to fully exploit the imaging capabilities offered by this technology. Whereas the human brain can excel at identifying features in the recorded events, it is a significant challenge to develop an automated, algorithmic solution. The Pandora Software Development Kit provides functionality to aid the design and implementation of pattern-recognition algorithms. It promotes the use of a multi-algorithm approach to pattern recognition, in which individual algorithms each address a specific task in a particular topology. Many tens of algorithms then carefully build up a picture of the event and, together, provide a robust automated pattern-recognition solution. This paper describes details of the chain of over one hundred Pandora algorithms and tools used to reconstruct cosmic-ray muon and neutrino events in the MicroBooNE detector. Metrics that assess the...

  20. APPRECIATING SPEECH THROUGH GAMING

    Directory of Open Access Journals (Sweden)

    Mario T Carreon

    2014-06-01

    Full Text Available This paper discusses the Speech and Phoneme Recognition as an Educational Aid for the Deaf and Hearing Impaired (SPREAD application and the ongoing research on its deployment as a tool for motivating deaf and hearing impaired students to learn and appreciate speech. This application uses the Sphinx-4 voice recognition system to analyze the vocalization of the student and provide prompt feedback on their pronunciation. The packaging of the application as an interactive game aims to provide additional motivation for the deaf and hearing impaired student through visual motivation for them to learn and appreciate speech.

  1. Commercial speech and off-label drug uses: what role for wide acceptance, general recognition and research incentives?

    Science.gov (United States)

    Gilhooley, Margaret

    2011-01-01

    This article provides an overview of how the constitutional protections for commercial speech affect the Food and Drug Administration's (FDA) regulation of drugs, and the emerging issues about the scope of these protections. A federal district court has already found that commercial speech allows manufacturers to distribute reprints of medical articles about a new off-label use of a drug as long as it contains disclosures to prevent deception and to inform readers about the lack of FDA review. This paper summarizes the current agency guidance that accepts the manufacturer's distribution of reprints with disclosures. Allergan, the maker of Botox, recently maintained in a lawsuit that the First Amendment permits drug companies to provide "truthful information" to doctors about "widely accepted" off-label uses of a drug. While the case was settled as part of a fraud and abuse case on other grounds, extending constitutional protections generally to "widely accepted" uses is not warranted, especially if it covers the use of a drug for a new purpose that needs more proof of efficacy, and that can involve substantial risks. A health law academic pointed out in an article examining a fraud and abuse case that off-label use of drugs is common, and that practitioners may lack adequate dosage information about the off-label uses. Drug companies may obtain approval of a drug for a narrow use, such as for a specific type of pain, but practitioners use the drug for similar uses based on their experience. The writer maintained that a controlled study may not be necessary to establish efficacy for an expanded use of a drug for pain. Even if this is the case, as discussed below in this paper, added safety risks may exist if the expansion covers a longer period of time and use by a wider number of patients. The protections for commercial speech should not be extended to allow manufacturers to distribute information about practitioner use with a disclosure about the lack of FDA

  2. Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit.

    Science.gov (United States)

    Arnold, Denis; Tomaschek, Fabian; Sering, Konstantin; Lopez, Florence; Baayen, R Harald

    2017-01-01

    Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20-44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a 'wide' yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.

  3. Individual differences in language ability are related to variation in word recognition, not speech perception: evidence from eye movements.

    Science.gov (United States)

    McMurray, Bob; Munson, Cheyenne; Tomblin, J Bruce

    2014-08-01

    The authors examined speech perception deficits associated with individual differences in language ability, contrasting auditory, phonological, or lexical accounts by asking whether lexical competition is differentially sensitive to fine-grained acoustic variation. Adolescents with a range of language abilities (N = 74, including 35 impaired) participated in an experiment based on McMurray, Tanenhaus, and Aslin (2002). Participants heard tokens from six 9-step voice onset time (VOT) continua spanning 2 words (beach/peach, beak/peak, etc.) while viewing a screen containing pictures of those words and 2 unrelated objects. Participants selected the referent while eye movements to each picture were monitored as a measure of lexical activation. Fixations were examined as a function of both VOT and language ability. Eye movements were sensitive to within-category VOT differences: As VOT approached the boundary, listeners made more fixations to the competing word. This did not interact with language ability, suggesting that language impairment is not associated with differential auditory sensitivity or phonetic categorization. Listeners with poorer language skills showed heightened competitors fixations overall, suggesting a deficit in lexical processes. Language impairment may be better characterized by a deficit in lexical competition (inability to suppress competing words), rather than differences in phonological categorization or auditory abilities.

  4. Individual differences in language ability are related to variation in word recognition, not speech perception: Evidence from eye-movements

    Science.gov (United States)

    McMurray, Bob; Munson, Cheyenne; Tomblin, J. Bruce

    2013-01-01

    Purpose This study examined speech perception deficits associated with individual differences in language ability contrasting auditory, phonological or lexical accounts by asking if lexical competition is differentially sensitive to fine-grained acoustic variation. Methods 74 adolescents with a range of language abilities (including 35 impaired) participated in an experiment based on McMurray, Tanenhaus and Aslin (2002). Participants heard tokens from six 9-step Voice Onset Time (VOT) continua spanning two words (beach/peach, beak/peak, etc), while viewing a screen containing pictures of those words and two unrelated objects. Participants selected the referent while eye-movements to each picture were monitored as a measure of lexical activation. Fixations were examined as a function of both VOT and language ability. Results Eye-movements were sensitive to within-category VOT differences: as VOT approached the boundary, listeners made more fixations to the competing word. This did not interact with language ability, suggesting that language impairment is not associated with differential auditory sensitivity or phonetic categorization. Listeners with poorer language skills showed heightened competitors fixations overall, suggesting a deficit in lexical processes. Conclusions Language impairment may be better characterized by a deficit in lexical competition (inability to suppress competing words), rather than differences phonological categorization or auditory abilities. PMID:24687026

  5. Building Searchable Collections of Enterprise Speech Data.

    Science.gov (United States)

    Cooper, James W.; Viswanathan, Mahesh; Byron, Donna; Chan, Margaret

    The study has applied speech recognition and text-mining technologies to a set of recorded outbound marketing calls and analyzed the results. Since speaker-independent speech recognition technology results in a significantly lower recognition rate than that found when the recognizer is trained for a particular speaker, a number of post-processing…

  6. A Kinect-Based Sign Language Hand Gesture Recognition System for Hearing- and Speech-Impaired: A Pilot Study of Pakistani Sign Language.

    Science.gov (United States)

    Halim, Zahid; Abbas, Ghulam

    2015-01-01

    Sign language provides hearing and speech impaired individuals with an interface to communicate with other members of the society. Unfortunately, sign language is not understood by most of the common people. For this, a gadget based on image processing and pattern recognition can provide with a vital aid for detecting and translating sign language into a vocal language. This work presents a system for detecting and understanding the sign language gestures by a custom built software tool and later translating the gesture into a vocal language. For the purpose of recognizing a particular gesture, the system employs a Dynamic Time Warping (DTW) algorithm and an off-the-shelf software tool is employed for vocal language generation. Microsoft(®) Kinect is the primary tool used to capture video stream of a user. The proposed method is capable of successfully detecting gestures stored in the dictionary with an accuracy of 91%. The proposed system has the ability to define and add custom made gestures. Based on an experiment in which 10 individuals with impairments used the system to communicate with 5 people with no disability, 87% agreed that the system was useful.

  7. Speech Processing.

    Science.gov (United States)

    1983-05-01

    The VDE system developed had the capability of recognizing up to 248 separate words in syntactic structures. 4 The two systems described are isolated...AND SPEAKER RECOGNITION by M.J.Hunt 5 ASSESSMENT OF SPEECH SYSTEMS ’ ..- * . by R.K.Moore 6 A SURVEY OF CURRENT EQUIPMENT AND RESEARCH’ by J.S.Bridle...TECHNOLOGY IN NAVY TRAINING SYSTEMS by R.Breaux, M.Blind and R.Lynchard 10 9 I-I GENERAL REVIEW OF MILITARY APPLICATIONS OF VOICE PROCESSING DR. BRUNO

  8. Tackling the complexity in speech

    DEFF Research Database (Denmark)

    section includes four carefully selected chapters. They deal with facets of speech production, speech acoustics, and/or speech perception or recognition, place them in an integrated phonetic-phonological perspective, and relate them in more or less explicit ways to aspects of speech technology. Therefore......, we hope that this volume can help speech scientists with traditional training in phonetics and phonology to keep up with the latest developments in speech technology. In the opposite direction, speech researchers starting from a technological perspective will hopefully get inspired by reading about...... the questions, phenomena, and communicative functions that are currently addressed in phonetics and phonology. Either way, the future of speech research lies in international, interdisciplinary collaborations, and our volume is meant to reflect and facilitate such collaborations...

  9. Automated pattern recognition to support geological mapping and exploration target generation: a case study from southern Namibia

    CSIR Research Space (South Africa)

    Eberle, D

    2015-06-01

    Full Text Available This paper demonstrates a methodology for the automatic joint interpretation of high resolution airborne geophysical and space-borne remote sensing data to support geological mapping in a largely automated, fast and objective manner. At the request...

  10. Biometric correspondence between reface computerized facial approximations and CT-derived ground truth skin surface models objectively examined using an automated facial recognition system.

    Science.gov (United States)

    Parks, Connie L; Monson, Keith L

    2018-05-01

    This study employed an automated facial recognition system as a means of objectively evaluating biometric correspondence between a ReFace facial approximation and the computed tomography (CT) derived ground truth skin surface of the same individual. High rates of biometric correspondence were observed, irrespective of rank class (R k ) or demographic cohort examined. Overall, 48% of the test subjects' ReFace approximation probes (n=96) were matched to his or her corresponding ground truth skin surface image at R 1 , a rank indicating a high degree of biometric correspondence and a potential positive identification. Identification rates improved with each successively broader rank class (R 10 =85%, R 25 =96%, and R 50 =99%), with 100% identification by R 57 . A sharp increase (39% mean increase) in identification rates was observed between R 1 and R 10 across most rank classes and demographic cohorts. In contrast, significantly lower (p0.05) performance differences were observed across demographic cohorts or CT scan protocols. Performance measures observed in this research suggest that ReFace approximations are biometrically similar to the actual faces of the approximated individuals and, therefore, may have potential operational utility in contexts in which computerized approximations are utilized as probes in automated facial recognition systems. Copyright © 2018. Published by Elsevier B.V.

  11. Music and Speech Perception in Children Using Sung Speech.

    Science.gov (United States)

    Nie, Yingjiu; Galvin, John J; Morikawa, Michael; André, Victoria; Wheeler, Harley; Fu, Qian-Jie

    2018-01-01

    This study examined music and speech perception in normal-hearing children with some or no musical training. Thirty children (mean age = 11.3 years), 15 with and 15 without formal music training participated in the study. Music perception was measured using a melodic contour identification (MCI) task; stimuli were a piano sample or sung speech with a fixed timbre (same word for each note) or a mixed timbre (different words for each note). Speech perception was measured in quiet and in steady noise using a matrix-styled sentence recognition task; stimuli were naturally intonated speech or sung speech with a fixed pitch (same note for each word) or a mixed pitch (different notes for each word). Significant musician advantages were observed for MCI and speech in noise but not for speech in quiet. MCI performance was significantly poorer with the mixed timbre stimuli. Speech performance in noise was significantly poorer with the fixed or mixed pitch stimuli than with spoken speech. Across all subjects, age at testing and MCI performance were significantly correlated with speech performance in noise. MCI and speech performance in quiet was significantly poorer for children than for adults from a related study using the same stimuli and tasks; speech performance in noise was significantly poorer for young than for older children. Long-term music training appeared to benefit melodic pitch perception and speech understanding in noise in these pediatric listeners.

  12. Detection of target phonemes in spontaneous and read speech

    OpenAIRE

    Mehta, G.; Cutler, A.

    1988-01-01

    Although spontaneous speech occurs more frequently in most listeners’ experience than read speech, laboratory studies of human speech recognition typically use carefully controlled materials read from a script. The phonological and prosodic characteristics of spontaneous and read speech differ considerably, however, which suggests that laboratory results may not generalize to the recognition of spontaneous and read speech materials, and their response time to detect word-initial target phonem...

  13. Automatic Recognition of Improperly Pronounced Initial 'r' Consonant in Romanian

    Directory of Open Access Journals (Sweden)

    VELICAN, V.

    2012-08-01

    Full Text Available Correctly assessing the degree of mispronunciation and deciding upon the necessary treatment are fundamental activities for all speech disorder specialists. Obviously, the experience and the availability of the specialists are essentials in order to assure an efficient therapy for the speech impaired. To overcome this deficiency a more objective approach would include the existence of a tool that independent of the specialist's abilities could be used to establish the diagnostics. A complete automated system based on speech processing algorithms capable of performing the recognition task is therefore thoroughly justified and can be viewed as a goal that will bring many benefits to the field of speech pronunciation correction. This paper presents further results of the authors' work on developing speech processing algorithms able to identify mispronunciations in Romanian language, more exactly we propose the use of the Walsh-Hadamard Transform (WHT as feature selection tool in the case of identifying rhotacism. The results are encouraging with a best recognition rate of 92.55%.

  14. Interpreting complex data by methods of recognition and classification in an automated system of aerogeophysical material processing

    Energy Technology Data Exchange (ETDEWEB)

    Koval' , L.A.; Dolgov, S.V.; Liokumovich, G.B.; Ovcharenko, A.V.; Priyezzhev, I.I.

    1984-01-01

    The system of automated processing of aerogeophysical data, ASOM-AGS/YeS, is equipped with complex interpretation of multichannel measurements. Algorithms of factor analysis, automatic classification and apparatus of a priori specified (selected) decisive rules are used. The areas of effect of these procedures can be initially limited to the specified geological information. The possibilities of the method are demonstrated by the results of automated processing of the aerogram-spectrometric measurements in the region of the known copper-porphyr manifestation in Kazakhstan. This ore deposit was clearly noted after processing by the method of main components by complex aureole of independent factors U (severe increase), Th (noticeable increase), K (decrease).

  15. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  16. Automated segmentation and recognition of abdominal wall muscles in X-ray torso CT images and its application in abdominal CAD

    International Nuclear Information System (INIS)

    Zhou, X.; Kamiya, N.; Hara, T.; Fujita, H.; Chen, H.; Yokoyama, R.; Hoshi, H.

    2007-01-01

    The information of abdominal wall is very important for the planning of surgical operation and abdominal organ recognition. In research fields of computer assisted radiology and surgery and computer-aided diagnosis, the segmentation and recognition of the abdominal wall muscles in CT images is a necessary pre-processing step. Due to the complexity of the abdominal wall structure and indistinctive in CT images, the automated segmentation of abdominal wall muscles is a difficult issue and has not been solved completely. We propose an approach to segment the abdominal wall muscles and divide it into three categories (front abdominal muscles including rectus abdominis; left and right side abdominal muscles including external oblique, internal oblique and transversus abdominis muscles) automatically. The approach, first, makes an initial classification of bone, fat, and muscles and organs based on the CT number. Then a layer structure is generated to describe the 3-D anatomical structures of human torso by stretching the torso region onto a thin-plate for easy recognition. The abdominal wall muscles are recognized on the layer structures using the spatial relations to the skeletal structure and CT numbers. Finally, the recognized regions are mapped back to the 3-D CT images using an inverse transformation of the stretching process. This method is applied to 20 cases of torso CT images and evaluations are based on visual comparison of the recognition results and the original CT images by an expert in anatomy. The results show that our approach can segment and recognize abdominal wall muscle regions effectively. (orig.)

  17. Speech Problems

    Science.gov (United States)

    ... Staying Safe Videos for Educators Search English Español Speech Problems KidsHealth / For Teens / Speech Problems What's in ... a person's ability to speak clearly. Some Common Speech and Language Disorders Stuttering is a problem that ...

  18. Byblos Speech Recognition Benchmark Results

    National Research Council Canada - National Science Library

    Kubala, F; Austin, S; Barry, C; Makhoul, J; Placeway, P; Schwartz, R

    1991-01-01

    .... Surprisingly, the 12-speaker model performs as well as the one made from 109 speakers. Also within the RM domain, we demonstrate that state-of-the-art SI models perform poorly for speakers with strong dialects...

  19. Improved Methods for Pitch Synchronous Linear Prediction Analysis of Speech

    OpenAIRE

    劉, 麗清

    2015-01-01

    Linear prediction (LP) analysis has been applied to speech system over the last few decades. LP technique is well-suited for speech analysis due to its ability to model speech production process approximately. Hence LP analysis has been widely used for speech enhancement, low-bit-rate speech coding in cellular telephony, speech recognition, characteristic parameter extraction (vocal tract resonances frequencies, fundamental frequency called pitch) and so on. However, the performance of the co...

  20. An automated patient recognition method based on an image-matching technique using previous chest radiographs in the picture archiving and communication system environment

    International Nuclear Information System (INIS)

    Morishita, Junji; Katsuragawa, Shigehiko; Kondo, Keisuke; Doi, Kunio

    2001-01-01

    An automated patient recognition method for correcting 'wrong' chest radiographs being stored in a picture archiving and communication system (PACS) environment has been developed. The method is based on an image-matching technique that uses previous chest radiographs. For identification of a 'wrong' patient, the correlation value was determined for a previous image of a patient and a new, current image of the presumed corresponding patient. The current image was shifted horizontally and vertically and rotated, so that we could determine the best match between the two images. The results indicated that the correlation values between the current and previous images for the same, 'correct' patients were generally greater than those for different, 'wrong' patients. Although the two histograms for the same patient and for different patients overlapped at correlation values greater than 0.80, most parts of the histograms were separated. The correlation value was compared with a threshold value that was determined based on an analysis of the histograms of correlation values obtained for the same patient and for different patients. If the current image is considered potentially to belong to a 'wrong' patient, then a warning sign with the probability for a 'wrong' patient is provided to alert radiology personnel. Our results indicate that at least half of the 'wrong' images in our database can be identified correctly with the method described in this study. The overall performance in terms of a receiver operating characteristic curve showed a high performance of the system. The results also indicate that some readings of 'wrong' images for a given patient in the PACS environment can be prevented by use of the method we developed. Therefore an automated warning system for patient recognition would be useful in correcting 'wrong' images being stored in the PACS environment

  1. The Effectiveness of Clear Speech as a Masker

    Science.gov (United States)

    Calandruccio, Lauren; Van Engen, Kristin; Dhar, Sumitrajit; Bradlow, Ann R.

    2010-01-01

    Purpose: It is established that speaking clearly is an effective means of enhancing intelligibility. Because any signal-processing scheme modeled after known acoustic-phonetic features of clear speech will likely affect both target and competing speech, it is important to understand how speech recognition is affected when a competing speech signal…

  2. Cleft Audit Protocol for Speech (CAPS-A): A Comprehensive Training Package for Speech Analysis

    Science.gov (United States)

    Sell, D.; John, A.; Harding-Bell, A.; Sweeney, T.; Hegarty, F.; Freeman, J.

    2009-01-01

    Background: The previous literature has largely focused on speech analysis systems and ignored process issues, such as the nature of adequate speech samples, data acquisition, recording and playback. Although there has been recognition of the need for training on tools used in speech analysis associated with cleft palate, little attention has been…

  3. Speech Compression

    Directory of Open Access Journals (Sweden)

    Jerry D. Gibson

    2016-06-01

    Full Text Available Speech compression is a key technology underlying digital cellular communications, VoIP, voicemail, and voice response systems. We trace the evolution of speech coding based on the linear prediction model, highlight the key milestones in speech coding, and outline the structures of the most important speech coding standards. Current challenges, future research directions, fundamental limits on performance, and the critical open problem of speech coding for emergency first responders are all discussed.

  4. Should visual speech cues (speechreading) be considered when fitting hearing aids?

    Science.gov (United States)

    Grant, Ken

    2002-05-01

    When talker and listener are face-to-face, visual speech cues become an important part of the communication environment, and yet, these cues are seldom considered when designing hearing aids. Models of auditory-visual speech recognition highlight the importance of complementary versus redundant speech information for predicting auditory-visual recognition performance. Thus, for hearing aids to work optimally when visual speech cues are present, it is important to know whether the cues provided by amplification and the cues provided by speechreading complement each other. In this talk, data will be reviewed that show nonmonotonicity between auditory-alone speech recognition and auditory-visual speech recognition, suggesting that efforts designed solely to improve auditory-alone recognition may not always result in improved auditory-visual recognition. Data will also be presented showing that one of the most important speech cues for enhancing auditory-visual speech recognition performance, voicing, is often the cue that benefits least from amplification.

  5. Head movements encode emotions during speech and song.

    Science.gov (United States)

    Livingstone, Steven R; Palmer, Caroline

    2016-04-01

    When speaking or singing, vocalists often move their heads in an expressive fashion, yet the influence of emotion on vocalists' head motion is unknown. Using a comparative speech/song task, we examined whether vocalists' intended emotions influence head movements and whether those movements influence the perceived emotion. In Experiment 1, vocalists were recorded with motion capture while speaking and singing each statement with different emotional intentions (very happy, happy, neutral, sad, very sad). Functional data analyses showed that head movements differed in translational and rotational displacement across emotional intentions, yet were similar across speech and song, transcending differences in F0 (varied freely in speech, fixed in song) and lexical variability. Head motion specific to emotional state occurred before and after vocalizations, as well as during sound production, confirming that some aspects of movement were not simply a by-product of sound production. In Experiment 2, observers accurately identified vocalists' intended emotion on the basis of silent, face-occluded videos of head movements during speech and song. These results provide the first evidence that head movements encode a vocalist's emotional intent and that observers decode emotional information from these movements. We discuss implications for models of head motion during vocalizations and applied outcomes in social robotics and automated emotion recognition. (c) 2016 APA, all rights reserved).

  6. Mixed deep learning and natural language processing method for fake-food image recognition and standardization to help automated dietary assessment.

    Science.gov (United States)

    Mezgec, Simon; Eftimov, Tome; Bucher, Tamara; Koroušić Seljak, Barbara

    2018-04-06

    The present study tested the combination of an established and a validated food-choice research method (the 'fake food buffet') with a new food-matching technology to automate the data collection and analysis. The methodology combines fake-food image recognition using deep learning and food matching and standardization based on natural language processing. The former is specific because it uses a single deep learning network to perform both the segmentation and the classification at the pixel level of the image. To assess its performance, measures based on the standard pixel accuracy and Intersection over Union were applied. Food matching firstly describes each of the recognized food items in the image and then matches the food items with their compositional data, considering both their food names and their descriptors. The final accuracy of the deep learning model trained on fake-food images acquired by 124 study participants and providing fifty-five food classes was 92·18 %, while the food matching was performed with a classification accuracy of 93 %. The present findings are a step towards automating dietary assessment and food-choice research. The methodology outperforms other approaches in pixel accuracy, and since it is the first automatic solution for recognizing the images of fake foods, the results could be used as a baseline for possible future studies. As the approach enables a semi-automatic description of recognized food items (e.g. with respect to FoodEx2), these can be linked to any food composition database that applies the same classification and description system.

  7. Speech endpoint detection with non-language speech sounds for generic speech processing applications

    Science.gov (United States)

    McClain, Matthew; Romanowski, Brian

    2009-05-01

    Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.

  8. AUTOMATIC RECOGNITION OF CORONAL TYPE II RADIO BURSTS: THE AUTOMATED RADIO BURST IDENTIFICATION SYSTEM METHOD AND FIRST OBSERVATIONS

    International Nuclear Information System (INIS)

    Lobzin, Vasili V.; Cairns, Iver H.; Robinson, Peter A.; Steward, Graham; Patterson, Garth

    2010-01-01

    Major space weather events such as solar flares and coronal mass ejections are usually accompanied by solar radio bursts, which can potentially be used for real-time space weather forecasts. Type II radio bursts are produced near the local plasma frequency and its harmonic by fast electrons accelerated by a shock wave moving through the corona and solar wind with a typical speed of ∼1000 km s -1 . The coronal bursts have dynamic spectra with frequency gradually falling with time and durations of several minutes. This Letter presents a new method developed to detect type II coronal radio bursts automatically and describes its implementation in an extended Automated Radio Burst Identification System (ARBIS 2). Preliminary tests of the method with spectra obtained in 2002 show that the performance of the current implementation is quite high, ∼80%, while the probability of false positives is reasonably low, with one false positive per 100-200 hr for high solar activity and less than one false event per 10000 hr for low solar activity periods. The first automatically detected coronal type II radio burst is also presented.

  9. Speech to Text Software Evaluation Report

    CERN Document Server

    Martins Santo, Ana Luisa

    2017-01-01

    This document compares out-of-box performance of three commercially available speech recognition software: Vocapia VoxSigma TM , Google Cloud Speech, and Lime- craft Transcriber. It is defined a set of evaluation criteria and test methods for speech recognition softwares. The evaluation of these softwares in noisy environments are also included for the testing purposes. Recognition accuracy was compared using noisy environments and languages. Testing in ”ideal” non-noisy environment of a quiet room has been also performed for comparison.

  10. Schizophrenia alters intra-network functional connectivity in the caudate for detecting speech under informational speech masking conditions.

    Science.gov (United States)

    Zheng, Yingjun; Wu, Chao; Li, Juanhua; Li, Ruikeng; Peng, Hongjun; She, Shenglin; Ning, Yuping; Li, Liang

    2018-04-04

    Speech recognition under noisy "cocktail-party" environments involves multiple perceptual/cognitive processes, including target detection, selective attention, irrelevant signal inhibition, sensory/working memory, and speech production. Compared to health listeners, people with schizophrenia are more vulnerable to masking stimuli and perform worse in speech recognition under speech-on-speech masking conditions. Although the schizophrenia-related speech-recognition impairment under "cocktail-party" conditions is associated with deficits of various perceptual/cognitive processes, it is crucial to know whether the brain substrates critically underlying speech detection against informational speech masking are impaired in people with schizophrenia. Using functional magnetic resonance imaging (fMRI), this study investigated differences between people with schizophrenia (n = 19, mean age = 33 ± 10 years) and their matched healthy controls (n = 15, mean age = 30 ± 9 years) in intra-network functional connectivity (FC) specifically associated with target-speech detection under speech-on-speech-masking conditions. The target-speech detection performance under the speech-on-speech-masking condition in participants with schizophrenia was significantly worse than that in matched healthy participants (healthy controls). Moreover, in healthy controls, but not participants with schizophrenia, the strength of intra-network FC within the bilateral caudate was positively correlated with the speech-detection performance under the speech-masking conditions. Compared to controls, patients showed altered spatial activity pattern and decreased intra-network FC in the caudate. In people with schizophrenia, the declined speech-detection performance under speech-on-speech masking conditions is associated with reduced intra-caudate functional connectivity, which normally contributes to detecting target speech against speech masking via its functions of suppressing masking-speech signals.

  11. Speech Matters

    DEFF Research Database (Denmark)

    Hasse Jørgensen, Stina

    2011-01-01

    About Speech Matters - Katarina Gregos, the Greek curator's exhibition at the Danish Pavillion, the Venice Biannual 2011.......About Speech Matters - Katarina Gregos, the Greek curator's exhibition at the Danish Pavillion, the Venice Biannual 2011....

  12. Optimal reduced-rank quadratic classifiers using the Fukunaga-Koontz transform with applications to automated target recognition

    Science.gov (United States)

    Huo, Xiaoming; Elad, Michael; Flesia, Ana G.; Muise, Robert R.; Stanfill, S. Robert; Friedman, Jerome; Popescu, Bogdan; Chen, Jihong; Mahalanobis, Abhijit; Donoho, David L.

    2003-09-01

    In target recognition applications of discriminant of classification analysis, each 'feature' is a result of a convolution of an imagery with a filter, which may be derived from a feature vector. It is important to use relatively few features. We analyze an optimal reduced-rank classifier under the two-class situation. Assuming each population is Gaussian and has zero mean, and the classes differ through the covariance matrices: ∑1 and ∑2. The following matrix is considered: Λ=(∑1+∑2)-1/2∑1(∑1+∑2)-1/2. We show that the k eigenvectors of this matrix whose eigenvalues are most different from 1/2 offer the best rank k approximation to the maximum likelihood classifier. The matrix Λ and its eigenvectors have been introduced by Fukunaga and Koontz; hence this analysis gives a new interpretation of the well known Fukunaga-Koontz transform. The optimality that is promised in this method hold if the two populations are exactly Guassian with the same means. To check the applicability of this approach to real data, an experiment is performed, in which several 'modern' classifiers were used on an Infrared ATR data. In these experiments, a reduced-rank classifier-Tuned Basis Functions-outperforms others. The competitive performance of the optimal reduced-rank quadratic classifier suggests that, at least for classification purposes, the imagery data behaves in a nearly-Gaussian fashion.

  13. Advocate: A Distributed Architecture for Speech-to-Speech Translation

    Science.gov (United States)

    2009-01-01

    tecture, are either wrapped natural-language processing ( NLP ) components or objects developed from scratch using the architecture’s API. GATE is...framework, we put together a demonstration Arabic -to- English speech translation system using both internally developed ( Arabic speech recognition and MT...conditions of our Arabic S2S demonstration system described earlier. Once again, the data size was varied and eighty identical requests were

  14. Speech-to-Speech Relay Service

    Science.gov (United States)

    Consumer Guide Speech to Speech Relay Service Speech-to-Speech (STS) is one form of Telecommunications Relay Service (TRS). TRS is a service that allows persons with hearing and speech disabilities ...

  15. Testing of Haar-Like Feature in Region of Interest Detection for Automated Target Recognition (ATR) System

    Science.gov (United States)

    Zhang, Yuhan; Lu, Dr. Thomas

    2010-01-01

    The objectives of this project were to develop a ROI (Region of Interest) detector using Haar-like feature similar to the face detection in Intel's OpenCV library, implement it in Matlab code, and test the performance of the new ROI detector against the existing ROI detector that uses Optimal Trade-off Maximum Average Correlation Height filter (OTMACH). The ROI detector included 3 parts: 1, Automated Haar-like feature selection in finding a small set of the most relevant Haar-like features for detecting ROIs that contained a target. 2, Having the small set of Haar-like features from the last step, a neural network needed to be trained to recognize ROIs with targets by taking the Haar-like features as inputs. 3, using the trained neural network from the last step, a filtering method needed to be developed to process the neural network responses into a small set of regions of interests. This needed to be coded in Matlab. All the 3 parts needed to be coded in Matlab. The parameters in the detector needed to be trained by machine learning and tested with specific datasets. Since OpenCV library and Haar-like feature were not available in Matlab, the Haar-like feature calculation needed to be implemented in Matlab. The codes for Adaptive Boosting and max/min filters in Matlab could to be found from the Internet but needed to be integrated to serve the purpose of this project. The performance of the new detector was tested by comparing the accuracy and the speed of the new detector against the existing OTMACH detector. The speed was referred as the average speed to find the regions of interests in an image. The accuracy was measured by the number of false positives (false alarms) at the same detection rate between the two detectors.

  16. FOLD-EM: automated fold recognition in medium- and low-resolution (4-15 Å) electron density maps.

    Science.gov (United States)

    Saha, Mitul; Morais, Marc C

    2012-12-15

    Owing to the size and complexity of large multi-component biological assemblies, the most tractable approach to determining their atomic structure is often to fit high-resolution radiographic or nuclear magnetic resonance structures of isolated components into lower resolution electron density maps of the larger assembly obtained using cryo-electron microscopy (cryo-EM). This hybrid approach to structure determination requires that an atomic resolution structure of each component, or a suitable homolog, is available. If neither is available, then the amount of structural information regarding that component is limited by the resolution of the cryo-EM map. However, even if a suitable homolog cannot be identified using sequence analysis, a search for structural homologs should still be performed because structural homology often persists throughout evolution even when sequence homology is undetectable, As macromolecules can often be described as a collection of independently folded domains, one way of searching for structural homologs would be to systematically fit representative domain structures from a protein domain database into the medium/low resolution cryo-EM map and return the best fits. Taken together, the best fitting non-overlapping structures would constitute a 'mosaic' backbone model of the assembly that could aid map interpretation and illuminate biological function. Using the computational principles of the Scale-Invariant Feature Transform (SIFT), we have developed FOLD-EM-a computational tool that can identify folded macromolecular domains in medium to low resolution (4-15 Å) electron density maps and return a model of the constituent polypeptides in a fully automated fashion. As a by-product, FOLD-EM can also do flexible multi-domain fitting that may provide insight into conformational changes that occur in macromolecular assemblies.

  17. The motor theory of speech perception revisited.

    Science.gov (United States)

    Massaro, Dominic W; Chen, Trevor H

    2008-04-01

    Galantucci, Fowler, and Turvey (2006) have claimed that perceiving speech is perceiving gestures and that the motor system is recruited for perceiving speech. We make the counter argument that perceiving speech is not perceiving gestures, that the motor system is not recruitedfor perceiving speech, and that speech perception can be adequately described by a prototypical pattern recognition model, the fuzzy logical model of perception (FLMP). Empirical evidence taken as support for gesture and motor theory is reconsidered in more detail and in the framework of the FLMR Additional theoretical and logical arguments are made to challenge gesture and motor theory.

  18. Censored: Whistleblowers and impossible speech

    OpenAIRE

    Kenny, Kate

    2017-01-01

    What happens to a person who speaks out about corruption in their organization, and finds themselves excluded from their profession? In this article, I argue that whistleblowers experience exclusions because they have engaged in ‘impossible speech’, that is, a speech act considered to be unacceptable or illegitimate. Drawing on Butler’s theories of recognition and censorship, I show how norms of acceptable speech working through recruitment practices, alongside the actions of colleagues, can ...

  19. Apraxia of Speech

    Science.gov (United States)

    ... Health Info » Voice, Speech, and Language Apraxia of Speech On this page: What is apraxia of speech? ... about apraxia of speech? What is apraxia of speech? Apraxia of speech (AOS)—also known as acquired ...

  20. Introductory speeches

    International Nuclear Information System (INIS)

    2001-01-01

    This CD is multimedia presentation of programme safety upgrading of Bohunice V1 NPP. This chapter consist of introductory commentary and 4 introductory speeches (video records): (1) Introductory speech of Vincent Pillar, Board chairman and director general of Slovak electric, Plc. (SE); (2) Introductory speech of Stefan Schmidt, director of SE - Bohunice Nuclear power plants; (3) Introductory speech of Jan Korec, Board chairman and director general of VUJE Trnava, Inc. - Engineering, Design and Research Organisation, Trnava; Introductory speech of Dietrich Kuschel, Senior vice-president of FRAMATOME ANP Project and Engineering