WorldWideScience

Sample records for automatic speech recognition

  1. Automatic speech recognition a deep learning approach

    CERN Document Server

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  2. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    Automatic speech recognition (ASR) uses dictionaries that map orthographic words to their phonetic representation. To minimize the occurrence of out-of-vocabulary words, ASR requires large phonetic dictionaries to model pronunciation. Hand-crafted high-quality phonetic dictionaries are difficult...... for particular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules. Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies...... to dictionary lookup, compound splitting and letter-to-sound rules. Phonix and eSpeak will be compared in an ASR scenario using the Kaldi speech recognition toolkit (Povey et al., 2011) and compared to a graphemic baseline. Also a mapping between the phonetic alphabets, which are both based on X-Sampa or IPA...

  3. Predicting automatic speech recognition performance over communication channels from instrumental speech quality and intelligibility scores

    NARCIS (Netherlands)

    Gallardo, L.F.; Möller, S.; Beerends, J.

    2017-01-01

    The performance of automatic speech recognition based on coded-decoded speech heavily depends on the quality of the transmitted signals, determined by channel impairments. This paper examines relationships between speech recognition performance and measurements of speech quality and intelligibility

  4. Development of a System for Automatic Recognition of Speech

    Directory of Open Access Journals (Sweden)

    Roman Jarina

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  5. Indonesian Automatic Speech Recognition For Command Speech Controller Multimedia Player

    Directory of Open Access Journals (Sweden)

    Vivien Arief Wardhany

    2014-12-01

    Full Text Available The purpose of multimedia devices development is controlling through voice. Nowdays voice that can be recognized only in English. To overcome the issue, then recognition using Indonesian language model and accousticc model and dictionary. Automatic Speech Recognizier is build using engine CMU Sphinx with modified english language to Indonesian Language database and XBMC used as the multimedia player. The experiment is using 10 volunteers testing items based on 7 commands. The volunteers is classifiedd by the genders, 5 Male & 5 female. 10 samples is taken in each command, continue with each volunteer perform 10 testing command. Each volunteer also have to try all 7 command that already provided. Based on percentage clarification table, the word “Kanan” had the most recognize with percentage 83% while “pilih” is the lowest one. The word which had the most wrong clarification is “kembali” with percentagee 67%, while the word “kanan” is the lowest one. From the result of Recognition Rate by male there are several command such as “Kembali”, “Utama”, “Atas “ and “Bawah” has the low Recognition Rate. Especially for “kembali” cannot be recognized as the command in the female voices but in male voice that command has 4% of RR this is because the command doesn’t have similar word in english near to “kembali” so the system unrecognize the command. Also for the command “Pilih” using the female voice has 80% of RR but for the male voice has only 4% of RR. This problem is mostly because of the different voice characteristic between adult male and female which male has lower voice frequencies (from 85 to 180 Hz than woman (165 to 255 Hz.The result of the experiment showed that each man had different number of recognition rate caused by the difference tone, pronunciation, and speed of speech. For further work needs to be done in order to improving the accouracy of the Indonesian Automatic Speech Recognition system

  6. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  7. Bridging Automatic Speech Recognition and Psycholinguistics: Extending Shortlist to an End-to-End Model of Human Speech Recognition

    NARCIS (Netherlands)

    Scharenborg, O.E.; Bosch, L.F.M. ten; Boves, L.W.J.; Norris, D.

    2003-01-01

    This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz. Shortlist (Norris, 1994). Experiments based on 'real-life' speech highlight

  8. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  9. Brain-inspired speech segmentation for automatic speech recognition using the speech envelope as a temporal reference

    OpenAIRE

    Byeongwook Lee; Kwang-Hyun Cho

    2016-01-01

    Speech segmentation is a crucial step in automatic speech recognition because additional speech analyses are performed for each framed speech segment. Conventional segmentation techniques primarily segment speech using a fixed frame size for computational simplicity. However, this approach is insufficient for capturing the quasi-regular structure of speech, which causes substantial recognition failure in noisy environments. How does the brain handle quasi-regular structured speech and maintai...

  10. Automatic speech recognition for radiological reporting

    International Nuclear Information System (INIS)

    Vidal, B.

    1991-01-01

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  11. Using Automatic Speech Recognition Technology with Elicited Oral Response Testing

    Science.gov (United States)

    Cox, Troy L.; Davies, Randall S.

    2012-01-01

    This study examined the use of automatic speech recognition (ASR) scored elicited oral response (EOR) tests to assess the speaking ability of English language learners. It also examined the relationship between ASR-scored EOR and other language proficiency measures and the ability of the ASR to rate speakers without bias to gender or native…

  12. Automatic Speech Recognition: Reliability and Pedagogical Implications for Teaching Pronunciation

    Science.gov (United States)

    Kim, In-Seok

    2006-01-01

    This study examines the reliability of automatic speech recognition (ASR) software used to teach English pronunciation, focusing on one particular piece of software, "FluSpeak, as a typical example." Thirty-six Korean English as a Foreign Language (EFL) college students participated in an experiment in which they listened to 15 sentences…

  13. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    Automatic speech recognition (ASR) relies on three resources: audio, orthographic transcriptions and a pronunciation dictionary. The dictionary or lexicon maps orthographic words to sequences of phones or phonemes that represent the pronunciation of the corresponding word. The quality of a speech...... recognition system depends heavily on the dictionary and the transcriptions therein. This paper presents an analysis of phonetic/phonemic features that are salient for current Danish ASR systems. This preliminary study consists of a series of experiments using an ASR system trained on the DK-PAROLE corpus...

  14. Modelling context in automatic speech recognition

    NARCIS (Netherlands)

    Wiggers, P.

    2008-01-01

    Speech is at the core of human communication. Speaking and listing comes so natural to us that we do not have to think about it at all. The underlying cognitive processes are very rapid and almost completely subconscious. It is hard, if not impossible not to understand speech. For computers on the

  15. Matrix sentence intelligibility prediction using an automatic speech recognition system.

    Science.gov (United States)

    Schädler, Marc René; Warzybok, Anna; Hochmuth, Sabine; Kollmeier, Birger

    2015-01-01

    The feasibility of predicting the outcome of the German matrix sentence test for different types of stationary background noise using an automatic speech recognition (ASR) system was studied. Speech reception thresholds (SRT) of 50% intelligibility were predicted in seven noise conditions. The ASR system used Mel-frequency cepstral coefficients as a front-end and employed whole-word Hidden Markov models on the back-end side. The ASR system was trained and tested with noisy matrix sentences on a broad range of signal-to-noise ratios. The ASR-based predictions were compared to data from the literature ( Hochmuth et al, 2015 ) obtained with 10 native German listeners with normal hearing and predictions of the speech intelligibility index (SII). The ASR-based predictions showed a high and significant correlation (R² = 0.95, p speech and noise signals. Minimum assumptions were made about human speech processing already incorporated in a reference-free ordinary ASR system.

  16. Bridging automatic speech recognition and psycholinguistics: Extending Shortlist to an end-to-end model of human speech recognition (L)

    Science.gov (United States)

    Scharenborg, Odette; ten Bosch, Louis; Boves, Lou; Norris, Dennis

    2003-12-01

    This letter evaluates potential benefits of combining human speech recognition (HSR) and automatic speech recognition by building a joint model of an automatic phone recognizer (APR) and a computational model of HSR, viz., Shortlist [Norris, Cognition 52, 189-234 (1994)]. Experiments based on ``real-life'' speech highlight critical limitations posed by some of the simplifying assumptions made in models of human speech recognition. These limitations could be overcome by avoiding hard phone decisions at the output side of the APR, and by using a match between the input and the internal lexicon that flexibly copes with deviations from canonical phonemic representations.

  17. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    for English and now extended to cover 50 languages. Due to the nature of open source software, the quality of language support depends greatly on who encoded them. The Danish version was created by a Danish native speaker and contains more than 8,600 spelling-to-phoneme rules and more than 11,000 rules...... for particular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules. Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies...... to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality...

  18. Automatic speech recognition for report generation in computed tomography

    International Nuclear Information System (INIS)

    Teichgraeber, U.K.M.; Ehrenstein, T.; Lemke, M.; Liebig, T.; Stobbe, H.; Hosten, N.; Keske, U.; Felix, R.

    1999-01-01

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.) [de

  19. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  20. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    Science.gov (United States)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  1. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    Science.gov (United States)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  2. Application of automatic speech recognition to quantitative assessment of tracheoesophageal speech with different signal quality.

    Science.gov (United States)

    Haderlein, Tino; Riedhammer, Korbinian; Nöth, Elmar; Toy, Hikmet; Schuster, Maria; Eysholdt, Ulrich; Hornegger, Joachim; Rosanowski, Frank

    2009-01-01

    Tracheoesophageal voice is state-of-the-art in voice rehabilitation after laryngectomy. Intelligibility on a telephone is an important evaluation criterion as it is a crucial part of social life. An objective measure of intelligibility when talking on a telephone is desirable in the field of postlaryngectomy speech therapy and its evaluation. Based upon successful earlier studies with broadband speech, an automatic speech recognition (ASR) system was applied to 41 recordings of postlaryngectomy patients. Recordings were available in different signal qualities; quality was the crucial criterion for this study. Compared to the intelligibility rating of 5 human experts, the ASR system had a correlation coefficient of r = -0.87 and Krippendorff's alpha of 0.65 when broadband speech was processed. The rater group alone achieved alpha = 0.66. With the test recordings in telephone quality, the system reached r = -0.79 and alpha = 0.67. For medical purposes, a comprehensive diagnostic approach to (substitute) voice has to cover both subjective and objective tests. An automatic recognition system such as the one proposed in this study can be used for objective intelligibility rating with results comparable to those of human experts. This holds for broadband speech as well as for automatic evaluation via telephone. Copyright 2008 S. Karger AG, Basel.

  3. Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

    NARCIS (Netherlands)

    Huijbregts, M.A.H.; Ordelman, Roeland J.F.; de Jong, Franciska M.G.

    2007-01-01

    This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life

  4. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    OpenAIRE

    Andreas Maier; Tino Haderlein; Florian Stelzle; Elmar Nöth; Emeka Nkenke; Frank Rosanowski; Anne Schützenberger; Maria Schuster

    2010-01-01

    In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngect...

  5. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  6. Multilingual Techniques for Low Resource Automatic Speech Recognition

    Science.gov (United States)

    2016-05-20

    networks for acoustic modeling in speech recognition. In IEEE Signal Processing Magazine, volume 28, pages 82–97, November 2012. 27, 42, 44, 47 [36] G...applications. IEEE Trans. on Acoustics, Speech and Signal Processing , 23(3):283–296, 1975. 36 [55] J. Mamou, J. Cui, X. Cui, M. J. Gales, B. Kingsbury, K. Knill...Mandy, Marcia, Michael, Mitch, Mitra , Najim, Patrick, Scott, Sean, Sree, Stephanie, Stephen, Timo, Tuka, Wei-Ning, William, Xue, Yaodong, Yonatan, Yu

  7. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  8. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...... the language model probabilities during speech recognition. Experimental results on a small dataset show that the tracking error rate of the system using only speech recognition is 34.9% whereas the tracking error rate for the system that incorporates eye-gaze tracking into the speech recognizer is 31...

  9. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    International Nuclear Information System (INIS)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C.; Nomiya, Diogo V.

    2009-01-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  10. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    Energy Technology Data Exchange (ETDEWEB)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C., E-mail: calexandre@ien.gov.b, E-mail: mol@ien.gov.b, E-mail: cmnap@ien.gov.b, E-mail: mag@ien.gov.b [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Nomiya, Diogo V., E-mail: diogonomiya@gmail.co [Universidade Federal do Rio de Janeiro (UFRJ), RJ (Brazil)

    2009-07-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  11. Assessing the efficacy of benchmarks for automatic speech accent recognition

    OpenAIRE

    Benjamin Bock; Lior Shamir

    2015-01-01

    Speech accents can possess valuable information about the speaker, and can be used in intelligent multimedia-based human-computer interfaces. The performance of algorithms for automatic classification of accents is often evaluated using audio datasets that include recording samples of different people, representing different accents. Here we describe a method that can detect bias in accent datasets, and apply the method to two accent identification datasets to reveal the existence of dataset ...

  12. Automatic Speech Recognition Predicts Speech Intelligibility and Comprehension for Listeners with Simulated Age-Related Hearing Loss

    Science.gov (United States)

    Fontan, Lionel; Ferrané, Isabelle; Farinas, Jérôme; Pinquier, Julien; Tardieu, Julien; Magnen, Cynthia; Gaillard, Pascal; Aumont, Xavier; Füllgrabe, Christian

    2017-01-01

    Purpose: The purpose of this article is to assess speech processing for listeners with simulated age-related hearing loss (ARHL) and to investigate whether the observed performance can be replicated using an automatic speech recognition (ASR) system. The long-term goal of this research is to develop a system that will assist…

  13. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  14. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  15. Automatic speech recognition research at NASA-Ames Research Center

    Science.gov (United States)

    Coler, Clayton R.; Plummer, Robert P.; Huff, Edward M.; Hitchcock, Myron H.

    1977-01-01

    A trainable acoustic pattern recognizer manufactured by Scope Electronics is presented. The voice command system VCS encodes speech by sampling 16 bandpass filters with center frequencies in the range from 200 to 5000 Hz. Variations in speaking rate are compensated for by a compression algorithm that subdivides each utterance into eight subintervals in such a way that the amount of spectral change within each subinterval is the same. The recorded filter values within each subinterval are then reduced to a 15-bit representation, giving a 120-bit encoding for each utterance. The VCS incorporates a simple recognition algorithm that utilizes five training samples of each word in a vocabulary of up to 24 words. The recognition rate of approximately 85 percent correct for untrained speakers and 94 percent correct for trained speakers was not considered adequate for flight systems use. Therefore, the built-in recognition algorithm was disabled, and the VCS was modified to transmit 120-bit encodings to an external computer for recognition.

  16. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers

    2005-01-01

    An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior to the appe......An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior....... Furthermore the paper investigates an alternative to the standard mel frequency cepstral coefficient filter bank (MFCC), an empirically chosen Speech Band Emphasizing filter bank (SBE), which improves the resolution in the speech band. The combinations of AQBNE and SBE are tested on the Danish SpeechDat-Car...

  17. Automatic speech recognition used for evaluation of text-to-speech systems

    Czech Academy of Sciences Publication Activity Database

    Vích, Robert; Nouza, J.; Vondra, Martin

    -, č. 5042 (2008), s. 136-148 ISSN 0302-9743 R&D Projects: GA AV ČR 1ET301710509; GA AV ČR 1QS108040569 Institutional research plan: CEZ:AV0Z20670512 Keywords : speech recognition * speech processing Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering

  18. Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

    NARCIS (Netherlands)

    Huijbregts, M.A.H.; Ordelman, Roeland J.F.; de Jong, Franciska M.G.

    2007-01-01

    This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life

  19. The Usefulness of Automatic Speech Recognition (ASR) Eyespeak Software in Improving Iraqi EFL Students' Pronunciation

    Science.gov (United States)

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The present study focuses on determining whether automatic speech recognition (ASR) technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not…

  20. Errors in Automatic Speech Recognition versus Difficulties in Second Language Listening

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Akita, Yuya; Kawahara, Tatsuya

    2015-01-01

    Automatic Speech Recognition (ASR) technology has become a part of contemporary Computer-Assisted Language Learning (CALL) systems. ASR systems however are being criticized for their erroneous performance especially when utilized as a mean to develop skills in a Second Language (L2) where errors are not tolerated. Nevertheless, these errors can…

  1. Automatic Speech Recognition Technology as an Effective Means for Teaching Pronunciation

    Science.gov (United States)

    Elimat, Amal Khalil; AbuSeileek, Ali Farhan

    2014-01-01

    This study aimed to explore the effect of using automatic speech recognition technology (ASR) on the third grade EFL students' performance in pronunciation, whether teaching pronunciation through ASR is better than regular instruction, and the most effective teaching technique (individual work, pair work, or group work) in teaching pronunciation…

  2. The Effect of Automatic Speech Recognition Eyespeak Software on Iraqi Students' English Pronunciation: A Pilot Study

    Science.gov (United States)

    Sidgi, Lina Fathi Sidig; Shaari, Ahmad Jelani

    2017-01-01

    The use of technology, such as computer-assisted language learning (CALL), is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR). Integrating such technology, especially in the instruction of pronunciation…

  3. Evaluation of missing data techniques for in-car automatic speech recognition

    OpenAIRE

    Wang, Y.; Vuerinckx, R.; Gemmeke, J.F.; Cranen, B.; Hamme, H. Van

    2009-01-01

    Wang Y., Vuerinckx R., Gemmeke J., Cranen B., Van hamme H., ''Evaluation of missing data techniques for in-car automatic speech recognition'', Proceedings NAG/DAGA 2009 - international conference on acoustics, 4 pp., March 23-26, 2009, Rotterdam, The Netherlands.

  4. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  5. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system.

    Science.gov (United States)

    Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo

    2009-04-01

    The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained

  6. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  7. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Directory of Open Access Journals (Sweden)

    Rubén Fernández Pozo

    2009-01-01

    Full Text Available This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA. Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  8. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    Science.gov (United States)

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  9. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    Machine Translation (MT) and Computer-Assisted Translation (CAT) are considered complementary: the first one taking care of the translation process automatically and the latter getting the aid of human translators in order to get better translation outputs. With the demand for high quality...... translations, combining machine translation with computer assisted translation has drawn attention in current research. This combines two prospects: the opportunity of ensuring high quality translation along with a significant performance gain. Automatic Speech Recognition (ASR) is another important area......, which caters important functionalities in language processing and natural language understanding tasks. In this work we integrate automatic speech recognition and machine translation in parallel. We aim to avoid manual typing of possible translations as dictating the translation would take less time...

  10. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1975-01-01

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author) [fr

  11. Developing a broadband automatic speech recognition system for Afrikaans

    CSIR Research Space (South Africa)

    De Wet, Febe

    2011-08-01

    Full Text Available for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline results for an ASR system that was built using these resources are also presented. In addition, the article suggests different strategies to exploit...

  12. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  13. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    translations, combining machine translation with computer assisted translation has drawn attention in current research. This combines two prospects: the opportunity of ensuring high quality translation along with a significant performance gain. Automatic Speech Recognition (ASR) is another important area......, which caters important functionalities in language processing and natural language understanding tasks. In this work we integrate automatic speech recognition and machine translation in parallel. We aim to avoid manual typing of possible translations as dictating the translation would take less time...... to the n-best list rescoring, we also use word graphs with the expectation of arriving at a tighter integration of ASR and MT models. Integration methods include constraining ASR models using language and translation models of MT, and vice versa. We currently develop and experiment different methods...

  14. The role of binary mask patterns in automatic speech recognition in background noise.

    Science.gov (United States)

    Narayanan, Arun; Wang, DeLiang

    2013-05-01

    Processing noisy signals using the ideal binary mask improves automatic speech recognition (ASR) performance. This paper presents the first study that investigates the role of binary mask patterns in ASR under various noises, signal-to-noise ratios (SNRs), and vocabulary sizes. Binary masks are computed either by comparing the SNR within a time-frequency unit of a mixture signal with a local criterion (LC), or by comparing the local target energy with the long-term average spectral energy of speech. ASR results show that (1) akin to human speech recognition, binary masking significantly improves ASR performance even when the SNR is as low as -60 dB; (2) the ASR performance profiles are qualitatively similar to those obtained in human intelligibility experiments; (3) the difference between the LC and mixture SNR is more correlated to the recognition accuracy than LC; (4) LC at which the performance peaks is lower than 0 dB, which is the threshold that maximizes the SNR gain of processed signals. This broad agreement with human performance is rather surprising. The results also indicate that maximizing the SNR gain is probably not an appropriate goal for improving either human or machine recognition of noisy speech.

  15. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  16. AUTOMATIC SPEECH RECOGNITION – THE MAIN STAGES OVER LAST 50 YEARS

    Directory of Open Access Journals (Sweden)

    I. B. Tampel

    2015-11-01

    Full Text Available The main stages of automatic speech recognition systems over last 50 years are regarded. The attempt is made to evaluate different methods in the context of approaching to functioning of biological systems. The method implementation based on dynamic programming algorithm and done in 1968 is considered as a benchmark. Shortcomings of the method, which make it possible to use it only for command recognition, are considered. The next method considered is based on a formalism of Markov chains. Based on the notion of coarticulation the necessity of applying context dependent triphones and biphones instead of context independent phonemes is shown. The problems of insufficiency of speech databases for triphone training which lead to state tying methods are explained. The importance of model adaptation and feature normalization methods providing better invariance to speakers, communication channels and additive noise are shown. Deep Neural Networks and Recurrent Networks are considered as the most up-to-date methods. The similarity of deep (multilayer neural networks and biological systems is noted. In conclusion, the problems and drawbacks of the modern systems of automatic speech recognition are described and prognosis of their development is given.

  17. A Speech Recognition-based Solution for the Automatic Detection of Mild Cognitive Impairment from Spontaneous Speech.

    Science.gov (United States)

    Toth, Laszlo; Hoffmann, Ildiko; Gosztolya, Gabor; Vincze, Veronika; Szatloczki, Greta; Banreti, Zoltan; Pakaski, Magdolna; Kalman, Janos

    2018-01-01

    Even today the reliable diagnosis of the prodromal stages of Alzheimer's disease (AD) remains a great challenge. Our research focuses on the earliest detectable indicators of cognitive decline in mild cognitive impairment (MCI). Since the presence of language impairment has been reported even in the mild stage of AD, the aim of this study is to develop a sensitive neuropsychological screening method which is based on the analysis of spontaneous speech production during performing a memory task. In the future, this can form the basis of an Internet-based interactive screening software for the recognition of MCI. Participants were 38 healthy controls and 48 clinically diagnosed MCI patients. The provoked spontaneous speech by asking the patients to recall the content of 2 short black and white films (one direct, one delayed), and by answering one question. Acoustic parameters (hesitation ratio, speech tempo, length and number of silent and filled pauses, length of utterance) were extracted from the recorded speech signals, first manually (using the Praat software), and then automatically, with an automatic speech recognition (ASR) based tool. First, the extracted parameters were statistically analyzed. Then we applied machine learning algorithms to see whether the MCI and the control group can be discriminated automatically based on the acoustic features. The statistical analysis showed significant differences for most of the acoustic parameters (speech tempo, articulation rate, silent pause, hesitation ratio, length of utterance, pause-per-utterance ratio). The most significant differences between the two groups were found in the speech tempo in the delayed recall task, and in the number of pauses for the question-answering task. The fully automated version of the analysis process - that is, using the ASR-based features in combination with machine learning - was able to separate the two classes with an F1-score of 78.8%. The temporal analysis of spontaneous speech

  18. Speech Recognition

    Directory of Open Access Journals (Sweden)

    Adrian Morariu

    2009-01-01

    Full Text Available This paper presents a method of speech recognition by pattern recognition techniques. Learning consists in determining the unique characteristics of a word (cepstral coefficients by eliminating those characteristics that are different from one word to another. For learning and recognition, the system will build a dictionary of words by determining the characteristics of each word to be used in the recognition. Determining the characteristics of an audio signal consists in the following steps: noise removal, sampling it, applying Hamming window, switching to frequency domain through Fourier transform, calculating the magnitude spectrum, filtering data, determining cepstral coefficients.

  19. An Exploration of the Potential of Automatic Speech Recognition to Assist and Enable Receptive Communication in Higher Education

    Science.gov (United States)

    Wald, Mike

    2006-01-01

    The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search…

  20. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders.

    Science.gov (United States)

    Kitzing, Peter; Maier, Andreas; Ahlander, Viveka Lyberg

    2009-01-01

    In general opinion computerized automatic speech recognition (ASR) seems to be regarded as a method only to accomplish transcriptions from spoken language to written text and as such quite insecure and rather cumbersome. However, due to great advances in computer technology and informatics methodology ASR has nowadays become quite dependable and easier to handle, and the number of applications has increased considerably. After some introductory background information on ASR a number of applications of great interest for professionals in voice, speech, and language therapy are pointed out. In the foreseeable future, the keyboard and mouse will by means of ASR technology be replaced in many functions by a microphone as the human-computer interface, and the computer will talk back via its loud-speaker. It seems important that professionals engaged in the care of oral communication disorders take part in this development so their clients may get the optimal benefit from this new technology.

  1. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1981-01-01

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author) [fr

  2. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-12-01

    Full Text Available This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  3. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-06-01

    This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  4. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  5. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  6. Automatic Speech Recognition Using Template Model for Man-Machine Interface

    OpenAIRE

    Mishra, Neema; Shrawankar, Urmila; Thakare, V M

    2013-01-01

    Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech is being an important mode of interaction with computers. In this paper Feature extraction is implemented using well-known Mel-Frequenc...

  7. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  8. GAUSSIAN MIXTURE MODELS FOR ADAPTATION OF DEEP NEURAL NETWORK ACOUSTIC MODELS IN AUTOMATIC SPEECH RECOGNITION SYSTEMS

    Directory of Open Access Journals (Sweden)

    Natalia A. Tomashenko

    2016-11-01

    Full Text Available Subject of Research. We study speaker adaptation of deep neural network (DNN acoustic models in automatic speech recognition systems. The aim of speaker adaptation techniques is to improve the accuracy of the speech recognition system for a particular speaker. Method. A novel method for training and adaptation of deep neural network acoustic models has been developed. It is based on using an auxiliary GMM (Gaussian Mixture Models model and GMMD (GMM-derived features. The principle advantage of the proposed GMMD features is the possibility of performing the adaptation of a DNN through the adaptation of the auxiliary GMM. In the proposed approach any methods for the adaptation of the auxiliary GMM can be used, hence, it provides a universal method for transferring adaptation algorithms developed for GMMs to DNN adaptation.Main Results. The effectiveness of the proposed approach was shown by means of one of the most common adaptation algorithms for GMM models – MAP (Maximum A Posteriori adaptation. Different ways of integration of the proposed approach into state-of-the-art DNN architecture have been proposed and explored. Analysis of choosing the type of the auxiliary GMM model is given. Experimental results on the TED-LIUM corpus demonstrate that, in an unsupervised adaptation mode, the proposed adaptation technique can provide, approximately, a 11–18% relative word error reduction (WER on different adaptation sets, compared to the speaker-independent DNN system built on conventional features, and a 3–6% relative WER reduction compared to the SAT-DNN trained on fMLLR adapted features.

  9. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    Tjong Wan Sen

    2013-09-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  10. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  11. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...... in the mobile context covering motivations, challenges, fundamental techniques and applications. Three ASR architectures are introduced: embedded speech recognition, distributed speech recognition and network speech recognition. Their pros and cons and implementation issues are discussed. Applications within...... command and control, text entry and search are presented with an emphasis on mobile text entry....

  12. The Usefulness of Automatic Speech Recognition (ASR Eyespeak Software in Improving Iraqi EFL Students’ Pronunciation

    Directory of Open Access Journals (Sweden)

    Lina Fathi Sidig Sidgi

    2017-02-01

    Full Text Available The present study focuses on determining whether automatic speech recognition (ASR technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not found in their native language (Arabic. This study is concerned with ASR and its effectiveness in overcoming this difficulty. The data were obtained from twenty participants randomly selected from first-year college students at Al-Turath University College from the Department of English in Baghdad-Iraq. The students had participated in a two month pronunciation instruction course using ASR Eyespeak software. At the end of the pronunciation instruction course using ASR Eyespeak software, the students completed a questionnaire to get their opinions about the usefulness of the ASR Eyespeak in improving their pronunciation. The findings of the study revealed that the students found ASR Eyespeak software very useful in improving their pronunciation and helping them realise their pronunciation mistakes. They also reported that learning pronunciation with ASR Eyespeak enjoyable.

  13. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  14. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  15. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    2016-08-26

    Aug 26, 2016 ... Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next ...

  16. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency ...

  17. Estimation of Phoneme-Specific HMM Topologies for the Automatic Recognition of Dysarthric Speech

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic for each phoneme in the speaker’s phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components is performed with a Genetic Algorithm (GA. Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria.

  18. An automatic speech recognition system with speaker-independent identification support

    Science.gov (United States)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  19. The Effect of Automatic Speech Recognition EyeSpeak Software on Iraqi Students’ English Pronunciation: A Pilot Study

    Directory of Open Access Journals (Sweden)

    Lina Fathi Sidig Sidgi

    2017-04-01

    Full Text Available The use of technology, such as computer-assisted language learning (CALL, is used in teaching and learning in the foreign language classrooms where it is most needed. One promising emerging technology that supports language learning is automatic speech recognition (ASR. Integrating such technology, especially in the instruction of pronunciation in the classroom, is important in helping students to achieve correct pronunciation. In Iraq, English is a foreign language, and it is not surprising that learners commit many pronunciation mistakes. One factor contributing to these mistakes is the difference between the Arabic and English phonetic systems. Thus, the sound transformation from the mother tongue (Arabic to the target language (English is one barrier for Arab learners. The purpose of this study is to investigate the effectiveness of using automatic speech recognition ASR EyeSpeak software in improving the pronunciation of Iraqi learners of English. An experimental research project with a pretest-posttest design is conducted over a one-month period in the Department of English at Al-Turath University College in Baghdad, Iraq. The ten participants are randomly selected first-year college students enrolled in a pronunciation class that uses traditional teaching methods and ASR EyeSpeak software. The findings show that using EyeSpeak software leads to a significant improvement in the students’ English pronunciation, evident from the test scores they achieve after using EyeSpeak software.

  20. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    automatic recognition of speech (ASR). Instead, likely for historical reasons, envelopes of power spectrum were adopted as main carrier of linguistic information in ASR. However, the relationships between phonetic values of sounds and their short-term spectral envelopes are not straightforward. Consequently, this asks for ...

  1. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  2. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    Science.gov (United States)

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  3. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features

    NARCIS (Netherlands)

    Raaijmakers, S.; Truong, K.P.

    2008-01-01

    We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic

  4. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers

    2005-01-01

    to the appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  5. A simple error classification system for understanding sources of error in automatic speech recognition and human transcription.

    Science.gov (United States)

    Zafar, Atif; Mamlin, Burke; Perkins, Susan; Belsito, Anne M; Overhage, J Marc; McDonald, Clement J

    2004-09-01

    To (1) discover the types of errors most commonly found in clinical notes that are generated either using automatic speech recognition (ASR) or via human transcription and (2) to develop efficient rules for classifying these errors based on the categories found in (1). The purpose of classifying errors into categories is to understand the underlying processes that generate these errors, so that measures can be taken to improve these processes. We integrated the Dragon NaturallySpeaking v4.0 speech recognition engine into the Regenstrief Medical Record System. We captured the text output of the speech engine prior to error correction by the speaker. We also acquired a set of human transcribed but uncorrected notes for comparison. We then attempted to error correct these notes based on looking at the context alone. Initially, three domain experts independently examined 104 ASR notes (containing 29,144 words) generated by a single speaker and 44 human transcribed notes (containing 14,199 words) generated by multiple speakers for errors. Collaborative group sessions were subsequently held where error categorizes were determined and rules developed and incrementally refined for systematically examining the notes and classifying errors. We found that the errors could be classified into nine categories: (1) announciation errors occurring due to speaker mispronounciation, (2) dictionary errors resulting from missing terms, (3) suffix errors caused by misrecognition of appropriate tenses of a word, (4) added words, (5) deleted words, (6) homonym errors resulting from substitution of a phonetically identical word, (7) spelling errors, (8) nonsense errors, words/phrases whose meaning could not be appreciated by examining just the context, and (9) critical errors, words/phrases where a reader of a note could potentially misunderstand the concept that was related by the speaker. A simple method is presented for examining errors in transcribed documents and classifying these

  6. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

    Directory of Open Access Journals (Sweden)

    Kan Li

    2018-04-01

    Full Text Available This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM speech processing as well as neuromorphic implementations based on spiking neural network (SNN, yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR regime.

  7. Induction of the morphology of natural language : unsupervised morpheme segmentation with application to automatic speech recognition

    OpenAIRE

    Creutz, Mathias

    2006-01-01

    In order to develop computer applications that successfully process natural language data (text and speech), one needs good models of the vocabulary and grammar of as many languages as possible. According to standard linguistic theory, words consist of morphemes, which are the smallest individually meaningful elements in a language. Since an immense number of word forms can be constructed by combining a limited set of morphemes, the capability of understanding and producing new word forms dep...

  8. Pronunciation Modeling for Large Vocabulary Speech Recognition

    Science.gov (United States)

    Kantor, Arthur

    2010-01-01

    The large pronunciation variability of words in conversational speech is one of the major causes of low accuracy in automatic speech recognition (ASR). Many pronunciation modeling approaches have been developed to address this problem. Some explicitly manipulate the pronunciation dictionary as well as the set of the units used to define the…

  9. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suzuki Motoyuki

    2009-01-01

    Full Text Available Abstract We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the "query relevance." Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  10. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Akinori Ito

    2009-01-01

    Full Text Available We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the “query relevance.” Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  11. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  12. Tone realisation in a Yoruba speech recognition corpus

    CSIR Research Space (South Africa)

    Van Niekerk, D

    2012-05-01

    Full Text Available The authors investigate the acoustic realisation of tone in short continuous utterances in Yoruba. Fundamental frequency contours are extracted for automatically aligned syllables from a speech corpus of 33 speakers collected for speech recognition...

  13. Machines a Comprendre la Parole: Methodologie et Bilan de Recherche (Automatic Speech Recognition: Methodology and the State of the Research)

    Science.gov (United States)

    Haton, Jean-Pierre

    1974-01-01

    Still no decisive result has been achieved in the automatic machine recognition of sentences of a natural language. Current research concentrates on developing algorithms for syntactic and semantic analysis. It is obvious that clues from all levels of perception have to be taken into account if a long term solution is ever to be found. (Author/MSE)

  14. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    2015-01-01

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers...... of three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...

  15. Connected digit speech recognition system for Malayalam

    Indian Academy of Sciences (India)

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual ...

  16. Speech Recognition: How Do We Teach It?

    Science.gov (United States)

    Barksdale, Karl

    2002-01-01

    States that growing use of speech recognition software has made voice writing an essential computer skill. Describes how to present the topic, develop basic speech recognition skills, and teach speech recognition outlining, writing, proofreading, and editing. (Contains 14 references.) (SK)

  17. Visual speech influences speech perception immediately but not automatically.

    Science.gov (United States)

    Mitterer, Holger; Reinisch, Eva

    2017-02-01

    Two experiments examined the time course of the use of auditory and visual speech cues to spoken word recognition using an eye-tracking paradigm. Results of the first experiment showed that the use of visual speech cues from lipreading is reduced if concurrently presented pictures require a division of attentional resources. This reduction was evident even when listeners' eye gaze was on the speaker rather than the (static) pictures. Experiment 2 used a deictic hand gesture to foster attention to the speaker. At the same time, the visual processing load was reduced by keeping the visual display constant over a fixed number of successive trials. Under these conditions, the visual speech cues from lipreading were used. Moreover, the eye-tracking data indicated that visual information was used immediately and even earlier than auditory information. In combination, these data indicate that visual speech cues are not used automatically, but if they are used, they are used immediately.

  18. Methods of Teaching Speech Recognition

    Science.gov (United States)

    Rader, Martha H.; Bailey, Glenn A.

    2010-01-01

    Objective: This article introduces the history and development of speech recognition, addresses its role in the business curriculum, outlines related national and state standards, describes instructional strategies, and discusses the assessment of student achievement in speech recognition classes. Methods: Research methods included a synthesis of…

  19. Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2009-03-01

    Full Text Available The authors describes the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively...

  20. End-to-end visual speech recognition with LSTMS

    NARCIS (Netherlands)

    Petridis, Stavros; Li, Zuwei; Pantic, Maja

    2017-01-01

    Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on

  1. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  2. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text...... computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways of verifying...

  3. Auditory Modeling for Noisy Speech Recognition

    National Research Council Canada - National Science Library

    2000-01-01

    ...) has used its existing technology in phonetic speech recognition, audio signal processing, and multilingual language translation to design and demonstrate an advanced audio interface for speech...

  4. An automatic image recognition approach

    Directory of Open Access Journals (Sweden)

    Tudor Barbu

    2007-07-01

    Full Text Available Our paper focuses on the graphical analysis domain. We propose an automatic image recognition technique. This approach consists of two main pattern recognition steps. First, it performs an image feature extraction operation on an input image set, using statistical dispersion features. Then, an unsupervised classification process is performed on the previously obtained graphical feature vectors. An automatic region-growing based clustering procedure is proposed and utilized in the classification stage.

  5. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    Science.gov (United States)

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  6. Automatic Smoker Detection from Telephone Speech Signals

    DEFF Research Database (Denmark)

    Alavijeh, Amir Hossein Poorjam; Hesaraki, Soheila; Safavi, Saeid

    2017-01-01

    This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis...... on Gaussian mixture model means and weights respectively. Each framework is evaluated using different classification algorithms to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to improve the classification accuracy. The proposed...... method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach...

  7. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  8. Speech Recognition: Its Place in Business Education.

    Science.gov (United States)

    Szul, Linda F.; Bouder, Michele

    2003-01-01

    Suggests uses of speech recognition devices in the classroom for students with disabilities. Compares speech recognition software packages and provides guidelines for selection and teaching. (Contains 14 references.) (SK)

  9. Speech recognition employing biologically plausible receptive fields

    DEFF Research Database (Denmark)

    Fereczkowski, Michal; Bothe, Hans-Heinrich

    2011-01-01

    The main idea of the project is to build a widely speaker-independent, biologically motivated automatic speech recognition (ASR) system. The two main differences between our approach and current state-of-the-art ASRs are that i) the features used here are based on the responses of neuronlike...... Model-based adaptation procedures. Two databases are used, TI46 for discrete speech a subset of the TIMIT database collected from speakers belonging to the New York dialect region. Each of the selection of 10 sentences is uttered once by each of 35 speakers. The major differences between the two data...... sets initiate the development and comparison of two distinct ASRs within the project, which will be presented in the following. Employing a reduced sampling frequency and bandwidth of the signals, the ASR algorithm reaches and goes beyond recognition results that are known from humans....

  10. IBM MASTOR SYSTEM: Multilingual Automatic Speech-to-speech Translator

    National Research Council Canada - National Science Library

    Gao, Yuqing; Gu, Liang; Zhou, Bowen; Sarikaya, Ruhi; Afify, Mohamed; Kuo, Hong-Kwang; Zhu, Wei-zhong; Deng, Yonggang; Prosser, Charles; Zhang, Wei

    2006-01-01

    .... Challenges include speech recognition and machine translation in adverse environments, lack of training data and linguistic resources for under-studied languages, and the need to rapidly develop...

  11. Speech recognition implementation in radiology

    International Nuclear Information System (INIS)

    White, Keith S.

    2005-01-01

    Continuous speech recognition (SR) is an emerging technology that allows direct digital transcription of dictated radiology reports. The SR systems are being widely deployed in the radiology community. This is a review of technical and practical issues that should be considered when implementing an SR system. (orig.)

  12. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  13. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT

  14. AUTOMATIC ARCHITECTURAL STYLE RECOGNITION

    Directory of Open Access Journals (Sweden)

    M. Mathias

    2012-09-01

    Full Text Available Procedural modeling has proven to be a very valuable tool in the field of architecture. In the last few years, research has soared to automatically create procedural models from images. However, current algorithms for this process of inverse procedural modeling rely on the assumption that the building style is known. So far, the determination of the building style has remained a manual task. In this paper, we propose an algorithm which automates this process through classification of architectural styles from facade images. Our classifier first identifies the images containing buildings, then separates individual facades within an image and determines the building style. This information could then be used to initialize the building reconstruction process. We have trained our classifier to distinguish between several distinct architectural styles, namely Flemish Renaissance, Haussmannian and Neoclassical. Finally, we demonstrate our approach on various street-side images.

  15. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  16. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...

  17. Physics of Automatic Target Recognition

    CERN Document Server

    Sadjadi, Firooz

    2007-01-01

    Physics of Automatic Target Recognition addresses the fundamental physical bases of sensing, and information extraction in the state-of-the art automatic target recognition field. It explores both passive and active multispectral sensing, polarimetric diversity, complex signature exploitation, sensor and processing adaptation, transformation of electromagnetic and acoustic waves in their interactions with targets, background clutter, transmission media, and sensing elements. The general inverse scattering, and advanced signal processing techniques and scientific evaluation methodologies being used in this multi disciplinary field will be part of this exposition. The issues of modeling of target signatures in various spectral modalities, LADAR, IR, SAR, high resolution radar, acoustic, seismic, visible, hyperspectral, in diverse geometric aspects will be addressed. The methods for signal processing and classification will cover concepts such as sensor adaptive and artificial neural networks, time reversal filt...

  18. Intonation and Dialog Context as Constraints for Speech Recognition.

    Science.gov (United States)

    Taylor, Paul; King, Simon; Isard, Stephen; Wright, Helen

    1998-01-01

    Describes how to use intonation and dialog context to improve the performance of an automatic speech-recognition system. Experiments utilized the DCIEM Maptask corpus, using a separate bigram language model for each type of move and showing that, with the correct move-specific language model for each utterance in the test set, the recognizer's…

  19. Automatic speech signal segmentation based on the innovation adaptive filter

    Directory of Open Access Journals (Sweden)

    Makowski Ryszard

    2014-06-01

    Full Text Available Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors’ studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006, and it is a modified version of Brandt’s algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.

  20. Stimulated Deep Neural Network for Speech Recognition

    Science.gov (United States)

    2016-09-08

    approaches yield state-of-the-art performance in a range of tasks, including speech recognition . However, the parameters of the network are hard to analyze...advantage of the smoothness con- straints that stimulated training offers. The approaches are eval- uated on two large vocabulary speech recognition

  1. Unsupervised modulation filter learning for noise-robust speech recognition.

    Science.gov (United States)

    Agrawal, Purvi; Ganapathy, Sriram

    2017-09-01

    The modulation filtering approach to robust automatic speech recognition (ASR) is based on enhancing perceptually relevant regions of the modulation spectrum while suppressing the regions susceptible to noise. In this paper, a data-driven unsupervised modulation filter learning scheme is proposed using convolutional restricted Boltzmann machine. The initial filter is learned using the speech spectrogram while subsequent filters are learned using residual spectrograms. The modulation filtered spectrograms are used for ASR experiments on noisy and reverberant speech where these features provide significant improvements over other robust features. Furthermore, the application of the proposed method for semi-supervised learning is investigated.

  2. Personality in speech assessment and automatic classification

    CERN Document Server

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  3. Automatic Recognition of Road Signs

    Science.gov (United States)

    Inoue, Yasuo; Kohashi, Yuuichirou; Ishikawa, Naoto; Nakajima, Masato

    2002-11-01

    The increase in traffic accidents is becoming a serious social problem with the recent rapid traffic increase. In many cases, the driver"s carelessness is the primary factor of traffic accidents, and the driver assistance system is demanded for supporting driver"s safety. In this research, we propose the new method of automatic detection and recognition of road signs by image processing. The purpose of this research is to prevent accidents caused by driver"s carelessness, and call attention to a driver when the driver violates traffic a regulation. In this research, high accuracy and the efficient sign detecting method are realized by removing unnecessary information except for a road sign from an image, and detect a road sign using shape features. At first, the color information that is not used in road signs is removed from an image. Next, edges except for circular and triangle ones are removed to choose sign shape. In the recognition process, normalized cross correlation operation is carried out to the two-dimensional differentiation pattern of a sign, and the accurate and efficient method for detecting the road sign is realized. Moreover, the real-time operation in a software base was realized by holding down calculation cost, maintaining highly precise sign detection and recognition. Specifically, it becomes specifically possible to process by 0.1 sec(s)/frame using a general-purpose PC (CPU: Pentium4 1.7GHz). As a result of in-vehicle experimentation, our system could process on real time and has confirmed that detection and recognition of a sign could be performed correctly.

  4. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System.

    Science.gov (United States)

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  5. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  6. Automatic Dialect Detection in Arabic Broadcast Speech

    OpenAIRE

    Ali, Ahmed; Bell, Peter; Renals, Steve

    2015-01-01

    We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both generative and discriminate classifiers, and we combined these featuresusing a multi-class Support Vector Machine (SVM).We validated our results on an Arabic/English language identification task, with an accuracy of 100%. We used these features in a binary class...

  7. Automatic Dialect Detection in Arabic Broadcast Speech

    OpenAIRE

    Ali, Ahmed; Dehak, Najim; Cardinal, Patrick; Khurana, Sameer; Yella, Sree Harsha; Glass, James; Bell, Peter; Renals, Steve

    2016-01-01

    In this paper, we investigate different approaches for dialect identification in Arabic broadcast speech. These methods are based on phonetic and lexical features obtained from a speech recognition system, and bottleneck features using the i-vector framework. We studied both generative and discriminative classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/English language identification task, with an accuracy of 1...

  8. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  9. Speech emotion recognition methods: A literature review

    Science.gov (United States)

    Basharirad, Babak; Moradhaseli, Mohammadreza

    2017-10-01

    Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.

  10. Automatic Smoker Detection from Telephone Speech Signals

    DEFF Research Database (Denmark)

    Alavijeh, Amir Hossein Poorjam; Hesaraki, Soheila; Safavi, Saeid

    2017-01-01

    This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis on G...

  11. Man machine interface based on speech recognition

    International Nuclear Information System (INIS)

    Jorge, Carlos A.F.; Aghina, Mauricio A.C.; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

    2007-01-01

    This work reports the development of a Man Machine Interface based on speech recognition. The system must recognize spoken commands, and execute the desired tasks, without manual interventions of operators. The range of applications goes from the execution of commands in an industrial plant's control room, to navigation and interaction in virtual environments. Results are reported for isolated word recognition, the isolated words corresponding to the spoken commands. For the pre-processing stage, relevant parameters are extracted from the speech signals, using the cepstral analysis technique, that are used for isolated word recognition, and corresponds to the inputs of an artificial neural network, that performs recognition tasks. (author)

  12. Speech-in-Speech Recognition: A Training Study

    Science.gov (United States)

    Van Engen, Kristin J.

    2012-01-01

    This study aims to identify aspects of speech-in-noise recognition that are susceptible to training, focusing on whether listeners can learn to adapt to target talkers ("tune in") and learn to better cope with various maskers ("tune out") after short-term training. Listeners received training on English sentence recognition in…

  13. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Abstract. Information is carried in changes of a signal. The paper starts with revis- iting Dudley's concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of ...

  14. Adding user-friendliness and ease of implementation to continuous speech recognition technology with speech macros: case studies.

    Science.gov (United States)

    Green, Harrison D

    2004-01-01

    Continuous Speech Recognition Technology implementation is expensive, and the failure of leading companies in this niche can hamper usefulness. C-SRT, if deployed and used with speech macros, experiences vastly improved implementations and drastically reduces medical transcription costs.A speech macro is a short phrase that is automatically translated into a block of text or a graphic display. A more powerful form of speech macro can bring up predefined templates and insert spoken text into the proper position automatically, based on its interpretation. Cases from the author's consulting experiences and from medical journals emphasize the need for speech macros from a cost-benefit standpoint. A prototype program is introduced that facilitates the process of creating macros. The need for macro management software is reconciled with current research on speech recognition and technology adoption. A planned experiment is discussed.

  15. Speech Recognition for the iCub Platform

    Directory of Open Access Journals (Sweden)

    Bertrand Higy

    2018-02-01

    Full Text Available This paper describes open source software (available at https://github.com/robotology/natural-speech to build automatic speech recognition (ASR systems and run them within the YARP platform. The toolkit is designed (i to allow non-ASR experts to easily create their own ASR system and run it on iCub and (ii to build deep learning-based models specifically addressing the main challenges an ASR system faces in the context of verbal human–iCub interactions. The toolkit mostly consists of Python, C++ code and shell scripts integrated in YARP. As additional contribution, a second codebase (written in Matlab is provided for more expert ASR users who want to experiment with bio-inspired and developmental learning-inspired ASR systems. Specifically, we provide code for two distinct kinds of speech recognition: “articulatory” and “unsupervised” speech recognition. The first is largely inspired by influential neurobiological theories of speech perception which assume speech perception to be mediated by brain motor cortex activities. Our articulatory systems have been shown to outperform strong deep learning-based baselines. The second type of recognition systems, the “unsupervised” systems, do not use any supervised information (contrary to most ASR systems, including our articulatory systems. To some extent, they mimic an infant who has to discover the basic speech units of a language by herself. In addition, we provide resources consisting of pre-trained deep learning models for ASR, and a 2.5-h speech dataset of spoken commands, the VoCub dataset, which can be used to adapt an ASR system to the typical acoustic environments in which iCub operates.

  16. Dynamic Programming Algorithms in Speech Recognition

    Directory of Open Access Journals (Sweden)

    Titus Felix FURTUNA

    2008-01-01

    Full Text Available In a system of speech recognition containing words, the recognition requires the comparison between the entry signal of the word and the various words of the dictionary. The problem can be solved efficiently by a dynamic comparison algorithm whose goal is to put in optimal correspondence the temporal scales of the two words. An algorithm of this type is Dynamic Time Warping. This paper presents two alternatives for implementation of the algorithm designed for recognition of the isolated words.

  17. Contextual variability during speech-in-speech recognition.

    Science.gov (United States)

    Brouwer, Susanne; Bradlow, Ann R

    2014-07-01

    This study examined the influence of background language variation on speech recognition. English listeners performed an English sentence recognition task in either "pure" background conditions in which all trials had either English or Dutch background babble or in mixed background conditions in which the background language varied across trials (i.e., a mix of English and Dutch or one of these background languages mixed with quiet trials). This design allowed the authors to compare performance on identical trials across pure and mixed conditions. The data reveal that speech-in-speech recognition is sensitive to contextual variation in terms of the target-background language (mis)match depending on the relative ease/difficulty of the test trials in relation to the surrounding trials.

  18. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  19. Human and automatic speaker recognition over telecommunication channels

    CERN Document Server

    Fernández Gallardo, Laura

    2016-01-01

    This work addresses the evaluation of the human and the automatic speaker recognition performances under different channel distortions caused by bandwidth limitation, codecs, and electro-acoustic user interfaces, among other impairments. Its main contribution is the demonstration of the benefits of communication channels of extended bandwidth, together with an insight into how speaker-specific characteristics of speech are preserved through different transmissions. It provides sufficient motivation for considering speaker recognition as a criterion for the migration from narrowband to enhanced bandwidths, such as wideband and super-wideband.

  20. Speech Recognition Software for Language Learning: Toward an Evaluation of Validity and Student Perceptions

    Science.gov (United States)

    Cordier, Deborah

    2009-01-01

    A renewed focus on foreign language (FL) learning and speech for communication has resulted in computer-assisted language learning (CALL) software developed with Automatic Speech Recognition (ASR). ASR features for FL pronunciation (Lafford, 2004) are functional components of CALL designs used for FL teaching and learning. The ASR features…

  1. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness

    OpenAIRE

    Ramirez, J.; Gorriz, J. M.; Segura, J. C.

    2007-01-01

    This chapter has shown an overview of the main challenges in robust speech detection and a review of the state of the art and applications. VADs are frequently used in a number of applications including speech coding, speech enhancement and speech recognition. A precise VAD extracts a set of discriminative speech features from the noisy speech and formulates the decision in terms of well defined rule. The chapter has summarized three robust VAD methods that yield high speech/non-speech discri...

  2. A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hermus Kris

    2007-01-01

    Full Text Available The objective of this paper is threefold: (1 to provide an extensive review of signal subspace speech enhancement, (2 to derive an upper bound for the performance of these techniques, and (3 to present a comprehensive study of the potential of subspace filtering to increase the robustness of automatic speech recognisers against stationary additive noise distortions. Subspace filtering methods are based on the orthogonal decomposition of the noisy speech observation space into a signal subspace and a noise subspace. This decomposition is possible under the assumption of a low-rank model for speech, and on the availability of an estimate of the noise correlation matrix. We present an extensive overview of the available estimators, and derive a theoretical estimator to experimentally assess an upper bound to the performance that can be achieved by any subspace-based method. Automatic speech recognition experiments with noisy data demonstrate that subspace-based speech enhancement can significantly increase the robustness of these systems in additive coloured noise environments. Optimal performance is obtained only if no explicit rank reduction of the noisy Hankel matrix is performed. Although this strategy might increase the level of the residual noise, it reduces the risk of removing essential signal information for the recogniser's back end. Finally, it is also shown that subspace filtering compares favourably to the well-known spectral subtraction technique.

  3. Support vector machine for automatic pain recognition

    Science.gov (United States)

    Monwar, Md Maruf; Rezaei, Siamak

    2009-02-01

    Facial expressions are a key index of emotion and the interpretation of such expressions of emotion is critical to everyday social functioning. In this paper, we present an efficient video analysis technique for recognition of a specific expression, pain, from human faces. We employ an automatic face detector which detects face from the stored video frame using skin color modeling technique. For pain recognition, location and shape features of the detected faces are computed. These features are then used as inputs to a support vector machine (SVM) for classification. We compare the results with neural network based and eigenimage based automatic pain recognition systems. The experiment results indicate that using support vector machine as classifier can certainly improve the performance of automatic pain recognition system.

  4. Toddlers' recognition of noise-vocoded speech.

    Science.gov (United States)

    Newman, Rochelle; Chatterjee, Monita

    2013-01-01

    Despite their remarkable clinical success, cochlear-implant listeners today still receive spectrally degraded information. Much research has examined normally hearing adult listeners' ability to interpret spectrally degraded signals, primarily using noise-vocoded speech to simulate cochlear implant processing. Far less research has explored infants' and toddlers' ability to interpret spectrally degraded signals, despite the fact that children in this age range are frequently implanted. This study examines 27-month-old typically developing toddlers' recognition of noise-vocoded speech in a language-guided looking study. Children saw two images on each trial and heard a voice instructing them to look at one item ("Find the cat!"). Full-spectrum sentences or their noise-vocoded versions were presented with varying numbers of spectral channels. Toddlers showed equivalent proportions of looking to the target object with full-speech and 24- or 8-channel noise-vocoded speech; they failed to look appropriately with 2-channel noise-vocoded speech and showed variable performance with 4-channel noise-vocoded speech. Despite accurate looking performance for speech with at least eight channels, children were slower to respond appropriately as the number of channels decreased. These results indicate that 2-yr-olds have developed the ability to interpret vocoded speech, even without practice, but that doing so requires additional processing. These findings have important implications for pediatric cochlear implantation.

  5. Continuous speech recognition with sparse coding

    CSIR Research Space (South Africa)

    Smit, WJ

    2009-04-01

    Full Text Available , we show how sparse codes can be used to do continuous speech recognition. We use the TIDIGITS dataset to illustrate the process. First a waveform is transformed into a spectrogram, and a sparse code for the spectrogram is found by means of a linear...

  6. Speech Recognition Technology for Disabilities Education

    Science.gov (United States)

    Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng

    2005-01-01

    Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…

  7. Speech Recognition for A Digital Video Library.

    Science.gov (United States)

    Witbrock, Michael J.; Hauptmann, Alexander G.

    1998-01-01

    Production of the meta-data supporting the Informedia Digital Video Library interface is automated using techniques derived from artificial intelligence research. Speech recognition and natural-language processing, information retrieval, and image analysis are applied to produce an interface that helps users locate information and navigate more…

  8. Speech Recognition, Disability, and College Composition

    Science.gov (United States)

    Nelson, Lorna M.; Reynolds, Thomas W., Jr.

    2015-01-01

    This study examined the composing processes of five postsecondary students who used or were learning to use speech recognition software (SR) for college-level writing. The study analyzed their composing processes through observation, interviews, and analysis of written products over a series of composing sessions. This investigation was prompted…

  9. Speech Recognition: A World of Opportunities

    Science.gov (United States)

    PACER Center, 2004

    2004-01-01

    Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. The technology is also useful to people with learning disabilities who experience difficulty with spelling…

  10. Effects of Cognitive Load on Speech Recognition

    Science.gov (United States)

    Mattys, Sven L.; Wiget, Lukas

    2011-01-01

    The effect of cognitive load (CL) on speech recognition has received little attention despite the prevalence of CL in everyday life, e.g., dual-tasking. To assess the effect of CL on the interaction between lexically-mediated and acoustically-mediated processes, we measured the magnitude of the "Ganong effect" (i.e., lexical bias on phoneme…

  11. SPECTRAL METHODS IN POLISH EMOTIONAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2016-12-01

    Full Text Available In this article the issue of emotion recognition based on Polish emotional speech signal analysis was presented. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. Speech signal has been processed by Artificial Neural Networks (ANN. The inputs for ANN were information obtained from signal spectrogram. Researches were conducted for three different spectrogram divisions. The ANN consists of four layers but the number of neurons in each layer depends of spectrogram division. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom. The averange effectiveness of emotions recognition was about 80%.

  12. Error analysis to improve the speech recognition accuracy on ...

    Indian Academy of Sciences (India)

    Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy.

  13. Multilingual Data Selection for Low Resource Speech Recognition

    Science.gov (United States)

    2016-09-12

    Multilingual Data Selection For Low Resource Speech Recognition Samuel Thomas, Kartik Audhkhasi, Jia Cui, Brian Kingsbury and Bhuvana Ramabhadran IBM...deep neural network- based multilingual frontends provide significant improvements to speech recognition systems in low resource settings. To ef...present speech recognition results on 7 very limited language pack (VLLP) languages from the second option period of the IARPA Babel program using

  14. Comparing grapheme-based and phoneme-based speech recognition for Afrikaans

    CSIR Research Space (South Africa)

    Basson, WD

    2012-11-01

    Full Text Available This paper compares the recognition accuracy of a phoneme-based automatic speech recognition system with that of a grapheme-based system, using Afrikaans as case study. The first system is developed using a conventional pronunciation dictionary...

  15. Towards automatic forensic face recognition

    NARCIS (Netherlands)

    Ali, Tauseef; Spreeuwers, Lieuwe Jan; Veldhuis, Raymond N.J.

    2011-01-01

    In this paper we present a methodology and experimental results for evidence evaluation in the context of forensic face recognition. In forensic applications, the matching score (hereafter referred to as similarity score) from a biometric system must be represented as a Likelihood Ratio (LR). In our

  16. Automatic TLI recognition system, general description

    Energy Technology Data Exchange (ETDEWEB)

    Lassahn, G.D.

    1997-02-01

    This report is a general description of an automatic target recognition system developed at the Idaho National Engineering Laboratory for the Department of Energy. A user`s manual is a separate volume, Automatic TLI Recognition System, User`s Guide, and a programmer`s manual is Automatic TLI Recognition System, Programmer`s Guide. This system was designed as an automatic target recognition system for fast screening of large amounts of multi-sensor image data, based on low-cost parallel processors. This system naturally incorporates image data fusion, and it gives uncertainty estimates. It is relatively low cost, compact, and transportable. The software is easily enhanced to expand the system`s capabilities, and the hardware is easily expandable to increase the system`s speed. In addition to its primary function as a trainable target recognition system, this is also a versatile, general-purpose tool for image manipulation and analysis, which can be either keyboard-driven or script-driven. This report includes descriptions of three variants of the computer hardware, a description of the mathematical basis if the training process, and a description with examples of the system capabilities.

  17. A rough set approach to speech recognition

    Science.gov (United States)

    Zhao, Zhigang

    1992-09-01

    Speech recognition is a very difficult classification problem due to the variations in loudness, speed, and tone of voice. In the last 40 years, many methodologies have been developed to solve this problem, but most lack learning ability and depend fully on the knowledge of human experts. Systems of this kind are hard to develop and difficult to maintain and upgrade. A study was conducted to investigate the feasibility of using a machine learning approach in solving speech recognition problems. The system is based on rough set theory. It first generates a set of decision rules using a set of reference words called training samples, and then uses the decision rules to recognize new words. The main feature of this system is that, under the supervision of human experts, the machine learns and applies knowledge on its own to the designated tasks. The main advantages of this system over a traditional system are its simplicity and adaptiveness, which suggest that it may have significant potential in practical applications of computer speech recognition. Furthermore, the studies presented demonstrate the potential application of rough-set based learning systems in solving other important pattern classification problems, such as character recognition, system fault detection, and trainable robotic control.

  18. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    Science.gov (United States)

    Holzrichter, J.F.; Ng, L.C.

    1998-03-17

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs.

  19. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    International Nuclear Information System (INIS)

    Holzrichter, J.F.; Ng, L.C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs

  20. Speech Processing and Recognition (SPaRe)

    Science.gov (United States)

    2011-01-01

    parameters such as duration, audio /video bitrates, audio /video codecs , audio channels, and sample rates. These parameters are automatically populated in the...used to segment each conversation into utterance level audio and transcript files. First, all speech data from the English interviewers and all...News Corpus [12]. The TDT4 corpus includes approximately 200 hours of Mandarin audio with closed-captions, or approximate transcripts. These

  1. Automatic transcription of continuous speech into syllable-like units ...

    Indian Academy of Sciences (India)

    Abstract. The focus of this paper is to automatically segment and label continu- ous speech signal into syllable-like units for Indian languages. In this approach, the continuous speech signal is first automatically segmented into syllable-like units using group delay based algorithm. Similar syllable segments are then grouped.

  2. Relationship between speech recognition in noise and sparseness.

    Science.gov (United States)

    Li, Guoping; Lutman, Mark E; Wang, Shouyan; Bleeck, Stefan

    2012-02-01

    Established methods for predicting speech recognition in noise require knowledge of clean speech signals, placing limitations on their application. The study evaluates an alternative approach based on characteristics of noisy speech, specifically its sparseness as represented by the statistic kurtosis. Experiments 1 and 2 involved acoustic analysis of vowel-consonant-vowel (VCV) syllables in babble noise, comparing kurtosis, glimpsing areas, and extended speech intelligibility index (ESII) of noisy speech signals with one another and with pre-existing speech recognition scores. Experiment 3 manipulated kurtosis of VCV syllables and investigated effects on speech recognition scores in normal-hearing listeners. Pre-existing speech recognition data for Experiments 1 and 2; seven normal-hearing participants for Experiment 3. Experiments 1 and 2 demonstrated that kurtosis calculated in the time-domain from noisy speech is highly correlated (r > 0.98) with established prediction models: glimpsing and ESII. All three measures predicted speech recognition scores well. The final experiment showed a clear monotonic relationship between speech recognition scores and kurtosis. Speech recognition performance in noise is closely related to the sparseness (kurtosis) of the noisy speech signal, at least for the types of speech and noise used here and for listeners with normal hearing.

  3. Automatic Door Access System Using Face Recognition

    Directory of Open Access Journals (Sweden)

    Hteik Htar Lwin

    2015-06-01

    Full Text Available Abstract Most doors are controlled by persons with the use of keys security cards password or pattern to open the door. Theaim of this paper is to help users forimprovement of the door security of sensitive locations by using face detection and recognition. Face is a complex multidimensional structure and needs good computing techniques for detection and recognition. This paper is comprised mainly of three subsystems namely face detection face recognition and automatic door access control. Face detection is the process of detecting the region of face in an image. The face is detected by using the viola jones method and face recognition is implemented by using the Principal Component Analysis PCA. Face Recognition based on PCA is generally referred to as the use of Eigenfaces.If a face is recognized it is known else it is unknown. The door will open automatically for the known person due to the command of the microcontroller. On the other hand alarm will ring for the unknown person. Since PCA reduces the dimensions of face images without losing important features facial images for many persons can be stored in the database. Although many training images are used computational efficiency cannot be decreased significantly. Therefore face recognition using PCA can be more useful for door security system than other face recognition schemes.

  4. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  5. Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives

    Directory of Open Access Journals (Sweden)

    S. Hamidreza Kasaei

    2011-05-01

    Full Text Available In this paper, we propose design and initial implementation of a robust system which can automatically translates voice into text and text to sign language animations. Sign Language
    Translation Systems could significantly improve deaf lives especially in communications, exchange of information and employment of machine for translation conversations from one language to another has. Therefore, considering these points, it seems necessary to study the speech recognition. Usually, the voice recognition algorithms address three major challenges. The first is extracting feature form speech and the second is when limited sound gallery are available for recognition, and the final challenge is to improve speaker dependent to speaker independent voice recognition. Extracting feature form speech is an important stage in our method. Different procedures are available for extracting feature form speech. One of the commonest of which used in speech
    recognition systems is Mel-Frequency Cepstral Coefficients (MFCCs. The algorithm starts with preprocessing and signal conditioning. Next extracting feature form speech using Cepstral coefficients will be done. Then the result of this process sends to segmentation part. Finally recognition part recognizes the words and then converting word recognized to facial animation. The project is still in progress and some new interesting methods are described in the current report.

  6. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Christophe Lévy

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  7. Improved Open-Microphone Speech Recognition

    Science.gov (United States)

    Abrash, Victor

    2002-12-01

    Many current and future NASA missions make extreme demands on mission personnel both in terms of work load and in performing under difficult environmental conditions. In situations where hands are impeded or needed for other tasks, eyes are busy attending to the environment, or tasks are sufficiently complex that ease of use of the interface becomes critical, spoken natural language dialog systems offer unique input and output modalities that can improve efficiency and safety. They also offer new capabilities that would not otherwise be available. For example, many NASA applications require astronauts to use computers in micro-gravity or while wearing space suits. Under these circumstances, command and control systems that allow users to issue commands or enter data in hands-and eyes-busy situations become critical. Speech recognition technology designed for current commercial applications limits the performance of the open-ended state-of-the-art dialog systems being developed at NASA. For example, today's recognition systems typically listen to user input only during short segments of the dialog, and user input outside of these short time windows is lost. Mistakes detecting the start and end times of user utterances can lead to mistakes in the recognition output, and the dialog system as a whole has no way to recover from this, or any other, recognition error. Systems also often require the user to signal when that user is going to speak, which is impractical in a hands-free environment, or only allow a system-initiated dialog requiring the user to speak immediately following a system prompt. In this project, SRI has developed software to enable speech recognition in a hands-free, open-microphone environment, eliminating the need for a push-to-talk button or other signaling mechanism. The software continuously captures a user's speech and makes it available to one or more recognizers. By constantly monitoring and storing the audio stream, it provides the spoken

  8. Image-based automatic recognition of larvae

    Science.gov (United States)

    Sang, Ru; Yu, Guiying; Fan, Weijun; Guo, Tiantai

    2010-08-01

    As the main objects, imagoes have been researched in quarantine pest recognition in these days. However, pests in their larval stage are latent, and the larvae spread abroad much easily with the circulation of agricultural and forest products. It is presented in this paper that, as the new research objects, larvae are recognized by means of machine vision, image processing and pattern recognition. More visional information is reserved and the recognition rate is improved as color image segmentation is applied to images of larvae. Along with the characteristics of affine invariance, perspective invariance and brightness invariance, scale invariant feature transform (SIFT) is adopted for the feature extraction. The neural network algorithm is utilized for pattern recognition, and the automatic identification of larvae images is successfully achieved with satisfactory results.

  9. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  10. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    Science.gov (United States)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  11. Error analysis to improve the speech recognition accuracy on ...

    Indian Academy of Sciences (India)

    measures, error-rate and Word Error Rate (WER) by application of the proposed method. Keywords. Speech recognition; pronunciation dictionary modification method; error analysis; F-measure. 1. Introduction. Speech is one of the easiest modes of ...

  12. Optimal pattern synthesis for speech recognition based on principal component analysis

    Science.gov (United States)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  13. Automatic stereoscopic system for person recognition

    Science.gov (United States)

    Murynin, Alexander B.; Matveev, Ivan A.; Kuznetsov, Victor D.

    1999-06-01

    A biometric access control system based on identification of human face is presented. The system developed performs remote measurements of the necessary face features. Two different scenarios of the system behavior are implemented. The first one assumes the verification of personal data entered by visitor from console using keyboard or card reader. The system functions as an automatic checkpoint, that strictly controls access of different visitors. The other scenario makes it possible to identify visitors without any person identifier or pass. Only person biometrics are used to identify the visitor. The recognition system automatically finds necessary identification information preliminary stored in the database. Two laboratory models of recognition system were developed. The models are designed to use different information types and sources. In addition to stereoscopic images inputted to computer from cameras the models can use voice data and some person physical characteristics such as person's height, measured by imaging system.

  14. Factors influencing recognition of interrupted speech.

    Science.gov (United States)

    Wang, Xin; Humes, Larry E

    2010-10-01

    This study examined the effect of interruption parameters (e.g., interruption rate, on-duration and proportion), linguistic factors, and other general factors, on the recognition of interrupted consonant-vowel-consonant (CVC) words in quiet. Sixty-two young adults with normal-hearing were randomly assigned to one of three test groups, "male65," "female65" and "male85," that differed in talker (male/female) and presentation level (65/85 dB SPL), with about 20 subjects per group. A total of 13 stimulus conditions, representing different interruption patterns within the words (i.e., various combinations of three interruption parameters), in combination with two values (easy and hard) of lexical difficulty were examined (i.e., 13×2=26 test conditions) within each group. Results showed that, overall, the proportion of speech and lexical difficulty had major effects on the integration and recognition of interrupted CVC words, while the other variables had small effects. Interactions between interruption parameters and linguistic factors were observed: to reach the same degree of word-recognition performance, less acoustic information was required for lexically easy words than hard words. Implications of the findings of the current study for models of the temporal integration of speech are discussed.

  15. Unification of automatic target tracking and automatic target recognition

    Science.gov (United States)

    Schachter, Bruce J.

    2014-06-01

    The subject being addressed is how an automatic target tracker (ATT) and an automatic target recognizer (ATR) can be fused together so tightly and so well that their distinctiveness becomes lost in the merger. This has historically not been the case outside of biology and a few academic papers. The biological model of ATT∪ATR arises from dynamic patterns of activity distributed across many neural circuits and structures (including retina). The information that the brain receives from the eyes is "old news" at the time that it receives it. The eyes and brain forecast a tracked object's future position, rather than relying on received retinal position. Anticipation of the next moment - building up a consistent perception - is accomplished under difficult conditions: motion (eyes, head, body, scene background, target) and processing limitations (neural noise, delays, eye jitter, distractions). Not only does the human vision system surmount these problems, but it has innate mechanisms to exploit motion in support of target detection and classification. Biological vision doesn't normally operate on snapshots. Feature extraction, detection and recognition are spatiotemporal. When vision is viewed as a spatiotemporal process, target detection, recognition, tracking, event detection and activity recognition, do not seem as distinct as they are in current ATT and ATR designs. They appear as similar mechanism taking place at varying time scales. A framework is provided for unifying ATT and ATR.

  16. Dealing with Phrase Level Co-Articulation (PLC) in speech recognition: a first approach

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; van Leeuwen, David A.; Robinson, Tony; Renals, Steve

    1999-01-01

    Whereas nowadays within-word co-articulation effects are usually sufficiently dealt with in automatic speech recognition, this is not always the case with phrase level co-articulation effects (PLC). This paper describes a first approach in dealing with phrase level co-articulation by applying these

  17. Transcribe Your Class: Using Speech Recognition to Improve Access for At-Risk Students

    Science.gov (United States)

    Bain, Keith; Lund-Lucas, Eunice; Stevens, Janice

    2012-01-01

    Through a project supported by Canada's Social Development Partnerships Program, a team of leading National Disability Organizations, universities, and industry partners are piloting a prototype Hosted Transcription Service that uses speech recognition to automatically create multimedia transcripts that can be used by students for study purposes.…

  18. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet Phuong; van Leeuwen, David A.; de Jong, Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two

  19. Recognition of time-compressed speech does not predict recognition of natural fast-rate speech by older listeners.

    Science.gov (United States)

    Gordon-Salant, Sandra; Zion, Danielle J; Espy-Wilson, Carol

    2014-10-01

    This study investigated whether recognition of time-compressed speech predicts recognition of natural fast-rate speech, and whether this relationship is influenced by listener age. High and low context sentences were presented to younger and older normal-hearing adults at a normal speech rate, naturally fast speech rate, and fast rate implemented by time compressing the normal-rate sentences. Recognition of time-compressed sentences over-estimated recognition of natural fast sentences for both groups, especially for older listeners. The findings suggest that older listeners are at a much greater disadvantage when listening to natural fast speech than would be predicted by recognition performance for time-compressed speech.

  20. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  1. Mispronunciation Detection for Language Learning and Speech Recognition Adaptation

    Science.gov (United States)

    Ge, Zhenhao

    2013-01-01

    The areas of "mispronunciation detection" (or "accent detection" more specifically) within the speech recognition community are receiving increased attention now. Two application areas, namely language learning and speech recognition adaptation, are largely driving this research interest and are the focal points of this work.…

  2. Speech Recognition and Cognitive Skills in Bimodal Cochlear Implant Users

    Science.gov (United States)

    Hua, Håkan; Johansson, Björn; Magnusson, Lennart; Lyxell, Björn; Ellis, Rachel J.

    2017-01-01

    Purpose: To examine the relation between speech recognition and cognitive skills in bimodal cochlear implant (CI) and hearing aid users. Method: Seventeen bimodal CI users (28-74 years) were recruited to the study. Speech recognition tests were carried out in quiet and in noise. The cognitive tests employed included the Reading Span Test and the…

  3. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep

  4. Features Speech Signature Image Recognition on Mobile Devices

    Directory of Open Access Journals (Sweden)

    Alexander Mikhailovich Alyushin

    2015-12-01

    Full Text Available The algorithms fordynamic spectrograms images recognition, processing and soundspeech signature (SS weredeveloped. The software for mobile phones, thatcan recognize speech signatureswas prepared. The investigation of the SS recognition speed on its boundarytypes was conducted. Recommendations on the boundary types choice in the optimal ratio of recognitionspeed and required space were given.

  5. Emotion Recognition of Speech Signals Based on Filter Methods

    Directory of Open Access Journals (Sweden)

    Narjes Yazdanian

    2016-10-01

    Full Text Available Speech is the basic mean of communication among human beings.With the increase of transaction between human and machine, necessity of automatic dialogue and removing human factor has been considered. The aim of this study was to determine a set of affective features the speech signal is based on emotions. In this study system was designs that include three mains sections, features extraction, features selection and classification. After extraction of useful features such as, mel frequency cepstral coefficient (MFCC, linear prediction cepstral coefficients (LPC, perceptive linear prediction coefficients (PLP, ferment frequency, zero crossing rate, cepstral coefficients and pitch frequency, Mean, Jitter, Shimmer, Energy, Minimum, Maximum, Amplitude, Standard Deviation, at a later stage with filter methods such as Pearson Correlation Coefficient, t-test, relief and information gain, we came up with a method to rank and select effective features in emotion recognition. Then Result, are given to the classification system as a subset of input. In this classification stage, multi support vector machine are used to classify seven type of emotion. According to the results, that method of relief, together with multi support vector machine, has the most classification accuracy with emotion recognition rate of 93.94%.

  6. Method and apparatus for obtaining complete speech signals for speech recognition applications

    Science.gov (United States)

    Abrash, Victor (Inventor); Cesari, Federico (Inventor); Franco, Horacio (Inventor); George, Christopher (Inventor); Zheng, Jing (Inventor)

    2009-01-01

    The present invention relates to a method and apparatus for obtaining complete speech signals for speech recognition applications. In one embodiment, the method continuously records an audio stream comprising a sequence of frames to a circular buffer. When a user command to commence or terminate speech recognition is received, the method obtains a number of frames of the audio stream occurring before or after the user command in order to identify an augmented audio signal for speech recognition processing. In further embodiments, the method analyzes the augmented audio signal in order to locate starting and ending speech endpoints that bound at least a portion of speech to be processed for recognition. At least one of the speech endpoints is located using a Hidden Markov Model.

  7. Recognition memory in noise for speech of varying intelligibility.

    Science.gov (United States)

    Gilbert, Rachael C; Chandrasekaran, Bharath; Smiljanic, Rajka

    2014-01-01

    This study investigated the extent to which noise impacts normal-hearing young adults' speech processing of sentences that vary in intelligibility. Intelligibility and recognition memory in noise were examined for conversational and clear speech sentences recorded in quiet (quiet speech, QS) and in response to the environmental noise (noise-adapted speech, NAS). Results showed that (1) increased intelligibility through conversational-to-clear speech modifications led to improved recognition memory and (2) NAS presented a more naturalistic speech adaptation to noise compared to QS, leading to more accurate word recognition and enhanced sentence recognition memory. These results demonstrate that acoustic-phonetic modifications implemented in listener-oriented speech enhance speech-in-noise processing beyond word recognition. Effortful speech processing in challenging listening environments can thus be improved by speaking style adaptations on the part of the talker. In addition to enhanced intelligibility, a substantial improvement in recognition memory can be achieved through speaker adaptations to the environment and to the listener when in adverse conditions.

  8. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  9. Hybrid methodological approach to context-dependent speech recognition

    Directory of Open Access Journals (Sweden)

    Dragiša Mišković

    2017-01-01

    Full Text Available Although the importance of contextual information in speech recognition has been acknowledged for a long time now, it has remained clearly underutilized even in state-of-the-art speech recognition systems. This article introduces a novel, methodologically hybrid approach to the research question of context-dependent speech recognition in human–machine interaction. To the extent that it is hybrid, the approach integrates aspects of both statistical and representational paradigms. We extend the standard statistical pattern-matching approach with a cognitively inspired and analytically tractable model with explanatory power. This methodological extension allows for accounting for contextual information which is otherwise unavailable in speech recognition systems, and using it to improve post-processing of recognition hypotheses. The article introduces an algorithm for evaluation of recognition hypotheses, illustrates it for concrete interaction domains, and discusses its implementation within two prototype conversational agents.

  10. Speech Recognition for Environmental Control: Effect of Microphone Type, Dysarthria, and Severity on Recognition Results.

    Science.gov (United States)

    Fager, Susan Koch; Burnfield, Judith M

    2015-01-01

    This study examines the use of commercially available automatic speech recognition (ASR) across microphone options as access to environmental control for individuals with and without dysarthria. A study of two groups of speakers (typical speech and dysarthria), was conducted to understand their performance using ASR and various microphones for environmental control. Specifically, dependent variables examined included attempts per command, recognition accuracy, frequency of error type, and perceived workload. A further sub-analysis of the group of participants with dysarthria examined the impact of severity. Results indicated a significantly larger number of attempts were required (P = 0.007), and significantly lower recognition accuracies were achieved by the dysarthric participants (P = 0.010). A sub-analysis examining severity demonstrated no significant differences between the typical speakers and participants with mild dysarthria. However, significant differences were evident (P = 0.007, P = 0.008) between mild and moderate-severe dysarthric participants. No significant differences existed across microphones. A higher frequency of threshold errors occurred for typical participants and no response errors for moderate-severe dysarthrics. There were no significant differences on the NASA Task Load Index.

  11. Speech recognition using articulatory and excitation source features

    CERN Document Server

    Rao, K Sreenivasa

    2017-01-01

    This book discusses the contribution of articulatory and excitation source information in discriminating sound units. The authors focus on excitation source component of speech -- and the dynamics of various articulators during speech production -- for enhancement of speech recognition (SR) performance. Speech recognition is analyzed for read, extempore, and conversation modes of speech. Five groups of articulatory features (AFs) are explored for speech recognition, in addition to conventional spectral features. Each chapter provides the motivation for exploring the specific feature for SR task, discusses the methods to extract those features, and finally suggests appropriate models to capture the sound unit specific knowledge from the proposed features. The authors close by discussing various combinations of spectral, articulatory and source features, and the desired models to enhance the performance of SR systems.

  12. Robot Command Interface Using an Audio-Visual Speech Recognition System

    Science.gov (United States)

    Ceballos, Alexánder; Gómez, Juan; Prieto, Flavio; Redarce, Tanneguy

    In recent years audio-visual speech recognition has emerged as an active field of research thanks to advances in pattern recognition, signal processing and machine vision. Its ultimate goal is to allow human-computer communication using voice, taking into account the visual information contained in the audio-visual speech signal. This document presents a command's automatic recognition system using audio-visual information. The system is expected to control the laparoscopic robot da Vinci. The audio signal is treated using the Mel Frequency Cepstral Coefficients parametrization method. Besides, features based on the points that define the mouth's outer contour according to the MPEG-4 standard are used in order to extract the visual speech information.

  13. WORD BASED TAMIL SPEECH RECOGNITION USING TEMPORAL FEATURE BASED SEGMENTATION

    Directory of Open Access Journals (Sweden)

    A. Akila

    2015-05-01

    Full Text Available Speech recognition system requires segmentation of speech waveform into fundamental acoustic units. Segmentation is a process of decomposing the speech signal into smaller units. Speech segmentation could be done using wavelet, fuzzy methods, Artificial Neural Networks and Hidden Markov Model. Speech segmentation is a process of breaking continuous stream of sound into some basic units like words, phonemes or syllable that could be recognized. Segmentation could be used to distinguish different types of audio signals from large amount of audio data, often referred as audio classification. The speech segmentation can be divided into two categories based on whether the algorithm uses previous knowledge of data to process the speech. The categories are blind segmentation and aided segmentation.The major issues with the connected speech recognition algorithms were the vocabulary size will be larger with variation in the combination of words in the connected speech and the complexity of the algorithm is more to find the best match for the given test pattern. To overcome these issues, the connected speech has to be segmented into words using the attributes of speech. A methodology using the temporal feature Short Term Energy was proposed and compared with an existing algorithm called Dynamic Thresholding segmentation algorithm which uses spectrogram image of the connected speech for segmentation.

  14. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  15. Image simulation for automatic license plate recognition

    Science.gov (United States)

    Bala, Raja; Zhao, Yonghui; Burry, Aaron; Kozitsky, Vladimir; Fillion, Claude; Saunders, Craig; Rodríguez-Serrano, José

    2012-01-01

    Automatic license plate recognition (ALPR) is an important capability for traffic surveillance applications, including toll monitoring and detection of different types of traffic violations. ALPR is a multi-stage process comprising plate localization, character segmentation, optical character recognition (OCR), and identification of originating jurisdiction (i.e. state or province). Training of an ALPR system for a new jurisdiction typically involves gathering vast amounts of license plate images and associated ground truth data, followed by iterative tuning and optimization of the ALPR algorithms. The substantial time and effort required to train and optimize the ALPR system can result in excessive operational cost and overhead. In this paper we propose a framework to create an artificial set of license plate images for accelerated training and optimization of ALPR algorithms. The framework comprises two steps: the synthesis of license plate images according to the design and layout for a jurisdiction of interest; and the modeling of imaging transformations and distortions typically encountered in the image capture process. Distortion parameters are estimated by measurements of real plate images. The simulation methodology is successfully demonstrated for training of OCR.

  16. Posteriori Probabilities and Likelihoods Combination for Speech and Speaker Recognition

    OpenAIRE

    BenZeghiba, Mohamed Faouzi; Bourlard, Hervé

    2004-01-01

    This paper investigates a new approach to perform simultaneous speech and speaker recognition. The likelihood estimated by a speaker identification system is combined with the posterior probability estimated by the speech recognizer. So, the joint posterior probability of the pronounced word and the speaker identity is maximized. A comparison study with other standard techniques is carried out in three different applications, (1) closed set speech and speaker identification, (2) open set spee...

  17. A methodology of error detection: Improving speech recognition in radiology

    OpenAIRE

    Voll, Kimberly Dawn

    2006-01-01

    Automated speech recognition (ASR) in radiology report dictation demands highly accurate and robust recognition software. Despite vendor claims, current implementations are suboptimal, leading to poor accuracy, and time and money wasted on proofreading. Thus, other methods must be considered for increasing the reliability and performance of ASR before it is a viable alternative to human transcription. One such method is post-ASR error detection, used to recover from the inaccuracy of speech r...

  18. Vocal Tract Representation in the Recognition of Cerebral Palsied Speech

    Science.gov (United States)

    Rudzicz, Frank; Hirst, Graeme; van Lieshout, Pascal

    2012-01-01

    Purpose: In this study, the authors explored articulatory information as a means of improving the recognition of dysarthric speech by machine. Method: Data were derived chiefly from the TORGO database of dysarthric articulation (Rudzicz, Namasivayam, & Wolff, 2011) in which motions of various points in the vocal tract are measured during speech.…

  19. Say anything : oilpatch to help pilot speech recognition technology

    Energy Technology Data Exchange (ETDEWEB)

    Marsters, S.

    2001-05-01

    Halifax-based OKAMLogic Inc. is developing a more effective way of interacting with automated systems through the use of a newly developed technology called natural language automatic speech recognition (ASR). ASR combines wireless and voice-recognition technologies making it possible to use voice commands through any wireless or conventional telephone to access data. It is particularly well suited for the oil and gas industry where highly-mobile field workers working on rigs or installing pipelines, often do not have an office with a PC. ASR makes it possible for these workers to remotely access networks and databases safely and securely using only their voice and a cell phone. Voice prints may be used to validate if a caller is an authorized user of the system. Once the check is done, the user can transmit information, such as reports on daily pipeline construction activity, to the company's network. OKAMLogic was recognized by the Branham Group as one of the 25 up and coming Canadian companies to watch for and was granted a research contract by the Canadian Centre for Marine Communications and Aliant Telecom to help in the development of ASR. 1 fig.

  20. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  1. A computer model of auditory efferent suppression: implications for the recognition of speech in noise.

    Science.gov (United States)

    Brown, Guy J; Ferry, Robert T; Meddis, Ray

    2010-02-01

    The neural mechanisms underlying the ability of human listeners to recognize speech in the presence of background noise are still imperfectly understood. However, there is mounting evidence that the medial olivocochlear system plays an important role, via efferents that exert a suppressive effect on the response of the basilar membrane. The current paper presents a computer modeling study that investigates the possible role of this activity on speech intelligibility in noise. A model of auditory efferent processing [Ferry, R. T., and Meddis, R. (2007). J. Acoust. Soc. Am. 122, 3519-3526] is used to provide acoustic features for a statistical automatic speech recognition system, thus allowing the effects of efferent activity on speech intelligibility to be quantified. Performance of the "basic" model (without efferent activity) on a connected digit recognition task is good when the speech is uncorrupted by noise but falls when noise is present. However, recognition performance is much improved when efferent activity is applied. Furthermore, optimal performance is obtained when the amount of efferent activity is proportional to the noise level. The results obtained are consistent with the suggestion that efferent suppression causes a "release from adaptation" in the auditory-nerve response to noisy speech, which enhances its intelligibility.

  2. Jackson's Parrot: Samuel Beckett, Aphasic Speech Automatisms, and Psychosomatic Language.

    Science.gov (United States)

    Salisbury, Laura; Code, Chris

    2016-06-01

    This article explores the relationship between automatic and involuntary language in the work of Samuel Beckett and late nineteenth-century neurological conceptions of language that emerged from aphasiology. Using the work of John Hughlings Jackson alongside contemporary neuroscientific research, we explore the significance of the lexical and affective symmetries between Beckett's compulsive and profoundly embodied language and aphasic speech automatisms. The interdisciplinary work in this article explores the paradox of how and why Beckett was able to search out a longed-for language of feeling that might disarticulate the classical bond between the language, intention, rationality and the human, in forms of expression that seem automatic and "readymade".

  3. Speech-recognition interfaces for music information retrieval

    Science.gov (United States)

    Goto, Masataka

    2005-09-01

    This paper describes two hands-free music information retrieval (MIR) systems that enable a user to retrieve and play back a musical piece by saying its title or the artist's name. Although various interfaces for MIR have been proposed, speech-recognition interfaces suitable for retrieving musical pieces have not been studied. Our MIR-based jukebox systems employ two different speech-recognition interfaces for MIR, speech completion and speech spotter, which exploit intentionally controlled nonverbal speech information in original ways. The first is a music retrieval system with the speech-completion interface that is suitable for music stores and car-driving situations. When a user only remembers part of the name of a musical piece or an artist and utters only a remembered fragment, the system helps the user recall and enter the name by completing the fragment. The second is a background-music playback system with the speech-spotter interface that can enrich human-human conversation. When a user is talking to another person, the system allows the user to enter voice commands for music playback control by spotting a special voice-command utterance in face-to-face or telephone conversations. Experimental results from use of these systems have demonstrated the effectiveness of the speech-completion and speech-spotter interfaces. (Video clips: http://staff.aist.go.jp/m.goto/MIR/speech-if.html)

  4. Four-Channel Biosignal Analysis and Feature Extraction for Automatic Emotion Recognition

    Science.gov (United States)

    Kim, Jonghwa; André, Elisabeth

    This paper investigates the potential of physiological signals as a reliable channel for automatic recognition of user's emotial state. For the emotion recognition, little attention has been paid so far to physiological signals compared to audio-visual emotion channels such as facial expression or speech. All essential stages of automatic recognition system using biosignals are discussed, from recording physiological dataset up to feature-based multiclass classification. Four-channel biosensors are used to measure electromyogram, electrocardiogram, skin conductivity and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to search the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by emotion recognition results.

  5. Use of digital speech recognition in diagnostics radiology

    International Nuclear Information System (INIS)

    Arndt, H.; Stockheim, D.; Mutze, S.; Petersein, J.; Gregor, P.; Hamm, B.

    1999-01-01

    Purpose: Applicability and benefits of digital speech recognition in diagnostic radiology were tested using the speech recognition system SP 6000. Methods: The speech recognition system SP 6000 was integrated into the network of the institute and connected to the existing Radiological Information System (RIS). Three subjects used this system for writing 2305 findings from dictation. After the recognition process the date, length of dictation, time required for checking/correction, kind of examination and error rate were recorded for every dictation. With the same subjects, a correlation was performed with 625 conventionally written finding. Results: After an 1-hour initial training the average error rates were 8.4 to 13.3%. The first adaptation of the speech recognition system (after nine days) decreased the average error rates to 2.4 to 10.7% due to the ability of the program to learn. The 2 nd and 3 rd adaptations resulted only in small changes of the error rate. An individual comparison of the error rate developments in the same kind of investigation showed the relative independence of the error rate on the individual user. Conclusion: The results show that the speech recognition system SP 6000 can be evaluated as an advantageous alternative for quickly recording radiological findings. A comparison between manually writing and dictating the findings verifies the individual differences of the writing speeds and shows the advantage of the application of voice recognition when faced with normal keyboard performance. (orig.) [de

  6. Effects of speech clarity on recognition memory for spoken sentences.

    Science.gov (United States)

    Van Engen, Kristin J; Chandrasekaran, Bharath; Smiljanic, Rajka

    2012-01-01

    Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language). Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise). Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

  7. Automatic transcription of continuous speech into syllable-like units ...

    Indian Academy of Sciences (India)

    present. In addition, it requires a dictionary with the phonemic/sub-word unit transcription of the words and extensive language models to perform large vocabulary continuous speech recognition. The recogniser outputs words that exist in the dictionary. When the recogniser is to be adapted to a new task or new language, ...

  8. Relationship between multipulse integration and speech recognition with cochlear implants.

    Science.gov (United States)

    Zhou, Ning; Pfingst, Bryan E

    2014-09-01

    Comparisons of performance with cochlear implants and postmortem conditions in the cochlea in humans have shown mixed results. The limitations in those studies favor the use of within-subject designs and non-invasive measures to estimate cochlear conditions. One non-invasive correlate of cochlear health is multipulse integration, established in an animal model. The present study used this measure to relate neural health in human cochlear implant users to their speech recognition performance. The multipulse-integration slopes were derived based on psychophysical detection thresholds measured for two pulse rates (80 and 640 pulses per second). A within-subject design was used in eight subjects with bilateral implants where the direction and magnitude of ear differences in the multipulse-integration slopes were compared with those of the speech-recognition results. The speech measures included speech reception threshold for sentences and phoneme recognition in noise. The magnitude of ear difference in the integration slopes was significantly correlated with the magnitude of ear difference in speech reception thresholds, consonant recognition in noise, and transmission of place of articulation of consonants. These results suggest that multipulse integration predicts speech recognition in noise and perception of features that use dynamic spectral cues.

  9. [Japanese radiological report creation with continuous speech recognition].

    Science.gov (United States)

    Takahara, Taro; Nakajima, Mika; Nitatori, Toshiaki; Hachiya, Junichi

    2002-01-01

    Ten Japanese radiological reports consisting of 1381 characters (681 words) were created by two board-certified radiologists who used conventional typing and a continuous speech-recognition system called AmiVoice (Advanced Media, Inc., Tokyo, Japan). The two radiologists had not had any special training prior to their use of the continuous speech-recognition system. The model of speech-to-text analysis was generated from 22,589 radiological reports (5.7 MB). Dedicated pronunciations for loan words (i.e., English words) were registered by a board-certified radiologist in consideration of variations in Japanese pronunciation. Misrecognition occurred in 40 of 1362 words, corresponding to a 97.1% rate of accuracy of recognition. The average speech recognition time per report was 31.3 sec, and the additional time required for corrections was 25.0 sec. The total speech input time of 56.2 sec was much less than the conventional input time of 142.8 sec for typing. Continuous speech recognition is faster than typing, even considering the additional time required for corrections, and is acceptable in view of the overall reduction in report turn-around time.

  10. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suh Youngjoo

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  11. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Science.gov (United States)

    Suh, Youngjoo; Kim, Hoirin

    2010-12-01

    We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  12. Speed and automaticity of word recognition - inseparable twins?

    DEFF Research Database (Denmark)

    Poulsen, Mads; Asmussen, Vibeke; Elbro, Carsten

    'Speed and automaticity' of word recognition is a standard collocation. However, it is not clear whether speed and automaticity (i.e., effortlessness) make independent contributions to reading comprehension. In theory, both speed and automaticity may save cognitive resources for comprehension...... processes. Hence, the aim of the present study was to assess the unique contributions of word recognition speed and automaticity to reading comprehension while controlling for decoding speed and accuracy. Method: 139 Grade 5 students completed tests of reading comprehension and computer-based tests of speed...... of decoding and word recognition together with a test of effortlessness (automaticity) of word recognition. Effortlessness was measured in a dual task in which participants were presented with a word enclosed in an unrelated figure. The task was to read the word and decide whether the figure was a triangle...

  13. Cost-Efficient Development of Acoustic Models for Speech Recognition of Related Languages

    Directory of Open Access Journals (Sweden)

    J. Nouza

    2013-09-01

    Full Text Available When adapting an existing speech recognition system to a new language, major development costs are associated with the creation of an appropriate acoustic model (AM. For its training, a certain amount of recorded and annotated speech is required. In this paper, we show that not only the annotation process, but also the process of speech acquisition can be automated to minimize the need of human and expert work. We demonstrate the proposed methodology on Croatian language, for which the target AM has been built via cross-lingual adaptation of a Czech AM in 2 ways: a using commercially available GlobalPhone database, and b by automatic speech data mining from HRT radio archive. The latter approach is cost-free, yet it yields comparable or better results in LVCSR experiments conducted on 3 Croatian test sets.

  14. An effective cluster-based model for robust speech detection and speech recognition in noisy environments.

    Science.gov (United States)

    Górriz, J M; Ramírez, J; Segura, J C; Puntonet, C G

    2006-07-01

    This paper shows an accurate speech detection algorithm for improving the performance of speech recognition systems working in noisy environments. The proposed method is based on a hard decision clustering approach where a set of prototypes is used to characterize the noisy channel. Detecting the presence of speech is enabled by a decision rule formulated in terms of an averaged distance between the observation vector and a cluster-based noise model. The algorithm benefits from using contextual information, a strategy that considers not only a single speech frame but also a neighborhood of data in order to smooth the decision function and improve speech detection robustness. The proposed scheme exhibits reduced computational cost making it adequate for real time applications, i.e., automated speech recognition systems. An exhaustive analysis is conducted on the AURORA 2 and AURORA 3 databases in order to assess the performance of the algorithm and to compare it to existing standard voice activity detection (VAD) methods. The results show significant improvements in detection accuracy and speech recognition rate over standard VADs such as ITU-T G.729, ETSI GSM AMR, and ETSI AFE for distributed speech recognition and a representative set of recently reported VAD algorithms.

  15. A frequency-selective feedback model of auditory efferent suppression and its implications for the recognition of speech in noise.

    Science.gov (United States)

    Clark, Nicholas R; Brown, Guy J; Jürgens, Tim; Meddis, Ray

    2012-09-01

    The potential contribution of the peripheral auditory efferent system to our understanding of speech in a background of competing noise was studied using a computer model of the auditory periphery and assessed using an automatic speech recognition system. A previous study had shown that a fixed efferent attenuation applied to all channels of a multi-channel model could improve the recognition of connected digit triplets in noise [G. J. Brown, R. T. Ferry, and R. Meddis, J. Acoust. Soc. Am. 127, 943-954 (2010)]. In the current study an anatomically justified feedback loop was used to automatically regulate separate attenuation values for each auditory channel. This arrangement resulted in a further enhancement of speech recognition over fixed-attenuation conditions. Comparisons between multi-talker babble and pink noise interference conditions suggest that the benefit originates from the model's ability to modify the amount of suppression in each channel separately according to the spectral shape of the interfering sounds.

  16. Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

    Science.gov (United States)

    Ryumin, D.; Karpov, A. A.

    2017-05-01

    In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.

  17. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  18. Investigation on Mandarin Broadcast News Speech Recognition

    National Research Council Canada - National Science Library

    Hwang, Mei-Yuh; Lei, Xin; Wang, Wen; Shinozaki, Takahiro

    2006-01-01

    .... They have successfully incorporated the most popular speech technologies into their system. More importantly, they present two novel algorithms for smoothing pitch features and segmenting Chinese characters into word units...

  19. Robust Speech Recognition from Binary Masks

    Science.gov (United States)

    2010-01-01

    invariance to translation and size of the input pattern. Since the binary patterns of IBM are, in a way, similar to handwritten digits, we used a CNN...classification for each pattern. This also adds to the translational invariance of the CNN. To be consistent, we use the same strategy while testing IBMs...the cases, the noisy speech was enhanced using the MMSE algorithm, which is a widely used speech enhancement algorithm (Ephraim and Malah, 1985), as

  20. Robust Models and Features for Speech Recognition.

    Science.gov (United States)

    1998-03-13

    and relevant Spokes of the Speaker Independent Wall Street Journal database in 1994, the Marketplace database in 1995, and the Broadcast news...also built a 64000 word vocabulary. Lan- guage models for this vocabulary were built from a combination of Wall Street Journal data available from...was made from transcribing clean read speech ( Wall Street Journal task in 1994) to real world speech (transcription of radio and TV broadcast news

  1. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  2. Audio signal recognition for speech, music, and environmental sounds

    Science.gov (United States)

    Ellis, Daniel P. W.

    2003-10-01

    Human listeners are very good at all kinds of sound detection and identification tasks, from understanding heavily accented speech to noticing a ringing phone underneath music playing at full blast. Efforts to duplicate these abilities on computer have been particularly intense in the area of speech recognition, and it is instructive to review which approaches have proved most powerful, and which major problems still remain. The features and models developed for speech have found applications in other audio recognition tasks, including musical signal analysis, and the problems of analyzing the general ``ambient'' audio that might be encountered by an auditorily endowed robot. This talk will briefly review statistical pattern recognition for audio signals, giving examples in several of these domains. Particular emphasis will be given to common aspects and lessons learned.

  3. Evaluating deep learning architectures for Speech Emotion Recognition.

    Science.gov (United States)

    Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence

    2017-08-01

    Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.

  4. Speech Rate Control for Improving Elderly Speech Recognition of Smart Devices

    Directory of Open Access Journals (Sweden)

    SON, G.

    2017-05-01

    Full Text Available Although smart devices have become a widely-adopted tool for communication in modern society, it still requires a steep learning curve among the elderly. By introducing a voice-based interface for smart devices using voice recognition technology, smart devices can become more user-friendly and useful to the elderly. However, the voice recognition technology used in current devices is attuned to the voice patterns of the young. Therefore, speech recognition falters when an elderly user speaks into the device. This paper has identified that the elderly's improper speech rate by each syllable contributes to the failure in the voice recognition system. Thus, upon modifying the speech rate by each syllable, the voice recognition rate saw an increase of 12.3%. This paper demonstrates that by simply modifying the speech rate by each syllable, which is one of the factors that causes errors in voice recognition, the recognition rate can be substantially increased. Such improvements in voice recognition technology can make it easier for the elderly to operate smart devices that will allow them to be more socially connected in a mobile world and access information at their fingertips. It may also be helpful in bridging the communication divide between generations.

  5. Statistical pattern recognition for automatic writer identification and verification

    NARCIS (Netherlands)

    Bulacu, Marius Lucian

    2007-01-01

    The thesis addresses the problem of automatic person identification using scanned images of handwriting.Identifying the author of a handwritten sample using automatic image-based methods is an interesting pattern recognition problem with direct applicability in the forensic and historic document

  6. Automatic sign language recognition inspired by human sign perception

    NARCIS (Netherlands)

    Ten Holt, G.A.

    2010-01-01

    Automatic sign language recognition is a relatively new field of research (since ca. 1990). Its objectives are to automatically analyze sign language utterances. There are several issues within the research area that merit investigation: how to capture the utterances (cameras, magnetic sensors,

  7. A model based method for automatic facial expression recognition

    NARCIS (Netherlands)

    Kuilenburg, H. van; Wiering, M.A.; Uyl, M. den

    2006-01-01

    Automatic facial expression recognition is a research topic with interesting applications in the field of human-computer interaction, psychology and product marketing. The classification accuracy for an automatic system which uses static images as input is however largely limited by the image

  8. Robust relationship between reading span and speech recognition in noise.

    Science.gov (United States)

    Souza, Pamela; Arehart, Kathryn

    2015-01-01

    Working memory refers to a cognitive system that manages information processing and temporary storage. Recent work has demonstrated that individual differences in working memory capacity measured using a reading span task are related to ability to recognize speech in noise. In this project, we investigated whether the specific implementation of the reading span task influenced the strength of the relationship between working memory capacity and speech recognition. The relationship between speech recognition and working memory capacity was examined for two different working memory tests that varied in approach, using a within-subject design. Data consisted of audiometric results along with the two different working memory tests; one speech-in-noise test; and a reading comprehension test. The test group included 94 older adults with varying hearing loss and 30 younger adults with normal hearing. Listeners with poorer working memory capacity had more difficulty understanding speech in noise after accounting for age and degree of hearing loss. That relationship did not differ significantly between the two different implementations of reading span. Our findings suggest that different implementations of a verbal reading span task do not affect the strength of the relationship between working memory capacity and speech recognition.

  9. Sign language perception research for improving automatic sign language recognition

    NARCIS (Netherlands)

    Ten Holt, G.A.; Arendsen, J.; De Ridder, H.; Van Doorn, A.J.; Reinders, M.J.T.; Hendriks, E.A.

    2009-01-01

    Current automatic sign language recognition (ASLR) seldom uses perceptual knowledge about the recognition of sign language. Using such knowledge can improve ASLR because it can give an indication which elements or phases of a sign are important for its meaning. Also, the current generation of

  10. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes......State of the art speech recognition systems (ASR), mostly use Mel-Frequency cepstral coefficients (MFCC), as acoustic features. In this paper, we propose a new discriminative analysis of acoustic features, based on spectrogram analysis. Both spectral and temporal variations of speech signal...

  11. Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders

    CERN Document Server

    Baghai-Ravary, Ladan

    2013-01-01

    Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders provides a survey of methods designed to aid clinicians in the diagnosis and monitoring of speech disorders such as dysarthria and dyspraxia, with an emphasis on the signal processing techniques, statistical validity of the results presented in the literature, and the appropriateness of methods that do not require specialized equipment, rigorously controlled recording procedures or highly skilled personnel to interpret results. Such techniques offer the promise of a simple and cost-effective, yet objective, assessment of a range of medical conditions, which would be of great value to clinicians. The ideal scenario would begin with the collection of examples of the clients’ speech, either over the phone or using portable recording devices operated by non-specialist nursing staff. The recordings could then be analyzed initially to aid diagnosis of conditions, and subsequently to monitor the clients’ progress and res...

  12. Current trends in small vocabulary speech recognition for equipment control

    Science.gov (United States)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  13. Quality Assessment of Compressed Video for Automatic License Plate Recognition

    DEFF Research Database (Denmark)

    Ukhanova, Ann; Støttrup-Andersen, Jesper; Forchhammer, Søren

    2014-01-01

    Definition of video quality requirements for video surveillance poses new questions in the area of quality assessment. This paper presents a quality assessment experiment for an automatic license plate recognition scenario. We explore the influence of the compression by H.264/AVC and H.265/HEVC...... standards on the recognition performance. We compare logarithmic and logistic functions for quality modeling. Our results show that a logistic function can better describe the dependence of recognition performance on the quality for both compression standards. We observe that automatic license plate...

  14. Basic speech recognition for spoken dialogues

    CSIR Research Space (South Africa)

    Van Heerden, C

    2009-09-01

    Full Text Available speech recognisers for a diverse multitude of languages. The paper investigates the feasibility of developing small-vocabulary speaker-independent ASR systems designed for use in a telephone-based information system, using ten resource-scarce languages...

  15. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem.

    Directory of Open Access Journals (Sweden)

    Cai Wingfield

    2017-09-01

    Full Text Available There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental 'machine states', generated as the ASR analysis progresses over time, to the incremental 'brain states', measured using combined electro- and magneto-encephalography (EMEG, generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.

  16. Toward Automatic Recognition of Children's Affective State Using Physiological Parameters and Fuzzy Model of Emotions

    Directory of Open Access Journals (Sweden)

    SCHIPOR, O.-A.

    2012-05-01

    Full Text Available Affective computing � the ability of a system to recognize, understand and simulate human emotional intelligence � is one of the most dynamic fields of HCI � Human Computer Interaction. These characteristics find their applicability in those areas where it is necessary to extend traditional cognitive communication with emotional features. That is why, Computer Based Speech Therapy Systems (CBST, and especially those involving children with speech disorders, require this qualitative shift. So in this paper we propose an original emotional framework recognition as an extension for our previous developed system � Logomon. A fuzzy model is used in order to interpret the values of specific physiological parameters and to obtain the emotional state of the subject. Moreover, an experiment that indicates the emotion pattern (average fuzzy sets for each therapeutic sequence is also presented. The obtained results encourage us to continue working on automatic emotion recognition and provide important clues regarding the future development of our CBST.

  17. Speech recognition: Acoustic-phonetic knowledge acquisition and representation

    Science.gov (United States)

    Zue, Victor W.

    1988-09-01

    The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.

  18. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    OpenAIRE

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing instrument that may cause a shift in writing process research because the underlying processes are changing. In addition to this, we take advantage of on of the weak points of speech recognition, namely the...

  19. The effect of network degradation on speech recognition

    CSIR Research Space (South Africa)

    Joubert, G

    2005-11-01

    Full Text Available The authors describe a system, based on open-source tools, that was developed in order to study the effect of network degenerations in Voice-over-Internet-Protocol applications on speech-recognition accuracy. Sophisticated play-out algorithms...

  20. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  1. How Aging Affects the Recognition of Emotional Speech

    Science.gov (United States)

    Paulmann, Silke; Pell, Marc D.; Kotz, Sonja A.

    2008-01-01

    To successfully infer a speaker's emotional state, diverse sources of emotional information need to be decoded. The present study explored to what extent emotional speech recognition of "basic" emotions (anger, disgust, fear, happiness, pleasant surprise, sadness) differs between different sex (male/female) and age (young/middle-aged) groups in a…

  2. Speech Recognition for Students with Disabilities in Writing

    Science.gov (United States)

    Gardner, Teresa J.

    2008-01-01

    The role of technology in education is ever increasing. This article looks at students with disabilities and the problem of writing independently. Speech recognition technology offers an option, or solution, for students who have physical and/or learning disabilities and for students who cannot access and use computer keyboards or switches.…

  3. Speech Recognition Issues for Dutch Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.; Matousek, Vaclav; Mautner, Pavel; Moucek, Roman; Tauser, Karel

    2001-01-01

    In this paper, ongoing work on the development of the speech recognition modules of a multimedia retrieval environment for Dutch is described. The work on the generation of acoustic models and language models along with their current performance is presented. Some characteristics of the Dutch

  4. Science 101: How Does Speech-Recognition Software Work?

    Science.gov (United States)

    Robertson, Bill

    2016-01-01

    This column provides background science information for elementary teachers. Many innovations with computer software begin with analysis of how humans do a task. This article takes a look at how humans recognize spoken words and explains the origins of speech-recognition software.

  5. Development of a speech recognition system for Spanish broadcast news

    NARCIS (Netherlands)

    Niculescu, A.I.; de Jong, Franciska M.G.

    2008-01-01

    This paper reports on the development process of a speech recognition system for Spanish broadcast news within the MESH FP6 project. The system uses the SONIC recognizer developed at the Center for Spoken Language Research (CSLR), University of Colorado. Acoustic and language models were trained

  6. Shortlist B: A Bayesian model of continuous speech recognition

    NARCIS (Netherlands)

    Norris, D.; McQueen, J.M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract

  7. Shortlist B: A Bayesian Model of Continuous Speech Recognition

    Science.gov (United States)

    Norris, Dennis; McQueen, James M.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist (D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prelexical and lexical representations, a feedforward…

  8. Multitasking During Degraded Speech Recognition in School-Age Children.

    Science.gov (United States)

    Grieco-Calub, Tina M; Ward, Kristina M; Brehm, Laurel

    2017-01-01

    Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children's multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children's accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children's dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children's proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition.

  9. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing

  10. Effect of speech rate variation on acoustic phone stability in Afrikaans speech recognition

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2007-11-01

    Full Text Available space correlation matrix has been shown to be useful in normalising speech (perform- ing speaker normalisation) prior to speech recognition system training [5]. Speaker space correlation matrices as defined in [5] are con- structed by extracting... is constructed for the m observed phones and d-dimensional feature vector. The vectors of phones being investigated are then concate- nated in the same sequence for each speaker, which results in a matrix of speaker vectors. If the correlation values...

  11. POLISH EMOTIONAL SPEECH RECOGNITION USING ARTIFICAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2014-11-01

    Full Text Available The article presents the issue of emotion recognition based on polish emotional speech analysis. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. The following parameters extracted from sampled and normalised speech signal has been used for the analysis: energy of signal, speaker’s sex, average value of speech signal and both the minimum and maximum sample value for a given signal. As an emotional state a classifier fof our layers of artificial neural network has been used. The achieved results reach 50% of accuracy. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom.

  12. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue

    2006-01-01

    BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize that it is the temp......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize...... that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...... of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  13. Automatic Target Recognition for Hyperspectral Imagery

    Science.gov (United States)

    2012-03-01

    D., Rossacci, M., Lockwood, R., Cooley, T., & Jacobson, J. Maintaining CFAR Operation in Hyperspectral Target Detection Using Extreme Value...Multiple-Band CFAR Detection of an Optical Pattern with Unknown Spectral Distribution. IEEE Transactions on Acoustics, Speech, and Signal Processing...F. C., Fuhrmann, D. R., Kelly, E. J., & Nitzburg, R. (1992). A CFAR adaptive matched filter detector. IEEE Transactions onAerospace and Electronic

  14. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  15. An automatic system for Turkish word recognition using Discrete Wavelet Neural Network based on adaptive entropy

    International Nuclear Information System (INIS)

    Avci, E.

    2007-01-01

    In this paper, an automatic system is presented for word recognition using real Turkish word signals. This paper especially deals with combination of the feature extraction and classification from real Turkish word signals. A Discrete Wavelet Neural Network (DWNN) model is used, which consists of two layers: discrete wavelet layer and multi-layer perceptron. The discrete wavelet layer is used for adaptive feature extraction in the time-frequency domain and is composed of Discrete Wavelet Transform (DWT) and wavelet entropy. The multi-layer perceptron used for classification is a feed-forward neural network. The performance of the used system is evaluated by using noisy Turkish word signals. Test results showing the effectiveness of the proposed automatic system are presented in this paper. The rate of correct recognition is about 92.5% for the sample speech signals. (author)

  16. The Impact of Telephone Channels on the Accuracy of Automatic Speaker Recognition

    Directory of Open Access Journals (Sweden)

    V. Delić

    2011-11-01

    Full Text Available This paper presents an experimental study on the impact of telephone channels on the accuracy of automatic speaker recognition. Speaker models and the design of the recognizer used in this study are based on Hidden Markov models. In order to simulate telephonequality speech signals, several experimental conditions were introduced taking two control factors into consideration: the type of the applied codec and the probability of transmission errors. In addition, the impact of echo signals – that are often present in Internet telephony – on the accuracy of automatic speaker recognition systems is considered. Finally, the paper provides a brief overview of several methodologies for the adaptation of the recognizer to the expected environmental conditions that may enhance the robustness of the speaker recognizer.

  17. Incorporating Speech Recognition into a Natural User Interface

    Science.gov (United States)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to

  18. Mechanisms of enhancing visual-speech recognition by prior auditory information.

    Science.gov (United States)

    Blank, Helen; von Kriegstein, Katharina

    2013-01-15

    Speech recognition from visual-only faces is difficult, but can be improved by prior information about what is said. Here, we investigated how the human brain uses prior information from auditory speech to improve visual-speech recognition. In a functional magnetic resonance imaging study, participants performed a visual-speech recognition task, indicating whether the word spoken in visual-only videos matched the preceding auditory-only speech, and a control task (face-identity recognition) containing exactly the same stimuli. We localized a visual-speech processing network by contrasting activity during visual-speech recognition with the control task. Within this network, the left posterior superior temporal sulcus (STS) showed increased activity and interacted with auditory-speech areas if prior information from auditory speech did not match the visual speech. This mismatch-related activity and the functional connectivity to auditory-speech areas were specific for speech, i.e., they were not present in the control task. The mismatch-related activity correlated positively with performance, indicating that posterior STS was behaviorally relevant for visual-speech recognition. In line with predictive coding frameworks, these findings suggest that prediction error signals are produced if visually presented speech does not match the prediction from preceding auditory speech, and that this mechanism plays a role in optimizing visual-speech recognition by prior information. Copyright © 2012 Elsevier Inc. All rights reserved.

  19. How noise and language proficiency influence speech recognition by individual non-native listeners.

    Science.gov (United States)

    Zhang, Jin; Xie, Lingli; Li, Yongjun; Chatterjee, Monita; Ding, Nai

    2014-01-01

    This study investigated how speech recognition in noise is affected by language proficiency for individual non-native speakers. The recognition of English and Chinese sentences was measured as a function of the signal-to-noise ratio (SNR) in sixty native Chinese speakers who never lived in an English-speaking environment. The recognition score for speech in quiet (which varied from 15%-92%) was found to be uncorrelated with speech recognition threshold (SRTQ/2), i.e. the SNR at which the recognition score drops to 50% of the recognition score in quiet. This result demonstrates separable contributions of language proficiency and auditory processing to speech recognition in noise.

  20. Sine-wave speech recognition in a tonal language.

    Science.gov (United States)

    Feng, Yan-Mei; Xu, Li; Zhou, Ning; Yang, Guang; Yin, Shan-Kai

    2012-02-01

    It is hypothesized that in sine-wave replicas of natural speech, lexical tone recognition would be severely impaired due to the loss of F0 information, but the linguistic information at the sentence level could be retrieved even with limited tone information. Forty-one native Mandarin-Chinese-speaking listeners participated in the experiments. Results showed that sine-wave tone-recognition performance was on average only 32.7% correct. However, sine-wave sentence-recognition performance was very accurate, approximately 92% correct on average. Therefore the functional load of lexical tones on sentence recognition is limited, and the high-level recognition of sine-wave sentences is likely attributed to the perceptual organization that is influenced by top-down processes. © 2012 Acoustical Society of America

  1. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  2. Hemispheric lateralization of linguistic prosody recognition in comparison to speech and speaker recognition.

    Science.gov (United States)

    Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina

    2014-11-15

    Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on

  3. Automatic recognition of printed Oriya script

    Indian Academy of Sciences (India)

    R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22

    Abstract. This paper deals with an Optical Character Recognition (OCR) system for printed Oriya script. The development of OCR for this script is difficult because a large number of character shapes in the script have to be recognized. In the proposed system, the document image is first captured using a flat-bed scanner.

  4. Automatic Number Plate Recognition System for IPhone Devices

    Directory of Open Access Journals (Sweden)

    Călin Enăchescu

    2013-06-01

    Full Text Available This paper presents a system for automatic number plate recognition, implemented for devices running the iOS operating system. The methods used for number plate recognition are based on existing methods, but optimized for devices with low hardware resources. To solve the task of automatic number plate recognition we have divided it into the following subtasks: image acquisition, localization of the number plate position on the image and character detection. The first subtask is performed by the camera of an iPhone, the second one is done using image pre-processing methods and template matching. For the character recognition we are using a feed-forward artificial neural network. Each of these methods is presented along with its results.

  5. Automatic Recognition of Improperly Pronounced Initial 'r' Consonant in Romanian

    Directory of Open Access Journals (Sweden)

    VELICAN, V.

    2012-08-01

    Full Text Available Correctly assessing the degree of mispronunciation and deciding upon the necessary treatment are fundamental activities for all speech disorder specialists. Obviously, the experience and the availability of the specialists are essentials in order to assure an efficient therapy for the speech impaired. To overcome this deficiency a more objective approach would include the existence of a tool that independent of the specialist's abilities could be used to establish the diagnostics. A complete automated system based on speech processing algorithms capable of performing the recognition task is therefore thoroughly justified and can be viewed as a goal that will bring many benefits to the field of speech pronunciation correction. This paper presents further results of the authors' work on developing speech processing algorithms able to identify mispronunciations in Romanian language, more exactly we propose the use of the Walsh-Hadamard Transform (WHT as feature selection tool in the case of identifying rhotacism. The results are encouraging with a best recognition rate of 92.55%.

  6. Dynamic Relation Between Working Memory Capacity and Speech Recognition in Noise During the First 6 Months of Hearing Aid Use

    Directory of Open Access Journals (Sweden)

    Elaine H. N. Ng

    2014-11-01

    Full Text Available The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced.

  7. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    Science.gov (United States)

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  8. Efficient CEPSTRAL Normalization for Robust Speech Recognition

    Science.gov (United States)

    1993-01-01

    CMN curve depends on the duration of the utterance, and is plotted in Figure 2 for the average duration in the DARPA Wall Street Journal task, 7...task consisting of dictation of sentences from the Wall Street Journal . A component of that evaluation involved utterances from a set of unknown... Wall Street Journal domain. 73 The version of SPHINX-II used for this evaluation was con- figured to maximize the robustness of the recognition pro

  9. Finding Acoustic Regularities in Speech: Applications to Phonetic Recognition

    Science.gov (United States)

    1988-12-01

    Chow, R.M. Schwartz, S. Roucos, O.A. Kimball, P.J. Price, G.F. Kubala , M.O. Dunham, M.A. Krasner, and J. Makhoul, "The role of word...Krasner, G.F. Kubala , J. Makhoul, P.J. Price, S. Roucos, and R.M. Schwartz, "BYBLOS: The BBN con- tinuous speech recognition system," Proc. ICASSP, pp...correlated signals," IRE Trans. Inform. Theory, vol. 2, pp. 41-46, 1956. [64] G.F. Kubala , "A feature space for acoustic-phonetic decoding of speech

  10. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    . As a continuation of the pioneering work done in the SEECAT project, our presentation will report on a feasibility study where post-editor trainees will be asked to post-edit raw MT using voice and keyboard as an input method. This feasibility study will explore the potential of combining one of the most popular...... computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways of verifying...... and correcting raw MT output will be examined making comparisons in terms of: i) overall time to complete the task, ii) final quality of the target text, and iii) user satisfaction....

  11. A Context Dependent Automatic Target Recognition System

    Science.gov (United States)

    Kim, J. H.; Payton, D. W.; Olin, K. E.; Tseng, D. Y.

    1984-06-01

    This paper describes a new approach to automatic target recognizer (ATR) development utilizing artificial intelligent techniques. The ATR system exploits contextual information in its detection and classification processes to provide a high degree of robustness and adaptability. In the system, knowledge about domain objects and their contextual relationships is encoded in frames, separating it from low level image processing algorithms. This knowledge-based system demonstrates an improvement over the conventional statistical approach through the exploitation of diverse forms of knowledge in its decision-making process.

  12. Two Systems for Automatic Music Genre Recognition

    DEFF Research Database (Denmark)

    Sturm, Bob L.

    2012-01-01

    We re-implement and test two state-of-the-art systems for automatic music genre classification; but unlike past works in this area, we look closer than ever before at their behavior. First, we look at specific instances where each system consistently applies the same wrong label across multiple...... trials of cross-validation. Second, we test the robustness of each system to spectral equalization. Finally, we test how well human subjects recognize the genres of music excerpts composed by each system to be highly genre representative. Our results suggest that neither high-performing system has...... a capacity to recognize music genre....

  13. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...... and concealment is conducted at the sub-vector level as opposed to conventional techniques where an entire vector is replaced even though only a single bit error occurs. The sub-vector EC is further combined with weighted Viterbi decoding. Encouraging recognition results are observed for the proposed techniques....... Lastly, to understand the effects of applying various EC techniques, this paper introduces three approaches consisting of speech feature, dynamic programming distance and hidden Markov model state duration comparison....

  14. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The automatic analysis of affect is a relatively new and challenging multidisciplinary research area that has gained a lot of interest over the past few years. The research and development of affect recognition systems has opened many opportunities for improving the interaction between man and

  15. Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition

    Science.gov (United States)

    Kim, Myungjong; Kim, Younggwan; Yoo, Joohong; Wang, Jun; Kim, Hoirin

    2017-01-01

    This paper addresses the problem of recognizing the speech uttered by patients with dysarthria, which is a motor speech disorder impeding the physical production of speech. Patients with dysarthria have articulatory limitation, and therefore, they often have trouble in pronouncing certain sounds, resulting in undesirable phonetic variation. Modern automatic speech recognition systems designed for regular speakers are ineffective for dysarthric sufferers due to the phonetic variation. To capture the phonetic variation, Kullback-Leibler divergence based hidden Markov model (KL-HMM) is adopted, where the emission probability of state is parametrized by a categorical distribution using phoneme posterior probabilities obtained from a deep neural network-based acoustic model. To further reflect speaker-specific phonetic variation patterns, a speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization which can enhance discriminability between categorical distributions of KL-HMM states while preserving speaker-specific information is proposed. Evaluation of the proposed speaker adaptation method on a database of several hundred words for 30 speakers consisting of 12 mildly dysarthric, 8 moderately dysarthric, and 10 non-dysarthric control speakers showed that the proposed approach significantly outperformed the conventional deep neural network based speaker adapted system on dysarthric as well as non-dysarthric speech. PMID:28320669

  16. Enhancing the magnitude spectrum of speech features for robust speech recognition

    Science.gov (United States)

    Hung, Jeih-weih; Fan, Hao-teng; Tu, Wen-hsiang

    2012-12-01

    In this article, we present an effective compensation scheme to improve noise robustness for the spectra of speech signals. In this compensation scheme, called magnitude spectrum enhancement (MSE), a voice activity detection (VAD) process is performed on the frame sequence of the utterance. The magnitude spectra of non-speech frames are then reduced while those of speech frames are amplified. In experiments conducted on the Aurora-2 noisy digits database, MSE achieves an error reduction rate of nearly 42% relative to baseline processing. This method outperforms well-known spectral-domain speech enhancement techniques, including spectral subtraction (SS) and Wiener filtering (WF). In addition, the proposed MSE can be integrated with cepstral-domain robustness methods, such as mean and variance normalization (MVN) and histogram normalization (HEQ), to achieve further improvements in recognition accuracy under noise-corrupted environments.

  17. Dynamic Bayesian Networks for Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Liang Luhong

    2002-01-01

    Full Text Available The use of visual features in audio-visual speech recognition (AVSR is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM and the factorial HMM (FHMM, and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.

  18. Automatization and Orthographic Development in Second Language Visual Word Recognition

    Science.gov (United States)

    Kida, Shusaku

    2016-01-01

    The present study investigated second language (L2) learners' acquisition of automatic word recognition and the development of L2 orthographic representation in the mental lexicon. Participants in the study were Japanese university students enrolled in a compulsory course involving a weekly 30-minute sustained silent reading (SSR) activity with…

  19. Auditory signal design for automatic number plate recognition system

    NARCIS (Netherlands)

    Heydra, C.G.; Jansen, R.J.; Van Egmond, R.

    2014-01-01

    This paper focuses on the design of an auditory signal for the Automatic Number Plate Recognition system of Dutch national police. The auditory signal is designed to alert police officers of suspicious cars in their proximity, communicating priority level and location of the suspicious car and

  20. Quality Assessment of Compressed Video for Automatic License Plate Recognition

    DEFF Research Database (Denmark)

    Ukhanova, Ann; Støttrup-Andersen, Jesper; Forchhammer, Søren

    2014-01-01

    Definition of video quality requirements for video surveillance poses new questions in the area of quality assessment. This paper presents a quality assessment experiment for an automatic license plate recognition scenario. We explore the influence of the compression by H.264/AVC and H.265/HEVC...

  1. Subauditory Speech Recognition based on EMG/EPG Signals

    Science.gov (United States)

    Jorgensen, Charles; Lee, Diana Dee; Agabon, Shane; Lau, Sonie (Technical Monitor)

    2003-01-01

    Sub-vocal electromyogram/electro palatogram (EMG/EPG) signal classification is demonstrated as a method for silent speech recognition. Recorded electrode signals from the larynx and sublingual areas below the jaw are noise filtered and transformed into features using complex dual quad tree wavelet transforms. Feature sets for six sub-vocally pronounced words are trained using a trust region scaled conjugate gradient neural network. Real time signals for previously unseen patterns are classified into categories suitable for primitive control of graphic objects. Feature construction, recognition accuracy and an approach for extension of the technique to a variety of real world application areas are presented.

  2. Speech Recognition Challenge in the Wild: Arabic MGB-3

    OpenAIRE

    Ali, Ahmed; Vogel, Stephan; Renals, Steve

    2017-01-01

    This paper describes the Arabic MGB-3 Challenge - Arabic Speech Recognition in the Wild. Unlike last year's Arabic MGB-2 Challenge, for which the recognition task was based on more than 1,200 hours broadcast TV news recordings from Aljazeera Arabic TV programs, MGB-3 emphasises dialectal Arabic using a multi-genre collection of Egyptian YouTube videos. Seven genres were used for the data collection: comedy, cooking, family/kids, fashion, drama, sports, and science (TEDx). A total of 16 hours ...

  3. Radar automatic target recognition (ATR) and non-cooperative target recognition (NCTR)

    CERN Document Server

    Blacknell, David

    2013-01-01

    The ability to detect and locate targets by day or night, over wide areas, regardless of weather conditions has long made radar a key sensor in many military and civil applications. However, the ability to automatically and reliably distinguish different targets represents a difficult challenge. Radar Automatic Target Recognition (ATR) and Non-Cooperative Target Recognition (NCTR) captures material presented in the NATO SET-172 lecture series to provide an overview of the state-of-the-art and continuing challenges of radar target recognition. Topics covered include the problem as applied to th

  4. Automatic gang graffiti recognition and interpretation

    Science.gov (United States)

    Parra, Albert; Boutin, Mireille; Delp, Edward J.

    2017-09-01

    One of the roles of emergency first responders (e.g., police and fire departments) is to prevent and protect against events that can jeopardize the safety and well-being of a community. In the case of criminal gang activity, tools are needed for finding, documenting, and taking the necessary actions to mitigate the problem or issue. We describe an integrated mobile-based system capable of using location-based services, combined with image analysis, to track and analyze gang activity through the acquisition, indexing, and recognition of gang graffiti images. This approach uses image analysis methods for color recognition, image segmentation, and image retrieval and classification. A database of gang graffiti images is described that includes not only the images but also metadata related to the images, such as date and time, geoposition, gang, gang member, colors, and symbols. The user can then query the data in a useful manner. We have implemented these features both as applications for Android and iOS hand-held devices and as a web-based interface.

  5. Shortlist B: A Bayesian model of continuous speech recognition

    OpenAIRE

    Norris, D.; McQueen, J.

    2008-01-01

    A Bayesian model of continuous speech recognition is presented. It is based on Shortlist ( D. Norris, 1994; D. Norris, J. M. McQueen, A. Cutler, & S. Butterfield, 1997) and shares many of its key assumptions: parallel competitive evaluation of multiple lexical hypotheses, phonologically abstract prelexical and lexical representations, a feedforward architecture with no online feedback, and a lexical segmentation algorithm based on the viability of chunks of the input as possible words. Shortl...

  6. Appropriate baseline values for HMM-based speech recognition

    CSIR Research Space (South Africa)

    Barnard, E

    2004-11-01

    Full Text Available values for HMM-based speech recognition Etienne Gouws, Kobus Wolvaardt, Neil Kleynhans, Etienne Barnard Department of Electrical, Electronic and Computer Engineering University of Pretoria Pretoria, South Africa ebarnard@up.ac.za Abstract A number.... Keywords - Hidden Markov Models (HMM), Feature sets, Mixture models, Pronunciation dictionaries, Monophones, Triphones 1. Introduction There is a growing awareness that Human Language Technolo- gies can play a significant role in bridging the digital...

  7. Speech Signal Analysis and Pattern Recognition in Diagnosis of Dysarthria

    OpenAIRE

    Thoppil, Minu George; Kumar, C. Santhosh; Kumar, Anand; Amose, John

    2017-01-01

    Background: Dysarthria refers to a group of disorders resulting from disturbances in muscular control over the speech mechanism due to damage of central or peripheral nervous system. There is wide subjective variability in assessment of dysarthria between different clinicians. In our study, we tried to identify a pattern among types of dysarthria by acoustic analysis and to prevent intersubject variability. Objectives: (1) Pattern recognition among types of dysarthria with software tool and t...

  8. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    OpenAIRE

    Zhang, Wei; Zhang, Xueying; Sun, Ying

    2017-01-01

    Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM) approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between e...

  9. Automatic Recognition of Object Names in Literature

    Science.gov (United States)

    Bonnin, C.; Lesteven, S.; Derriere, S.; Oberto, A.

    2008-08-01

    SIMBAD is a database of astronomical objects that provides (among other things) their bibliographic references in a large number of journals. Currently, these references have to be entered manually by librarians who read each paper. To cope with the increasing number of papers, CDS develops a tool to assist the librarians in their work, taking advantage of the Dictionary of Nomenclature of Celestial Objects, which keeps track of object acronyms and of their origin. The program searches for object names directly in PDF documents by comparing the words with all the formats stored in the Dictionary of Nomenclature. It also searches for variable star names based on constellation names and for a large list of usual names such as Aldebaran or the Crab. Object names found in the documents often correspond to several astronomical objects. The system retrieves all possible matches, displays them with their object type given by SIMBAD, and lets the librarian make the final choice. The bibliographic reference can then be automatically added to the object identifiers in the database. Besides, the systematic usage of the Dictionary of Nomenclature, which is updated manually, permitted to automatically check it and to detect errors and inconsistencies. Last but not least, the program collects some additional information such as the position of the object names in the document (in the title, subtitle, abstract, table, figure caption...) and their number of occurrences. In the future, this will permit to calculate the 'weight' of an object in a reference and to provide SIMBAD users with an important new information, which will help them to find the most relevant papers in the object reference list.

  10. Increase in Speech Recognition Due to Linguistic Mismatch between Target and Masker Speech: Monolingual and Simultaneous Bilingual Performance

    Science.gov (United States)

    Calandruccio, Lauren; Zhou, Haibo

    2014-01-01

    Purpose: To examine whether improved speech recognition during linguistically mismatched target-masker experiments is due to linguistic unfamiliarity of the masker speech or linguistic dissimilarity between the target and masker speech. Method: Monolingual English speakers (n = 20) and English-Greek simultaneous bilinguals (n = 20) listened to…

  11. EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Directory of Open Access Journals (Sweden)

    Theodoros Anagnostopoulos

    2014-11-01

    Full Text Available Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space.

  12. Cascaded automatic target recognition (Cascaded ATR)

    Science.gov (United States)

    Walls, Bradley

    2010-04-01

    The global war on terror has plunged US and coalition forces into a battle space requiring the continuous adaptation of tactics and technologies to cope with an elusive enemy. As a result, technologies that enhance the intelligence, surveillance, and reconnaissance (ISR) mission making the warfighter more effective are experiencing increased interest. In this paper we show how a new generation of smart cameras built around foveated sensing makes possible a powerful ISR technique termed Cascaded ATR. Foveated sensing is an innovative optical concept in which a single aperture captures two distinct fields of view. In Cascaded ATR, foveated sensing is used to provide a coarse resolution, persistent surveillance, wide field of view (WFOV) detector to accomplish detection level perception. At the same time, within the foveated sensor, these detection locations are passed as a cue to a steerable, high fidelity, narrow field of view (NFOV) detector to perform recognition level perception. Two new ISR mission scenarios, utilizing Cascaded ATR, are proposed.

  13. Northeast Artificial Intelligence Consortium Annual Report. 1988 Artificial Intelligence Applications to Speech Recognition. Volume 8

    Science.gov (United States)

    1989-10-01

    1988 Artificial Intelligence Applications to Speech Recognition Syracuse University Harvey E. Rhody, Thomas R. Ridley, John A. ,les DTIC S ELECTE FEB...Include Security Oiewftction) NORTHEAST ARTIFICIAL INTELLIGENCE CONSORTIUM ANNUAL REPORT - 1988 Artificial Intelligence Applications to Speech...Intelligence Consortium 1988 Annual Report Volume 8 Artificial Intelligence Applications to Speech Recognition Harvey E. Rhody Thomas R. Ridley John A

  14. Robust Speaker Recognition with Combined Use of Acoustic and Throat Microphone Speech

    DEFF Research Database (Denmark)

    Sahidullah, Md; Gonzalez Hautamäki, Rosa; Thomsen, Dennis Alexander Lehmann

    2016-01-01

    Accuracy of automatic speaker recognition (ASV) systems degrades severely in the presence of background noise. In this paper, we study the use of additional side information provided by a body-conducted sensor, throat microphone. Throat microphone signal is much less affected by background noise...... in comparison to acoustic microphone signal. This makes throat microphones potentially useful for feature extraction or speech activity detection. This paper, firstly, proposes a new prototype system for simultaneous data-acquisition of acoustic and throat microphone signals. Secondly, we study the use...... of this additional information for both speech activity detection, feature extraction and fusion of the acoustic and throat microphone signals. We collect a pilot database consisting of 38 subjects including both clean and noisy sessions. We carry out speaker verification experiments using Gaussian mixture model...

  15. Speech recognition: impact on workflow and report availability

    International Nuclear Information System (INIS)

    Glaser, C.; Trumm, C.; Nissen-Meyer, S.; Francke, M.; Kuettner, B.; Reiser, M.

    2005-01-01

    With ongoing technical refinements speech recognition systems (SRS) are becoming an increasingly attractive alternative to traditional methods of preparing and transcribing medical reports. The two main components of any SRS are the acoustic model and the language model. Features of modern SRS with continuous speech recognition are macros with individually definable texts and report templates as well as the option to navigate in a text or to control SRS or RIS functions by speech recognition. The best benefit from SRS can be obtained if it is integrated into a RIS/RIS-PACS installation. Report availability and time efficiency of the reporting process (related to recognition rate, time expenditure for editing and correcting a report) are the principal determinants of the clinical performance of any SRS. For practical purposes the recognition rate is estimated by the error rate (unit ''word''). Error rates range from 4 to 28%. Roughly 20% of them are errors in the vocabulary which may result in clinically relevant misinterpretation. It is thus mandatory to thoroughly correct any transcribed text as well as to continuously train and adapt the SRS vocabulary. The implementation of SRS dramatically improves report availability. This is most pronounced for CT and CR. However, the individual time expenditure for (SRS-based) reporting increased by 20-25% (CR) and according to literature data there is an increase by 30% for CT and MRI. The extent to which the transcription staff profits from SRS depends largely on its qualification. Online dictation implies a workload shift from the transcription staff to the reporting radiologist. (orig.) [de

  16. Spike Pattern Recognition for Automatic Collimation Alignment

    CERN Document Server

    Azzopardi, Gabriella; Salvachua Ferrando, Belen Maria; Mereghetti, Alessio; Redaelli, Stefano; CERN. Geneva. ATS Department

    2017-01-01

    The LHC makes use of a collimation system to protect its sensitive equipment by intercepting potentially dangerous beam halo particles. The appropriate collimator settings to protect the machine against beam losses relies on a very precise alignment of all the collimators with respect to the beam. The beam center at each collimator is then found by touching the beam halo using an alignment procedure. Until now, in order to determine whether a collimator is aligned with the beam or not, a user is required to follow the collimator’s BLM loss data and detect spikes. A machine learning (ML) model was trained in order to automatically recognize spikes when a collimator is aligned. The model was loosely integrated with the alignment implementation to determine the classification performance and reliability, without effecting the alignment process itself. The model was tested on a number of collimators during this MD and the machine learning was able to output the classifications in real-time.

  17. The self-advantage in visual speech processing enhances audiovisual speech recognition in noise.

    Science.gov (United States)

    Tye-Murray, Nancy; Spehar, Brent P; Myerson, Joel; Hale, Sandra; Sommers, Mitchell S

    2015-08-01

    Individuals lip read themselves more accurately than they lip read others when only the visual speech signal is available (Tye-Murray et al., Psychonomic Bulletin & Review, 20, 115-119, 2013). This self-advantage for vision-only speech recognition is consistent with the common-coding hypothesis (Prinz, European Journal of Cognitive Psychology, 9, 129-154, 1997), which posits (1) that observing an action activates the same motor plan representation as actually performing that action and (2) that observing one's own actions activates motor plan representations more than the others' actions because of greater congruity between percepts and corresponding motor plans. The present study extends this line of research to audiovisual speech recognition by examining whether there is a self-advantage when the visual signal is added to the auditory signal under poor listening conditions. Participants were assigned to sub-groups for round-robin testing in which each participant was paired with every member of their subgroup, including themselves, serving as both talker and listener/observer. On average, the benefit participants obtained from the visual signal when they were the talker was greater than when the talker was someone else and also was greater than the benefit others obtained from observing as well as listening to them. Moreover, the self-advantage in audiovisual speech recognition was significant after statistically controlling for individual differences in both participants' ability to benefit from a visual speech signal and the extent to which their own visual speech signal benefited others. These findings are consistent with our previous finding of a self-advantage in lip reading and with the hypothesis of a common code for action perception and motor plan representation.

  18. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  19. Exploiting temporal correlation of speech for error robust and bandwidth flexible distributed speech recognition

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Dalsgaard, Paul; Lindberg, Børge

    2007-01-01

    In this paper the temporal correlation of speech is exploited in front-end feature extraction, client based error recovery and server based error concealment (EC) for distributed speech recognition. First, the paper investigates a half frame rate (HFR) front-end that uses double frame shifting...... at the client side. At the server side, each HFR feature vector is duplicated to construct a full frame rate (FFR) feature sequence. This HFR front-end gives comparable performance to the FFR front-end but contains only half the FFR features. Secondly, different arrangements of the other half of the FFR...

  20. Automatic anatomy recognition via multiobject oriented active shape models.

    Science.gov (United States)

    Chen, Xinjian; Udupa, Jayaram K; Alavi, Abass; Torigian, Drew A

    2010-12-01

    This paper studies the feasibility of developing an automatic anatomy recognition (AAR) system in clinical radiology and demonstrates its operation on clinical 2D images. The anatomy recognition method described here consists of two main components: (a) multiobject generalization of OASM and (b) object recognition strategies. The OASM algorithm is generalized to multiple objects by including a model for each object and assigning a cost structure specific to each object in the spirit of live wire. The delineation of multiobject boundaries is done in MOASM via a three level dynamic programming algorithm, wherein the first level is at pixel level which aims to find optimal oriented boundary segments between successive landmarks, the second level is at landmark level which aims to find optimal location for the landmarks, and the third level is at the object level which aims to find optimal arrangement of object boundaries over all objects. The object recognition strategy attempts to find that pose vector (consisting of translation, rotation, and scale component) for the multiobject model that yields the smallest total boundary cost for all objects. The delineation and recognition accuracies were evaluated separately utilizing routine clinical chest CT, abdominal CT, and foot MRI data sets. The delineation accuracy was evaluated in terms of true and false positive volume fractions (TPVF and FPVF). The recognition accuracy was assessed (1) in terms of the size of the space of the pose vectors for the model assembly that yielded high delineation accuracy, (2) as a function of the number of objects and objects' distribution and size in the model, (3) in terms of the interdependence between delineation and recognition, and (4) in terms of the closeness of the optimum recognition result to the global optimum. When multiple objects are included in the model, the delineation accuracy in terms of TPVF can be improved to 97%-98% with a low FPVF of 0.1%-0.2%. Typically, a

  1. Recognition of In-Ear Microphone Speech Data Using Multi-Layer Neural Networks

    National Research Council Canada - National Science Library

    Bulbuller, Gokhan

    2006-01-01

    .... In this study, a speech recognition system is presented, specifically an isolated word recognizer which uses speech collected from the external auditory canals of the subjects via an in-ear microphone...

  2. Advanced Audio Interface for Phonetic Speech Recognition in a High Noise Environment

    National Research Council Canada - National Science Library

    2000-01-01

    Standard Object Systems, Inc. (SOS) has used its existing technology in phonetic speech recognition, audio signal processing, and multilingual language translation to design and demonstrate an advanced audio interface for speech...

  3. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Directory of Open Access Journals (Sweden)

    Betkowska Agnieszka

    2007-01-01

    Full Text Available We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM architecture developed from a clean-speech hidden Markov model (HMM and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs reduced the word error rate by 20.5 compared to that of the clean-speech word HMMs.

  4. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Science.gov (United States)

    Betkowska, Agnieszka; Shinoda, Koichi; Furui, Sadaoki

    2007-12-01

    We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM) architecture developed from a clean-speech hidden Markov model (HMM) and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs) reduced the word error rate by 20.5[InlineEquation not available: see fulltext.] compared to that of the clean-speech word HMMs.

  5. Automatic Facial Expression Recognition and Operator Functional State

    Science.gov (United States)

    Blanson, Nina

    2012-01-01

    The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions

  6. Production ready feature recognition based automatic group technology part coding

    Energy Technology Data Exchange (ETDEWEB)

    Ames, A.L.

    1990-01-01

    During the past four years, a feature recognition based expert system for automatically performing group technology part coding from solid model data has been under development. The system has become a production quality tool, capable of quickly the geometry based portions of a part code with no human intervention. It has been tested on over 200 solid models, half of which are models of production Sandia designs. Its performance rivals that of humans performing the same task, often surpassing them in speed and uniformity. The feature recognition capability developed for part coding is being extended to support other applications, such as manufacturability analysis, automatic decomposition (for finite element meshing and machining), and assembly planning. Initial surveys of these applications indicate that the current capability will provide a strong basis for other applications and that extensions toward more global geometric reasoning and tighter coupling with solid modeler functionality will be necessary.

  7. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  8. Multimodal Approach for Automatic Emotion Recognition Applied to the Tension Levels Study in TV Newscasts

    Directory of Open Access Journals (Sweden)

    Moisés Henrique Ramos Pereira

    2015-12-01

    Full Text Available This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance with the distribution of audiovisual indicators extracted over a TV newscast, demonstrating the potential of the approach to support the TV journalistic discourse analysis.This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance

  9. Model-based Sparse Component Analysis for Multiparty Distant Speech Recognition

    OpenAIRE

    Asaei, Afsaneh

    2013-01-01

    This research takes place in the general context of improving the performance of the Distant Speech Recognition (DSR) systems, tackling the reverberation and recognition of overlap speech. Perceptual modeling indicates that sparse representation exists in the auditory cortex. The present project thus builds upon the hypothesis that incorporating this information in DSR front-end processing could improve the speech recognition performance in realistic conditions including overlap and reverbera...

  10. Semi-automatic recognition of marine debris on beaches

    OpenAIRE

    Ge, Zhenpeng; Shi, Huahong; Mei, Xuefei; Dai, Zhijun; Li, Daoji

    2016-01-01

    An increasing amount of anthropogenic marine debris is pervading the earth?s environmental systems, resulting in an enormous threat to living organisms. Additionally, the large amount of marine debris around the world has been investigated mostly through tedious manual methods. Therefore, we propose the use of a new technique, light detection and ranging (LIDAR), for the semi-automatic recognition of marine debris on a beach because of its substantially more efficient role in comparison with ...

  11. Automatic TLI recognition system. Part 1: System description

    Energy Technology Data Exchange (ETDEWEB)

    Partin, J.K.; Lassahn, G.D.; Davidson, J.R.

    1994-05-01

    This report describes an automatic target recognition system for fast screening of large amounts of multi-sensor image data, based on low-cost parallel processors. This system uses image data fusion and gives uncertainty estimates. It is relatively low cost, compact, and transportable. The software is easily enhanced to expand the system`s capabilities, and the hardware is easily expandable to increase the system`s speed. This volume gives a general description of the ATR system.

  12. Automatic dialogue act recognition using a dynamic Bayesian network

    OpenAIRE

    Dielmann, Alfred; Renals, Steve

    2007-01-01

    We propose a joint segmentation and classification approach for the dialogue act recognition task on natural multi-party meetings (ICSI Meeting Corpus). Five broad DA categories are automatically recognised using a generative Dynamic Bayesian Network based infrastructure. Prosodic features and a switching graphical model are used to estimate DA boundaries, in conjunction with a factored language model which is used to relate words and DA categories. This easily generalizable and extensible sy...

  13. Speech Signal Analysis and Pattern Recognition in Diagnosis of Dysarthria.

    Science.gov (United States)

    Thoppil, Minu George; Kumar, C Santhosh; Kumar, Anand; Amose, John

    2017-01-01

    Dysarthria refers to a group of disorders resulting from disturbances in muscular control over the speech mechanism due to damage of central or peripheral nervous system. There is wide subjective variability in assessment of dysarthria between different clinicians. In our study, we tried to identify a pattern among types of dysarthria by acoustic analysis and to prevent intersubject variability. (1) Pattern recognition among types of dysarthria with software tool and to compare with normal subjects. (2) To assess the severity of dysarthria with software tool. Speech of seventy subjects were recorded, both normal subjects and the dysarthric patients who attended the outpatient department/admitted in AIMS. Speech waveforms were analyzed using Praat and MATHLAB toolkit. The pitch contour, formant variation, and speech duration of the extracted graphs were analyzed. Study population included 25 normal subjects and 45 dysarthric patients. Dysarthric subjects included 24 patients with extrapyramidal dysarthria, 14 cases of spastic dysarthria, and 7 cases of ataxic dysarthria. Analysis of pitch of the study population showed a specific pattern in each type. F0 jitter was found in spastic dysarthria, pitch break with ataxic dysarthria, and pitch monotonicity with extrapyramidal dysarthria. By pattern recognition, we identified 19 cases in which one or more recognized patterns coexisted. There was a significant correlation between the severity of dysarthria and formant range. Specific patterns were identified for types of dysarthria so that this software tool will help clinicians to identify the types of dysarthria in a better way and could prevent intersubject variability. We also assessed the severity of dysarthria by formant range. Mixed dysarthria can be more common than clinically expected.

  14. Speech Signal Analysis and Pattern Recognition in Diagnosis of Dysarthria

    Science.gov (United States)

    Thoppil, Minu George; Kumar, C. Santhosh; Kumar, Anand; Amose, John

    2017-01-01

    Background: Dysarthria refers to a group of disorders resulting from disturbances in muscular control over the speech mechanism due to damage of central or peripheral nervous system. There is wide subjective variability in assessment of dysarthria between different clinicians. In our study, we tried to identify a pattern among types of dysarthria by acoustic analysis and to prevent intersubject variability. Objectives: (1) Pattern recognition among types of dysarthria with software tool and to compare with normal subjects. (2) To assess the severity of dysarthria with software tool. Materials and Methods: Speech of seventy subjects were recorded, both normal subjects and the dysarthric patients who attended the outpatient department/admitted in AIMS. Speech waveforms were analyzed using Praat and MATHLAB toolkit. The pitch contour, formant variation, and speech duration of the extracted graphs were analyzed. Results: Study population included 25 normal subjects and 45 dysarthric patients. Dysarthric subjects included 24 patients with extrapyramidal dysarthria, 14 cases of spastic dysarthria, and 7 cases of ataxic dysarthria. Analysis of pitch of the study population showed a specific pattern in each type. F0 jitter was found in spastic dysarthria, pitch break with ataxic dysarthria, and pitch monotonicity with extrapyramidal dysarthria. By pattern recognition, we identified 19 cases in which one or more recognized patterns coexisted. There was a significant correlation between the severity of dysarthria and formant range. Conclusions: Specific patterns were identified for types of dysarthria so that this software tool will help clinicians to identify the types of dysarthria in a better way and could prevent intersubject variability. We also assessed the severity of dysarthria by formant range. Mixed dysarthria can be more common than clinically expected. PMID:29184336

  15. Automatic Blastomere Recognition from a Single Embryo Image

    Directory of Open Access Journals (Sweden)

    Yun Tian

    2014-01-01

    Full Text Available The number of blastomeres of human day 3 embryos is one of the most important criteria for evaluating embryo viability. However, due to the transparency and overlap of blastomeres, it is a challenge to recognize blastomeres automatically using a single embryo image. This study proposes an approach based on least square curve fitting (LSCF for automatic blastomere recognition from a single image. First, combining edge detection, deletion of multiple connected points, and dilation and erosion, an effective preprocessing method was designed to obtain part of blastomere edges that were singly connected. Next, an automatic recognition method for blastomeres was proposed using least square circle fitting. This algorithm was tested on 381 embryo microscopic images obtained from the eight-cell period, and the results were compared with those provided by experts. Embryos were recognized with a 0 error rate occupancy of 21.59%, and the ratio of embryos in which the false recognition number was less than or equal to 2 was 83.16%. This experiment demonstrated that our method could efficiently and rapidly recognize the number of blastomeres from a single embryo image without the need to reconstruct the three-dimensional model of the blastomeres first; this method is simple and efficient.

  16. Advanced automatic target recognition for police helicopter missions

    Science.gov (United States)

    Stahl, Christoph; Schoppmann, Paul

    2000-08-01

    The results of a case study about the application of an advanced method for automatic target recognition to infrared imagery taken from police helicopter missions are presented. The method consists of the following steps: preprocessing, classification, fusion, postprocessing and tracking, and combines the three paradigms image pyramids, neural networks and bayesian nets. The technology has been developed using a variety of different scenes typical for military aircraft missions. Infrared cameras have been in use for several years at the Bavarian police helicopter forces and are highly valuable for night missions. Several object classes like 'persons' or 'vehicles' are tested and the possible discrimination between persons and animals is shown. The analysis of complex scenes with hidden objects and clutter shows the potentials and limitations of automatic target recognition for real-world tasks. Several display concepts illustrate the achievable improvement of the situation awareness. The similarities and differences between various mission types concerning object variability, time constraints, consequences of false alarms, etc. are discussed. Typical police actions like searching for missing persons or runaway criminals illustrate the advantages of automatic target recognition. The results demonstrate the possible operational benefits for the helicopter crew. Future work will include performance evaluation issues and a system integration concept for the target platform.

  17. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...... for noisy speech while achieving higher compression rates....

  18. A system of automatic speaker recognition on a minicomputer

    International Nuclear Information System (INIS)

    El Chafei, Cherif

    1978-01-01

    This study describes a system of automatic speaker recognition using the pitch of the voice. The pre-treatment consists in the extraction of the speakers' discriminating characteristics taken from the pitch. The programme of recognition gives, firstly, a preselection and then calculates the distance between the speaker's characteristics to be recognized and those of the speakers already recorded. An experience of recognition has been realized. It has been undertaken with 15 speakers and included 566 tests spread over an intermittent period of four months. The discriminating characteristics used offer several interesting qualities. The algorithms concerning the measure of the characteristics on one hand, the speakers' classification on the other hand, are simple. The results obtained in real time with a minicomputer are satisfactory. Furthermore they probably could be improved if we considered other speaker's discriminating characteristics but this was unfortunately not in our possibilities. (author) [fr

  19. Auditory-model based robust feature selection for speech recognition.

    Science.gov (United States)

    Koniaris, Christos; Kuropatwinski, Marcin; Kleijn, W Bastiaan

    2010-02-01

    It is shown that robust dimension-reduction of a feature set for speech recognition can be based on a model of the human auditory system. Whereas conventional methods optimize classification performance, the proposed method exploits knowledge implicit in the auditory periphery, inheriting its robustness. Features are selected to maximize the similarity of the Euclidean geometry of the feature domain and the perceptual domain. Recognition experiments using mel-frequency cepstral coefficients (MFCCs) confirm the effectiveness of the approach, which does not require labeled training data. For noisy data the method outperforms commonly used discriminant-analysis based dimension-reduction methods that rely on labeling. The results indicate that selecting MFCCs in their natural order results in subsets with good performance.

  20. Multilevel Analysis in Analyzing Speech Data

    Science.gov (United States)

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  1. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    Science.gov (United States)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  2. Comparison of fluctuating maskers for speech recognition tests.

    Science.gov (United States)

    Francart, Tom; van Wieringen, Astrid; Wouters, Jan

    2011-01-01

    To investigate the extent to which temporal gaps, temporal fine structure, and comprehensibility of the masker affect masking strength in speech recognition experiments. Seven different masker types with Dutch speech materials were evaluated. Amongst these maskers were the ICRA-5 fluctuating noise, the international speech test signal (ISTS), and competing talkers in Dutch and Swedish. Normal-hearing and hearing-impaired subjects. The normal-hearing subjects benefited from both temporal gaps and temporal fine structure in the fluctuating maskers. When the competing talker was comprehensible, performance decreased. The ISTS masker appeared to cause a large informational masking component. The stationary maskers yielded the steepest slopes of the psychometric function, followed by the modulated noises, followed by the competing talkers. Although the hearing-impaired group was heterogeneous, their data showed similar tendencies, but sometimes to a lesser extent, depending on individuals' hearing impairment. If measurement time is of primary concern non-modulated maskers are advised. If it is useful to assess release of masking by the use of temporal gaps, a fluctuating noise is advised. If perception of temporal fine structure is being investigated, a foreign-language competing talker is advised.

  3. A systematic review of speech recognition technology in health care.

    Science.gov (United States)

    Johnson, Maree; Lapkin, Samuel; Long, Vanessa; Sanchez, Paula; Suominen, Hanna; Basilakis, Jim; Dawson, Linda

    2014-10-28

    To undertake a systematic review of existing literature relating to speech recognition technology and its application within health care. A systematic review of existing literature from 2000 was undertaken. Inclusion criteria were: all papers that referred to speech recognition (SR) in health care settings, used by health professionals (allied health, medicine, nursing, technical or support staff), with an evaluation or patient or staff outcomes. Experimental and non-experimental designs were considered. Six databases (Ebscohost including CINAHL, EMBASE, MEDLINE including the Cochrane Database of Systematic Reviews, OVID Technologies, PreMED-LINE, PsycINFO) were searched by a qualified health librarian trained in systematic review searches initially capturing 1,730 references. Fourteen studies met the inclusion criteria and were retained. The heterogeneity of the studies made comparative analysis and synthesis of the data challenging resulting in a narrative presentation of the results. SR, although not as accurate as human transcription, does deliver reduced turnaround times for reporting and cost-effective reporting, although equivocal evidence of improved workflow processes. SR systems have substantial benefits and should be considered in light of the cost and selection of the SR system, training requirements, length of the transcription task, potential use of macros and templates, the presence of accented voices or experienced and in-experienced typists, and workflow patterns.

  4. Northeast Artificial Intelligence Consortium (NAIC). Volume 8. Artificial Intelligence Applications to Speech Recognition

    Science.gov (United States)

    1990-12-01

    AD-A234 887 RADC-TR-90-404, Vol VIII (of 18) Final Technical Report December 1990 ARTIFICIAL INTELLIGENCE APPLICATIONS TO SPEECH RECOGNITION...AND SUBl7TLE 5. FUNIOING NUMBERS ARTIFICIAL INTELLIGENCE APPLICATIONS TO SPEECH C - F30602-85-C-0008 RECOGNITION eE - b2702F AUTHOR(S) PR - 5581 TA - 27

  5. Northeast Artificial Intelligence Consortium Annual Report 1987. Volume 6. Artificial Intelligence Applications to Speech Recognition

    Science.gov (United States)

    1989-03-01

    ARTIFICIAL INTELLIGENCE’CONSORTIUM ANNUAL REPORT 1987 Artificial Intelligence Applications to Speech Recognition 12. PERSONAL AUTHOR(S) H. E. Rhody, J. A...obsolete. SECURITY CLASSIFICATION OF THIS PAGE UNCLASSIFIED 6 ARTIFICIAL INTELLIGENCE APPLICATIONS TO SPEECH RECOGNITION Report submitted by: Harvey E

  6. Supporting Dictation Speech Recognition Error Correction: The Impact of External Information

    Science.gov (United States)

    Shi, Yongmei; Zhou, Lina

    2011-01-01

    Although speech recognition technology has made remarkable progress, its wide adoption is still restricted by notable effort made and frustration experienced by users while correcting speech recognition errors. One of the promising ways to improve error correction is by providing user support. Although support mechanisms have been proposed for…

  7. Masked Speech Recognition and Reading Ability in School-Age Children: Is There a Relationship?

    Science.gov (United States)

    Miller, Gabrielle; Lewis, Barbara; Benchek, Penelope; Buss, Emily; Calandruccio, Lauren

    2018-01-01

    Purpose: The relationship between reading (decoding) skills, phonological processing abilities, and masked speech recognition in typically developing children was explored. This experiment was designed to evaluate the relationship between phonological processing and decoding abilities and 2 aspects of masked speech recognition in typically…

  8. Northeast Artificial Intelligence Consortium Annual Report 1986: Artificial Intelligence Applications to Speech Recognition. Volume 7

    Science.gov (United States)

    1988-06-01

    CONSORTIUM ANNUAL REPORT 1986 Artificial Intelligence Applications to Speech Recognition 12- PERSONAL AUTHOR(S) H. E. Rhody, J. Hillenbrand, J. A. Biles...funded partially by the Laboratory Directors’ Fund. By.... Dit: ;b! A!- UNCLASSIFIED %%v IZ. 7 ARTIFICIAL INTELLIGENCE APPLICATIONS TO SPEECH... Intelligence Applications to Speech Recognition Syracuse University ’ II. E. Rhody, J. Hillenbrand and J. A. Bites rT~s effort was funded partially by tho

  9. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    Science.gov (United States)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  10. Neuroscience-inspired computational systems for speech recognition under noisy conditions

    Science.gov (United States)

    Schafer, Phillip B.

    Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes

  11. A pattern recognition approach based on DTW for automatic transient identification in nuclear power plants

    International Nuclear Information System (INIS)

    Galbally, Javier; Galbally, David

    2015-01-01

    Highlights: • Novel transient identification method for NPPs. • Low-complexity. • Low training data requirements. • High accuracy. • Fully reproducible protocol carried out on a real benchmark. - Abstract: Automatic identification of transients in nuclear power plants (NPPs) allows monitoring the fatigue damage accumulated by critical components during plant operation, and is therefore of great importance for ensuring that usage factors remain within the original design bases postulated by the plant designer. Although several schemes to address this important issue have been explored in the literature, there is still no definitive solution available. In the present work, a new method for automatic transient identification is proposed, based on the Dynamic Time Warping (DTW) algorithm, largely used in other related areas such as signature or speech recognition. The novel transient identification system is evaluated on real operational data following a rigorous pattern recognition protocol. Results show the high accuracy of the proposed approach, which is combined with other interesting features such as its low complexity and its very limited requirements of training data

  12. Lexical decoder for continuous speech recognition: sequential neural network approach

    International Nuclear Information System (INIS)

    Iooss, Christine

    1991-01-01

    The work presented in this dissertation concerns the study of a connectionist architecture to treat sequential inputs. In this context, the model proposed by J.L. Elman, a recurrent multilayers network, is used. Its abilities and its limits are evaluated. Modifications are done in order to treat erroneous or noisy sequential inputs and to classify patterns. The application context of this study concerns the realisation of a lexical decoder for analytical multi-speakers continuous speech recognition. Lexical decoding is completed from lattices of phonemes which are obtained after an acoustic-phonetic decoding stage relying on a K Nearest Neighbors search technique. Test are done on sentences formed from a lexicon of 20 words. The results are obtained show the ability of the proposed connectionist model to take into account the sequentiality at the input level, to memorize the context and to treat noisy or erroneous inputs. (author) [fr

  13. Entrance C - New Automatic Number Plate Recognition System

    CERN Multimedia

    2013-01-01

    Entrance C (Satigny) is now equipped with a latest-generation Automatic Number Plate Recognition (ANPR) system and a fast-action road gate.   During the month of August, Entrance C will be continuously open from 7.00 a.m. to 7.00 p.m. (working days only). The security guards will open the gate as usual from 7.00 a.m. to 9.00 a.m. and from 5.00 p.m. to 7.00 p.m. For the rest of the working day (9.00 a.m. to 5.00 p.m.) the gate will operate automatically. Please observe the following points:       Stop at the STOP sign on the ground     Position yourself next to the card reader for optimal recognition     Motorcyclists must use their CERN card     Cyclists may not activate the gate and should use the bicycle turnstile     Keep a safe distance from the vehicle in front of you   If access is denied, please check that your vehicle regist...

  14. Individual differences in language and working memory affect children's speech recognition in noise.

    Science.gov (United States)

    McCreery, Ryan W; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-05-01

    We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Ninety-six children with normal hearing, who were between 5 and 12 years of age. Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Working memory and language both influence children's speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child's auditory skills, consistent with the Ease of Language Understanding model.

  15. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance.

    Directory of Open Access Journals (Sweden)

    Simone Hantke

    Full Text Available We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers, six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps, and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR, it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.

  16. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance.

    Science.gov (United States)

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.

  17. Automatic Pavement Crack Recognition Based on BP Neural Network

    Directory of Open Access Journals (Sweden)

    Li Li

    2014-02-01

    Full Text Available A feasible pavement crack detection system plays an important role in evaluating the road condition and providing the necessary road maintenance. In this paper, a back propagation neural network (BPNN is used to recognize pavement cracks from images. To improve the recognition accuracy of the BPNN, a complete framework of image processing is proposed including image preprocessing and crack information extraction. In this framework, the redundant image information is reduced as much as possible and two sets of feature parameters are constructed to classify the crack images. Then a BPNN is adopted to distinguish pavement images between linear and alligator cracks to acquire high recognition accuracy. Besides, the linear cracks can be further classified into transversal and longitudinal cracks according to the direction angle. Finally, the proposed method is evaluated on the data of 400 pavement images obtained by the Automatic Road Analyzer (ARAN in Northern China and the results show that the proposed method seems to be a powerful tool for pavement crack recognition. The rates of correct classification for alligator, transversal and longitudinal cracks are 97.5%, 100% and 88.0%, respectively. Compared to some previous studies, the method proposed in this paper is effective for all three kinds of cracks and the results are also acceptable for engineering application.

  18. COMPARISON BETWEEN GMM-SVM SEQUENCE KERNEL AND GMM: APPLICATION TO SPEECH EMOTION RECOGNITION

    Directory of Open Access Journals (Sweden)

    I. TRABELSI

    2016-09-01

    Full Text Available Speech emotion recognition aims at automatically identifying the emotional or physical state of a human being from his or her voice. The emotional state is an important factor in human communication, because it provides feedback information in many applications. This paper makes a comparison of two standard methods used for speaker recognition and verification: Gaussian Mixture Models (GMM and Support Vector Machines (SVM for emotion recognition. An extensive comparison of two methods: GMM and GMM/SVM sequence kernel is conducted. The main goal here is to analyze and compare influence of initial setting of parameters such as number of mixture components, used number of iterations and volume of training data for these two methods. Experimental studies are performed over the Berlin Emotional Database, expressing different emotions, in German language. The emotions used in this study are anger, fear, joy, boredom, neutral, disgust, and sadness. Experimental results show the effectiveness of the combination of GMM and SVM in order to classify sound data sequences when compared to systems based on GMM.

  19. Automatic error detection in alignments for speech synthesis

    CSIR Research Space (South Africa)

    Barnard, E

    2006-11-01

    Full Text Available The phonetic segmentation of recorded speech is a crucial factor in the quality of concatenative systems for speech synthesis. The authors describe a likelihood-based error detection process that can be used to flag possible errors in such a...

  20. Speech recognition acceptance by physicians: A temporal replication of a survey of expectations and experiences.

    Science.gov (United States)

    Lyons, Joseph P; Sanders, Salvatore A; Fredrick Cesene, Daniel; Palmer, Christopher; Mihalik, Valerie L; Weigel, Tracy

    2016-09-01

    A replication survey of physicians' expectations and experience with speech recognition technology was conducted before and after its implementation. The expectations survey was administered to emergency medicine physicians prior to training with the speech recognition system. The experience survey consisting of similar items was administered after physicians gained speech recognition technology experience. In this study, 82 percent of the physicians were initially optimistic that the use of speech recognition technology with the electronic medical record was a good idea. After using the technology for 6 months, 87 percent of the physicians agreed that speech recognition technology was a good idea. In addition, 72 percent of the physicians in this study had an expectation that the use of speech recognition technology would save time. After use in the clinical environment, 51 percent of the participants reported time savings. The increased acceptance of speech recognition technology by physicians in this study was attributed to improvements in the technology and the electronic medical record. © The Author(s) 2015.

  1. Age-Related and Gender-Related Changes in Monaural Speech Recognition.

    Science.gov (United States)

    Dubno, Judy R.; And Others

    1997-01-01

    A study of 129 older adults (ages 55-84) with sensorineural hearing loss examined the effects of age, gender, and auditory thresholds on several measures of speech recognition. Results found significant declines with age for males in maximum word recognition, maximum synthetic sentence identification, and keyword recognition in high-context…

  2. The Effects of Background Noise on the Performance of an Automatic Speech Recogniser

    Science.gov (United States)

    Littlefield, Jason; HashemiSakhtsari, Ahmad

    2002-11-01

    Ambient or environmental noise is a major factor that affects the performance of an automatic speech recognizer. Large vocabulary, speaker-dependent, continuous speech recognizers are commercially available. Speech recognizers, perform well in a quiet environment, but poorly in a noisy environment. Speaker-dependent speech recognizers require training prior to them being tested, where the level of background noise in both phases affects the performance of the recognizer. This study aims to determine whether the best performance of a speech recognizer occurs when the levels of background noise during the training and test phases are the same, and how the performance is affected when the levels of background noise during the training and test phases are different. The relationship between the performance of the speech recognizer and upgrading the computer speed and amount of memory as well as software version was also investigated.

  3. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.

    Directory of Open Access Journals (Sweden)

    Ling He

    Full Text Available The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the

  4. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.

    Science.gov (United States)

    He, Ling; Liu, Yin; Yin, Heng; Zhang, Junpeng; Zhang, Jing; Zhang, Jiang

    2017-01-01

    The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the syllables is 91.24%.

  5. An objective auditory measure to assess speech recognition in adult cochlear implant users.

    Science.gov (United States)

    Turgeon, C; Lazzouni, L; Lepore, F; Ellemberg, D

    2014-04-01

    To verify if a mismatch negativity (MMN) paradigm based on speech syllables can differentiate between good and poorer cochlear implant (CI) users on a speech recognition task. Twenty adults with a CI and 11 normal hearing adults participated in the study. Based on a speech recognition test, ten CI users were classified as good performers and ten as poor performers. We measured the MMN with /da/ as the standard stimulus and /ba/ and /ga/ as the deviants. Separate analyses were conducted on the amplitude and latency of the MMN. A MMN was evoked by both deviant stimuli in all normal hearing participants and in well performing CI users, with similar amplitudes for both groups. However, the amplitude of the MMN was significantly reduced for the poorer CI users compared to the normal hearing group and the good CI users. The latency was longer for both groups of cochlear implant users. A bivariate correlation showed a significant positive correlation between the speech recognition score and the amplitude of the MMN. The MMN can distinguish between CI users who have good versus poor speech recognition as assessed with conventional tasks. Our findings suggest that the MMN can be use to assess speech recognition proficiency in CI users who cannot be tested with regular speech recognition tasks, like infants and other non-verbal populations. Copyright © 2013 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.

  6. From Birdsong to Human Speech Recognition: Bayesian Inference on a Hierarchy of Nonlinear Dynamical Systems

    Science.gov (United States)

    Yildiz, Izzet B.; von Kriegstein, Katharina; Kiebel, Stefan J.

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents—an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments. PMID:24068902

  7. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    Directory of Open Access Journals (Sweden)

    Izzet B Yildiz

    Full Text Available Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  8. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users

    OpenAIRE

    Kaandorp, Marre W.; Smits, Cas; Merkus, Paul; Festen, Joost M.; Goverts, S. Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with ...

  9. Automatic recognition of offensive team formation in american football plays

    KAUST Repository

    Atmosukarto, Indriyati

    2013-06-01

    Compared to security surveillance and military applications, where automated action analysis is prevalent, the sports domain is extremely under-served. Most existing software packages for sports video analysis require manual annotation of important events in the video. American football is the most popular sport in the United States, however most game analysis is still done manually. Line of scrimmage and offensive team formation recognition are two statistics that must be tagged by American Football coaches when watching and evaluating past play video clips, a process which takes many man hours per week. These two statistics are also the building blocks for more high-level analysis such as play strategy inference and automatic statistic generation. In this paper, we propose a novel framework where given an American football play clip, we automatically identify the video frame in which the offensive team lines in formation (formation frame), the line of scrimmage for that play, and the type of player formation the offensive team takes on. The proposed framework achieves 95% accuracy in detecting the formation frame, 98% accuracy in detecting the line of scrimmage, and up to 67% accuracy in classifying the offensive team\\'s formation. To validate our framework, we compiled a large dataset comprising more than 800 play-clips of standard and high definition resolution from real-world football games. This dataset will be made publicly available for future comparison. © 2013 IEEE.

  10. Open Dataset for the Automatic Recognition of Sedentary Behaviors.

    Science.gov (United States)

    Possos, William; Cruz, Robinson; Cerón, Jesús D; López, Diego M; Sierra-Torres, Carlos H

    2017-01-01

    Sedentarism is associated with the development of noncommunicable diseases (NCD) such as cardiovascular diseases (CVD), type 2 diabetes, and cancer. Therefore, the identification of specific sedentary behaviors (TV viewing, sitting at work, driving, relaxing, etc.) is especially relevant for planning personalized prevention programs. To build and evaluate a public a dataset for the automatic recognition (classification) of sedentary behaviors. The dataset included data from 30 subjects, who performed 23 sedentary behaviors while wearing a commercial wearable on the wrist, a smartphone on the hip and another in the thigh. Bluetooth Low Energy (BLE) beacons were used in order to improve the automatic classification of different sedentary behaviors. The study also compared six well know data mining classification techniques in order to identify the more precise method of solving the classification problem of the 23 defined behaviors. A better classification accuracy was obtained using the Random Forest algorithm and when data were collected from the phone on the hip. Furthermore, the use of beacons as a reference for obtaining the symbolic location of the individual improved the precision of the classification.

  11. AUTOMATIC RECOGNITION OF INDOOR NAVIGATION ELEMENTS FROM KINECT POINT CLOUDS

    Directory of Open Access Journals (Sweden)

    L. Zeng

    2017-09-01

    Full Text Available This paper realizes automatically the navigating elements defined by indoorGML data standard – door, stairway and wall. The data used is indoor 3D point cloud collected by Kinect v2 launched in 2011 through the means of ORB-SLAM. By contrast, it is cheaper and more convenient than lidar, but the point clouds also have the problem of noise, registration error and large data volume. Hence, we adopt a shape descriptor – histogram of distances between two randomly chosen points, proposed by Osada and merges with other descriptor – in conjunction with random forest classifier to recognize the navigation elements (door, stairway and wall from Kinect point clouds. This research acquires navigation elements and their 3-d location information from each single data frame through segmentation of point clouds, boundary extraction, feature calculation and classification. Finally, this paper utilizes the acquired navigation elements and their information to generate the state data of the indoor navigation module automatically. The experimental results demonstrate a high recognition accuracy of the proposed method.

  12. Automatic Recognition of Indoor Navigation Elements from Kinect Point Clouds

    Science.gov (United States)

    Zeng, L.; Kang, Z.

    2017-09-01

    This paper realizes automatically the navigating elements defined by indoorGML data standard - door, stairway and wall. The data used is indoor 3D point cloud collected by Kinect v2 launched in 2011 through the means of ORB-SLAM. By contrast, it is cheaper and more convenient than lidar, but the point clouds also have the problem of noise, registration error and large data volume. Hence, we adopt a shape descriptor - histogram of distances between two randomly chosen points, proposed by Osada and merges with other descriptor - in conjunction with random forest classifier to recognize the navigation elements (door, stairway and wall) from Kinect point clouds. This research acquires navigation elements and their 3-d location information from each single data frame through segmentation of point clouds, boundary extraction, feature calculation and classification. Finally, this paper utilizes the acquired navigation elements and their information to generate the state data of the indoor navigation module automatically. The experimental results demonstrate a high recognition accuracy of the proposed method.

  13. A speech recognition system for data collection in precision agriculture

    Science.gov (United States)

    Dux, David Lee

    Agricultural producers have shown interest in collecting detailed, accurate, and meaningful field data through field scouting, but scouting is labor intensive. They use yield monitor attachments to collect weed and other field data while driving equipment. However, distractions from using a keyboard or buttons while driving can lead to driving errors or missed data points. At Purdue University, researchers have developed an ASR system to allow equipment operators to collect georeferenced data while keeping hands and eyes on the machine during harvesting and to ease georeferencing of data collected during scouting. A notebook computer retrieved locations from a GPS unit and displayed and stored data in Excel. A headset microphone with a single earphone collected spoken input while allowing the operator to hear outside sounds. One-, two-, or three-word commands activated appropriate VBA macros. Four speech recognition products were chosen based on hardware requirements and ability to add new terms. After training, speech recognition accuracy was 100% for Kurzweil VoicePlus and Verbex Listen for the 132 vocabulary words tested, during tests walking outdoors or driving an ATV. Scouting tests were performed by carrying the system in a backpack while walking in soybean fields. The system recorded a point or a series of points with each utterance. Boundaries of points showed problem areas in the field and single points marked rocks and field corners. Data were displayed as an Excel chart to show a real-time map as data were collected. The information was later displayed in a GIS over remote sensed field images. Field corners and areas of poor stand matched, with voice data explaining anomalies in the image. The system was tested during soybean harvest by using voice to locate weed patches. A harvester operator with little computer experience marked points by voice when the harvester entered and exited weed patches or areas with poor crop stand. The operator found the

  14. Stochastic Modeling as a Means of Automatic Speech Recognition

    Science.gov (United States)

    1975-04-01

    ROOTS - ER UU - T S 41300 ROSES - ER OU S IH S -1 .i’O ROUNÜEO - ER PR üH N - 0 EH - 0 41500 RUSSIA - ER fix SH RX 41600 SRY - S EH IH 41700...about Uatarqata. Tal I us all about China. Tal I us all about China. Giva ua Russia . Giva us Russia . Tall Ma all about Israal. Tall ma all about...Knight on king knight llva capturas pawn on king bishop savan. Quean goas to bishop thraa. Queer goes to bishop thraa. Knight goes to bishop

  15. A hybrid model for automatic emotion recognition in suicide notes.

    Science.gov (United States)

    Yang, Hui; Willis, Alistair; de Roeck, Anne; Nuseibeh, Bashar

    2012-01-01

    We describe the Open University team's submission to the 2011 i2b2/VA/Cincinnati Medical Natural Language Processing Challenge, Track 2 Shared Task for sentiment analysis in suicide notes. This Shared Task focused on the development of automatic systems that identify, at the sentence level, affective text of 15 specific emotions from suicide notes. We propose a hybrid model that incorporates a number of natural language processing techniques, including lexicon-based keyword spotting, CRF-based emotion cue identification, and machine learning-based emotion classification. The results generated by different techniques are integrated using different vote-based merging strategies. The automated system performed well against the manually-annotated gold standard, and achieved encouraging results with a micro-averaged F-measure score of 61.39% in textual emotion recognition, which was ranked 1st place out of 24 participant teams in this challenge. The results demonstrate that effective emotion recognition by an automated system is possible when a large annotated corpus is available.

  16. Using Face Recognition in the Automatic Door Access Control in a Secured Room

    Directory of Open Access Journals (Sweden)

    Gheorghe Gilca

    2017-06-01

    Full Text Available The aim of this paper is to help users improve the door security of sensitive locations by using face detection and recognition. This paper is comprised mainly of three subsystems: face detection, face recognition and automatic door access control. The door will open automatically for the known person due to the command of the microcontroller.

  17. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    Science.gov (United States)

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory

  18. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used......Frame based speech processing inherently assumes a stationary behavior of speech signals in a short period of time. Over a long time, the characteristics of the signals can change significantly and frames are not equally important, underscoring the need for frame selection. In this paper, we...... for scalable source coding schemes in distributed speech recognition where the target bit rate is met by adjusting the frame rate. Thirdly, it is applied to voice activity detection. Very encouraging results are obtained for all three speech processing tasks....

  19. Research Into the Use of Speech Recognition Enhanced Microworlds in an Authorable Language Tutor

    National Research Council Canada - National Science Library

    Plott, Beth

    1999-01-01

    .... Once the first microworld exercise was completed and integrated into MILT, ARI funded the investigation of the use of discreet speech recognition technology in language learning using the microworld exercise as a basis...

  20. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    the relative effect of various factors. Results: Some factors have a major impact, such as the words to be recognised, the type of recognition, and participants. The type of microphone is especially significant when combined with the type of noise. While loud noises in the operating room can have a predominant......Introduction: Speech recognition is currently being deployed in medical and anaesthesia applications. This article is part of a project to investigate and further develop a prototype of a speech-input interface in Danish for an electronic anaesthesia patient record, to be used in real time during...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type...

  1. Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions

    Directory of Open Access Journals (Sweden)

    Itakura Fumitada

    2007-01-01

    Full Text Available We address issues for improving handsfree speech recognition performance in different car environments using a single distant microphone. In this paper, we propose a nonlinear multiple-regression-based enhancement method for in-car speech recognition. In order to develop a data-driven in-car recognition system, we develop an effective algorithm for adapting the regression parameters to different driving conditions. We also devise the model compensation scheme by synthesizing the training data using the optimal regression parameters and by selecting the optimal HMM for the test speech. Based on isolated word recognition experiments conducted in 15 real car environments, the proposed adaptive regression approach shows an advantage in average relative word error rate (WER reductions of 52.5 and 14.8 , compared to original noisy speech and ETSI advanced front end, respectively.

  2. The effects of reverberant self- and overlap-masking on speech recognition in cochlear implant listeners.

    Science.gov (United States)

    Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S

    2014-06-01

    Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

  3. Channel normalization technique for speech recognition in mismatched conditions

    CSIR Research Space (South Africa)

    Kleynhans, N

    2008-11-01

    Full Text Available The performance of trainable speech-processing systems deteriorates significantly when there is a mismatch between the training and testing data. The data mismatch becomes a dominant factor when collecting speech data for resource scarce languages...

  4. Location and acoustic scale cues in concurrent speech recognition.

    Science.gov (United States)

    Ives, D Timothy; Vestergaard, Martin D; Kistler, Doris J; Patterson, Roy D

    2010-06-01

    Location and acoustic scale cues have both been shown to have an effect on the recognition of speech in multi-speaker environments. This study examines the interaction of these variables. Subjects were presented with concurrent triplets of syllables from a target voice and a distracting voice, and asked to recognize a specific target syllable. The task was made more or less difficult by changing (a) the location of the distracting speaker, (b) the scale difference between the two speakers, and/or (c) the relative level of the two speakers. Scale differences were produced by changing the vocal tract length and glottal pulse rate during syllable synthesis: 32 acoustic scale differences were used. Location cues were produced by convolving head-related transfer functions with the stimulus. The angle between the target speaker and the distracter was 0 degrees, 4 degrees, 8 degrees, 16 degrees, or 32 degrees on the 0 degrees horizontal plane. The relative level of the target to the distracter was 0 or -6 dB. The results show that location and scale difference interact, and the interaction is greatest when one of these cues is small. Increasing either the acoustic scale or the angle between target and distracter speakers quickly elevates performance to ceiling levels.

  5. End-to-End Neural Segmental Models for Speech Recognition

    Science.gov (United States)

    Tang, Hao; Lu, Liang; Kong, Lingpeng; Gimpel, Kevin; Livescu, Karen; Dyer, Chris; Smith, Noah A.; Renals, Steve

    2017-12-01

    Segmental models are an alternative to frame-based models for sequence prediction, where hypothesized path weights are based on entire segment scores rather than a single frame at a time. Neural segmental models are segmental models that use neural network-based weight functions. Neural segmental models have achieved competitive results for speech recognition, and their end-to-end training has been explored in several studies. In this work, we review neural segmental models, which can be viewed as consisting of a neural network-based acoustic encoder and a finite-state transducer decoder. We study end-to-end segmental models with different weight functions, including ones based on frame-level neural classifiers and on segmental recurrent neural networks. We study how reducing the search space size impacts performance under different weight functions. We also compare several loss functions for end-to-end training. Finally, we explore training approaches, including multi-stage vs. end-to-end training and multitask training that combines segmental and frame-level losses.

  6. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  7. Two Stage Data Augmentation for Low Resourced Speech Recognition (Author’s Manuscript)

    Science.gov (United States)

    2016-09-12

    Development of a speech recognition system for ice- landic using machine translated text,” in SLTU, 2008. [5] J. Li, L. Deng, Y. Gong, and R. Haeb...speech recognition, deep neural networks, data augmentation 1. Introduction When training data is limited—whether it be audio or text—the obvious...channel variation [33]. We test whether this translates to improved performance with data augmentation. Ta- ble 2 presents results on Amharic using

  8. Speech Recognition and Acoustic Features in Combined Electric and Acoustic Stimulation

    Science.gov (United States)

    Yoon, Yang-soo; Li, Yongxin; Fu, Qian-Jie

    2012-01-01

    Purpose: In this study, the authors aimed to identify speech information processed by a hearing aid (HA) that is additive to information processed by a cochlear implant (CI) as a function of signal-to-noise ratio (SNR). Method: Speech recognition was measured with CI alone, HA alone, and CI + HA. Ten participants were separated into 2 groups; good…

  9. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  10. Influence of Native and Non-Native Multitalker Babble on Speech Recognition in Noise.

    Science.gov (United States)

    Jain, Chandni; Konadath, Sreeraj; Vimal, Bharathi M; Suresh, Vidhya

    2014-03-06

    The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years) with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and non-native multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR) compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada) babble compared to non-native babble (Malayalam). However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam) as well as non-native babble (Kannada). The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  11. Listeners Experience Linguistic Masking Release in Noise-Vocoded Speech-in-Speech Recognition

    Science.gov (United States)

    Viswanathan, Navin; Kokkinakis, Kostas; Williams, Brittany T.

    2018-01-01

    Purpose: The purpose of this study was to evaluate whether listeners with normal hearing perceiving noise-vocoded speech-in-speech demonstrate better intelligibility of target speech when the background speech was mismatched in language (linguistic release from masking [LRM]) and/or location (spatial release from masking [SRM]) relative to the…

  12. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...... has been shifting towards dimensional affect recognition based on the idea that emotional states are not independent from one another but related in a systematic manner. In this paper, we design a continuous dimensional speech affect recognition model that combines acoustic and semantic features. We...... show that combining semantic and acoustic information for dimensional speech recognition improves the results. Moreover, we show that valence is better estimated using semantic features while arousal is better estimated using acoustic features....

  13. Automatic transcription of continuous speech into syllable-like units ...

    Indian Academy of Sciences (India)

    During testing, the speech signal is segmented into syllable-like units which are then tested against the HMMs obtained during training. ... DON Laboratory, Computer Science Department, Indian Institute of Technology Madras, Chennai 600 036; Department of Information Technology, SSN College of Engineering, Chennai ...

  14. Effect of Speaker Age on Speech Recognition and Perceived Listening Effort in Older Adults with Hearing Loss

    Science.gov (United States)

    McAuliffe, Megan J.; Wilding, Phillipa J.; Rickard, Natalie A.; O'Beirne, Greg A.

    2012-01-01

    Purpose: Older adults exhibit difficulty understanding speech that has been experimentally degraded. Age-related changes to the speech mechanism lead to natural degradations in signal quality. We tested the hypothesis that older adults with hearing loss would exhibit declines in speech recognition when listening to the speech of older adults,…

  15. Prediction of Speech Recognition in Cochlear Implant Users by Adapting Auditory Models to Psychophysical Data

    Directory of Open Access Journals (Sweden)

    Svante Stadler

    2009-01-01

    Full Text Available Users of cochlear implants (CIs vary widely in their ability to recognize speech in noisy conditions. There are many factors that may influence their performance. We have investigated to what degree it can be explained by the users' ability to discriminate spectral shapes. A speech recognition task has been simulated using both a simple and a complex models of CI hearing. The models were individualized by adapting their parameters to fit the results of a spectral discrimination test. The predicted speech recognition performance was compared to experimental results, and they were significantly correlated. The presented framework may be used to simulate the effects of changing the CI encoding strategy.

  16. How Linguistic Closure and Verbal Working Memory Relate to Speech Recognition in Noise—A Review

    Science.gov (United States)

    Koelewijn, Thomas; Zekveld, Adriana A.; Kramer, Sophia E.; Festen, Joost M.

    2013-01-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations. PMID:23945955

  17. How linguistic closure and verbal working memory relate to speech recognition in noise--a review.

    Science.gov (United States)

    Besser, Jana; Koelewijn, Thomas; Zekveld, Adriana A; Kramer, Sophia E; Festen, Joost M

    2013-06-01

    The ability to recognize masked speech, commonly measured with a speech reception threshold (SRT) test, is associated with cognitive processing abilities. Two cognitive factors frequently assessed in speech recognition research are the capacity of working memory (WM), measured by means of a reading span (Rspan) or listening span (Lspan) test, and the ability to read masked text (linguistic closure), measured by the text reception threshold (TRT). The current article provides a review of recent hearing research that examined the relationship of TRT and WM span to SRTs in various maskers. Furthermore, modality differences in WM capacity assessed with the Rspan compared to the Lspan test were examined and related to speech recognition abilities in an experimental study with young adults with normal hearing (NH). Span scores were strongly associated with each other, but were higher in the auditory modality. The results of the reviewed studies suggest that TRT and WM span are related to each other, but differ in their relationships with SRT performance. In NH adults of middle age or older, both TRT and Rspan were associated with SRTs in speech maskers, whereas TRT better predicted speech recognition in fluctuating nonspeech maskers. The associations with SRTs in steady-state noise were inconclusive for both measures. WM span was positively related to benefit from contextual information in speech recognition, but better TRTs related to less interference from unrelated cues. Data for individuals with impaired hearing are limited, but larger WM span seems to give a general advantage in various listening situations.

  18. Lipreading and audiovisual speech recognition across the adult lifespan: Implications for audiovisual integration.

    Science.gov (United States)

    Tye-Murray, Nancy; Spehar, Brent; Myerson, Joel; Hale, Sandra; Sommers, Mitchell

    2016-06-01

    In this study of visual (V-only) and audiovisual (AV) speech recognition in adults aged 22-92 years, the rate of age-related decrease in V-only performance was more than twice that in AV performance. Both auditory-only (A-only) and V-only performance were significant predictors of AV speech recognition, but age did not account for additional (unique) variance. Blurring the visual speech signal decreased speech recognition, and in AV conditions involving stimuli associated with equivalent unimodal performance for each participant, speech recognition remained constant from 22 to 92 years of age. Finally, principal components analysis revealed separate visual and auditory factors, but no evidence of an AV integration factor. Taken together, these results suggest that the benefit that comes from being able to see as well as hear a talker remains constant throughout adulthood and that changes in this AV advantage are entirely driven by age-related changes in unimodal visual and auditory speech recognition. (PsycINFO Database Record (c) 2016 APA, all rights reserved).

  19. "Product" Versus "Process" Measures in Assessing Speech Recognition Outcomes in Adults With Cochlear Implants.

    Science.gov (United States)

    Moberly, Aaron C; Castellanos, Irina; Vasil, Kara J; Adunka, Oliver F; Pisoni, David B

    2018-03-01

    1) When controlling for age in postlingual adult cochlear implant (CI) users, information-processing functions, as assessed using "process" measures of working memory capacity, inhibitory control, information-processing speed, and fluid reasoning, will predict traditional "product" outcome measures of speech recognition. 2) Demographic/audiologic factors, particularly duration of deafness, duration of CI use, degree of residual hearing, and socioeconomic status, will impact performance on underlying information-processing functions, as assessed using process measures. Clinicians and researchers rely heavily on endpoint product measures of accuracy in speech recognition to gauge patient outcomes postoperatively. However, these measures are primarily descriptive and were not designed to assess the underlying core information-processing operations that are used during speech recognition. In contrast, process measures reflect the integrity of elementary core subprocesses that are operative during behavioral tests using complex speech signals. Forty-two experienced adult CI users were tested using three product measures of speech recognition, along with four process measures of working memory capacity, inhibitory control, speed of lexical/phonological access, and nonverbal fluid reasoning. Demographic and audiologic factors were also assessed. Scores on product measures were associated with core process measures of speed of lexical/phonological access and nonverbal fluid reasoning. After controlling for participant age, demographic and audiologic factors did not correlate with process measure scores. Findings provide support for the important foundational roles of information processing operations in speech recognition outcomes of postlingually deaf patients who have received CIs.

  20. Report generation using digital speech recognition in radiology

    International Nuclear Information System (INIS)

    Vorbeck, F.; Ba-Ssalamah, A.; Kettenbach, J.; Huebsch, P.

    2000-01-01

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5 %, and the mean error rate for conventional typing was 0.4 %. Reports edited with the CSR, on average, were generated 19 % faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28 % for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. (orig.)

  1. Use of intonation contours for speech recognition in noise by cochlear implant recipients.

    Science.gov (United States)

    Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin

    2011-05-01

    The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise.

  2. Joint variable frame rate and length analysis for speech recognition under adverse conditions

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Kraljevski, Ivan

    2014-01-01

    This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal...... frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness.......-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased...

  3. Semi-automatic recognition of marine debris on beaches.

    Science.gov (United States)

    Ge, Zhenpeng; Shi, Huahong; Mei, Xuefei; Dai, Zhijun; Li, Daoji

    2016-05-09

    An increasing amount of anthropogenic marine debris is pervading the earth's environmental systems, resulting in an enormous threat to living organisms. Additionally, the large amount of marine debris around the world has been investigated mostly through tedious manual methods. Therefore, we propose the use of a new technique, light detection and ranging (LIDAR), for the semi-automatic recognition of marine debris on a beach because of its substantially more efficient role in comparison with other more laborious methods. Our results revealed that LIDAR should be used for the classification of marine debris into plastic, paper, cloth and metal. Additionally, we reconstructed a 3-dimensional model of different types of debris on a beach with a high validity of debris revivification using LIDAR-based individual separation. These findings demonstrate that the availability of this new technique enables detailed observations to be made of debris on a large beach that was previously not possible. It is strongly suggested that LIDAR could be implemented as an appropriate monitoring tool for marine debris by global researchers and governments.

  4. The role of situational awareness in automatic target recognition

    Science.gov (United States)

    Snorrason, Magnus S.; Goodsell, Thom R.; Monnier, Camille R.; Stevens, Mark R.

    2004-09-01

    Model-based Automatic Target Recognition (ATR) algorithms are adept at recognizing targets in high fidelity 3D LADAR imagery. Most current approaches involve a matching component where a hypothesized target and target pose are iteratively aligned to pre-segmented range data. Once the model-to-data alignment has converged, a match score is generated indicating the quality of match. This score is then used to rank one model hypothesis over another. The main drawback of this approach is twofold. First, to ensure the correct target is recognized, a large number of model hypotheses must be considered. Even with a highly accurate indexing algorithm, the number of target types and variants that need to be explored is prohibitive for real-time operation. Second, the iterative matching step must consider a variety of target poses to ensure that the correct alignment is recovered. Inaccurate alignments produce erroneous match scores and thus errors when ranking one target hypothesis over another. To compensate for such drawbacks, we explore the use of situational awareness information already available to an image analyst. Examples of such information include knowledge of the surrounding terrain (to assess potential occlusion levels) and targets of interest (to account for target variants).

  5. Feature extractor and search engine for automatic object recognition

    Science.gov (United States)

    Freire, Joao C.; Correia, Bento A. B.

    1998-10-01

    Automatic object recognition methodologies are being used increasingly in the automation of many industrial processes. However, the specific nature of each system usually implies an intensive development effort. Aiming to control and constrain such effort, this paper presents a processing architecture that promotes the reutilization of common system modules. The approach described covers the whole processing chain by connecting low-level pre-processing to the final classification stage, specially emphasizing the segmentation and feature extraction levels. The resulting core is a search engine based on the data structure of a graph. This structure is initially fed with raw edge information extracted from the original raster image representation. The edge segments used at this preliminary version of the graph are the result of a multistage process of edge following and edge linking. The use of image properties at high levels of abstraction allows the dimensions of the search space to be significantly reduced. Modularity and reutilization allows the refinement of an initial low-level representation of information into successive higher levels of description, until a final set of features that may directly feed a classifier is achieved. Parallel contours, constant curvature edge segments and closed contours of a given shape are a few examples of the features easily defined within this architecture and that can be extracted by the search engine implemented.

  6. Automatic sociophonetics: Exploring corpora with a forensic accent recognition system.

    Science.gov (United States)

    Brown, Georgina; Wormald, Jessica

    2017-07-01

    This paper demonstrates how the Y-ACCDIST system, the York ACCDIST-based automatic accent recognition system [Brown (2015). Proceedings of the International Congress of Phonetic Sciences, Glasgow, UK], can be used to inspect sociophonetic corpora as a preliminary "screening" tool. Although Y-ACCDIST's intended application is to assist with forensic casework, the system can also be exploited in sociophonetic research to begin unpacking variation. Using a subset of the PEBL (Panjabi-English in Bradford and Leicester) corpus, the outputs of Y-ACCDIST are explored, which, it is argued, efficiently and objectively assess speaker similarities across different linguistic varieties. The ways these outputs corroborate with a phonetic analysis of the data are also discovered. First, Y-ACCDIST is used to classify speakers from the corpus based on language background and region. A Y-ACCDIST cluster analysis is then implemented, which groups speakers in ways consistent with more localised networks, providing a means of identifying potential communities of practice. Additionally, the results of a Y-ACCDIST feature selection task that indicates which specific phonemes are most valuable in distinguishing between speaker groups are presented. How Y-ACCDIST outputs can be used to reinforce more traditional sociophonetic analyses and support qualitative interpretations of the data is demonstrated.

  7. Speech Recognition in Adults with Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging

    Science.gov (United States)

    Moberly, Aaron C.; Harris, Michael S.; Boyce, Lauren; Nittrouer, Susan

    2017-01-01

    Purpose: Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced…

  8. Native language shapes automatic neural processing of speech.

    Science.gov (United States)

    Intartaglia, Bastien; White-Schwoch, Travis; Meunier, Christine; Roman, Stéphane; Kraus, Nina; Schön, Daniele

    2016-08-01

    The development of the phoneme inventory is driven by the acoustic-phonetic properties of one's native language. Neural representation of speech is known to be shaped by language experience, as indexed by cortical responses, and recent studies suggest that subcortical processing also exhibits this attunement to native language. However, most work to date has focused on the differences between tonal and non-tonal languages that use pitch variations to convey phonemic categories. The aim of this cross-language study is to determine whether subcortical encoding of speech sounds is sensitive to language experience by comparing native speakers of two non-tonal languages (French and English). We hypothesized that neural representations would be more robust and fine-grained for speech sounds that belong to the native phonemic inventory of the listener, and especially for the dimensions that are phonetically relevant to the listener such as high frequency components. We recorded neural responses of American English and French native speakers, listening to natural syllables of both languages. Results showed that, independently of the stimulus, American participants exhibited greater neural representation of the fundamental frequency compared to French participants, consistent with the importance of the fundamental frequency to convey stress patterns in English. Furthermore, participants showed more robust encoding and more precise spectral representations of the first formant when listening to the syllable of their native language as compared to non-native language. These results align with the hypothesis that language experience shapes sensory processing of speech and that this plasticity occurs as a function of what is meaningful to a listener. Copyright © 2016 Elsevier Ltd. All rights reserved.

  9. Melodic contour identification and sentence recognition using sung speech.

    Science.gov (United States)

    Crew, Joseph D; Galvin, John J; Fu, Qian-Jie

    2015-09-01

    For bimodal cochlear implant users, acoustic and electric hearing has been shown to contribute differently to speech and music perception. However, differences in test paradigms and stimuli in speech and music testing can make it difficult to assess the relative contributions of each device. To address these concerns, the Sung Speech Corpus (SSC) was created. The SSC contains 50 monosyllable words sung over an octave range and can be used to test both speech and music perception using the same stimuli. Here SSC data are presented with normal hearing listeners and any advantage of musicianship is examined.

  10. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Directory of Open Access Journals (Sweden)

    Rosa S Gisladottir

    Full Text Available The ability to recognize speech acts (verbal actions in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  11. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Science.gov (United States)

    Gisladottir, Rosa S; Chwilla, Dorothee J; Levinson, Stephen C

    2015-01-01

    The ability to recognize speech acts (verbal actions) in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action) impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs) were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  12. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners.

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.

  13. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  14. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    Science.gov (United States)

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line

  15. Does quality of life depend on speech recognition performance for adult cochlear implant users?

    Science.gov (United States)

    Capretta, Natalie R; Moberly, Aaron C

    2016-03-01

    Current postoperative clinical outcome measures for adults receiving cochlear implants (CIs) consist of testing speech recognition, primarily under quiet conditions. However, it is strongly suspected that results on these measures may not adequately reflect patients' quality of life (QOL) using their implants. This study aimed to evaluate whether QOL for CI users depends on speech recognition performance. Twenty-three postlingually deafened adults with CIs were assessed. Participants were tested for speech recognition (Central Institute for the Deaf word and AzBio sentence recognition in quiet) and completed three QOL measures-the Nijmegen Cochlear Implant Questionnaire; either the Hearing Handicap Inventory for Adults or the Hearing Handicap Inventory for the Elderly; and the Speech, Spatial and Qualities of Hearing Scale questionnaires-to assess a variety of QOL factors. Correlations were sought between speech recognition and QOL scores. Demographics, audiologic history, language, and cognitive skills were also examined as potential predictors of QOL. Only a few QOL scores significantly correlated with postoperative sentence or word recognition in quiet, and correlations were primarily isolated to speech-related subscales on QOL measures. Poorer pre- and postoperative unaided hearing predicted better QOL. Socioeconomic status, duration of deafness, age at implantation, duration of CI use, reading ability, vocabulary size, and cognitive status did not consistently predict QOL scores. For adult, postlingually deafened CI users, clinical speech recognition measures in quiet do not correlate broadly with QOL. Results suggest the need for additional outcome measures of the benefits and limitations of cochlear implantation. 4. Laryngoscope, 126:699-706, 2016. © 2015 The American Laryngological, Rhinological and Otological Society, Inc.

  16. Recognition of voice commands using adaptation of foreign language speech recognizer via selection of phonetic transcriptions

    Science.gov (United States)

    Maskeliunas, Rytis; Rudzionis, Vytautas

    2011-06-01

    In recent years various commercial speech recognizers have become available. These recognizers provide the possibility to develop applications incorporating various speech recognition techniques easily and quickly. All of these commercial recognizers are typically targeted to widely spoken languages having large market potential; however, it may be possible to adapt available commercial recognizers for use in environments where less widely spoken languages are used. Since most commercial recognition engines are closed systems the single avenue for the adaptation is to try set ways for the selection of proper phonetic transcription methods between the two languages. This paper deals with the methods to find the phonetic transcriptions for Lithuanian voice commands to be recognized using English speech engines. The experimental evaluation showed that it is possible to find phonetic transcriptions that will enable the recognition of Lithuanian voice commands with recognition accuracy of over 90%.

  17. Speech, speaker and speaker's gender identification in automatically processed broadcast stream

    Czech Academy of Sciences Publication Activity Database

    Silovský, J.; Nouza, Jan

    2006-01-01

    Roč. 15, č. 3 (2006), s. 42-48 ISSN 1210-2512 R&D Projects: GA AV ČR(CZ) 1QS108040569 Institutional research plan: CEZ:AV0Z20670512 Keywords : speech processing * speaker recognition Subject RIV: JD - Computer Applications, Robotics

  18. Automatic analysis of multiparty meetings

    Indian Academy of Sciences (India)

    AMI) meeting corpus, the development of a meeting speech recognition system, and systems for the automatic segmentation, summarization and social processing of meetings, together with some example applications based on these systems.

  19. AUTOMATIC RECOGNITION OF BOTH INTER AND INTRA CLASSES OF DIGITAL MODULATED SIGNALS USING ARTIFICIAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    JIDE JULIUS POPOOLA

    2014-04-01

    Full Text Available In radio communication systems, signal modulation format recognition is a significant characteristic used in radio signal monitoring and identification. Over the past few decades, modulation formats have become increasingly complex, which has led to the problem of how to accurately and promptly recognize a modulation format. In addressing these challenges, the development of automatic modulation recognition systems that can classify a radio signal’s modulation format has received worldwide attention. Decision-theoretic methods and pattern recognition solutions are the two typical automatic modulation recognition approaches. While decision-theoretic approaches use probabilistic or likelihood functions, pattern recognition uses feature-based methods. This study applies the pattern recognition approach based on statistical parameters, using an artificial neural network to classify five different digital modulation formats. The paper deals with automatic recognition of both inter-and intra-classes of digitally modulated signals in contrast to most of the existing algorithms in literature that deal with either inter-class or intra-class modulation format recognition. The results of this study show that accurate and prompt modulation recognition is possible beyond the lower bound of 5 dB commonly acclaimed in literature. The other significant contribution of this paper is the usage of the Python programming language which reduces computational complexity that characterizes other automatic modulation recognition classifiers developed using the conventional MATLAB neural network toolbox.

  20. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Directory of Open Access Journals (Sweden)

    Rachel Jane Ellis

    2015-08-01

    Full Text Available Proactive interference (PI is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  1. Does cognitive function predict frequency compressed speech recognition in listeners with normal hearing and normal cognition?

    Science.gov (United States)

    Ellis, Rachel J; Munro, Kevin J

    2013-01-01

    The aim was to investigate the relationship between cognitive ability and frequency compressed speech recognition in listeners with normal hearing and normal cognition. Speech-in-noise recognition was measured using Institute of Electrical and Electronic Engineers sentences presented over earphones at 65 dB SPL and a range of signal-to-noise ratios. There were three conditions: unprocessed, and at frequency compression ratios of 2:1 and 3:1 (cut-off frequency, 1.6 kHz). Working memory and cognitive ability were measured using the reading span test and the trail making test, respectively. Participants were 15 young normally-hearing adults with normal cognition. There was a statistically significant reduction in mean speech recognition from around 80% when unprocessed to 40% for 2:1 compression and 30% for 3:1 compression. There was a statistically significant relationship between speech recognition and cognition for the unprocessed condition but not for the frequency-compressed conditions. The relationship between cognitive functioning and recognition of frequency compressed speech-in-noise was not statistically significant. The findings may have been different if the participants had been provided with training and/or time to 'acclimatize' to the frequency-compressed conditions.

  2. Auditory training of speech recognition with interrupted and continuous noise maskers by children with hearing impairment.

    Science.gov (United States)

    Sullivan, Jessica R; Thibodeau, Linda M; Assmann, Peter F

    2013-01-01

    Previous studies have indicated that individuals with normal hearing (NH) experience a perceptual advantage for speech recognition in interrupted noise compared to continuous noise. In contrast, adults with hearing impairment (HI) and younger children with NH receive a minimal benefit. The objective of this investigation was to assess whether auditory training in interrupted noise would improve speech recognition in noise for children with HI and perhaps enhance their utilization of glimpsing skills. A partially-repeated measures design was used to evaluate the effectiveness of seven 1-h sessions of auditory training in interrupted and continuous noise. Speech recognition scores in interrupted and continuous noise were obtained from pre-, post-, and 3 months post-training from 24 children with moderate-to-severe hearing loss. Children who participated in auditory training in interrupted noise demonstrated a significantly greater improvement in speech recognition compared to those who trained in continuous noise. Those who trained in interrupted noise demonstrated similar improvements in both noise conditions while those who trained in continuous noise only showed modest improvements in the interrupted noise condition. This study presents direct evidence that auditory training in interrupted noise can be beneficial in improving speech recognition in noise for children with HI.

  3. The Relationship Between Spectral Modulation Detection and Speech Recognition: Adult Versus Pediatric Cochlear Implant Recipients.

    Science.gov (United States)

    Gifford, René H; Noble, Jack H; Camarata, Stephen M; Sunderhaus, Linsey W; Dwyer, Robert T; Dawant, Benoit M; Dietrich, Mary S; Labadie, Robert F

    2018-01-01

    Adult cochlear implant (CI) recipients demonstrate a reliable relationship between spectral modulation detection and speech understanding. Prior studies documenting this relationship have focused on postlingually deafened adult CI recipients-leaving an open question regarding the relationship between spectral resolution and speech understanding for adults and children with prelingual onset of deafness. Here, we report CI performance on the measures of speech recognition and spectral modulation detection for 578 CI recipients including 477 postlingual adults, 65 prelingual adults, and 36 prelingual pediatric CI users. The results demonstrated a significant correlation between spectral modulation detection and various measures of speech understanding for 542 adult CI recipients. For 36 pediatric CI recipients, however, there was no significant correlation between spectral modulation detection and speech understanding in quiet or in noise nor was spectral modulation detection significantly correlated with listener age or age at implantation. These findings suggest that pediatric CI recipients might not depend upon spectral resolution for speech understanding in the same manner as adult CI recipients. It is possible that pediatric CI users are making use of different cues, such as those contained within the temporal envelope, to achieve high levels of speech understanding. Further investigation is warranted to investigate the relationship between spectral and temporal resolution and speech recognition to describe the underlying mechanisms driving peripheral auditory processing in pediatric CI users.

  4. Phonological feature-based speech recognition system for pronunciation training in non-native language learning.

    Science.gov (United States)

    Arora, Vipul; Lahiri, Aditi; Reetz, Henning

    2018-01-01

    The authors address the question whether phonological features can be used effectively in an automatic speech recognition (ASR) system for pronunciation training in non-native language (L2) learning. Computer-aided pronunciation training consists of two essential tasks-detecting mispronunciations and providing corrective feedback, usually either on the basis of full words or phonemes. Phonemes, however, can be further disassembled into phonological features, which in turn define groups of phonemes. A phonological feature-based ASR system allows the authors to perform a sub-phonemic analysis at feature level, providing a more effective feedback to reach the acoustic goal and perceptual constancy. Furthermore, phonological features provide a structured way for analysing the types of errors a learner makes, and can readily convey which pronunciations need improvement. This paper presents the authors implementation of such an ASR system using deep neural networks as an acoustic model, and its use for detecting mispronunciations, analysing errors, and rendering corrective feedback. Quantitative as well as qualitative evaluations are carried out for German and Italian learners of English. In addition to achieving high accuracy of mispronunciation detection, the system also provides accurate diagnosis of errors.

  5. Fast Automatic Configuration of Artificial Neural Networks Used for Binary Patterns Recognition

    OpenAIRE

    Adrian Horzyk

    2001-01-01

    This paper presents a powerful method of an automatically generated architecture of neural networks used for binary pattems recognition, which can quickly and automatically reduce synapses in a way of minimally reducing a quality of recognition and a quality of generalization. Moreover, this method computes all weights in two runs over a leaming sequence, what makes this method very fast. First, the method calculates all binary features for each pattem and then weights are computed. Furthermo...

  6. Hierarchical singleton-type recurrent neural fuzzy networks for noisy speech recognition.

    Science.gov (United States)

    Juang, Chia-Feng; Chiou, Chyi-Tian; Lai, Chun-Lung

    2007-05-01

    This paper proposes noisy speech recognition using hierarchical singleton-type recurrent neural fuzzy networks (HSRNFNs). The proposed HSRNFN is a hierarchical connection of two singleton-type recurrent neural fuzzy networks (SRNFNs), where one is used for noise filtering and the other for recognition. The SRNFN is constructed by recurrent fuzzy if-then rules with fuzzy singletons in the consequences, and their recurrent properties make them suitable for processing speech patterns with temporal characteristics. In n words recognition, n SRNFNs are created for modeling n words, where each SRNFN receives the current frame feature and predicts the next one of its modeling word. The prediction error of each SRNFN is used as recognition criterion. In filtering, one SRNFN is created, and each SRNFN recognizer is connected to the same SRNFN filter, which filters noisy speech patterns in the feature domain before feeding them to the SRNFN recognizer. Experiments with Mandarin word recognition under different types of noise are performed. Other recognizers, including multilayer perceptron (MLP), time-delay neural networks (TDNNs), and hidden Markov models (HMMs), are also tested and compared. These experiments and comparisons demonstrate good results with HSRNFN for noisy speech recognition tasks.

  7. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  8. Hearing Handicap and Speech Recognition Correlate With Self-Reported Listening Effort and Fatigue.

    Science.gov (United States)

    Alhanbali, Sara; Dawes, Piers; Lloyd, Simon; Munro, Kevin J

    2017-10-31

    To investigate the correlations between hearing handicap, speech recognition, listening effort, and fatigue. Eighty-four adults with hearing loss (65 to 85 years) completed three self-report questionnaires: the Fatigue Assessment Scale, the Effort Assessment Scale, and the Hearing Handicap Inventory for Elderly. Audiometric assessment included pure-tone audiometry and speech recognition in noise. There was a significant positive correlation between handicap and fatigue (r = 0.39, p handicap and effort (r = 0.73, p handicap and speech recognition both correlate with self-reported listening effort and fatigue, which is consistent with a model of listening effort and fatigue where perceived difficulty is related to sustained effort and fatigue for unrewarding tasks over which the listener has low control. A clinical implication is that encouraging clients to recognize and focus on the pleasure and positive experiences of listening may result in greater satisfaction and benefit from hearing aid use.

  9. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  10. Speech recognition outcomes following bilateral cochlear implantation in adults aged over 50 years old.

    Science.gov (United States)

    Boisvert, Isabelle; McMahon, Catherine M; Dowell, Richard C

    2016-01-01

    To examine the speech recognition benefit of bilateral cochlear implantation over unilateral implantation in adults aged over 50 years old, and to identify potential predictors of successful bilateral implantation in this group. Retrospective cohort study using data collected during standard clinical practice. Bilateral performance was compared to the unilateral performance with the first and second implanted ear and examined in relation to potential predictive variables. Sixty-seven cochlear implant users who received a second implant after the age of 50 years old. Participants obtained significantly greater speech recognition scores with the use of bilateral cochlear implants compared to the use of each individual implant. The score obtained with the first implanted ear was the most reliable predictor of the score obtained with the second and with bilateral implants. Older adults can obtain speech recognition benefits from sequential bilateral cochlear implantation.

  11. Modeling the temporal dynamics of distinctive feature landmark detectors for speech recognition.

    Science.gov (United States)

    Jansen, Aren; Niyogi, Partha

    2008-09-01

    This paper elaborates on a computational model for speech recognition that is inspired by several interrelated strands of research in phonology, acoustic phonetics, speech perception, and neuroscience. The goals are twofold: (i) to explore frameworks for recognition that may provide a viable alternative to the current hidden Markov model (HMM) based speech recognition systems and (ii) to provide a computational platform that will facilitate engaging, quantifying, and testing various theories in the scientific traditions in phonetics, psychology, and neuroscience. This motivation leads to an approach that constructs a hierarchically structured point process representation based on distinctive feature landmark detectors and probabilistically integrates the firing patterns of these detectors to decode a phonological sequence. The accuracy of a broad class recognizer based on this framework is competitive with equivalent HMM-based systems. Various avenues for future development of the presented methodology are outlined.

  12. Speech Recognition of Aged Voices in the AAL Context: Detection of Distress Sentences

    OpenAIRE

    Aman , Frédéric; Vacher , Michel; Rossato , Solange; Portet , François

    2013-01-01

    International audience; By 2050, about a third of the French population will be over 65. In the context of technologies development aiming at helping aged people to live independently at home, the CIRDO project aims at implementing an ASR system into a social inclusion product designed for elderly people in order to detect distress situations. Speech recognition systems present higher word error rate when speech is uttered by elderly speakers compared to when non-aged voice is considered. Two...

  13. Towards social touch intelligence: developing a robust system for automatic touch recognition

    NARCIS (Netherlands)

    Jung, Merel Madeleine

    2014-01-01

    Touch behavior is of great importance during social interaction. Automatic recognition of social touch is necessary to transfer the touch modality from interpersonal interaction to other areas such as Human-Robot Interaction (HRI). This paper describes a PhD research program on the automatic

  14. Can prosody aid the automatic classification of dialog acts in conversational speech?

    Science.gov (United States)

    Shriberg, E; Bates, R; Stolcke, A; Taylor, P; Jurafsky, D; Ries, K; Coccaro, N; Martin, R; Meteer, M; van Ess-Dykema, C

    1998-01-01

    Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study is based on more than 1000 conversations from the Switchboard corpus. DAs were hand-annotated, and prosodic features (duration, pause, F0, energy, and speaking rate) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. Performance was evaluated for prosody models alone, and after combining the prosody models with word information--either from true words or from the output of an automatic speech recognizer. For an overall classification task, as well as three subtasks, prosody made significant contributions to classification. Feature-specific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DA-specific statistical language model improved performance over that of the language model alone, especially for the case of recognized words. Results suggest that DAs are redundantly marked in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications.

  15. Across-site patterns of modulation detection: relation to speech recognition.

    Science.gov (United States)

    Garadat, Soha N; Zwolan, Teresa A; Pfingst, Bryan E

    2012-05-01

    The aim of this study was to identify across-site patterns of modulation detection thresholds (MDTs) in subjects with cochlear implants and to determine if removal of sites with the poorest MDTs from speech processor programs would result in improved speech recognition. Five hundred millisecond trains of symmetric-biphasic pulses were modulated sinusoidally at 10 Hz and presented at a rate of 900 pps using monopolar stimulation. Subjects were asked to discriminate a modulated pulse train from an unmodulated pulse train for all electrodes in quiet and in the presence of an interleaved unmodulated masker presented on the adjacent site. Across-site patterns of masked MDTs were then used to construct two 10-channel MAPs such that one MAP consisted of sites with the best masked MDTs and the other MAP consisted of sites with the worst masked MDTs. Subjects' speech recognition skills were compared when they used these two different MAPs. Results showed that MDTs were variable across sites and were elevated in the presence of a masker by various amounts across sites. Better speech recognition was observed when the processor MAP consisted of sites with best masked MDTs, suggesting that temporal modulation sensitivity has important contributions to speech recognition with a cochlear implant.

  16. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Directory of Open Access Journals (Sweden)

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  17. Studying the Speech Recognition Scores of Hearing Impaied Children by Using Nonesense Syllables

    Directory of Open Access Journals (Sweden)

    Mohammad Reza Keyhani

    1998-09-01

    Full Text Available Background: The current article is aimed at evaluating speech recognition scores in hearing aid wearers to determine whether nonsense syllables are suitable speech materials to evaluate the effectiveness of their hearing aids. Method: Subjects were 60 children (15 males and 15 females with bilateral moderate and moderately severe sensorineural hearing impairment who were aged between 7.7-14 years old. Gain prescription was fitted by NAL method. Then speech evaluation was performed in a quiet place with and without hearing aid by using a list of 25 monosyllable words recorded on a tape. A list was prepared for the subjects to check in the correct response. The same method was used to obtain results for normal subjects. Results: The results revealed that the subjects using hearing aids achieved significantly higher SRS in comparison of not wearing it. Although the speech recognition ability was not compensated completely (the maximum score obtained was 60% it was also revealed that the syllable recognition ability in the less amplified frequencies were decreased. the SRS was very higher in normal subjects (with an average of 88%. Conclusion: It seems that Speech recognition score can prepare Audiologist with a more comprehensive method to evaluate the hearing aid benefits.

  18. Automatic switching between noise classification and speech enhancement for hearing aid devices.

    Science.gov (United States)

    Saki, Fatemeh; Kehtarnavaz, Nasser

    2016-08-01

    This paper presents a voice activity detector (VAD) for automatic switching between a noise classifier and a speech enhancer as part of the signal processing pipeline of hearing aid devices. The developed VAD consists of a computationally efficient feature extractor and a random forest classifier. Previously used signal features as well as two newly introduced signal features are extracted and fed into the classifier to perform automatic switching. This switching approach is compared to two popular VADs. The results obtained indicate the introduced approach outperforms these existing approaches in terms of both detection rate and processing time.

  19. Improved Emotion Recognition Using Gaussian Mixture Model and Extreme Learning Machine in Speech and Glottal Signals

    Directory of Open Access Journals (Sweden)

    Hariharan Muthusamy

    2015-01-01

    Full Text Available Recently, researchers have paid escalating attention to studying the emotional state of an individual from his/her speech signals as the speech signal is the fastest and the most natural method of communication between individuals. In this work, new feature enhancement using Gaussian mixture model (GMM was proposed to enhance the discriminatory power of the features extracted from speech and glottal signals. Three different emotional speech databases were utilized to gauge the proposed methods. Extreme learning machine (ELM and k-nearest neighbor (kNN classifier were employed to classify the different types of emotions. Several experiments were conducted and results show that the proposed methods significantly improved the speech emotion recognition performance compared to research works published in the literature.

  20. Automatic landmark detection and face recognition for side-view face images

    NARCIS (Netherlands)

    Santemiz, P.; Spreeuwers, Lieuwe Jan; Veldhuis, Raymond N.J.; Broemme, Arslan; Busch, Christoph

    2013-01-01

    In real-life scenarios where pose variation is up to side-view positions, face recognition becomes a challenging task. In this paper we propose an automatic side-view face recognition system designed for home-safety applications. Our goal is to recognize people as they pass through doors in order to

  1. EEG frequency-amplitude characteristics of the successful recognition of emotional speech.

    Science.gov (United States)

    Kislova, O O; Rusalova, M N

    2010-07-01

    EEG frequency-amplitude characteristics were studied in two groups of subjects, with high and low "emotional hearing" measures. Comparison of power over the whole EEG range between the two groups of subjects led to the conclusion that the EEG activation level was significantly higher in subjects with low "emotional hearing" measures than in those with high levels. This group also showed a higher level of activation in the posterior temporal areas of the cortex of the right hemisphere on recognition of emotions in speech. Thus, high initial levels of cortical activation and greater EEG reactivity on hearing emotional phrases are factors hindering the recognition of emotional expression in speech.

  2. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  3. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    OpenAIRE

    Lee, Heungkyu

    2013-01-01

    This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two-channel based independent vector analysis (IVA) on a home-robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at...

  4. Effects of noise on speech recognition: Challenges for communication by service members.

    Science.gov (United States)

    Le Prell, Colleen G; Clavier, Odile H

    2017-06-01

    Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.

  5. Support subspaces method for synthetic aperture radar automatic target recognition

    Directory of Open Access Journals (Sweden)

    Vladimir Fursov

    2016-09-01

    Full Text Available This article offers a new object recognition approach that gives high quality using synthetic aperture radar images. The approach includes image preprocessing, clustering and recognition stages. At the image preprocessing stage, we compute the mass centre of object images for better image matching. A conjugation index of a recognition vector is used as a distance function at clustering and recognition stages. We suggest a construction of the so-called support subspaces, which provide high recognition quality with a significant dimension reduction. The results of the experiments demonstrate that the proposed method provides higher recognition quality (97.8% than such methods as support vector machine (95.9%, deep learning based on multilayer auto-encoder (96.6% and adaptive boosting (96.1%. The proposed method is stable for objects processed from different angles.

  6. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    advantages and disadvantages of a vocal interface compared to the traditional touch-screen and keyboard interface to an electronic anaesthesia record during crisis situations; second, to assess the usability in a realistic work environment of some speech input strategies (hands-free vocal interface activated...... team would fill in the anaesthesia record, in one session using only the traditional touch-screen and keyboard interface while in the other session they could also use the speech input interface. Audio-video recordings of the sessions were subsequently analysed and additional subjective data were...... on the one hand that the traditional touch-screen and keyboard interface imposes a steadily increasing mental workload in terms of items to keep in memory until there is time to update the anaesthesia record, and on the other hand that the speech input interface will allow anaesthetists to enter medications...

  7. The interaction of hearing loss and level-dependent hearing protection on speech recognition in noise.

    Science.gov (United States)

    Giguère, Christian; Laroche, Chantal; Vaillancourt, Véronique

    2015-02-01

    To determine the effects of different control settings of level-dependent hearing protectors on speech recognition performance in interaction with hearing loss. Controlled laboratory experiment with two level-dependent devices (Peltor® PowerCom Plus™ and Nacre QuietPro®) in two military noises. Word recognition scores were collected in protected and unprotected conditions for 45 participants grouped into four hearing profile categories ranging from within normal limits to moderate-to-severe hearing loss. When the level-dependent mode was switched off to simulate conventional hearing protection, there were large differences across hearing profile categories regarding the effects of wearing the devices on speech recognition in noise; participants with normal hearing showed little effect while participants in the most hearing-impaired category showed large decrements in scores compared to unprotected listening. Activating the level-dependent mode of the devices produced large speech recognition benefits over the passive mode at both low and high gain pass-through settings. The category of participants with the most impaired hearing benefitted the most from the level-dependent mode. The findings indicate that level-dependent hearing protection circuitry can provide substantial benefits in speech recognition performance in noise, compared to conventional passive protection, for individuals covering a wide range of hearing losses.

  8. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Wei Zhang

    2017-01-01

    Full Text Available Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between emotions and certain mathematical derivations to determine the network structure. The proposed algorithm can handle a large number of concepts, whereas a typical FCM can handle only relatively simple networks (maps. Different acoustic features, including fundamental speech features and a new spectral feature, are extracted to evaluate the performance of the proposed method. Three experiments are conducted in this paper, namely, single feature experiment, feature combination experiment, and comparison between the proposed algorithm and typical networks. All experiments are performed on TYUT2.0 and EMO-DB databases. Results of the feature combination experiments show that the recognition rates of the combination features are 10%–20% better than those of single features. The proposed FCM learning algorithm generates 5%–20% performance improvement compared with traditional classification networks.

  9. Automatic Speech-to-Background Ratio Selection to Maintain Speech Intelligibility in Broadcasts Using an Objective Intelligibility Metric

    Directory of Open Access Journals (Sweden)

    Yan Tang

    2018-01-01

    Full Text Available While mixing, sound producers and audio professionals empirically set the speech-to- background ratio (SBR based on rules of thumb and their own perception of sounds. There is no guarantee that the speech content will be intelligible for the general population consuming content over a wide variety of devices, however. In this study, an approach to automatically determine the appropriate SBR for a scene using an objective intelligibility metric is introduced. The model-estimated SBR needed for a preset minimum intelligibility level was compared to the listener-preferred SBR for a range of background sounds. It was found that an extra gain added to the model estimation is needed even for listeners with normal hearing. This gain is needed so an audio scene can be auditioned with comfort and without compromising the sound effects contributed by the background. When the background introduces little informational masking, the extra gain holds almost constant across the various background sounds. However, a larger gain is required for a background that induces informational masking, such as competing speech. The results from a final subjective rating study show that the model-estimated SBR with the additional gain, yields the same listening experience as the SBR preferred by listeners.

  10. Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels

    NARCIS (Netherlands)

    Baskent, Deniz

    2006-01-01

    Speech recognition by normal-hearing listeners improves as a function of the number of spectral channels when tested with a noiseband vocoder simulating cochlear implant signal processing. Speech recognition by the best cochlear implant users, however, saturates around eight channels and does not

  11. Design and implementation of a user-oriented speech recognition interface: the synergy of technology and human factors

    NARCIS (Netherlands)

    Kloosterman, Sietse H.

    1994-01-01

    The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology

  12. DESIGN AND IMPLEMENTATION OF A USER-ORIENTED SPEECH RECOGNITION INTERFACE - THE SYNERGY OF TECHNOLOGY AND HUMAN-FACTORS

    NARCIS (Netherlands)

    KLOOSTERMAN, SH

    The design and implementation of a user-oriented speech recognition interface are described. The interface enables the use of speech recognition in so-called interactive voice response systems which can be accessed via a telephone connection. In the design of the interface a synergy of technology

  13. Fast Automatic Configuration of Artificial Neural Networks Used for Binary Patterns Recognition

    Directory of Open Access Journals (Sweden)

    Adrian Horzyk

    2001-01-01

    Full Text Available This paper presents a powerful method of an automatically generated architecture of neural networks used for binary pattems recognition, which can quickly and automatically reduce synapses in a way of minimally reducing a quality of recognition and a quality of generalization. Moreover, this method computes all weights in two runs over a leaming sequence, what makes this method very fast. First, the method calculates all binary features for each pattem and then weights are computed. Furthermore, there is a quality of generalization considered because it is one of the most important factors of recognition while using neural networks.

  14. Effects of pose and image resolution on automatic face recognition

    NARCIS (Netherlands)

    Mahmood, Zahid; Ali, Tauseef; Khan, Samee U.

    The popularity of face recognition systems have increased due to their use in widespread applications. Driven by the enormous number of potential application domains, several algorithms have been proposed for face recognition. Face pose and image resolutions are among the two important factors that

  15. Speaker-Adaptation for Hybrid HMM-ANN Continuous Speech Recognition System

    OpenAIRE

    Neto, Joao; Almeida, Luis; Hochberg, Mike; Martins, Ciro; Nunes, Luis; Renals, Steve; Robinson, Tony

    1995-01-01

    It is well known that recognition performance degrades significantly when moving from a speaker-dependent to a speaker-independent system. Traditional hidden Markov model (HMM) systems have successfully applied speaker-adaptation approaches to reduce this degradation. In this paper we present and evaluate some techniques for speaker-adaptation of a hybrid HMM-artificial neural network (ANN) continuous speech recognition system. These techniques are applied to a well trained, speaker-independe...

  16. Segment-Based Acoustic Models for Continuous Speech Recognition

    Science.gov (United States)

    1993-04-05

    particular realization of the general model expressed in Equation (2). Such a mixture can 7. Kubala , F., Austin, S., Barry, C., Makhoul, J. Place- combine...Through Reevaluation of N-Best Sentence Hypotheses," Proc. DARPA Speech and Natural Language Workshop, pp. 83-87, February 1991. [14] F. Kubala , S. Austin

  17. Divided attention disrupts perceptual encoding during speech recognition.

    Science.gov (United States)

    Mattys, Sven L; Palmer, Shekeila D

    2015-03-01

    Performing a secondary task while listening to speech has a detrimental effect on speech processing, but the locus of the disruption within the speech system is poorly understood. Recent research has shown that cognitive load imposed by a concurrent visual task increases dependency on lexical knowledge during speech processing, but it does not affect lexical activation per se. This suggests that "lexical drift" under cognitive load occurs either as a post-lexical bias at the decisional level or as a secondary consequence of reduced perceptual sensitivity. This study aimed to adjudicate between these alternatives using a forced-choice task that required listeners to identify noise-degraded spoken words with or without the addition of a concurrent visual task. Adding cognitive load increased the likelihood that listeners would select a word acoustically similar to the target even though its frequency was lower than that of the target. Thus, there was no evidence that cognitive load led to a high-frequency response bias. Rather, cognitive load seems to disrupt sublexical encoding, possibly by impairing perceptual acuity at the auditory periphery.

  18. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Fourier Transform process and then power spectrum of the signal is computed. Then during critical band analysis ... sensitive to the middle frequency range of the audible spectrum. PLP incorporates the effect ... sion of the modified speech spectrum is carried out according to the power law of hearing. (Stevens 1957), which ...

  19. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    dictive (PLP) cepstral coefficient for speech parameterization and continuous density. Hidden Markov Model (HMM) in the ... where each speaker is asked to read 20 set of continuous digits. The system obtained .... 2.1a Critical band integration (Bark frequency weighing): Experiments in human perception have shown that ...

  20. The Role of Somatosensory Information in Speech Perception: Imitation Improves Recognition of Disordered Speech.

    Science.gov (United States)

    Borrie, Stephanie A; Schäfer, Martina C M

    2015-12-01

    Perceptual learning paradigms involving written feedback appear to be a viable clinical tool to reduce the intelligibility burden of dysarthria. The underlying theoretical assumption is that pairing the degraded acoustics with the intended lexical targets facilitates a remapping of existing mental representations in the lexicon. This study investigated whether ties to mental representations can be strengthened by way of a somatosensory motor trace. Following an intelligibility pretest, 100 participants were assigned to 1 of 5 experimental groups. The control group received no training, but the other 4 groups received training with dysarthric speech under conditions involving a unique combination of auditory targets, written feedback, and/or a vocal imitation task. All participants then completed an intelligibility posttest. Training improved intelligibility of dysarthric speech, with the largest improvements observed when the auditory targets were accompanied by both written feedback and an imitation task. Further, a significant relationship between intelligibility improvement and imitation accuracy was identified. This study suggests that somatosensory information can strengthen the activation of speech sound maps of dysarthric speech. The findings, therefore, implicate a bidirectional relationship between speech perception and speech production as well as advance our understanding of the mechanisms that underlie perceptual learning of degraded speech.

  1. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly

    Directory of Open Access Journals (Sweden)

    Na W

    2017-08-01

    Full Text Available Wondo Na,1 Gibbeum Kim,1 Gungu Kim,1 Woojae Han,2 Jinsook Kim2 1Department of Speech Pathology and Audiology, Graduate School, 2Division of Speech Pathology and Audiology, Research Institute of Audiology and Speech Pathology, College of Natural Sciences, Hallym University, Chuncheon, Republic of Korea Purpose: The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss.Methods: One hundred subjects aged 65–84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants’ working memory.Results: 1 As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2 As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3 Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss.Conclusion: The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing

  2. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  3. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  4. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  5. Feature Compensation Employing Multiple Environmental Models for Robust In-Vehicle Speech Recognition

    Science.gov (United States)

    Kim, Wooil; Hansen, John H. L.

    An effective feature compensation method is developed for reliable speech recognition in real-life in-vehicle environments. The CU-Move corpus, used for evaluation, contains a range of speech and noise signals collected for a number of speakers under actual driving conditions. PCGMM-based feature compensation, considered in this paper, utilizes parallel model combination to generate noise-corrupted speech model by combining clean speech and the noise model. In order to address unknown time-varying background noise, an interpolation method of multiple environmental models is employed. To alleviate computational expenses due to multiple models, an Environment Transition Model is employed, which is motivated from Noise Language Model used in Environmental Sniffing. An environment dependent scheme of mixture sharing technique is proposed and shown to be more effective in reducing the computational complexity. A smaller environmental model set is determined by the environment transition model for mixture sharing. The proposed scheme is evaluated on the connected single digits portion of the CU-Move database using the Aurora2 evaluation toolkit. Experimental results indicate that our feature compensation method is effective for improving speech recognition in real-life in-vehicle conditions. A reduction of 73.10% of the computational requirements was obtained by employing the environment dependent mixture sharing scheme with only a slight change in recognition performance. This demonstrates that the proposed method is effective in maintaining the distinctive characteristics among the different environmental models, even when selecting a large number of Gaussian components for mixture sharing.

  6. Effect of a Bluetooth-implemented hearing aid on speech recognition performance: subjective and objective measurement.

    Science.gov (United States)

    Kim, Min-Beom; Chung, Won-Ho; Choi, Jeesun; Hong, Sung Hwa; Cho, Yang-Sun; Park, Gyuseok; Lee, Sangmin

    2014-06-01

    The object was to evaluate speech perception improvement through Bluetooth-implemented hearing aids in hearing-impaired adults. Thirty subjects with bilateral symmetric moderate sensorineural hearing loss participated in this study. A Bluetooth-implemented hearing aid was fitted unilaterally in all study subjects. Objective speech recognition score and subjective satisfaction were measured with a Bluetooth-implemented hearing aid to replace the acoustic connection from either a cellular phone or a loudspeaker system. In each system, participants were assigned to 4 conditions: wireless speech signal transmission into hearing aid (wireless mode) in quiet or noisy environment and conventional speech signal transmission using external microphone of hearing aid (conventional mode) in quiet or noisy environment. Also, participants completed questionnaires to investigate subjective satisfaction. Both cellular phone and loudspeaker system situation, participants showed improvements in sentence and word recognition scores with wireless mode compared to conventional mode in both quiet and noise conditions (P Bluetooth-implemented hearing aids helped to improve subjective and objective speech recognition performances in quiet and noisy environments during the use of electronic audio devices.

  7. Progressive-Search Algorithms for Large-Vocabulary Speech Recognition

    National Research Council Canada - National Science Library

    Murveit, Hy; Butzberger, John; Digalakis, Vassilios; Weintraub, Mitch

    1993-01-01

    .... An algorithm, the "Forward-Backward Word-Life Algorithm," is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements...

  8. The digits-in-noise test: assessing auditory speech recognition abilities in noise.

    Science.gov (United States)

    Smits, Cas; Theo Goverts, S; Festen, Joost M

    2013-03-01

    A speech-in-noise test which uses digit triplets in steady-state speech noise was developed. The test measures primarily the auditory, or bottom-up, speech recognition abilities in noise. Digit triplets were formed by concatenating single digits spoken by a male speaker. Level corrections were made to individual digits to create a set of homogeneous digit triplets with steep speech recognition functions. The test measures the speech reception threshold (SRT) in long-term average speech-spectrum noise via a 1-up, 1-down adaptive procedure with a measurement error of 0.7 dB. One training list is needed for naive listeners. No further learning effects were observed in 24 subsequent SRT measurements. The test was validated by comparing results on the test with results on the standard sentences-in-noise test. To avoid the confounding of hearing loss, age, and linguistic skills, these measurements were performed in normal-hearing subjects with simulated hearing loss. The signals were spectrally smeared and/or low-pass filtered at varying cutoff frequencies. After correction for measurement error the correlation coefficient between SRTs measured with both tests equaled 0.96. Finally, the feasibility of the test was approved in a study where reference SRT values were gathered in a representative set of 1386 listeners over 60 years of age.

  9. Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; O'Toole, John Mitchell

    2015-01-01

    The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…

  10. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  11. Review of Speech-to-Text Recognition Technology for Enhancing Learning

    Science.gov (United States)

    Shadiev, Rustam; Hwang, Wu-Yuin; Chen, Nian-Shing; Huang, Yueh-Min

    2014-01-01

    This paper reviewed literature from 1999 to 2014 inclusively on how Speech-to-Text Recognition (STR) technology has been applied to enhance learning. The first aim of this review is to understand how STR technology has been used to support learning over the past fifteen years, and the second is to analyze all research evidence to understand how…

  12. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  13. Speech Recognition, Working Memory and Conversation in Children with Cochlear Implants

    Science.gov (United States)

    Ibertsson, Tina; Hansson, Kristina; Asker-Arnason, Lena; Sahlen, Birgitta

    2009-01-01

    This study examined the relationship between speech recognition, working memory and conversational skills in a group of 13 children/adolescents with cochlear implants (CIs) between 11 and 19 years of age. Conversational skills were assessed in a referential communication task where the participants interacted with a hearing peer of the same age…

  14. Enhancing Speech Recognition Using Improved Particle Swarm Optimization Based Hidden Markov Model

    Directory of Open Access Journals (Sweden)

    Lokesh Selvaraj

    2014-01-01

    Full Text Available Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO is suggested. The suggested methodology contains four stages, namely, (i denoising, (ii feature mining (iii, vector quantization, and (iv IPSO based hidden Markov model (HMM technique (IP-HMM. At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC, mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  15. N-best: The Northern- and Southern-Dutch Benchmark Evaluation of Speech recognition Technology

    NARCIS (Netherlands)

    Kessens, J.M.; Leeuwen, D.A. van

    2007-01-01

    In this paper, we describe N-best 2008, the first Large Vocabulary Speech Recognition (LVCSR) benchmark evaluation held for the Dutch language. Both the accent as spoken in the Netherlands (Northern-Dutch) and in Belgium (Southern-Dutch or Flemish), will be evaluated. The evaluation tasks are

  16. A Freely-Available Authoring System for Browser-Based CALL with Speech Recognition

    Science.gov (United States)

    O'Brien, Myles

    2017-01-01

    A system for authoring browser-based CALL material incorporating Google speech recognition has been developed and made freely available for download. The system provides a teacher with a simple way to set up CALL material, including an optional image, sound or video, which will elicit spoken (and/or typed) answers from the user and check them…

  17. A Neuro-Linguistic Model for Speech Recognition in Tone Language

    African Journals Online (AJOL)

    The primary aim for this work is to develop a speech recognition system that exploits the computational paradigm with learning ability and the inherent robustness and parallelism in ANN coupled with the capability of fuzzy logic to model vagueness, handling uncertainness and support for human reasoning. This research ...

  18. Using Speech Recognition Software to Increase Writing Fluency for Individuals with Physical Disabilities

    Science.gov (United States)

    Garrett, Jennifer Tumlin; Heller, Kathryn Wolff; Fowler, Linda P.; Alberto, Paul A.; Fredrick, Laura D.; O'Rourke, Colleen M.

    2011-01-01

    Students with physical disabilities often have difficulty with writing fluency, despite the use of various strategies, adaptations, and assistive technology (AT). One possible intervention is the use of speech recognition software, although there is little research on its impact on students with physical disabilities. This study used an…

  19. Learning spectral-temporal features with 3D CNNs for speech emotion recognition

    NARCIS (Netherlands)

    Kim, Jaebok; Truong, Khiet; Englebienne, Gwenn; Evers, Vanessa

    2017-01-01

    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed

  20. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two-channel based independent vector analysis (IVA on a home-robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition-ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model-based real-time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360-degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  1. Assessing speech recognition abilities with digits in noise in cochlear implant and hearing aid users

    NARCIS (Netherlands)

    Kaandorp, M.W.; Smits, J.C.M.; Merkus, P.; Goverts, S.T.; Festen, J.M.

    2015-01-01

    Objective: The primary objective of the study was to investigate the feasibility, reliability, and validity of the Dutch digits in noise (DIN) test for measuring speech recognition in hearing aid and cochlear implant users and compare results to the standard sentences-in-noise (SIN) test. Design:

  2. The effect of speech recognition on working postures, productivity and the perception of user friendliness

    NARCIS (Netherlands)

    Korte, E.M. de; Lingen, P. van

    2006-01-01

    A comparative, experimental study with repeated measures has been conducted to evaluate the effect of the use of speech recognition on working postures, productivity and the perception of user friendliness. Fifteen subjects performed a standardised task, first with keyboard and mouse and, after a

  3. Automatic evaluation of speech rhythm instability and acceleration in dysarthrias associated with basal ganglia dysfunction

    Directory of Open Access Journals (Sweden)

    Jan eRusz

    2015-07-01

    Full Text Available Speech rhythm abnormalities are commonly present in patients with different neurodegenerative disorders. These alterations are hypothesized to be a consequence of disruption to the basal ganglia circuitry involving dysfunction of motor planning, programming and execution, which can be detected by a syllable repetition paradigm. Therefore, the aim of the present study was to design a robust signal processing technique that allows the automatic detection of spectrally-distinctive nuclei of syllable vocalizations and to determine speech features that represent rhythm instability and acceleration. A further aim was to elucidate specific patterns of dysrhythmia across various neurodegenerative disorders that share disruption of basal ganglia function. Speech samples based on repetition of the syllable /pa/ at a self-determined steady pace were acquired from 109 subjects, including 22 with Parkinson's disease (PD, 11 progressive supranuclear palsy (PSP, 9 multiple system atrophy (MSA, 24 ephedrone-induced parkinsonism (EP, 20 Huntington's disease (HD, and 23 healthy controls. Subsequently, an algorithm for the automatic detection of syllables as well as features representing rhythm instability and rhythm acceleration were designed. The proposed detection algorithm was able to correctly identify syllables and remove erroneous detections due to excessive inspiration and nonspeech sounds with a very high accuracy of 99.6%. Instability of vocal pace performance was observed in PSP, MSA, EP and HD groups. Significantly increased pace acceleration was observed only in the PD group. Although not significant, a tendency for pace acceleration was observed also in the PSP and MSA groups. Our findings underline the crucial role of the basal ganglia in the execution and maintenance of automatic speech motor sequences. We envisage the current approach to become the first step towards the development of acoustic technologies allowing automated assessment of rhythm

  4. Automatic recognition of smoke-plume signatures in lidar signal

    Science.gov (United States)

    Utkin, Andrei B.; Lavrov, Alexander; Vilar, Rui

    2008-10-01

    A simple and robust algorithm for lidar-signal classification based on the fast extraction of sufficiently pronounced peaks and their recognition with a perceptron, whose efficiency is enhanced by a fast nonlinear preprocessing that increases the signal dimension, is reported. The method allows smoke-plume recognition with an error rate as small as 0.31% (19 misdetections and 4 false alarms in analyzing a test set of 7409 peaks).

  5. Speaker Recognition on Lossy Compressed Speech Using the Speex Codec

    Science.gov (United States)

    2009-09-01

    Predictor) based codec . Like Speex it was designed for speech and compresses audio based on prediction and correlations in the signal. The standard...GSM 6.10 codec compresses audio at full rate to 12.2 kbit/s. GSM is the standard for the vast majority of cellular communications in the world and...primarily designed for music compression and is in the same family as the mp3 and Advanced Audio Codec (AAC) compression schemes. This family of

  6. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  7. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  8. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly.

    Science.gov (United States)

    Na, Wondo; Kim, Gibbeum; Kim, Gungu; Han, Woojae; Kim, Jinsook

    2017-01-01

    The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss. One hundred subjects aged 65-84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants' working memory. 1) As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2) As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3) Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss. The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing-impaired elderly who experience difficulty in communication.

  9. Effects of electrode deactivation on speech recognition in multichannel cochlear implant recipients.

    Science.gov (United States)

    Schvartz-Leyzac, Kara C; Zwolan, Teresa A; Pfingst, Bryan E

    2017-11-01

    The objective of the current study is to evaluate how speech recognition performance is affected by the number of active electrodes that are turned off in multichannel cochlear implants. Several recent studies have demonstrated positive effects of deactivating stimulation sites based on an objective measure in cochlear implant processing strategies. Previous studies using an analysis of variance have shown that, on average, cochlear implant listeners' performance does not improve beyond eight active electrodes. We hypothesized that using a generalized linear mixed model would allow for better examination of this question. Seven peri- and post-lingual adult cochlear implant users (eight ears) were tested on speech recognition tasks using experimental MAPs which contained either 8, 12, 16 or 20 active electrodes. Speech recognition tests included CUNY sentences in speech-shaped noise, TIMIT sentences in quiet as well as vowel (CVC) and consonant (CV) stimuli presented in quiet and in signal-to-noise ratios of 0 and +10 dB. The speech recognition threshold in noise (dB SNR) significantly worsened by approximately 2 dB on average as the number of active electrodes was decreased from 20 to 8. Likewise, sentence recognition scores in quiet significantly decreased by an average of approximately 12%. Cochlear implant recipients can utilize and benefit from using more than eight spectral channels when listening to complex sentences or sentences in background noise. The results of the current study suggest a conservative approach for turning off stimulation sites is best when using site-selection procedures.

  10. Recognition of temporally interrupted and spectrally degraded sentences with additional unprocessed low-frequency speech

    Science.gov (United States)

    Başkent, Deniz; Chatterjee, Monita

    2010-01-01

    Recognition of periodically interrupted sentences (with an interruption rate of 1.5 Hz, 50% duty cycle) was investigated under conditions of spectral degradation, implemented with a noiseband vocoder, with and without additional unprocessed low-pass filtered speech (cutoff frequency 500 Hz). Intelligibility of interrupted speech decreased with increasing spectral degradation. For all spectral-degradation conditions, however, adding the unprocessed low-pass filtered speech enhanced the intelligibility. The improvement at 4 and 8 channels was higher than the improvement at 16 and 32 channels: 19% and 8%, on average, respectively. The Articulation Index predicted an improvement of 0.09, in a scale from 0 to 1. Thus, the improvement at poorest spectral-degradation conditions was larger than what would be expected from additional speech information. Therefore, the results implied that the fine temporal cues from the unprocessed low-frequency speech, such as the additional voice pitch cues, helped perceptual integration of temporally interrupted and spectrally degraded speech, especially when the spectral degradations were severe. Considering the vocoder processing as a cochlear-implant simulation, where implant users’ performance is closest to 4 and 8-channel vocoder performance, the results support additional benefit of low-frequency acoustic input in combined electric-acoustic stimulation for perception of temporally degraded speech. PMID:20817081

  11. A Posterior Union Model with Applications to Robust Speech and Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Lin Jie

    2006-01-01

    Full Text Available This paper investigates speech and speaker recognition involving partial feature corruption, assuming unknown, time-varying noise characteristics. The probabilistic union model is extended from a conditional-probability formulation to a posterior-probability formulation as an improved solution to the problem. The new formulation allows the order of the model to be optimized for every single frame, thereby enhancing the capability of the model for dealing with nonstationary noise corruption. The new formulation also allows the model to be readily incorporated into a Gaussian mixture model (GMM for speaker recognition. Experiments have been conducted on two databases: TIDIGITS and SPIDRE, for speech recognition and speaker identification. Both databases are subject to unknown, time-varying band-selective corruption. The results have demonstrated the improved robustness for the new model.

  12. Behavioral and electrophysiological evidence for early and automatic detection of phonological equivalence in variable speech inputs.

    Science.gov (United States)

    Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina

    2011-11-01

    Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.

  13. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Science.gov (United States)

    Zhang, Xianglilan; Sun, Jiping; Luo, Zhigang

    2014-01-01

    Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English) names, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a promising option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW). This method defines a one-against-all index (OAI) for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW), as 6.97% relative reduction of error rate (RRER) compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  14. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  15. Fully Automatic Recognition of the Temporal Phases of Facial Actions

    NARCIS (Netherlands)

    Valstar, M.F.; Pantic, Maja

    Past work on automatic analysis of facial expressions has focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. The method proposed here enables the detection of a much larger range of facial behavior by recognizing facial muscle actions [action units (AUs)

  16. Automatic recognition of lactating sow behaviors through depth image processing

    Science.gov (United States)

    Manual observation and classification of animal behaviors is laborious, time-consuming, and of limited ability to process large amount of data. A computer vision-based system was developed that automatically recognizes sow behaviors (lying, sitting, standing, kneeling, feeding, drinking, and shiftin...

  17. Wavelet-based automatic cry recognition system for detecting infants with hearing-loss from normal infants

    Directory of Open Access Journals (Sweden)

    Mahmoud Mansouri Jam

    2013-11-01

    Full Text Available Infant cry is a multimodal and dynamic behaviour that it contains a lot of information. Goal of this investigation is recognition of two groups of infants by new acoustic feature that has not used in infant cry classification. The cry of deaf infants and normal hearing infants is studied. ‘Mel filter-bank discrete wavelet coefficients (MFDWCs’ have been extracted as feature vector. Infant cry classification is a pattern recognition problem such as ‘automatic speech recognition’, which in signal processing stage the authors performed some pre-processing included silence elimination, filtering, pre-emphasising and, segmentation. After applying the discrete wavelet transform on the Mel scaled log filter bank energies of a cry signal frames, MFDWCs feature vector was extracted. The feature vector, MFDWCs, of each cry sample has large length, so they used principle components analysis to reduce in feature space dimension, after training of neural network as classifier, they achieved to 93.2% correction rate in cry recognition of test data set. This result shows better efficiency in comparison with previous familiarised approaches.

  18. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes...

  19. Shortlist: A Connectionist Model of Continuous Speech Recognition.

    Science.gov (United States)

    Norris, Dennis

    1994-01-01

    The Shortlist model is presented, which incorporates the desirable properties of earlier models of back-propagation networks with recurrent connections that successfully model many aspects of human spoken word recognition. The new model is entirely bottom-up and can readily perform simulations with vocabularies of tens of thousands of words. (DR)

  20. Purging Musical Instrument Sample Databases Using Automatic Musical Instrument Recognition Methods

    OpenAIRE

    Livshin , Arie; Rodet , Xavier

    2009-01-01

    cote interne IRCAM: Livshin09a; None / None; National audience; Compilation of musical instrument sample databases requires careful elimination of badly recorded samples and validation of sample classification into correct categories. This paper introduces algorithms for automatic removal of bad instrument samples using Automatic Musical Instrument Recognition and Outlier Detection techniques. Best evaluation results on a methodically contaminated sound database are achieved using the introdu...

  1. DEVELOPMENT OF AUTOMATED SPEECH RECOGNITION SYSTEM FOR EGYPTIAN ARABIC PHONE CONVERSATIONS

    Directory of Open Access Journals (Sweden)

    A. N. Romanenko

    2016-07-01

    Full Text Available The paper deals with description of several speech recognition systems for the Egyptian Colloquial Arabic. The research is based on the CALLHOME Egyptian corpus. The description of both systems, classic: based on Hidden Markov and Gaussian Mixture Models, and state-of-the-art: deep neural network acoustic models is given. We have demonstrated the contribution from the usage of speaker-dependent bottleneck features; for their extraction three extractors based on neural networks were trained. For their training three datasets in several languageswere used:Russian, English and differentArabic dialects.We have studied the possibility of application of a small Modern Standard Arabic (MSA corpus to derive phonetic transcriptions. The experiments have shown that application of the extractor obtained on the basis of the Russian dataset enables to increase significantly the quality of the Arabic speech recognition. We have also stated that the usage of phonetic transcriptions based on modern standard Arabic decreases recognition quality. Nevertheless, system operation results remain applicable in practice. In addition, we have carried out the study of obtained models application for the keywords searching problem solution. The systems obtained demonstrate good results as compared to those published before. Some ways to improve speech recognition are offered.

  2. Hybrid model decomposition of speech and noise in a radial basis function neural model framework

    DEFF Research Database (Denmark)

    Sørensen, Helge Bjarup Dissing; Hartmann, Uwe

    1994-01-01

    The aim of the paper is to focus on a new approach to automatic speech recognition in noisy environments where the noise has either stationary or non-stationary statistical characteristics. The aim is to perform automatic recognition of speech in the presence of additive car noise. The technique...

  3. Automatic identification of otological drilling faults: an intelligent recognition algorithm.

    Science.gov (United States)

    Cao, Tianyang; Li, Xisheng; Gao, Zhiqiang; Feng, Guodong; Shen, Peng

    2010-06-01

    This article presents an intelligent recognition algorithm that can recognize milling states of the otological drill by fusing multi-sensor information. An otological drill was modified by the addition of sensors. The algorithm was designed according to features of the milling process and is composed of a characteristic curve, an adaptive filter and a rule base. The characteristic curve can weaken the impact of the unstable normal milling process and reserve the features of drilling faults. The adaptive filter is capable of suppressing interference in the characteristic curve by fusing multi-sensor information. The rule base can identify drilling faults through the filtering result data. The experiments were repeated on fresh porcine scapulas, including normal milling and two drilling faults. The algorithm has high rates of identification. This study shows that the intelligent recognition algorithm can identify drilling faults under interference conditions. (c) 2010 John Wiley & Sons, Ltd.

  4. Automatic Pavement Crack Recognition Based on BP Neural Network

    OpenAIRE

    Li, Li; Sun, Lijun; Ning, Guobao; Tan, Shengguang

    2014-01-01

    A feasible pavement crack detection system plays an important role in evaluating the road condition and providing the necessary road maintenance. In this paper, a back propagation neural network (BPNN) is used to recognize pavement cracks from images. To improve the recognition accuracy of the BPNN, a complete framework of image processing is proposed including image preprocessing and crack information extraction. In this framework, the redundant image information is reduced as much as possib...

  5. Image analysis in automatic system of pollen recognition

    Directory of Open Access Journals (Sweden)

    Piotr Rapiejko

    2012-12-01

    Full Text Available In allergology practice and research, it would be convenient to receive pollen identification and monitoring results in much shorter time than it comes from human identification. Image based analysis is one of the approaches to an automated identification scheme for pollen grain and pattern recognition on such images is widely used as a powerful tool. The goal of such attempt is to provide accurate, fast recognition and classification and counting of pollen grains by computer system for monitoring. The isolated pollen grain are objects extracted from microscopic image by CCD camera and PC computer under proper conditions for further analysis. The algorithms are based on the knowledge from feature vector analysis of estimated parameters calculated from grain characteristics, including morphological features, surface features and other applicable estimated characteristics. Segmentation algorithms specially tailored to pollen object characteristics provide exact descriptions of pollen characteristics (border and internal features already used by human expert. The specific characteristics and its measures are statistically estimated for each object. Some low level statistics for estimated local and global measures of the features establish the feature space. Some special care should be paid on choosing these feature and on constructing the feature space to optimize the number of subspaces for higher recognition rates in low-level classification for type differentiation of pollen grains.The results of estimated parameters of feature vector in low dimension space for some typical pollen types are presented, as well as some effective and fast recognition results of performed experiments for different pollens. The findings show the ewidence of using proper chosen estimators of central and invariant moments (M21, NM2, NM3, NM8 NM9, of tailored characteristics for good enough classification measures (efficiency > 95%, even for low dimensional classifiers

  6. Mandarin-Speaking Children’s Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours

    Directory of Open Access Journals (Sweden)

    Hong Zhou

    2017-06-01

    Full Text Available The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency (F0 contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F0 information. One the one hand, age group modulated significantly the use of F0 contours, indicating that elementary school children relied more on natural F0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F0 contours compared with natural F0 contours, while older children could benefit from semantic context even when natural F0 contours were altered, thus confirming the important role of F0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.

  7. Clustered Multi-Task Learning for Automatic Radar Target Recognition.

    Science.gov (United States)

    Li, Cong; Bao, Weimin; Xu, Luping; Zhang, Hua

    2017-09-27

    Model training is a key technique for radar target recognition. Traditional model training algorithms in the framework of single task leaning ignore the relationships among multiple tasks, which degrades the recognition performance. In this paper, we propose a clustered multi-task learning, which can reveal and share the multi-task relationships for radar target recognition. To further make full use of these relationships, the latent multi-task relationships in the projection space are taken into consideration. Specifically, a constraint term in the projection space is proposed, the main idea of which is that multiple tasks within a close cluster should be close to each other in the projection space. In the proposed method, the cluster structures and multi-task relationships can be autonomously learned and utilized in both of the original and projected space. In view of the nonlinear characteristics of radar targets, the proposed method is extended to a non-linear kernel version and the corresponding non-linear multi-task solving method is proposed. Comprehensive experimental studies on simulated high-resolution range profile dataset and MSTAR SAR public database verify the superiority of the proposed method to some related algorithms.

  8. Effects of contextual cues on speech recognition in simulated electric-acoustic stimulation.

    Science.gov (United States)

    Kong, Ying-Yee; Donaldson, Gail; Somarowthu, Ala

    2015-05-01

    Low-frequency acoustic cues have shown to improve speech perception in cochlear-implant listeners. However, the mechanisms underlying this benefit are still not well understood. This study investigated the extent to which low-frequency cues can facilitate listeners' use of linguistic knowledge in simulated electric-acoustic stimulation (EAS). Experiment 1 examined differences in the magnitude of EAS benefit at the phoneme, word, and sentence levels. Speech materials were processed via noise-channel vocoding and lowpass (LP) filtering. The amount of spectral degradation in the vocoder speech was varied by applying different numbers of vocoder channels. Normal-hearing listeners were tested on vocoder-alone, LP-alone, and vocoder + LP conditions. Experiment 2 further examined factors that underlie the context effect on EAS benefit at the sentence level by limiting the low-frequency cues to temporal envelope and periodicity (AM + FM). Results showed that EAS benefit was greater for higher-context than for lower-context speech materials even when the LP ear received only low-frequency AM + FM cues. Possible explanations for the greater EAS benefit observed with higher-context materials may lie in the interplay between perceptual and expectation-driven processes for EAS speech recognition, and/or the band-importance functions for different types of speech materials.

  9. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems, Phase II

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  10. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems, Phase I

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  11. High-Performance Speech Recognition Using Consistency Modeling

    Science.gov (United States)

    1994-03-01

    other techniques, SRI has reduced its word error rate on ARPA’s November 1992 baseline 5,000 word Wall Street Journal bigram evaluation set from 13.0% to...standard cross-word tied-mixture 5K bigram HMM recognizer for ARPA’s Wall Street Journal dictation task. It improved recognition development time by an order...on one of our Wall Street Journal development test sets; the results are summaiized in Table 2. Table 2: Word Error for WSJ Male 5K Closed Verbalized

  12. Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO

    Science.gov (United States)

    Tamati, Terrin N.; Pisoni, David B.

    2015-01-01

    Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary

  13. Non-native listeners' recognition of high-variability speech using PRESTO.

    Science.gov (United States)

    Tamati, Terrin N; Pisoni, David B

    2014-10-01

    Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary knowledge was assessed with the WordFam word familiarity test, and executive

  14. Automaticity of phonological and semantic processing during visual word recognition.

    Science.gov (United States)

    Pattamadilok, Chotiga; Chanoine, Valérie; Pallier, Christophe; Anton, Jean-Luc; Nazarian, Bruno; Belin, Pascal; Ziegler, Johannes C

    2017-04-01

    Reading involves activation of phonological and semantic knowledge. Yet, the automaticity of the activation of these representations remains subject to debate. The present study addressed this issue by examining how different brain areas involved in language processing responded to a manipulation of bottom-up (level of visibility) and top-down information (task demands) applied to written words. The analyses showed that the same brain areas were activated in response to written words whether the task was symbol detection, rime detection, or semantic judgment. This network included posterior, temporal and prefrontal regions, which clearly suggests the involvement of orthographic, semantic and phonological/articulatory processing in all tasks. However, we also found interactions between task and stimulus visibility, which reflected the fact that the strength of the neural responses to written words in several high-level language areas varied across tasks. Together, our findings suggest that the involvement of phonological and semantic processing in reading is supported by two complementary mechanisms. First, an automatic mechanism that results from a task-independent spread of activation throughout a network in which orthography is linked to phonology and semantics. Second, a mechanism that further fine-tunes the sensitivity of high-level language areas to the sensory input in a task-dependent manner. Copyright © 2017 Elsevier Inc. All rights reserved.

  15. Speech Recognition System For Robotic Control And Movement

    Directory of Open Access Journals (Sweden)

    Biraja Nalini Rout

    2015-08-01

    Full Text Available Abstract In a current scenario voice and data recognition is one of the most sought after field in the area of artificial intelligence and robotic 1 engineering. The idea specializes on deriving a voice to voice intelligent system which operates purely on audiovoice instructions using a specialized voice recognition module a micro controller a set of wheels and a movable arm to operate. The working involves real time voice inputs feeded to the VR module which equivalently processes the audio signals and produces the output in audio format. It consists an IDE for both Windows and UNIX based operating system for manipulating and processing instructions both at software and hardware levels. The system also can perform a basic set of manual operations decides through the expert system. The VR module processes the data using multilayer perceptron to generate the required result. Movable arm operates to pick and place objects as per the given voice instructions. Its usability involves substituting manual work at both personal and professional levels.

  16. Automatic recognition of the electrometer status from picture

    OpenAIRE

    HANZLÍK, Ondřej

    2015-01-01

    This thesis deals with problems of recognition of an electrometer´s state from sensing image. It is tangibly about electrometer´s scanning by a mobile phone´s camera. There is a surface with an electrometer´s dial which is detected and on this surface the particular numbers are detected consequently. The numbers are recognized via neural network. For more information from this image there are used some techniques of image segmentation to check the status. For the classification of the segment...

  17. Speech Recognition System and Formant Based Analysis of Spoken Arabic Vowels

    Science.gov (United States)

    Alotaibi, Yousef Ajami; Hussain, Amir

    Arabic is one of the world's oldest languages and is currently the second most spoken language in terms of number of speakers. However, it has not received much attention from the traditional speech processing research community. This study is specifically concerned with the analysis of vowels in modern standard Arabic dialect. The first and second formant values in these vowels are investigated and the differences and similarities between the vowels are explored using consonant-vowels-consonant (CVC) utterances. For this purpose, an HMM based recognizer was built to classify the vowels and the performance of the recognizer analyzed to help understand the similarities and dissimilarities between the phonetic features of vowels. The vowels are also analyzed in both time and frequency domains, and the consistent findings of the analysis are expected to facilitate future Arabic speech processing tasks such as vowel and speech recognition and classification.

  18. Rancang Bangun Game Berhitung Spaceship Dengan Pengendali Suara Menggunakan Speech Recognition Plugin

    Directory of Open Access Journals (Sweden)

    Hans Alfon Ericksoon

    2017-01-01

    Full Text Available Berhitung merupakan salah satu bagian dari modal awal dalam menjalani proses pendidikan. Dasar-dasar pembelajaran berhitung umumnya diberikan kepada siswa SD kelas 1 dan 2. Namun, anak-anak SD kebanyakan lebih memilih bermain ketika pulang bersekolah dari pada melatih kemampuan berhitungnya. Dengan teknologi yang ada saat ini, pelatihan kemampuan berhitung dapat dibuat lebih menarik, interaktif dan dapat dilakukan dimana saja. Pada Tugas Akhir ini dikembangkan suatu game berhitung Spaceship yang berbasis android dengan pengendali suara menggunakan Speech recognition Plugin. Tujuan dari pembuatan Tugas Akhir ini adalah membuat sebuah game yang dapat digunakan sebagai sarana alternatif untuk mengasah kemampuan berhitung anak-anak. Penyusunan materi yang digunakan dalam game ini mengikuti panduan beberapa buku matematika untuk anak-anak SD yang diterbitkan oleh Departemen Pendidikan Nasional sebagai referensi sehingga anak-anak dapat melatih kemampuan berhitungnya sesuai dengan kurikulum di sekolah. Game ini menjadi interaktif dengan diterapkannya fitur Speech recognition sehingga semakin menarik minat anak-anak.

  19. Speech-perception training for older adults with hearing loss impacts word recognition and effort.

    Science.gov (United States)

    Kuchinsky, Stefanie E; Ahlstrom, Jayne B; Cute, Stephanie L; Humes, Larry E; Dubno, Judy R; Eckert, Mark A

    2014-10-01

    The current pupillometry study examined the impact of speech-perception training on word recognition and cognitive effort in older adults with hearing loss. Trainees identified more words at the follow-up than at the baseline session. Training also resulted in an overall larger and faster peaking pupillary response, even when controlling for performance and reaction time. Perceptual and cognitive capacities affected the peak amplitude of the pupil response across participants but did not diminish the impact of training on the other pupil metrics. Thus, we demonstrated that pupillometry can be used to characterize training-related and individual differences in effort during a challenging listening task. Importantly, the results indicate that speech-perception training not only affects overall word recognition, but also a physiological metric of cognitive effort, which has the potential to be a biomarker of hearing loss intervention outcome. Copyright © 2014 Society for Psychophysiological Research.

  20. Exploring the link between cognitive abilities and speech recognition in the elderly under different listening conditions

    DEFF Research Database (Denmark)

    Nuesse, Theresa; Steenken, Rike; Neher, Tobias

    2018-01-01

    , and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing...... or hearing-impairment. To that end, speech recognition threshold (SRT) measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered....... In repeated linear regression analyses, composite scores of cognitive test outcomes (evaluated using PCA) were included to predict SRTs. These associations were different for the two groups. When hearing thresholds were controlled for, composed cognitive factors were significantly associated with the SRTs...