WorldWideScience

Sample records for automatic speech recognition

  1. Thai Automatic Speech Recognition

    National Research Council Canada - National Science Library

    Suebvisai, Sinaporn; Charoenpornsawat, Paisarn; Black, Alan; Woszczyna, Monika; Schultz, Tanja

    2005-01-01

    .... We focus on the discussion of the rapid deployment of ASR for Thai under limited time and data resources, including rapid data collection issues, acoustic model bootstrap, and automatic generation of pronunciations...

  2. Hidden Markov models in automatic speech recognition

    Science.gov (United States)

    Wrzoskowicz, Adam

    1993-11-01

    This article describes a method for constructing an automatic speech recognition system based on hidden Markov models (HMMs). The author discusses the basic concepts of HMM theory and the application of these models to the analysis and recognition of speech signals. The author provides algorithms which make it possible to train the ASR system and recognize signals on the basis of distinct stochastic models of selected speech sound classes. The author describes the specific components of the system and the procedures used to model and recognize speech. The author discusses problems associated with the choice of optimal signal detection and parameterization characteristics and their effect on the performance of the system. The author presents different options for the choice of speech signal segments and their consequences for the ASR process. The author gives special attention to the use of lexical, syntactic, and semantic information for the purpose of improving the quality and efficiency of the system. The author also describes an ASR system developed by the Speech Acoustics Laboratory of the IBPT PAS. The author discusses the results of experiments on the effect of noise on the performance of the ASR system and describes methods of constructing HMM's designed to operate in a noisy environment. The author also describes a language for human-robot communications which was defined as a complex multilevel network from an HMM model of speech sounds geared towards Polish inflections. The author also added mandatory lexical and syntactic rules to the system for its communications vocabulary.

  3. Predicting automatic speech recognition performance over communication channels from instrumental speech quality and intelligibility scores

    NARCIS (Netherlands)

    Gallardo, L.F.; Möller, S.; Beerends, J.

    2017-01-01

    The performance of automatic speech recognition based on coded-decoded speech heavily depends on the quality of the transmitted signals, determined by channel impairments. This paper examines relationships between speech recognition performance and measurements of speech quality and intelligibility

  4. Development of a System for Automatic Recognition of Speech

    Directory of Open Access Journals (Sweden)

    Roman Jarina

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  5. Indonesian Automatic Speech Recognition For Command Speech Controller Multimedia Player

    Directory of Open Access Journals (Sweden)

    Vivien Arief Wardhany

    2014-12-01

    Full Text Available The purpose of multimedia devices development is controlling through voice. Nowdays voice that can be recognized only in English. To overcome the issue, then recognition using Indonesian language model and accousticc model and dictionary. Automatic Speech Recognizier is build using engine CMU Sphinx with modified english language to Indonesian Language database and XBMC used as the multimedia player. The experiment is using 10 volunteers testing items based on 7 commands. The volunteers is classifiedd by the genders, 5 Male & 5 female. 10 samples is taken in each command, continue with each volunteer perform 10 testing command. Each volunteer also have to try all 7 command that already provided. Based on percentage clarification table, the word “Kanan” had the most recognize with percentage 83% while “pilih” is the lowest one. The word which had the most wrong clarification is “kembali” with percentagee 67%, while the word “kanan” is the lowest one. From the result of Recognition Rate by male there are several command such as “Kembali”, “Utama”, “Atas “ and “Bawah” has the low Recognition Rate. Especially for “kembali” cannot be recognized as the command in the female voices but in male voice that command has 4% of RR this is because the command doesn’t have similar word in english near to “kembali” so the system unrecognize the command. Also for the command “Pilih” using the female voice has 80% of RR but for the male voice has only 4% of RR. This problem is mostly because of the different voice characteristic between adult male and female which male has lower voice frequencies (from 85 to 180 Hz than woman (165 to 255 Hz.The result of the experiment showed that each man had different number of recognition rate caused by the difference tone, pronunciation, and speed of speech. For further work needs to be done in order to improving the accouracy of the Indonesian Automatic Speech Recognition system

  6. Automatic Speech Recognition from Neural Signals: A Focused Review

    Directory of Open Access Journals (Sweden)

    Christian Herff

    2016-09-01

    Full Text Available Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e.~patients suffering from locked-in syndrome. For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people.This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography. As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the emph{Brain-to-text} system.

  7. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  8. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    , like Danish, the graphemic and phonetic representations are very dissimilar and more complex rewriting rules must be applied to create the correct phonetic representation. Automatic phonetic transcribers use different strategies, from deep analysis to shallow rewriting rules, to produce phonetic......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...... for particular words and word classes in addition. In comparison, English has 5,852 spelling-tophoneme rules and 4,133 additional rules and 8,278 rules and 3,829 additional rules. Phonix applies deep morphological analysis as a preprocessing step. Should the analysis fail, several fallback strategies...

  9. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    2013-01-01

    Automatic speech recognition (ASR) relies on three resources: audio, orthographic transcriptions and a pronunciation dictionary. The dictionary or lexicon maps orthographic words to sequences of phones or phonemes that represent the pronunciation of the corresponding word. The quality of a speech....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  10. Recent advances in Automatic Speech Recognition for Vietnamese

    OpenAIRE

    Le , Viet-Bac; Besacier , Laurent; Seng , Sopheap; Bigi , Brigitte; Do , Thi-Ngoc-Diep

    2008-01-01

    International audience; This paper presents our recent activities for automatic speech recognition for Vietnamese. First, our text data collection and processing methods and tools are described. For language modeling, we investigate word, sub-word and also hybrid word/sub-word models. For acoustic modeling, when only limited speech data are available for Vietnamese, we propose some crosslingual acoustic modeling techniques. Furthermore, since the use of sub-word units can reduce the high out-...

  11. Automatic speech recognition for radiological reporting

    International Nuclear Information System (INIS)

    Vidal, B.

    1991-01-01

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  12. Experiments on Automatic Recognition of Nonnative Arabic Speech

    Directory of Open Access Journals (Sweden)

    Douglas O'Shaughnessy

    2008-05-01

    Full Text Available The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA. We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC and the hidden Markov Model Toolkit (HTK are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.

  13. Experiments on Automatic Recognition of Nonnative Arabic Speech

    Directory of Open Access Journals (Sweden)

    Selouani Sid-Ahmed

    2008-01-01

    Full Text Available The automatic recognition of foreign-accented Arabic speech is a challenging task since it involves a large number of nonnative accents. As well, the nonnative speech data available for training are generally insufficient. Moreover, as compared to other languages, the Arabic language has sparked a relatively small number of research efforts. In this paper, we are concerned with the problem of nonnative speech in a speaker independent, large-vocabulary speech recognition system for modern standard Arabic (MSA. We analyze some major differences at the phonetic level in order to determine which phonemes have a significant part in the recognition performance for both native and nonnative speakers. Special attention is given to specific Arabic phonemes. The performance of an HMM-based Arabic speech recognition system is analyzed with respect to speaker gender and its native origin. The WestPoint modern standard Arabic database from the language data consortium (LDC and the hidden Markov Model Toolkit (HTK are used throughout all experiments. Our study shows that the best performance in the overall phoneme recognition is obtained when nonnative speakers are involved in both training and testing phases. This is not the case when a language model and phonetic lattice networks are incorporated in the system. At the phonetic level, the results show that female nonnative speakers perform better than nonnative male speakers, and that emphatic phonemes yield a significant decrease in performance when they are uttered by both male and female nonnative speakers.

  14. Automatic speech recognition for report generation in computed tomography

    International Nuclear Information System (INIS)

    Teichgraeber, U.K.M.; Ehrenstein, T.; Lemke, M.; Liebig, T.; Stobbe, H.; Hosten, N.; Keske, U.; Felix, R.

    1999-01-01

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.) [de

  15. Leveraging Automatic Speech Recognition Errors to Detect Challenging Speech Segments in TED Talks

    Science.gov (United States)

    Mirzaei, Maryam Sadat; Meshgi, Kourosh; Kawahara, Tatsuya

    2016-01-01

    This study investigates the use of Automatic Speech Recognition (ASR) systems to epitomize second language (L2) listeners' problems in perception of TED talks. ASR-generated transcripts of videos often involve recognition errors, which may indicate difficult segments for L2 listeners. This paper aims to discover the root-causes of the ASR errors…

  16. Automatic Speech Acquisition and Recognition for Spacesuit Audio Systems

    Science.gov (United States)

    Ye, Sherry

    2015-01-01

    NASA has a widely recognized but unmet need for novel human-machine interface technologies that can facilitate communication during astronaut extravehicular activities (EVAs), when loud noises and strong reverberations inside spacesuits make communication challenging. WeVoice, Inc., has developed a multichannel signal-processing method for speech acquisition in noisy and reverberant environments that enables automatic speech recognition (ASR) technology inside spacesuits. The technology reduces noise by exploiting differences between the statistical nature of signals (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, ASR accuracy can be improved to the level at which crewmembers will find the speech interface useful. System components and features include beam forming/multichannel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, and ASR decoding. Arithmetic complexity models were developed and will help designers of real-time ASR systems select proper tasks when confronted with constraints in computational resources. In Phase I of the project, WeVoice validated the technology. The company further refined the technology in Phase II and developed a prototype for testing and use by suited astronauts.

  17. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review

    Science.gov (United States)

    Jamal, Norezmi; Shanta, Shahnoor; Mahmud, Farhanahani; Sha'abani, MNAH

    2017-09-01

    This paper reviews the state-of-the-art an automatic speech recognition (ASR) based approach for speech therapy of aphasic patients. Aphasia is a condition in which the affected person suffers from speech and language disorder resulting from a stroke or brain injury. Since there is a growing body of evidence indicating the possibility of improving the symptoms at an early stage, ASR based solutions are increasingly being researched for speech and language therapy. ASR is a technology that transfers human speech into transcript text by matching with the system's library. This is particularly useful in speech rehabilitation therapies as they provide accurate, real-time evaluation for speech input from an individual with speech disorder. ASR based approaches for speech therapy recognize the speech input from the aphasic patient and provide real-time feedback response to their mistakes. However, the accuracy of ASR is dependent on many factors such as, phoneme recognition, speech continuity, speaker and environmental differences as well as our depth of knowledge on human language understanding. Hence, the review examines recent development of ASR technologies and its performance for individuals with speech and language disorders.

  18. Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition

    NARCIS (Netherlands)

    Huijbregts, M.A.H.; Ordelman, Roeland J.F.; de Jong, Franciska M.G.

    2007-01-01

    This paper reports on the setup and evaluation of robust speech recognition system parts, geared towards transcript generation for heterogeneous, real-life media collections. The system is deployed for generating speech transcripts for the NIST/TRECVID-2007 test collection, part of a Dutch real-life

  19. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    OpenAIRE

    Andreas Maier; Tino Haderlein; Florian Stelzle; Elmar Nöth; Emeka Nkenke; Frank Rosanowski; Anne Schützenberger; Maria Schuster

    2010-01-01

    In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngect...

  20. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2009-01-01

    Objective: The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that

  1. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  2. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  3. Preliminary Analysis of Automatic Speech Recognition and Synthesis Technology.

    Science.gov (United States)

    1983-05-01

    ANDELES CA 0 SHDAP ET AL MAY 93 UNCISSIFED UCG -020-8 MDA04-8’-C-415F/ 17/2 N mE = h IEEE 11111 10’ ~ 2.0 11-41 & 11111I25IID MICROCOPY RESOLUTION TEST...speech. Private industry, which sees a major market for improved speech recognition systems, is attempting to solve the problems involved in...manufacturer is able to market such a recognition system. A second requirement for the spotting of keywords in distress signals concerns the need for a

  4. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    Energy Technology Data Exchange (ETDEWEB)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C., E-mail: calexandre@ien.gov.b, E-mail: mol@ien.gov.b, E-mail: cmnap@ien.gov.b, E-mail: mag@ien.gov.b [Instituto de Engenharia Nuclear (IEN/CNEN-RJ), Rio de Janeiro, RJ (Brazil); Nomiya, Diogo V., E-mail: diogonomiya@gmail.co [Universidade Federal do Rio de Janeiro (UFRJ), RJ (Brazil)

    2009-07-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  5. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    International Nuclear Information System (INIS)

    Jorge, Carlos Alexandre F.; Mol, Antonio Carlos A.; Pereira, Claudio M.N.A.; Aghina, Mauricio Alves C.; Nomiya, Diogo V.

    2009-01-01

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  6. AUTOMATIC SPEECH RECOGNITION SYSTEM CONCERNING THE MOROCCAN DIALECTE (Darija and Tamazight)

    OpenAIRE

    A. EL GHAZI; C. DAOUI; N. IDRISSI

    2012-01-01

    In this work we present an automatic speech recognition system for Moroccan dialect mainly: Darija (Arab dialect) and Tamazight. Many approaches have been used to model the Arabic and Tamazightphonetic units. In this paper, we propose to use the hidden Markov model (HMM) for modeling these phoneticunits. Experimental results show that the proposed approach further improves the recognition.

  7. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the langu......In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment...

  8. Practising verbal maritime communication with computer dialogue systems using automatic speech recognition (My Practice session)

    OpenAIRE

    John, Peter; Wellmann, J.; Appell, J.E.

    2016-01-01

    This My Practice session presents a novel online tool for practising verbal communication in a maritime setting. It is based on low-fi ChatBot simulation exercises which employ computer-based dialogue systems. The ChatBot exercises are equipped with an automatic speech recognition engine specifically designed for maritime communication. The speech input and output functionality enables learners to communicate with the computer freely and spontaneously. The exercises replicate real communicati...

  9. Automatic speech recognition used for evaluation of text-to-speech systems

    Czech Academy of Sciences Publication Activity Database

    Vích, Robert; Nouza, J.; Vondra, Martin

    -, č. 5042 (2008), s. 136-148 ISSN 0302-9743 R&D Projects: GA AV ČR 1ET301710509; GA AV ČR 1QS108040569 Institutional research plan: CEZ:AV0Z20670512 Keywords : speech recognition * speech processing Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering

  10. The influence of age, hearing, and working memory on the speech comprehension benefit derived from an automatic speech recognition system.

    Science.gov (United States)

    Zekveld, Adriana A; Kramer, Sophia E; Kessens, Judith M; Vlaming, Marcel S M G; Houtgast, Tammo

    2009-04-01

    The aim of the current study was to examine whether partly incorrect subtitles that are automatically generated by an Automatic Speech Recognition (ASR) system, improve speech comprehension by listeners with hearing impairment. In an earlier study (Zekveld et al. 2008), we showed that speech comprehension in noise by young listeners with normal hearing improves when presenting partly incorrect, automatically generated subtitles. The current study focused on the effects of age, hearing loss, visual working memory capacity, and linguistic skills on the benefit obtained from automatically generated subtitles during listening to speech in noise. In order to investigate the effects of age and hearing loss, three groups of participants were included: 22 young persons with normal hearing (YNH, mean age = 21 years), 22 middle-aged adults with normal hearing (MA-NH, mean age = 55 years) and 30 middle-aged adults with hearing impairment (MA-HI, mean age = 57 years). The benefit from automatic subtitling was measured by Speech Reception Threshold (SRT) tests (Plomp & Mimpen, 1979). Both unimodal auditory and bimodal audiovisual SRT tests were performed. In the audiovisual tests, the subtitles were presented simultaneously with the speech, whereas in the auditory test, only speech was presented. The difference between the auditory and audiovisual SRT was defined as the audiovisual benefit. Participants additionally rated the listening effort. We examined the influences of ASR accuracy level and text delay on the audiovisual benefit and the listening effort using a repeated measures General Linear Model analysis. In a correlation analysis, we evaluated the relationships between age, auditory SRT, visual working memory capacity and the audiovisual benefit and listening effort. The automatically generated subtitles improved speech comprehension in noise for all ASR accuracies and delays covered by the current study. Higher ASR accuracy levels resulted in more benefit obtained

  11. Sparse coding of the modulation spectrum for noise-robust automatic speech recognition

    NARCIS (Netherlands)

    Ahmadi, S.; Ahadi, S.M.; Cranen, B.; Boves, L.W.J.

    2014-01-01

    The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous research in automatic speech recognition converted this very rich representation into the equivalent of a sequence of short-time power spectra, mainly to simplify the computation of the

  12. Evaluation of missing data techniques for in-car automatic speech recognition

    OpenAIRE

    Wang, Y.; Vuerinckx, R.; Gemmeke, J.F.; Cranen, B.; Hamme, H. Van

    2009-01-01

    Wang Y., Vuerinckx R., Gemmeke J., Cranen B., Van hamme H., ''Evaluation of missing data techniques for in-car automatic speech recognition'', Proceedings NAG/DAGA 2009 - international conference on acoustics, 4 pp., March 23-26, 2009, Rotterdam, The Netherlands.

  13. Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes.

    Science.gov (United States)

    Meyer, Bernd T; Brand, Thomas; Kollmeier, Birger

    2011-01-01

    The aim of this study is to quantify the gap between the recognition performance of human listeners and an automatic speech recognition (ASR) system with special focus on intrinsic variations of speech, such as speaking rate and effort, altered pitch, and the presence of dialect and accent. Second, it is investigated if the most common ASR features contain all information required to recognize speech in noisy environments by using resynthesized ASR features in listening experiments. For the phoneme recognition task, the ASR system achieved the human performance level only when the signal-to-noise ratio (SNR) was increased by 15 dB, which is an estimate for the human-machine gap in terms of the SNR. The major part of this gap is attributed to the feature extraction stage, since human listeners achieve comparable recognition scores when the SNR difference between unaltered and resynthesized utterances is 10 dB. Intrinsic variabilities result in strong increases of error rates, both in human speech recognition (HSR) and ASR (with a relative increase of up to 120%). An analysis of phoneme duration and recognition rates indicates that human listeners are better able to identify temporal cues than the machine at low SNRs, which suggests incorporating information about the temporal dynamics of speech into ASR systems.

  14. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1975-01-01

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author) [fr

  15. An HMM-Like Dynamic Time Warping Scheme for Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ing-Jr Ding

    2014-01-01

    Full Text Available In the past, the kernel of automatic speech recognition (ASR is dynamic time warping (DTW, which is feature-based template matching and belongs to the category technique of dynamic programming (DP. Although DTW is an early developed ASR technique, DTW has been popular in lots of applications. DTW is playing an important role for the known Kinect-based gesture recognition application now. This paper proposed an intelligent speech recognition system using an improved DTW approach for multimedia and home automation services. The improved DTW presented in this work, called HMM-like DTW, is essentially a hidden Markov model- (HMM- like method where the concept of the typical HMM statistical model is brought into the design of DTW. The developed HMM-like DTW method, transforming feature-based DTW recognition into model-based DTW recognition, will be able to behave as the HMM recognition technique and therefore proposed HMM-like DTW with the HMM-like recognition model will have the capability to further perform model adaptation (also known as speaker adaptation. A series of experimental results in home automation-based multimedia access service environments demonstrated the superiority and effectiveness of the developed smart speech recognition system by HMM-like DTW.

  16. Multilingual Techniques for Low Resource Automatic Speech Recognition

    Science.gov (United States)

    2016-05-20

    linguistic and ASR expertise, and Regina, bringing in another point of view from the NLP side, really help shape the direction of some of the work in...improved keyword spotting. In Proc. ASRU, 2013. 41 [43] K. Kirchhoff and D. Vergyri. Cross-dialectal data sharing for acoustic modeling in Arabic speech

  17. Integrating Automatic Speech Recognition and Machine Translation for Better Translation Outputs

    DEFF Research Database (Denmark)

    Liyanapathirana, Jeevanthi

    translations, combining machine translation with computer assisted translation has drawn attention in current research. This combines two prospects: the opportunity of ensuring high quality translation along with a significant performance gain. Automatic Speech Recognition (ASR) is another important area......, which caters important functionalities in language processing and natural language understanding tasks. In this work we integrate automatic speech recognition and machine translation in parallel. We aim to avoid manual typing of possible translations as dictating the translation would take less time...... to the n-best list rescoring, we also use word graphs with the expectation of arriving at a tighter integration of ASR and MT models. Integration methods include constraining ASR models using language and translation models of MT, and vice versa. We currently develop and experiment different methods...

  18. Developing a broadband automatic speech recognition system for Afrikaans

    CSIR Research Space (South Africa)

    De Wet, Febe

    2011-08-01

    Full Text Available baseline transcription for the news data. The match between a baseline transcription and its corre- sponding audio can be evaluated automatically using an ASR system in forced alignment mode. Only those bulletins for which a bad match is indicated... Component Index for data [3]. occurrence of Afrikaans words3. Other text corpora that are currently under construction in- clude daily downloads of the scripts of news bulletins that are read on an Afrikaans radio station as well as transcripts of par...

  19. Language modeling for automatic speech recognition of inflective languages an applications-oriented approach using lexical data

    CERN Document Server

    Donaj, Gregor

    2017-01-01

    This book covers language modeling and automatic speech recognition for inflective languages (e.g. Slavic languages), which represent roughly half of the languages spoken in Europe. These languages do not perform as well as English in speech recognition systems and it is therefore harder to develop an application with sufficient quality for the end user. The authors describe the most important language features for the development of a speech recognition system. This is then presented through the analysis of errors in the system and the development of language models and their inclusion in speech recognition systems, which specifically address the errors that are relevant for targeted applications. The error analysis is done with regard to morphological characteristics of the word in the recognized sentences. The book is oriented towards speech recognition with large vocabularies and continuous and even spontaneous speech. Today such applications work with a rather small number of languages compared to the nu...

  20. Speech Recognition

    Directory of Open Access Journals (Sweden)

    Adrian Morariu

    2009-01-01

    Full Text Available This paper presents a method of speech recognition by pattern recognition techniques. Learning consists in determining the unique characteristics of a word (cepstral coefficients by eliminating those characteristics that are different from one word to another. For learning and recognition, the system will build a dictionary of words by determining the characteristics of each word to be used in the recognition. Determining the characteristics of an audio signal consists in the following steps: noise removal, sampling it, applying Hamming window, switching to frequency domain through Fourier transform, calculating the magnitude spectrum, filtering data, determining cepstral coefficients.

  1. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    Dupeyrat, Benoit

    1981-01-01

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author) [fr

  2. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-12-01

    Full Text Available This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  3. Speaker diarization and speech recognition in the semi-automatization of audio description: An exploratory study on future possibilities?

    Directory of Open Access Journals (Sweden)

    Héctor Delgado

    2015-06-01

    This article presents an overview of the technological components used in the process of audio description, and suggests a new scenario in which speech recognition, machine translation, and text-to-speech, with the corresponding human revision, could be used to increase audio description provision. The article focuses on a process in which both speaker diarization and speech recognition are used in order to obtain a semi-automatic transcription of the audio description track. The technical process is presented and experimental results are summarized.

  4. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  5. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  6. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  7. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    in the mobile context covering motivations, challenges, fundamental techniques and applications. Three ASR architectures are introduced: embedded speech recognition, distributed speech recognition and network speech recognition. Their pros and cons and implementation issues are discussed. Applications within......The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  8. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  9. The Usefulness of Automatic Speech Recognition (ASR Eyespeak Software in Improving Iraqi EFL Students’ Pronunciation

    Directory of Open Access Journals (Sweden)

    Lina Fathi Sidig Sidgi

    2017-02-01

    Full Text Available The present study focuses on determining whether automatic speech recognition (ASR technology is reliable for improving English pronunciation to Iraqi EFL students. Non-native learners of English are generally concerned about improving their pronunciation skills, and Iraqi students face difficulties in pronouncing English sounds that are not found in their native language (Arabic. This study is concerned with ASR and its effectiveness in overcoming this difficulty. The data were obtained from twenty participants randomly selected from first-year college students at Al-Turath University College from the Department of English in Baghdad-Iraq. The students had participated in a two month pronunciation instruction course using ASR Eyespeak software. At the end of the pronunciation instruction course using ASR Eyespeak software, the students completed a questionnaire to get their opinions about the usefulness of the ASR Eyespeak in improving their pronunciation. The findings of the study revealed that the students found ASR Eyespeak software very useful in improving their pronunciation and helping them realise their pronunciation mistakes. They also reported that learning pronunciation with ASR Eyespeak enjoyable.

  10. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency ...

  11. User Experience of a Mobile Speaking Application with Automatic Speech Recognition for EFL Learning

    Science.gov (United States)

    Ahn, Tae youn; Lee, Sangmin-Michelle

    2016-01-01

    With the spread of mobile devices, mobile phones have enormous potential regarding their pedagogical use in language education. The goal of this study is to analyse user experience of a mobile-based learning system that is enhanced by speech recognition technology for the improvement of EFL (English as a foreign language) learners' speaking…

  12. An automatic speech recognition system with speaker-independent identification support

    Science.gov (United States)

    Caranica, Alexandru; Burileanu, Corneliu

    2015-02-01

    The novelty of this work relies on the application of an open source research software toolkit (CMU Sphinx) to train, build and evaluate a speech recognition system, with speaker-independent support, for voice-controlled hardware applications. Moreover, we propose to use the trained acoustic model to successfully decode offline voice commands on embedded hardware, such as an ARMv6 low-cost SoC, Raspberry PI. This type of single-board computer, mainly used for educational and research activities, can serve as a proof-of-concept software and hardware stack for low cost voice automation systems.

  13. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  14. The Use of an Autonomous Pedagogical Agent and Automatic Speech Recognition for Teaching Sight Words to Students with Autism Spectrum Disorder

    Science.gov (United States)

    Saadatzi, Mohammad Nasser; Pennington, Robert C.; Welch, Karla C.; Graham, James H.; Scott, Renee E.

    2017-01-01

    In the current study, we examined the effects of an instructional package comprised of an autonomous pedagogical agent, automatic speech recognition, and constant time delay during the instruction of reading sight words aloud to young adults with autism spectrum disorder. We used a concurrent multiple baseline across participants design to…

  15. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    2015-01-01

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers...

  16. Automatic recognition of spontaneous emotions in speech using acoustic and lexical features

    NARCIS (Netherlands)

    Raaijmakers, S.; Truong, K.P.

    2008-01-01

    We developed acoustic and lexical classifiers, based on a boosting algorithm, to assess the separability on arousal and valence dimensions in spontaneous emotional speech. The spontaneous emotional speech data was acquired by inviting subjects to play a first-person shooter video game. Our acoustic

  17. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space

    Directory of Open Access Journals (Sweden)

    Kan Li

    2018-04-01

    Full Text Available This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM speech processing as well as neuromorphic implementations based on spiking neural network (SNN, yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR regime.

  18. Biologically-Inspired Spike-Based Automatic Speech Recognition of Isolated Digits Over a Reproducing Kernel Hilbert Space.

    Science.gov (United States)

    Li, Kan; Príncipe, José C

    2018-01-01

    This paper presents a novel real-time dynamic framework for quantifying time-series structure in spoken words using spikes. Audio signals are converted into multi-channel spike trains using a biologically-inspired leaky integrate-and-fire (LIF) spike generator. These spike trains are mapped into a function space of infinite dimension, i.e., a Reproducing Kernel Hilbert Space (RKHS) using point-process kernels, where a state-space model learns the dynamics of the multidimensional spike input using gradient descent learning. This kernelized recurrent system is very parsimonious and achieves the necessary memory depth via feedback of its internal states when trained discriminatively, utilizing the full context of the phoneme sequence. A main advantage of modeling nonlinear dynamics using state-space trajectories in the RKHS is that it imposes no restriction on the relationship between the exogenous input and its internal state. We are free to choose the input representation with an appropriate kernel, and changing the kernel does not impact the system nor the learning algorithm. Moreover, we show that this novel framework can outperform both traditional hidden Markov model (HMM) speech processing as well as neuromorphic implementations based on spiking neural network (SNN), yielding accurate and ultra-low power word spotters. As a proof of concept, we demonstrate its capabilities using the benchmark TI-46 digit corpus for isolated-word automatic speech recognition (ASR) or keyword spotting. Compared to HMM using Mel-frequency cepstral coefficient (MFCC) front-end without time-derivatives, our MFCC-KAARMA offered improved performance. For spike-train front-end, spike-KAARMA also outperformed state-of-the-art SNN solutions. Furthermore, compared to MFCCs, spike trains provided enhanced noise robustness in certain low signal-to-noise ratio (SNR) regime.

  19. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suzuki Motoyuki

    2009-01-01

    Full Text Available Abstract We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the "query relevance." Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  20. Automatic Query Generation and Query Relevance Measurement for Unsupervised Language Model Adaptation of Speech Recognition

    Directory of Open Access Journals (Sweden)

    Akinori Ito

    2009-01-01

    Full Text Available We are developing a method of Web-based unsupervised language model adaptation for recognition of spoken documents. The proposed method chooses keywords from the preliminary recognition result and retrieves Web documents using the chosen keywords. A problem is that the selected keywords tend to contain misrecognized words. The proposed method introduces two new ideas for avoiding the effects of keywords derived from misrecognized words. The first idea is to compose multiple queries from selected keyword candidates so that the misrecognized words and correct words do not fall into one query. The second idea is that the number of Web documents downloaded for each query is determined according to the “query relevance.” Combining these two ideas, we can alleviate bad effect of misrecognized keywords by decreasing the number of downloaded Web documents from queries that contain misrecognized keywords. Finally, we examine a method of determining the number of iterative adaptations based on the recognition likelihood. Experiments have shown that the proposed stopping criterion can determine almost the optimum number of iterations. In the final experiment, the word accuracy without adaptation (55.29% was improved to 60.38%, which was 1.13 point better than the result of the conventional unsupervised adaptation method (59.25%.

  1. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  2. Automatic Smoker Detection from Telephone Speech Signals

    DEFF Research Database (Denmark)

    Poorjam, Amir Hossein; Hesaraki, Soheila; Safavi, Saeid

    2017-01-01

    This paper proposes an automatic smoking habit detection from spontaneous telephone speech signals. In this method, each utterance is modeled using i-vector and non-negative factor analysis (NFA) frameworks, which yield low-dimensional representation of utterances by applying factor analysis...... method is evaluated on telephone speech signals of speakers whose smoking habits are known drawn from the National Institute of Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation databases. Experimental results over 1194 utterances show the effectiveness of the proposed approach...... for the automatic smoking habit detection task....

  3. Auditory Modeling for Noisy Speech Recognition

    National Research Council Canada - National Science Library

    2000-01-01

    ... digital filtering for noise cancellation which interfaces to speech recognition software. It uses auditory features in speech recognition training, and provides applications to multilingual spoken language translation...

  4. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels; Reconnaissance automatique de la parole (methode des passages par zero). Reconnaissance automatique de voyelles isolees

    Energy Technology Data Exchange (ETDEWEB)

    Dupeyrat, Benoit

    1975-06-10

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author) [French] Cette note decrit une methode de reconnaissance automatique de voyelles isolees basee sur un pretraitement particulier du signal vocal. Ce pretraitement consiste a extraire les extrema du signal vocal et les intervalles de temps les separant (distances entre passages par zero de la derivee du signal). La reconnaissance des voyelles est faite en utilisant des histogrammes normalises des valeurs de ces interval les. Le programme de reconnaissance utilise une distance entre l'histogramme du son a reconnaitre et des histogrammes modeles provenant d'un apprentissage. Les resultats obtenus en temps reels sur un minicalculateur, sont assez independants du locuteur, pourvu que la frequence fondamentale de la voix ne varie pas trop (locuteurs de meme sexe). (auteur)

  5. Analysis of Feature Extraction Methods for Speaker Dependent Speech Recognition

    Directory of Open Access Journals (Sweden)

    Gurpreet Kaur

    2017-02-01

    Full Text Available Speech recognition is about what is being said, irrespective of who is saying. Speech recognition is a growing field. Major progress is taking place on the technology of automatic speech recognition (ASR. Still, there are lots of barriers in this field in terms of recognition rate, background noise, speaker variability, speaking rate, accent etc. Speech recognition rate mainly depends on the selection of features and feature extraction methods. This paper outlines the feature extraction techniques for speaker dependent speech recognition for isolated words. A brief survey of different feature extraction techniques like Mel-Frequency Cepstral Coefficients (MFCC, Linear Predictive Coding Coefficients (LPCC, Perceptual Linear Prediction (PLP, Relative Spectra Perceptual linear Predictive (RASTA-PLP analysis are presented and evaluation is done. Speech recognition has various applications from daily use to commercial use. We have made a speaker dependent system and this system can be useful in many areas like controlling a patient vehicle using simple commands.

  6. Collecting and evaluating speech recognition corpora for nine Southern Bantu languages

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2009-03-01

    Full Text Available The authors describes the Lwazi corpus for automatic speech recognition (ASR), a new telephone speech corpus which includes data from nine Southern Bantu languages. Because of practical constraints, the amount of speech per language is relatively...

  7. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... and inaccuracies in the anaesthesia record. Supplementing the electronic anaesthesia record interface with speech input facilities is proposed as one possible solution to a part of the problem. The testing of the various hypotheses has involved the development of a prototype of an electronic anaesthesia record...... interface with speech input facilities in Danish. The evaluation of the new interface was carried out in a full-scale anaesthesia simulator. This has been complemented by laboratory experiments on several aspects of speech recognition for this type of use, e.g. the effects of noise on speech recognition...

  8. The software for automatic creation of the formal grammars used by speech recognition, computer vision, editable text conversion systems, and some new functions

    Science.gov (United States)

    Kardava, Irakli; Tadyszak, Krzysztof; Gulua, Nana; Jurga, Stefan

    2017-02-01

    For more flexibility of environmental perception by artificial intelligence it is needed to exist the supporting software modules, which will be able to automate the creation of specific language syntax and to make a further analysis for relevant decisions based on semantic functions. According of our proposed approach, of which implementation it is possible to create the couples of formal rules of given sentences (in case of natural languages) or statements (in case of special languages) by helping of computer vision, speech recognition or editable text conversion system for further automatic improvement. In other words, we have developed an approach, by which it can be achieved to significantly improve the training process automation of artificial intelligence, which as a result will give us a higher level of self-developing skills independently from us (from users). At the base of our approach we have developed a software demo version, which includes the algorithm and software code for the entire above mentioned component's implementation (computer vision, speech recognition and editable text conversion system). The program has the ability to work in a multi - stream mode and simultaneously create a syntax based on receiving information from several sources.

  9. End-to-end visual speech recognition with LSTMS

    NARCIS (Netherlands)

    Petridis, Stavros; Li, Zuwei; Pantic, Maja

    2017-01-01

    Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on

  10. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  11. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  12. The Suitability of Cloud-Based Speech Recognition Engines for Language Learning

    Science.gov (United States)

    Daniels, Paul; Iwago, Koji

    2017-01-01

    As online automatic speech recognition (ASR) engines become more accurate and more widely implemented with call software, it becomes important to evaluate the effectiveness and the accuracy of these recognition engines using authentic speech samples. This study investigates two of the most prominent cloud-based speech recognition engines--Apple's…

  13. Speech recognition implementation in radiology

    International Nuclear Information System (INIS)

    White, Keith S.

    2005-01-01

    Continuous speech recognition (SR) is an emerging technology that allows direct digital transcription of dictated radiology reports. The SR systems are being widely deployed in the radiology community. This is a review of technical and practical issues that should be considered when implementing an SR system. (orig.)

  14. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  15. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Hiroshi Saruwatari

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a 93.9% word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  16. ACOUSTIC SPEECH RECOGNITION FOR MARATHI LANGUAGE USING SPHINX

    Directory of Open Access Journals (Sweden)

    Aman Ankit

    2016-09-01

    Full Text Available Speech recognition or speech to text processing, is a process of recognizing human speech by the computer and converting into text. In speech recognition, transcripts are created by taking recordings of speech as audio and their text transcriptions. Speech based applications which include Natural Language Processing (NLP techniques are popular and an active area of research. Input to such applications is in natural language and output is obtained in natural language. Speech recognition mostly revolves around three approaches namely Acoustic phonetic approach, Pattern recognition approach and Artificial intelligence approach. Creation of acoustic model requires a large database of speech and training algorithms. The output of an ASR system is recognition and translation of spoken language into text by computers and computerized devices. ASR today finds enormous application in tasks that require human machine interfaces like, voice dialing, and etc. Our key contribution in this paper is to create corpora for Marathi language and explore the use of Sphinx engine for automatic speech recognition

  17. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT

  18. Under-resourced speech recognition based on the speech manifold

    CSIR Research Space (South Africa)

    Sahraeian, R

    2015-09-01

    Full Text Available Conventional acoustic modeling involves estimating many parameters to effectively model feature distributions. The sparseness of speech and text data, however, degrades the reliability of the estimation process and makes speech recognition a...

  19. Quadcopter Control Using Speech Recognition

    Science.gov (United States)

    Malik, H.; Darma, S.; Soekirno, S.

    2018-04-01

    This research reported a comparison from a success rate of speech recognition systems that used two types of databases they were existing databases and new databases, that were implemented into quadcopter as motion control. Speech recognition system was using Mel frequency cepstral coefficient method (MFCC) as feature extraction that was trained using recursive neural network method (RNN). MFCC method was one of the feature extraction methods that most used for speech recognition. This method has a success rate of 80% - 95%. Existing database was used to measure the success rate of RNN method. The new database was created using Indonesian language and then the success rate was compared with results from an existing database. Sound input from the microphone was processed on a DSP module with MFCC method to get the characteristic values. Then, the characteristic values were trained using the RNN which result was a command. The command became a control input to the single board computer (SBC) which result was the movement of the quadcopter. On SBC, we used robot operating system (ROS) as the kernel (Operating System).

  20. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  1. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...

  2. AUTOMATIC ARCHITECTURAL STYLE RECOGNITION

    Directory of Open Access Journals (Sweden)

    M. Mathias

    2012-09-01

    Full Text Available Procedural modeling has proven to be a very valuable tool in the field of architecture. In the last few years, research has soared to automatically create procedural models from images. However, current algorithms for this process of inverse procedural modeling rely on the assumption that the building style is known. So far, the determination of the building style has remained a manual task. In this paper, we propose an algorithm which automates this process through classification of architectural styles from facade images. Our classifier first identifies the images containing buildings, then separates individual facades within an image and determines the building style. This information could then be used to initialize the building reconstruction process. We have trained our classifier to distinguish between several distinct architectural styles, namely Flemish Renaissance, Haussmannian and Neoclassical. Finally, we demonstrate our approach on various street-side images.

  3. Deep Recurrent Convolutional Neural Network: Improving Performance For Speech Recognition

    OpenAIRE

    Zhang, Zewang; Sun, Zheng; Liu, Jiaqi; Chen, Jingwen; Huo, Zhao; Zhang, Xiao

    2016-01-01

    A deep learning approach has been widely applied in sequence modeling problems. In terms of automatic speech recognition (ASR), its performance has significantly been improved by increasing large speech corpus and deeper neural network. Especially, recurrent neural network and deep convolutional neural network have been applied in ASR successfully. Given the arising problem of training speed, we build a novel deep recurrent convolutional network for acoustic modeling and then apply deep resid...

  4. Efficient CEPSTRAL Normalization for Robust Speech Recognition

    National Research Council Canada - National Science Library

    Liu, Fu-Hua; Stern, Richard M; Huang, Xuedong; Acero, Alejandro

    1993-01-01

    In this paper we describe and compare the performance of a series of cepstrum-based procedures that enable the CMU SPHINX-II speech recognition system to maintain a high level of recognition accuracy...

  5. Speech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions

    Directory of Open Access Journals (Sweden)

    M. Bashirpour

    2016-09-01

    Full Text Available Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC in a speech emotion recognition system. We investigate its performance in emotion recognition using clean and noisy speech materials and compare it with the performances of the well-known MFCC, LPCC, RASTA-PLP, and also TEMFCC features. Speech samples are extracted from the Berlin emotional speech database (Emo DB and Persian emotional speech database (Persian ESD which are corrupted with 4 different noise types under various SNR levels. The experiments are conducted in clean train/noisy test scenarios to simulate practical conditions with noise sources. Simulation results show that higher recognition rates are achieved for PNCC as compared with the conventional features under noisy conditions.

  6. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual ...

  7. Physics of Automatic Target Recognition

    CERN Document Server

    Sadjadi, Firooz

    2007-01-01

    Physics of Automatic Target Recognition addresses the fundamental physical bases of sensing, and information extraction in the state-of-the art automatic target recognition field. It explores both passive and active multispectral sensing, polarimetric diversity, complex signature exploitation, sensor and processing adaptation, transformation of electromagnetic and acoustic waves in their interactions with targets, background clutter, transmission media, and sensing elements. The general inverse scattering, and advanced signal processing techniques and scientific evaluation methodologies being used in this multi disciplinary field will be part of this exposition. The issues of modeling of target signatures in various spectral modalities, LADAR, IR, SAR, high resolution radar, acoustic, seismic, visible, hyperspectral, in diverse geometric aspects will be addressed. The methods for signal processing and classification will cover concepts such as sensor adaptive and artificial neural networks, time reversal filt...

  8. Automatic speech signal segmentation based on the innovation adaptive filter

    Directory of Open Access Journals (Sweden)

    Makowski Ryszard

    2014-06-01

    Full Text Available Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors’ studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006, and it is a modified version of Brandt’s algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.

  9. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  10. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System.

    Science.gov (United States)

    Partila, Pavol; Voznak, Miroslav; Tovarek, Jaromir

    2015-01-01

    The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  11. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  12. Personality in speech assessment and automatic classification

    CERN Document Server

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  13. Speech emotion recognition methods: A literature review

    Science.gov (United States)

    Basharirad, Babak; Moradhaseli, Mohammadreza

    2017-10-01

    Recently, attention of the emotional speech signals research has been boosted in human machine interfaces due to availability of high computation capability. There are many systems proposed in the literature to identify the emotional state through speech. Selection of suitable feature sets, design of a proper classifications methods and prepare an appropriate dataset are the main key issues of speech emotion recognition systems. This paper critically analyzed the current available approaches of speech emotion recognition methods based on the three evaluating parameters (feature set, classification of features, accurately usage). In addition, this paper also evaluates the performance and limitations of available methods. Furthermore, it highlights the current promising direction for improvement of speech emotion recognition systems.

  14. Man machine interface based on speech recognition

    International Nuclear Information System (INIS)

    Jorge, Carlos A.F.; Aghina, Mauricio A.C.; Mol, Antonio C.A.; Pereira, Claudio M.N.A.

    2007-01-01

    This work reports the development of a Man Machine Interface based on speech recognition. The system must recognize spoken commands, and execute the desired tasks, without manual interventions of operators. The range of applications goes from the execution of commands in an industrial plant's control room, to navigation and interaction in virtual environments. Results are reported for isolated word recognition, the isolated words corresponding to the spoken commands. For the pre-processing stage, relevant parameters are extracted from the speech signals, using the cepstral analysis technique, that are used for isolated word recognition, and corresponds to the inputs of an artificial neural network, that performs recognition tasks. (author)

  15. On-device mobile speech recognition

    OpenAIRE

    Mustafa, MK

    2016-01-01

    Despite many years of research, Speech Recognition remains an active area of research in Artificial Intelligence. Currently, the most common commercial application of this technology on mobile devices uses a wireless client – server approach to meet the computational and memory demands of the speech recognition process. Unfortunately, such an approach is unlikely to remain viable when fully applied over the approximately 7.22 Billion mobile phones currently in circulation. In this thesis we p...

  16. Human phoneme recognition depending on speech-intrinsic variability.

    Science.gov (United States)

    Meyer, Bernd T; Jürgens, Tim; Wesker, Thorsten; Brand, Thomas; Kollmeier, Birger

    2010-11-01

    The influence of different sources of speech-intrinsic variation (speaking rate, effort, style and dialect or accent) on human speech perception was investigated. In listening experiments with 16 listeners, confusions of consonant-vowel-consonant (CVC) and vowel-consonant-vowel (VCV) sounds in speech-weighted noise were analyzed. Experiments were based on the OLLO logatome speech database, which was designed for a man-machine comparison. It contains utterances spoken by 50 speakers from five dialect/accent regions and covers several intrinsic variations. By comparing results depending on intrinsic and extrinsic variations (i.e., different levels of masking noise), the degradation induced by variabilities can be expressed in terms of the SNR. The spectral level distance between the respective speech segment and the long-term spectrum of the masking noise was found to be a good predictor for recognition rates, while phoneme confusions were influenced by the distance to spectrally close phonemes. An analysis based on transmitted information of articulatory features showed that voicing and manner of articulation are comparatively robust cues in the presence of intrinsic variations, whereas the coding of place is more degraded. The database and detailed results have been made available for comparisons between human speech recognition (HSR) and automatic speech recognizers (ASR).

  17. Speech Recognition for the iCub Platform

    Directory of Open Access Journals (Sweden)

    Bertrand Higy

    2018-02-01

    Full Text Available This paper describes open source software (available at https://github.com/robotology/natural-speech to build automatic speech recognition (ASR systems and run them within the YARP platform. The toolkit is designed (i to allow non-ASR experts to easily create their own ASR system and run it on iCub and (ii to build deep learning-based models specifically addressing the main challenges an ASR system faces in the context of verbal human–iCub interactions. The toolkit mostly consists of Python, C++ code and shell scripts integrated in YARP. As additional contribution, a second codebase (written in Matlab is provided for more expert ASR users who want to experiment with bio-inspired and developmental learning-inspired ASR systems. Specifically, we provide code for two distinct kinds of speech recognition: “articulatory” and “unsupervised” speech recognition. The first is largely inspired by influential neurobiological theories of speech perception which assume speech perception to be mediated by brain motor cortex activities. Our articulatory systems have been shown to outperform strong deep learning-based baselines. The second type of recognition systems, the “unsupervised” systems, do not use any supervised information (contrary to most ASR systems, including our articulatory systems. To some extent, they mimic an infant who has to discover the basic speech units of a language by herself. In addition, we provide resources consisting of pre-trained deep learning models for ASR, and a 2.5-h speech dataset of spoken commands, the VoCub dataset, which can be used to adapt an ASR system to the typical acoustic environments in which iCub operates.

  18. Dynamic Programming Algorithms in Speech Recognition

    Directory of Open Access Journals (Sweden)

    Titus Felix FURTUNA

    2008-01-01

    Full Text Available In a system of speech recognition containing words, the recognition requires the comparison between the entry signal of the word and the various words of the dictionary. The problem can be solved efficiently by a dynamic comparison algorithm whose goal is to put in optimal correspondence the temporal scales of the two words. An algorithm of this type is Dynamic Time Warping. This paper presents two alternatives for implementation of the algorithm designed for recognition of the isolated words.

  19. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  20. Performance Assessment of Dynaspeak Speech Recognition System on Inflight Databases

    National Research Council Canada - National Science Library

    Barry, Timothy

    2004-01-01

    .... To aid in the assessment of various commercially available speech recognition systems, several aircraft speech databases have been developed at the Air Force Research Laboratory's Human Effectiveness Directorate...

  1. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  2. Histogram equalization with Bayesian estimation for noise robust speech recognition.

    Science.gov (United States)

    Suh, Youngjoo; Kim, Hoirin

    2018-02-01

    The histogram equalization approach is an efficient feature normalization technique for noise robust automatic speech recognition. However, it suffers from performance degradation when some fundamental conditions are not satisfied in the test environment. To remedy these limitations of the original histogram equalization methods, class-based histogram equalization approach has been proposed. Although this approach showed substantial performance improvement under noise environments, it still suffers from performance degradation due to the overfitting problem when test data are insufficient. To address this issue, the proposed histogram equalization technique employs the Bayesian estimation method in the test cumulative distribution function estimation. It was reported in a previous study conducted on the Aurora-4 task that the proposed approach provided substantial performance gains in speech recognition systems based on the acoustic modeling of the Gaussian mixture model-hidden Markov model. In this work, the proposed approach was examined in speech recognition systems with deep neural network-hidden Markov model (DNN-HMM), the current mainstream speech recognition approach where it also showed meaningful performance improvement over the conventional maximum likelihood estimation-based method. The fusion of the proposed features with the mel-frequency cepstral coefficients provided additional performance gains in DNN-HMM systems, which otherwise suffer from performance degradation in the clean test condition.

  3. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Carrier nature of speech; modulation spectrum; spectral dynamics ... the relationships between phonetic values of sounds and their short-term spectral envelopes .... the number of free parameters that need to be estimated from training data.

  4. CASRA+: A Colloquial Arabic Speech Recognition Application

    OpenAIRE

    Ramzi A. Haraty; Omar El Ariss

    2007-01-01

    The research proposed here was for an Arabic speech recognition application, concentrating on the Lebanese dialect. The system starts by sampling the speech, which was the process of transforming the sound from analog to digital and then extracts the features by using the Mel-Frequency Cepstral Coefficients (MFCC). The extracted features are then compared with the system's stored model; in this case the stored model chosen was a phoneme-based model. This reference model differs from the direc...

  5. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  6. Voice Activity Detection. Fundamentals and Speech Recognition System Robustness

    OpenAIRE

    Ramirez, J.; Gorriz, J. M.; Segura, J. C.

    2007-01-01

    This chapter has shown an overview of the main challenges in robust speech detection and a review of the state of the art and applications. VADs are frequently used in a number of applications including speech coding, speech enhancement and speech recognition. A precise VAD extracts a set of discriminative speech features from the noisy speech and formulates the decision in terms of well defined rule. The chapter has summarized three robust VAD methods that yield high speech/non-speech discri...

  7. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

    Directory of Open Access Journals (Sweden)

    Petar S. Aleksic

    2002-11-01

    Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

  8. Human and automatic speaker recognition over telecommunication channels

    CERN Document Server

    Fernández Gallardo, Laura

    2016-01-01

    This work addresses the evaluation of the human and the automatic speaker recognition performances under different channel distortions caused by bandwidth limitation, codecs, and electro-acoustic user interfaces, among other impairments. Its main contribution is the demonstration of the benefits of communication channels of extended bandwidth, together with an insight into how speaker-specific characteristics of speech are preserved through different transmissions. It provides sufficient motivation for considering speaker recognition as a criterion for the migration from narrowband to enhanced bandwidths, such as wideband and super-wideband.

  9. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    (i.e. typing, handwriting and speaking) to improve the efficiency and accuracy of the translation process. However, further studies need to be conducted to build up new knowledge about the way in which state-of-the-art speech recognition software can be applied to the post-editing process...

  10. Speech Recognition Technology for Disabilities Education

    Science.gov (United States)

    Tang, K. Wendy; Kamoua, Ridha; Sutan, Victor; Farooq, Omer; Eng, Gilbert; Chu, Wei Chern; Hou, Guofeng

    2005-01-01

    Speech recognition is an alternative to traditional methods of interacting with a computer, such as textual input through a keyboard. An effective system can replace or reduce the reliability on standard keyboard and mouse input. This can especially assist dyslexic students who have problems with character or word use and manipulation in a textual…

  11. Adapting Speech Recognition in Augmented Reality for Mobile Devices in Outdoor Environments

    OpenAIRE

    Pascoal, Rui; Ribeiro, Ricardo; Batista, Fernando; de Almeida, Ana

    2017-01-01

    This paper describes the process of integrating automatic speech recognition (ASR) into a mobile application and explores the benefits and challenges of integrating speech with augmented reality (AR) in outdoor environments. The augmented reality allows end-users to interact with the information displayed and perform tasks, while increasing the user’s perception about the real world by adding virtual information to it. Speech is the most natural way of communication: it allows hands-free inte...

  12. A Study on Efficient Robust Speech Recognition with Stochastic Dynamic Time Warping

    OpenAIRE

    孫, 喜浩

    2014-01-01

    In recent years, great progress has been made in automatic speech recognition (ASR) system. The hidden Markov model (HMM) and dynamic time warping (DTW) are the two main algorithms which have been widely applied to ASR system. Although, HMM technique achieves higher recognition accuracy in clear speech environment and noisy environment. It needs large-set of words and realizes the algorithm more complexly.Thus, more and more researchers have focused on DTW-based ASR system.Dynamic time warpin...

  13. Comparing grapheme-based and phoneme-based speech recognition for Afrikaans

    CSIR Research Space (South Africa)

    Basson, WD

    2012-11-01

    Full Text Available This paper compares the recognition accuracy of a phoneme-based automatic speech recognition system with that of a grapheme-based system, using Afrikaans as case study. The first system is developed using a conventional pronunciation dictionary...

  14. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  15. Support vector machine for automatic pain recognition

    Science.gov (United States)

    Monwar, Md Maruf; Rezaei, Siamak

    2009-02-01

    Facial expressions are a key index of emotion and the interpretation of such expressions of emotion is critical to everyday social functioning. In this paper, we present an efficient video analysis technique for recognition of a specific expression, pain, from human faces. We employ an automatic face detector which detects face from the stored video frame using skin color modeling technique. For pain recognition, location and shape features of the detected faces are computed. These features are then used as inputs to a support vector machine (SVM) for classification. We compare the results with neural network based and eigenimage based automatic pain recognition systems. The experiment results indicate that using support vector machine as classifier can certainly improve the performance of automatic pain recognition system.

  16. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    International Nuclear Information System (INIS)

    Holzrichter, J.F.; Ng, L.C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching. 35 figs

  17. Speech coding, reconstruction and recognition using acoustics and electromagnetic waves

    Science.gov (United States)

    Holzrichter, John F.; Ng, Lawrence C.

    1998-01-01

    The use of EM radiation in conjunction with simultaneously recorded acoustic speech information enables a complete mathematical coding of acoustic speech. The methods include the forming of a feature vector for each pitch period of voiced speech and the forming of feature vectors for each time frame of unvoiced, as well as for combined voiced and unvoiced speech. The methods include how to deconvolve the speech excitation function from the acoustic speech output to describe the transfer function each time frame. The formation of feature vectors defining all acoustic speech units over well defined time frames can be used for purposes of speech coding, speech compression, speaker identification, language-of-speech identification, speech recognition, speech synthesis, speech translation, speech telephony, and speech teaching.

  18. Auditory analysis for speech recognition based on physiological models

    Science.gov (United States)

    Jeon, Woojay; Juang, Biing-Hwang

    2004-05-01

    To address the limitations of traditional cepstrum or LPC based front-end processing methods for automatic speech recognition, more elaborate methods based on physiological models of the human auditory system may be used to achieve more robust speech recognition in adverse environments. For this purpose, a modified version of a model of the primary auditory cortex featuring a three dimensional mapping of auditory spectra [Wang and Shamma, IEEE Trans. Speech Audio Process. 3, 382-395 (1995)] is adopted and investigated for its use as an improved front-end processing method. The study is conducted in two ways: first, by relating the model's redundant representation to traditional spectral representations and showing that the former not only encompasses information provided by the latter, but also reveals more relevant information that makes it superior in describing the identifying features of speech signals; and second, by observing the statistical features of the representation for various classes of sound to show how different identifying features manifest themselves as specific patterns on the cortical map, thereby becoming a place-coded data set on which detection theory could be applied to simulate auditory perception and cognition.

  19. Comparison of Forced-Alignment Speech Recognition and Humans for Generating Reference VAD

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua; Paola Bissiri, Maria

    2015-01-01

    This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions and the ......This present paper aims to answer the question whether forced-alignment speech recognition can be used as an alternative to humans in generating reference Voice Activity Detection (VAD) transcriptions. An investigation of the level of agreement between automatic/manual VAD transcriptions...... and the reference ones produced by a human expert was carried out. Thereafter, statistical analysis was employed on the automatically produced and the collected manual transcriptions. Experimental results confirmed that forced-alignment speech recognition can provide accurate and consistent VAD labels....

  20. Towards automatic forensic face recognition

    NARCIS (Netherlands)

    Ali, Tauseef; Spreeuwers, Lieuwe Jan; Veldhuis, Raymond N.J.

    2011-01-01

    In this paper we present a methodology and experimental results for evidence evaluation in the context of forensic face recognition. In forensic applications, the matching score (hereafter referred to as similarity score) from a biometric system must be represented as a Likelihood Ratio (LR). In our

  1. Automatic modulation recognition of communication signals

    CERN Document Server

    Azzouz, Elsayed Elsayed

    1996-01-01

    Automatic modulation recognition is a rapidly evolving area of signal analysis. In recent years, interest from the academic and military research institutes has focused around the research and development of modulation recognition algorithms. Any communication intelligence (COMINT) system comprises three main blocks: receiver front-end, modulation recogniser and output stage. Considerable work has been done in the area of receiver front-ends. The work at the output stage is concerned with information extraction, recording and exploitation and begins with signal demodulation, that requires accurate knowledge about the signal modulation type. There are, however, two main reasons for knowing the current modulation type of a signal; to preserve the signal information content and to decide upon the suitable counter action, such as jamming. Automatic Modulation Recognition of Communications Signals describes in depth this modulation recognition process. Drawing on several years of research, the authors provide a cr...

  2. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Lévy Christophe

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  3. Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives

    Directory of Open Access Journals (Sweden)

    S. Hamidreza Kasaei

    2011-05-01

    Full Text Available In this paper, we propose design and initial implementation of a robust system which can automatically translates voice into text and text to sign language animations. Sign Language
    Translation Systems could significantly improve deaf lives especially in communications, exchange of information and employment of machine for translation conversations from one language to another has. Therefore, considering these points, it seems necessary to study the speech recognition. Usually, the voice recognition algorithms address three major challenges. The first is extracting feature form speech and the second is when limited sound gallery are available for recognition, and the final challenge is to improve speaker dependent to speaker independent voice recognition. Extracting feature form speech is an important stage in our method. Different procedures are available for extracting feature form speech. One of the commonest of which used in speech
    recognition systems is Mel-Frequency Cepstral Coefficients (MFCCs. The algorithm starts with preprocessing and signal conditioning. Next extracting feature form speech using Cepstral coefficients will be done. Then the result of this process sends to segmentation part. Finally recognition part recognizes the words and then converting word recognized to facial animation. The project is still in progress and some new interesting methods are described in the current report.

  4. The automaticity of emotion recognition.

    Science.gov (United States)

    Tracy, Jessica L; Robins, Richard W

    2008-02-01

    Evolutionary accounts of emotion typically assume that humans evolved to quickly and efficiently recognize emotion expressions because these expressions convey fitness-enhancing messages. The present research tested this assumption in 2 studies. Specifically, the authors examined (a) how quickly perceivers could recognize expressions of anger, contempt, disgust, embarrassment, fear, happiness, pride, sadness, shame, and surprise; (b) whether accuracy is improved when perceivers deliberate about each expression's meaning (vs. respond as quickly as possible); and (c) whether accurate recognition can occur under cognitive load. Across both studies, perceivers quickly and efficiently (i.e., under cognitive load) recognized most emotion expressions, including the self-conscious emotions of pride, embarrassment, and shame. Deliberation improved accuracy in some cases, but these improvements were relatively small. Discussion focuses on the implications of these findings for the cognitive processes underlying emotion recognition.

  5. Error analysis to improve the speech recognition accuracy on ...

    Indian Academy of Sciences (India)

    dictionary plays a key role in the speech recognition accuracy. .... Sophisticated microphone is used for the recording speech corpus in a noise free environment. .... values, word error rate (WER) and error-rate will be calculated as follows:.

  6. Speech recognition in natural background noise.

    Directory of Open Access Journals (Sweden)

    Julien Meyer

    Full Text Available In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR. Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A, reference at 1 meter at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB. Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda. Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for

  7. Speech recognition in natural background noise.

    Science.gov (United States)

    Meyer, Julien; Dentel, Laure; Meunier, Fanny

    2013-01-01

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR). Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A), reference at 1 meter) at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (-8.8 dB to -18.4 dB). Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda). Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for future studies.

  8. Specific acoustic models for spontaneous and dictated style in indonesian speech recognition

    Science.gov (United States)

    Vista, C. B.; Satriawan, C. H.; Lestari, D. P.; Widyantoro, D. H.

    2018-03-01

    The performance of an automatic speech recognition system is affected by differences in speech style between the data the model is originally trained upon and incoming speech to be recognized. In this paper, the usage of GMM-HMM acoustic models for specific speech styles is investigated. We develop two systems for the experiments; the first employs a speech style classifier to predict the speech style of incoming speech, either spontaneous or dictated, then decodes this speech using an acoustic model specifically trained for that speech style. The second system uses both acoustic models to recognise incoming speech and decides upon a final result by calculating a confidence score of decoding. Results show that training specific acoustic models for spontaneous and dictated speech styles confers a slight recognition advantage as compared to a baseline model trained on a mixture of spontaneous and dictated training data. In addition, the speech style classifier approach of the first system produced slightly more accurate results than the confidence scoring employed in the second system.

  9. Silent Speech Recognition as an Alternative Communication Device for Persons with Laryngectomy.

    Science.gov (United States)

    Meltzner, Geoffrey S; Heaton, James T; Deng, Yunbin; De Luca, Gianluca; Roy, Serge H; Kline, Joshua C

    2017-12-01

    Each year thousands of individuals require surgical removal of their larynx (voice box) due to trauma or disease, and thereby require an alternative voice source or assistive device to verbally communicate. Although natural voice is lost after laryngectomy, most muscles controlling speech articulation remain intact. Surface electromyographic (sEMG) activity of speech musculature can be recorded from the neck and face, and used for automatic speech recognition to provide speech-to-text or synthesized speech as an alternative means of communication. This is true even when speech is mouthed or spoken in a silent (subvocal) manner, making it an appropriate communication platform after laryngectomy. In this study, 8 individuals at least 6 months after total laryngectomy were recorded using 8 sEMG sensors on their face (4) and neck (4) while reading phrases constructed from a 2,500-word vocabulary. A unique set of phrases were used for training phoneme-based recognition models for each of the 39 commonly used phonemes in English, and the remaining phrases were used for testing word recognition of the models based on phoneme identification from running speech. Word error rates were on average 10.3% for the full 8-sensor set (averaging 9.5% for the top 4 participants), and 13.6% when reducing the sensor set to 4 locations per individual (n=7). This study provides a compelling proof-of-concept for sEMG-based alaryngeal speech recognition, with the strong potential to further improve recognition performance.

  10. Optimal pattern synthesis for speech recognition based on principal component analysis

    Science.gov (United States)

    Korsun, O. N.; Poliyev, A. V.

    2018-02-01

    The algorithm for building an optimal pattern for the purpose of automatic speech recognition, which increases the probability of correct recognition, is developed and presented in this work. The optimal pattern forming is based on the decomposition of an initial pattern to principal components, which enables to reduce the dimension of multi-parameter optimization problem. At the next step the training samples are introduced and the optimal estimates for principal components decomposition coefficients are obtained by a numeric parameter optimization algorithm. Finally, we consider the experiment results that show the improvement in speech recognition introduced by the proposed optimization algorithm.

  11. System for automatic crate recognition

    Directory of Open Access Journals (Sweden)

    Radovan Kukla

    2012-01-01

    Full Text Available This contribution describes usage of computer vision and artificial intelligence methods for application. The method solves abuse of reverse vending machine. This topic has been solved as innovation voucher for the South Moravian Region. It was developed by Mendel university in Brno (Department of informatics – Faculty of Business and Economics and Department of Agricultural, Food and Environmental Engineering – Faculty of Agronomy together with the Czech subsidiary of Tomra. The project is focused on a possibility of integration industrial cameras and computers to process recognition of crates in the verse vending machine. The aim was the effective security system that will be able to save hundreds-thousands financial loss. As suitable development and runtime platform there was chosen product ControlWeb and VisionLab developed by Moravian Instruments Inc.

  12. Automatic Speaker Recognition for Mobile Forensic Applications

    Directory of Open Access Journals (Sweden)

    Mohammed Algabri

    2017-01-01

    Full Text Available Presently, lawyers, law enforcement agencies, and judges in courts use speech and other biometric features to recognize suspects. In general, speaker recognition is used for discriminating people based on their voices. The process of determining, if a suspected speaker is the source of trace, is called forensic speaker recognition. In such applications, the voice samples are most probably noisy, the recording sessions might mismatch each other, the sessions might not contain sufficient recording for recognition purposes, and the suspect voices are recorded through mobile channel. The identification of a person through his voice within a forensic quality context is challenging. In this paper, we propose a method for forensic speaker recognition for the Arabic language; the King Saud University Arabic Speech Database is used for obtaining experimental results. The advantage of this database is that each speaker’s voice is recorded in both clean and noisy environments, through a microphone and a mobile channel. This diversity facilitates its usage in forensic experimentations. Mel-Frequency Cepstral Coefficients are used for feature extraction and the Gaussian mixture model-universal background model is used for speaker modeling. Our approach has shown low equal error rates (EER, within noisy environments and with very short test samples.

  13. Image-based automatic recognition of larvae

    Science.gov (United States)

    Sang, Ru; Yu, Guiying; Fan, Weijun; Guo, Tiantai

    2010-08-01

    As the main objects, imagoes have been researched in quarantine pest recognition in these days. However, pests in their larval stage are latent, and the larvae spread abroad much easily with the circulation of agricultural and forest products. It is presented in this paper that, as the new research objects, larvae are recognized by means of machine vision, image processing and pattern recognition. More visional information is reserved and the recognition rate is improved as color image segmentation is applied to images of larvae. Along with the characteristics of affine invariance, perspective invariance and brightness invariance, scale invariant feature transform (SIFT) is adopted for the feature extraction. The neural network algorithm is utilized for pattern recognition, and the automatic identification of larvae images is successfully achieved with satisfactory results.

  14. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet Phuong; van Leeuwen, David A.; de Jong, Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two

  15. Dealing with Phrase Level Co-Articulation (PLC) in speech recognition: a first approach

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; van Leeuwen, David A.; Robinson, Tony; Renals, Steve

    1999-01-01

    Whereas nowadays within-word co-articulation effects are usually sufficiently dealt with in automatic speech recognition, this is not always the case with phrase level co-articulation effects (PLC). This paper describes a first approach in dealing with phrase level co-articulation by applying these

  16. Towards Contactless Silent Speech Recognition Based on Detection of Active and Visible Articulators Using IR-UWB Radar.

    Science.gov (United States)

    Shin, Young Hoon; Seo, Jiwon

    2016-10-29

    People with hearing or speaking disabilities are deprived of the benefits of conventional speech recognition technology because it is based on acoustic signals. Recent research has focused on silent speech recognition systems that are based on the motions of a speaker's vocal tract and articulators. Because most silent speech recognition systems use contact sensors that are very inconvenient to users or optical systems that are susceptible to environmental interference, a contactless and robust solution is hence required. Toward this objective, this paper presents a series of signal processing algorithms for a contactless silent speech recognition system using an impulse radio ultra-wide band (IR-UWB) radar. The IR-UWB radar is used to remotely and wirelessly detect motions of the lips and jaw. In order to extract the necessary features of lip and jaw motions from the received radar signals, we propose a feature extraction algorithm. The proposed algorithm noticeably improved speech recognition performance compared to the existing algorithm during our word recognition test with five speakers. We also propose a speech activity detection algorithm to automatically select speech segments from continuous input signals. Thus, speech recognition processing is performed only when speech segments are detected. Our testbed consists of commercial off-the-shelf radar products, and the proposed algorithms are readily applicable without designing specialized radar hardware for silent speech processing.

  17. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  18. Speech Recognition and Cognitive Skills in Bimodal Cochlear Implant Users

    Science.gov (United States)

    Hua, Håkan; Johansson, Björn; Magnusson, Lennart; Lyxell, Björn; Ellis, Rachel J.

    2017-01-01

    Purpose: To examine the relation between speech recognition and cognitive skills in bimodal cochlear implant (CI) and hearing aid users. Method: Seventeen bimodal CI users (28-74 years) were recruited to the study. Speech recognition tests were carried out in quiet and in noise. The cognitive tests employed included the Reading Span Test and the…

  19. Deep Complementary Bottleneck Features for Visual Speech Recognition

    NARCIS (Netherlands)

    Petridis, Stavros; Pantic, Maja

    Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep

  20. Features Speech Signature Image Recognition on Mobile Devices

    Directory of Open Access Journals (Sweden)

    Alexander Mikhailovich Alyushin

    2015-12-01

    Full Text Available The algorithms fordynamic spectrograms images recognition, processing and soundspeech signature (SS weredeveloped. The software for mobile phones, thatcan recognize speech signatureswas prepared. The investigation of the SS recognition speed on its boundarytypes was conducted. Recommendations on the boundary types choice in the optimal ratio of recognitionspeed and required space were given.

  1. Automatic transcription of continuous speech into syllable-like units ...

    Indian Academy of Sciences (India)

    style HMM models are generated for each of the clusters during training. During testing .... manual segmentation at syllable-like units followed by isolated style recognition of continu- ous speech ..... obtaining demisyllabic reference patterns.

  2. Emotion Recognition of Speech Signals Based on Filter Methods

    Directory of Open Access Journals (Sweden)

    Narjes Yazdanian

    2016-10-01

    Full Text Available Speech is the basic mean of communication among human beings.With the increase of transaction between human and machine, necessity of automatic dialogue and removing human factor has been considered. The aim of this study was to determine a set of affective features the speech signal is based on emotions. In this study system was designs that include three mains sections, features extraction, features selection and classification. After extraction of useful features such as, mel frequency cepstral coefficient (MFCC, linear prediction cepstral coefficients (LPC, perceptive linear prediction coefficients (PLP, ferment frequency, zero crossing rate, cepstral coefficients and pitch frequency, Mean, Jitter, Shimmer, Energy, Minimum, Maximum, Amplitude, Standard Deviation, at a later stage with filter methods such as Pearson Correlation Coefficient, t-test, relief and information gain, we came up with a method to rank and select effective features in emotion recognition. Then Result, are given to the classification system as a subset of input. In this classification stage, multi support vector machine are used to classify seven type of emotion. According to the results, that method of relief, together with multi support vector machine, has the most classification accuracy with emotion recognition rate of 93.94%.

  3. Neural speech recognition: continuous phoneme decoding using spatiotemporal representations of human cortical activity

    Science.gov (United States)

    Moses, David A.; Mesgarani, Nima; Leonard, Matthew K.; Chang, Edward F.

    2016-10-01

    Objective. The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach. The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results. The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance. These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.

  4. Hybrid methodological approach to context-dependent speech recognition

    Directory of Open Access Journals (Sweden)

    Dragiša Mišković

    2017-01-01

    Full Text Available Although the importance of contextual information in speech recognition has been acknowledged for a long time now, it has remained clearly underutilized even in state-of-the-art speech recognition systems. This article introduces a novel, methodologically hybrid approach to the research question of context-dependent speech recognition in human–machine interaction. To the extent that it is hybrid, the approach integrates aspects of both statistical and representational paradigms. We extend the standard statistical pattern-matching approach with a cognitively inspired and analytically tractable model with explanatory power. This methodological extension allows for accounting for contextual information which is otherwise unavailable in speech recognition systems, and using it to improve post-processing of recognition hypotheses. The article introduces an algorithm for evaluation of recognition hypotheses, illustrates it for concrete interaction domains, and discusses its implementation within two prototype conversational agents.

  5. Speech recognition using articulatory and excitation source features

    CERN Document Server

    Rao, K Sreenivasa

    2017-01-01

    This book discusses the contribution of articulatory and excitation source information in discriminating sound units. The authors focus on excitation source component of speech -- and the dynamics of various articulators during speech production -- for enhancement of speech recognition (SR) performance. Speech recognition is analyzed for read, extempore, and conversation modes of speech. Five groups of articulatory features (AFs) are explored for speech recognition, in addition to conventional spectral features. Each chapter provides the motivation for exploring the specific feature for SR task, discusses the methods to extract those features, and finally suggests appropriate models to capture the sound unit specific knowledge from the proposed features. The authors close by discussing various combinations of spectral, articulatory and source features, and the desired models to enhance the performance of SR systems.

  6. Noise-robust speech recognition through auditory feature detection and spike sequence decoding.

    Science.gov (United States)

    Schafer, Phillip B; Jin, Dezhe Z

    2014-03-01

    Speech recognition in noisy conditions is a major challenge for computer systems, but the human brain performs it routinely and accurately. Automatic speech recognition (ASR) systems that are inspired by neuroscience can potentially bridge the performance gap between humans and machines. We present a system for noise-robust isolated word recognition that works by decoding sequences of spikes from a population of simulated auditory feature-detecting neurons. Each neuron is trained to respond selectively to a brief spectrotemporal pattern, or feature, drawn from the simulated auditory nerve response to speech. The neural population conveys the time-dependent structure of a sound by its sequence of spikes. We compare two methods for decoding the spike sequences--one using a hidden Markov model-based recognizer, the other using a novel template-based recognition scheme. In the latter case, words are recognized by comparing their spike sequences to template sequences obtained from clean training data, using a similarity measure based on the length of the longest common sub-sequence. Using isolated spoken digits from the AURORA-2 database, we show that our combined system outperforms a state-of-the-art robust speech recognizer at low signal-to-noise ratios. Both the spike-based encoding scheme and the template-based decoding offer gains in noise robustness over traditional speech recognition methods. Our system highlights potential advantages of spike-based acoustic coding and provides a biologically motivated framework for robust ASR development.

  7. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  8. A methodology of error detection: Improving speech recognition in radiology

    OpenAIRE

    Voll, Kimberly Dawn

    2006-01-01

    Automated speech recognition (ASR) in radiology report dictation demands highly accurate and robust recognition software. Despite vendor claims, current implementations are suboptimal, leading to poor accuracy, and time and money wasted on proofreading. Thus, other methods must be considered for increasing the reliability and performance of ASR before it is a viable alternative to human transcription. One such method is post-ASR error detection, used to recover from the inaccuracy of speech r...

  9. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  10. Automatic discrimination between laughter and speech

    NARCIS (Netherlands)

    Truong, K.; Leeuwen, D. van

    2007-01-01

    Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker’s state and emotion can be revealed. This paper describes the

  11. Towards automatic musical instrument timbre recognition

    Science.gov (United States)

    Park, Tae Hong

    This dissertation is comprised of two parts---focus on issues concerning research and development of an artificial system for automatic musical instrument timbre recognition and musical compositions. The technical part of the essay includes a detailed record of developed and implemented algorithms for feature extraction and pattern recognition. A review of existing literature introducing historical aspects surrounding timbre research, problems associated with a number of timbre definitions, and highlights of selected research activities that have had significant impact in this field are also included. The developed timbre recognition system follows a bottom-up, data-driven model that includes a pre-processing module, feature extraction module, and a RBF/EBF (Radial/Elliptical Basis Function) neural network-based pattern recognition module. 829 monophonic samples from 12 instruments have been chosen from the Peter Siedlaczek library (Best Service) and other samples from the Internet and personal collections. Significant emphasis has been put on feature extraction development and testing to achieve robust and consistent feature vectors that are eventually passed to the neural network module. In order to avoid a garbage-in-garbage-out (GIGO) trap and improve generality, extra care was taken in designing and testing the developed algorithms using various dynamics, different playing techniques, and a variety of pitches for each instrument with inclusion of attack and steady-state portions of a signal. Most of the research and development was conducted in Matlab. The compositional part of the essay includes brief introductions to "A d'Ess Are ," "Aboji," "48 13 N, 16 20 O," and "pH-SQ." A general outline pertaining to the ideas and concepts behind the architectural designs of the pieces including formal structures, time structures, orchestration methods, and pitch structures are also presented.

  12. Use of digital speech recognition in diagnostics radiology

    International Nuclear Information System (INIS)

    Arndt, H.; Stockheim, D.; Mutze, S.; Petersein, J.; Gregor, P.; Hamm, B.

    1999-01-01

    Purpose: Applicability and benefits of digital speech recognition in diagnostic radiology were tested using the speech recognition system SP 6000. Methods: The speech recognition system SP 6000 was integrated into the network of the institute and connected to the existing Radiological Information System (RIS). Three subjects used this system for writing 2305 findings from dictation. After the recognition process the date, length of dictation, time required for checking/correction, kind of examination and error rate were recorded for every dictation. With the same subjects, a correlation was performed with 625 conventionally written finding. Results: After an 1-hour initial training the average error rates were 8.4 to 13.3%. The first adaptation of the speech recognition system (after nine days) decreased the average error rates to 2.4 to 10.7% due to the ability of the program to learn. The 2 nd and 3 rd adaptations resulted only in small changes of the error rate. An individual comparison of the error rate developments in the same kind of investigation showed the relative independence of the error rate on the individual user. Conclusion: The results show that the speech recognition system SP 6000 can be evaluated as an advantageous alternative for quickly recording radiological findings. A comparison between manually writing and dictating the findings verifies the individual differences of the writing speeds and shows the advantage of the application of voice recognition when faced with normal keyboard performance. (orig.) [de

  13. Histogram Equalization to Model Adaptation for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Suh Youngjoo

    2010-01-01

    Full Text Available We propose a new model adaptation method based on the histogram equalization technique for providing robustness in noisy environments. The trained acoustic mean models of a speech recognizer are adapted into environmentally matched conditions by using the histogram equalization algorithm on a single utterance basis. For more robust speech recognition in the heavily noisy conditions, trained acoustic covariance models are efficiently adapted by the signal-to-noise ratio-dependent linear interpolation between trained covariance models and utterance-level sample covariance models. Speech recognition experiments on both the digit-based Aurora2 task and the large vocabulary-based task showed that the proposed model adaptation approach provides significant performance improvements compared to the baseline speech recognizer trained on the clean speech data.

  14. Garbage Modeling for On-device Speech Recognition

    NARCIS (Netherlands)

    Van Gysel, C.; Velikovich, L.; McGraw, I.; Beaufays, F.

    2015-01-01

    User interactions with mobile devices increasingly depend on voice as a primary input modality. Due to the disadvantages of sending audio across potentially spotty network connections for speech recognition, in recent years there has been growing attention to performing recognition on-device. The

  15. Lexicon Optimization for Dutch Speech Recognition in Spoken Document Retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  16. Lexicon optimization for Dutch speech recognition in spoken document retrieval

    NARCIS (Netherlands)

    Ordelman, Roeland J.F.; van Hessen, Adrianus J.; de Jong, Franciska M.G.; Dalsgaard, P.; Lindberg, B.; Benner, H.

    2001-01-01

    In this paper, ongoing work concerning the language modelling and lexicon optimization of a Dutch speech recognition system for Spoken Document Retrieval is described: the collection and normalization of a training data set and the optimization of our recognition lexicon. Effects on lexical coverage

  17. Image simulation for automatic license plate recognition

    Science.gov (United States)

    Bala, Raja; Zhao, Yonghui; Burry, Aaron; Kozitsky, Vladimir; Fillion, Claude; Saunders, Craig; Rodríguez-Serrano, José

    2012-01-01

    Automatic license plate recognition (ALPR) is an important capability for traffic surveillance applications, including toll monitoring and detection of different types of traffic violations. ALPR is a multi-stage process comprising plate localization, character segmentation, optical character recognition (OCR), and identification of originating jurisdiction (i.e. state or province). Training of an ALPR system for a new jurisdiction typically involves gathering vast amounts of license plate images and associated ground truth data, followed by iterative tuning and optimization of the ALPR algorithms. The substantial time and effort required to train and optimize the ALPR system can result in excessive operational cost and overhead. In this paper we propose a framework to create an artificial set of license plate images for accelerated training and optimization of ALPR algorithms. The framework comprises two steps: the synthesis of license plate images according to the design and layout for a jurisdiction of interest; and the modeling of imaging transformations and distortions typically encountered in the image capture process. Distortion parameters are estimated by measurements of real plate images. The simulation methodology is successfully demonstrated for training of OCR.

  18. Robust Recognition of Loud and Lombard speech in the Fighter Cockpit Environment

    Science.gov (United States)

    1988-08-01

    the latter as inter-speaker variability. According to Zue [Z85j, inter-speaker variabilities can be attributed to sociolinguistic background, dialect...34 Journal of the Acoustical Society of America , Vol 50, 1971. [At74I B. S. Atal, "Linear prediction for speaker identification," Journal of the Acoustical...Society of America , Vol 55, 1974. [B771 B. Beek, E. P. Neuberg, and D. C. Hodge, "An Assessment of the Technology of Automatic Speech Recognition for

  19. Narrowing the gap between automatic and human word recognition

    NARCIS (Netherlands)

    Scharenborg, O.E.

    2005-01-01

    In everyday life, speech is all around us, on the radio, television, and in human-human interaction. We are continually confronted with novel utterances, and usually we have no problem recognising and understanding them. Several research fields investigate the speech recognition process. This thesis

  20. Four-Channel Biosignal Analysis and Feature Extraction for Automatic Emotion Recognition

    Science.gov (United States)

    Kim, Jonghwa; André, Elisabeth

    This paper investigates the potential of physiological signals as a reliable channel for automatic recognition of user's emotial state. For the emotion recognition, little attention has been paid so far to physiological signals compared to audio-visual emotion channels such as facial expression or speech. All essential stages of automatic recognition system using biosignals are discussed, from recording physiological dataset up to feature-based multiclass classification. Four-channel biosensors are used to measure electromyogram, electrocardiogram, skin conductivity and respiration changes. A wide range of physiological features from various analysis domains, including time/frequency, entropy, geometric analysis, subband spectra, multiscale entropy, etc., is proposed in order to search the best emotion-relevant features and to correlate them with emotional states. The best features extracted are specified in detail and their effectiveness is proven by emotion recognition results.

  1. Parametric Representation of the Speaker's Lips for Multimodal Sign Language and Speech Recognition

    Science.gov (United States)

    Ryumin, D.; Karpov, A. A.

    2017-05-01

    In this article, we propose a new method for parametric representation of human's lips region. The functional diagram of the method is described and implementation details with the explanation of its key stages and features are given. The results of automatic detection of the regions of interest are illustrated. A speed of the method work using several computers with different performances is reported. This universal method allows applying parametrical representation of the speaker's lipsfor the tasks of biometrics, computer vision, machine learning, and automatic recognition of face, elements of sign languages, and audio-visual speech, including lip-reading.

  2. Comparison of HMM and DTW methods in automatic recognition of pathological phoneme pronunciation

    OpenAIRE

    Wielgat, Robert; Zielinski, Tomasz P.; Swietojanski, Pawel; Zoladz, Piotr; Król, Daniel; Wozniak, Tomasz; Grabias, Stanislaw

    2007-01-01

    In the paper recently proposed Human Factor Cepstral Coefficients (HFCC) are used to automatic recognition of pathological phoneme pronunciation in speech of impaired children and efficiency of this approach is compared to application of the standard Mel-Frequency Cepstral Coefficients (MFCC) as a feature vector. Both dynamic time warping (DTW), working on whole words or embedded phoneme patterns, and hidden Markov models (HMM) are used as classifiers in the presented research. Obtained resul...

  3. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  4. Speed and automaticity of word recognition - inseparable twins?

    DEFF Research Database (Denmark)

    Poulsen, Mads; Asmussen, Vibeke; Elbro, Carsten

    'Speed and automaticity' of word recognition is a standard collocation. However, it is not clear whether speed and automaticity (i.e., effortlessness) make independent contributions to reading comprehension. In theory, both speed and automaticity may save cognitive resources for comprehension...... processes. Hence, the aim of the present study was to assess the unique contributions of word recognition speed and automaticity to reading comprehension while controlling for decoding speed and accuracy. Method: 139 Grade 5 students completed tests of reading comprehension and computer-based tests of speed...... of decoding and word recognition together with a test of effortlessness (automaticity) of word recognition. Effortlessness was measured in a dual task in which participants were presented with a word enclosed in an unrelated figure. The task was to read the word and decide whether the figure was a triangle...

  5. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  6. Evaluating deep learning architectures for Speech Emotion Recognition.

    Science.gov (United States)

    Fayek, Haytham M; Lech, Margaret; Cavedon, Lawrence

    2017-08-01

    Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models' performances. Copyright © 2017 Elsevier Ltd. All rights reserved.

  7. Speech recognition in individuals with sensorineural hearing loss.

    Science.gov (United States)

    de Andrade, Adriana Neves; Iorio, Maria Cecilia Martinelli; Gil, Daniela

    2016-01-01

    Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left) and type of stimulus presentation. The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance. Copyright © 2015 Associação Brasileira de Otorrinolaringologia e Cirurgia Cérvico-Facial. Published by Elsevier Editora Ltda. All rights reserved.

  8. Speech recognition in individuals with sensorineural hearing loss

    Directory of Open Access Journals (Sweden)

    Adriana Neves de Andrade

    Full Text Available ABSTRACT INTRODUCTION: Hearing loss can negatively influence the communication performance of individuals, who should be evaluated with suitable material and in situations of listening close to those found in everyday life. OBJECTIVE: To analyze and compare the performance of patients with mild-to-moderate sensorineural hearing loss in speech recognition tests carried out in silence and with noise, according to the variables ear (right and left and type of stimulus presentation. METHODS: The study included 19 right-handed individuals with mild-to-moderate symmetrical bilateral sensorineural hearing loss, submitted to the speech recognition test with words in different modalities and speech test with white noise and pictures. RESULTS: There was no significant difference between right and left ears in any of the tests. The mean number of correct responses in the speech recognition test with pictures, live voice, and recorded monosyllables was 97.1%, 85.9%, and 76.1%, respectively, whereas after the introduction of noise, the performance decreased to 72.6% accuracy. CONCLUSIONS: The best performances in the Speech Recognition Percentage Index were obtained using monosyllabic stimuli, represented by pictures presented in silence, with no significant differences between the right and left ears. After the introduction of competitive noise, there was a decrease in individuals' performance.

  9. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... and observations almost simultaneously when they are given or made. The tested speech input strategies were successful, even with the ambient noise. Speaking to the system while working appeared feasible, although improvements in speech recognition rates are needed. Conclusion: A vocal interface leads to shorter...

  10. Current trends in small vocabulary speech recognition for equipment control

    Science.gov (United States)

    Doukas, Nikolaos; Bardis, Nikolaos G.

    2017-09-01

    Speech recognition systems allow human - machine communication to acquire an intuitive nature that approaches the simplicity of inter - human communication. Small vocabulary speech recognition is a subset of the overall speech recognition problem, where only a small number of words need to be recognized. Speaker independent small vocabulary recognition can find significant applications in field equipment used by military personnel. Such equipment may typically be controlled by a small number of commands that need to be given quickly and accurately, under conditions where delicate manual operations are difficult to achieve. This type of application could hence significantly benefit by the use of robust voice operated control components, as they would facilitate the interaction with their users and render it much more reliable in times of crisis. This paper presents current challenges involved in attaining efficient and robust small vocabulary speech recognition. These challenges concern feature selection, classification techniques, speaker diversity and noise effects. A state machine approach is presented that facilitates the voice guidance of different equipment in a variety of situations.

  11. Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders

    CERN Document Server

    Baghai-Ravary, Ladan

    2013-01-01

    Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders provides a survey of methods designed to aid clinicians in the diagnosis and monitoring of speech disorders such as dysarthria and dyspraxia, with an emphasis on the signal processing techniques, statistical validity of the results presented in the literature, and the appropriateness of methods that do not require specialized equipment, rigorously controlled recording procedures or highly skilled personnel to interpret results. Such techniques offer the promise of a simple and cost-effective, yet objective, assessment of a range of medical conditions, which would be of great value to clinicians. The ideal scenario would begin with the collection of examples of the clients’ speech, either over the phone or using portable recording devices operated by non-specialist nursing staff. The recordings could then be analyzed initially to aid diagnosis of conditions, and subsequently to monitor the clients’ progress and res...

  12. Relating dynamic brain states to dynamic machine states: Human and machine solutions to the speech recognition problem.

    Directory of Open Access Journals (Sweden)

    Cai Wingfield

    2017-09-01

    Full Text Available There is widespread interest in the relationship between the neurobiological systems supporting human cognition and emerging computational systems capable of emulating these capacities. Human speech comprehension, poorly understood as a neurobiological process, is an important case in point. Automatic Speech Recognition (ASR systems with near-human levels of performance are now available, which provide a computationally explicit solution for the recognition of words in continuous speech. This research aims to bridge the gap between speech recognition processes in humans and machines, using novel multivariate techniques to compare incremental 'machine states', generated as the ASR analysis progresses over time, to the incremental 'brain states', measured using combined electro- and magneto-encephalography (EMEG, generated as the same inputs are heard by human listeners. This direct comparison of dynamic human and machine internal states, as they respond to the same incrementally delivered sensory input, revealed a significant correspondence between neural response patterns in human superior temporal cortex and the structural properties of ASR-derived phonetic models. Spatially coherent patches in human temporal cortex responded selectively to individual phonetic features defined on the basis of machine-extracted regularities in the speech to lexicon mapping process. These results demonstrate the feasibility of relating human and ASR solutions to the problem of speech recognition, and suggest the potential for further studies relating complex neural computations in human speech comprehension to the rapidly evolving ASR systems that address the same problem domain.

  13. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue

    2006-01-01

    of audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize...... that it is the temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features...

  14. Spoken Word Recognition of Chinese Words in Continuous Speech

    Science.gov (United States)

    Yip, Michael C. W.

    2015-01-01

    The present study examined the role of positional probability of syllables played in recognition of spoken word in continuous Cantonese speech. Because some sounds occur more frequently at the beginning position or ending position of Cantonese syllables than the others, so these kinds of probabilistic information of syllables may cue the locations…

  15. High-performance speech recognition using consistency modeling

    Science.gov (United States)

    Digalakis, Vassilios; Murveit, Hy; Monaco, Peter; Neumeyer, Leo; Sankar, Ananth

    1994-12-01

    The goal of SRI's consistency modeling project is to improve the raw acoustic modeling component of SRI's DECIPHER speech recognition system and develop consistency modeling technology. Consistency modeling aims to reduce the number of improper independence assumptions used in traditional speech recognition algorithms so that the resulting speech recognition hypotheses are more self-consistent and, therefore, more accurate. At the initial stages of this effort, SRI focused on developing the appropriate base technologies for consistency modeling. We first developed the Progressive Search technology that allowed us to perform large-vocabulary continuous speech recognition (LVCSR) experiments. Since its conception and development at SRI, this technique has been adopted by most laboratories, including other ARPA contracting sites, doing research on LVSR. Another goal of the consistency modeling project is to attack difficult modeling problems, when there is a mismatch between the training and testing phases. Such mismatches may include outlier speakers, different microphones and additive noise. We were able to either develop new, or transfer and evaluate existing, technologies that adapted our baseline genonic HMM recognizer to such difficult conditions.

  16. Multitasking During Degraded Speech Recognition in School-Age Children.

    Science.gov (United States)

    Grieco-Calub, Tina M; Ward, Kristina M; Brehm, Laurel

    2017-01-01

    Multitasking requires individuals to allocate their cognitive resources across different tasks. The purpose of the current study was to assess school-age children's multitasking abilities during degraded speech recognition. Children (8 to 12 years old) completed a dual-task paradigm including a sentence recognition (primary) task containing speech that was either unprocessed or noise-band vocoded with 8, 6, or 4 spectral channels and a visual monitoring (secondary) task. Children's accuracy and reaction time on the visual monitoring task was quantified during the dual-task paradigm in each condition of the primary task and compared with single-task performance. Children experienced dual-task costs in the 6- and 4-channel conditions of the primary speech recognition task with decreased accuracy on the visual monitoring task relative to baseline performance. In all conditions, children's dual-task performance on the visual monitoring task was strongly predicted by their single-task (baseline) performance on the task. Results suggest that children's proficiency with the secondary task contributes to the magnitude of dual-task costs while multitasking during degraded speech recognition.

  17. Channel normalization technique for speech recognition in mismatched conditions

    CSIR Research Space (South Africa)

    Kleynhans, N

    2008-11-01

    Full Text Available , where one wishes to use any available training data for a variety of purposes. Research into a new channel normalization (CN) technique for channel mismatched speech recognition is presented. A process of inverse linear filtering is used in order...

  18. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error

  19. Appropriate baseline values for HMM-based speech recognition

    CSIR Research Space (South Africa)

    Barnard, E

    2004-11-01

    Full Text Available A number of issues realted to the development of speech-recognition systems with Hidden Markov Models (HMM) are discussed. A set of systematic experiments using the HTK toolkit and the TMIT database are used to elucidate matters such as the number...

  20. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  1. Writing and Speech Recognition : Observing Error Correction Strategies of Professional Writers

    NARCIS (Netherlands)

    Leijten, M.A.J.C.

    2007-01-01

    In this thesis we describe the organization of speech recognition based writing processes. Writing can be seen as a visual representation of spoken language: a combination that speech recognition takes full advantage of. In the field of writing research, speech recognition is a new writing

  2. Automatic sign language recognition inspired by human sign perception

    NARCIS (Netherlands)

    Ten Holt, G.A.

    2010-01-01

    Automatic sign language recognition is a relatively new field of research (since ca. 1990). Its objectives are to automatically analyze sign language utterances. There are several issues within the research area that merit investigation: how to capture the utterances (cameras, magnetic sensors,

  3. A model based method for automatic facial expression recognition

    NARCIS (Netherlands)

    Kuilenburg, H. van; Wiering, M.A.; Uyl, M. den

    2006-01-01

    Automatic facial expression recognition is a research topic with interesting applications in the field of human-computer interaction, psychology and product marketing. The classification accuracy for an automatic system which uses static images as input is however largely limited by the image

  4. Statistical pattern recognition for automatic writer identification and verification

    NARCIS (Netherlands)

    Bulacu, Marius Lucian

    2007-01-01

    The thesis addresses the problem of automatic person identification using scanned images of handwriting.Identifying the author of a handwritten sample using automatic image-based methods is an interesting pattern recognition problem with direct applicability in the forensic and historic document

  5. Sign language perception research for improving automatic sign language recognition

    NARCIS (Netherlands)

    Ten Holt, G.A.; Arendsen, J.; De Ridder, H.; Van Doorn, A.J.; Reinders, M.J.T.; Hendriks, E.A.

    2009-01-01

    Current automatic sign language recognition (ASLR) seldom uses perceptual knowledge about the recognition of sign language. Using such knowledge can improve ASLR because it can give an indication which elements or phases of a sign are important for its meaning. Also, the current generation of

  6. Incorporating Speech Recognition into a Natural User Interface

    Science.gov (United States)

    Chapa, Nicholas

    2017-01-01

    The Augmented/ Virtual Reality (AVR) Lab has been working to study the applicability of recent virtual and augmented reality hardware and software to KSC operations. This includes the Oculus Rift, HTC Vive, Microsoft HoloLens, and Unity game engine. My project in this lab is to integrate voice recognition and voice commands into an easy to modify system that can be added to an existing portion of a Natural User Interface (NUI). A NUI is an intuitive and simple to use interface incorporating visual, touch, and speech recognition. The inclusion of speech recognition capability will allow users to perform actions or make inquiries using only their voice. The simplicity of needing only to speak to control an on-screen object or enact some digital action means that any user can quickly become accustomed to using this system. Multiple programs were tested for use in a speech command and recognition system. Sphinx4 translates speech to text using a Hidden Markov Model (HMM) based Language Model, an Acoustic Model, and a word Dictionary running on Java. PocketSphinx had similar functionality to Sphinx4 but instead ran on C. However, neither of these programs were ideal as building a Java or C wrapper slowed performance. The most ideal speech recognition system tested was the Unity Engine Grammar Recognizer. A Context Free Grammar (CFG) structure is written in an XML file to specify the structure of phrases and words that will be recognized by Unity Grammar Recognizer. Using Speech Recognition Grammar Specification (SRGS) 1.0 makes modifying the recognized combinations of words and phrases very simple and quick to do. With SRGS 1.0, semantic information can also be added to the XML file, which allows for even more control over how spoken words and phrases are interpreted by Unity. Additionally, using a CFG with SRGS 1.0 produces a Finite State Machine (FSM) functionality limiting the potential for incorrectly heard words or phrases. The purpose of my project was to

  7. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  8. Hemispheric lateralization of linguistic prosody recognition in comparison to speech and speaker recognition.

    Science.gov (United States)

    Kreitewolf, Jens; Friederici, Angela D; von Kriegstein, Katharina

    2014-11-15

    Hemispheric specialization for linguistic prosody is a controversial issue. While it is commonly assumed that linguistic prosody and emotional prosody are preferentially processed in the right hemisphere, neuropsychological work directly comparing processes of linguistic prosody and emotional prosody suggests a predominant role of the left hemisphere for linguistic prosody processing. Here, we used two functional magnetic resonance imaging (fMRI) experiments to clarify the role of left and right hemispheres in the neural processing of linguistic prosody. In the first experiment, we sought to confirm previous findings showing that linguistic prosody processing compared to other speech-related processes predominantly involves the right hemisphere. Unlike previous studies, we controlled for stimulus influences by employing a prosody and speech task using the same speech material. The second experiment was designed to investigate whether a left-hemispheric involvement in linguistic prosody processing is specific to contrasts between linguistic prosody and emotional prosody or whether it also occurs when linguistic prosody is contrasted against other non-linguistic processes (i.e., speaker recognition). Prosody and speaker tasks were performed on the same stimulus material. In both experiments, linguistic prosody processing was associated with activity in temporal, frontal, parietal and cerebellar regions. Activation in temporo-frontal regions showed differential lateralization depending on whether the control task required recognition of speech or speaker: recognition of linguistic prosody predominantly involved right temporo-frontal areas when it was contrasted against speech recognition; when contrasted against speaker recognition, recognition of linguistic prosody predominantly involved left temporo-frontal areas. The results show that linguistic prosody processing involves functions of both hemispheres and suggest that recognition of linguistic prosody is based on

  9. Robust Speaker Authentication Based on Combined Speech and Voiceprint Recognition

    Science.gov (United States)

    Malcangi, Mario

    2009-08-01

    Personal authentication is becoming increasingly important in many applications that have to protect proprietary data. Passwords and personal identification numbers (PINs) prove not to be robust enough to ensure that unauthorized people do not use them. Biometric authentication technology may offer a secure, convenient, accurate solution but sometimes fails due to its intrinsically fuzzy nature. This research aims to demonstrate that combining two basic speech processing methods, voiceprint identification and speech recognition, can provide a very high degree of robustness, especially if fuzzy decision logic is used.

  10. Speech pattern recognition for forensic acoustic purposes

    OpenAIRE

    Herrera Martínez, Marcelo; Aldana Blanco, Andrea Lorena; Guzmán Palacios, Ana María

    2014-01-01

    The present paper describes the development of a software for analysis of acoustic voice parameters (APAVOIX), which can be used for forensic acoustic purposes, based on the speaker recognition and identification. This software enables to observe in a clear manner, the parameters which are sufficient and necessary when performing a comparison between two voice signals, the suspicious and the original one. These parameters are used according to the classic method, generally used by state entit...

  11. Auditory Modeling for Noisy Speech Recognition.

    Science.gov (United States)

    2000-01-01

    multiple platforms including PCs, workstations, and DSPs. A prototype version of the SOS process was tested on the Japanese Hiragana language with good...judgment among linguists. American English has 48 phonetic sounds in the ARPABET representation. Hiragana , the Japanese phonetic language, has only 20... Japanese Hiragana ," H.L. Pfister, FL 95, 1995. "State Recognition for Noisy Dynamic Systems," H.L. Pfister, Tech 2005, Chicago, 1995. "Experiences

  12. Dynamic relation between working memory capacity and speech recognition in noise during the first 6 months of hearing aid use.

    Science.gov (United States)

    Ng, Elaine H N; Classon, Elisabet; Larsby, Birgitta; Arlinger, Stig; Lunner, Thomas; Rudner, Mary; Rönnberg, Jerker

    2014-11-23

    The present study aimed to investigate the changing relationship between aided speech recognition and cognitive function during the first 6 months of hearing aid use. Twenty-seven first-time hearing aid users with symmetrical mild to moderate sensorineural hearing loss were recruited. Aided speech recognition thresholds in noise were obtained in the hearing aid fitting session as well as at 3 and 6 months postfitting. Cognitive abilities were assessed using a reading span test, which is a measure of working memory capacity, and a cognitive test battery. Results showed a significant correlation between reading span and speech reception threshold during the hearing aid fitting session. This relation was significantly weakened over the first 6 months of hearing aid use. Multiple regression analysis showed that reading span was the main predictor of speech recognition thresholds in noise when hearing aids were first fitted, but that the pure-tone average hearing threshold was the main predictor 6 months later. One way of explaining the results is that working memory capacity plays a more important role in speech recognition in noise initially rather than after 6 months of use. We propose that new hearing aid users engage working memory capacity to recognize unfamiliar processed speech signals because the phonological form of these signals cannot be automatically matched to phonological representations in long-term memory. As familiarization proceeds, the mismatch effect is alleviated, and the engagement of working memory capacity is reduced. © The Author(s) 2014.

  13. Automatic recognition of printed Oriya script

    Indian Academy of Sciences (India)

    R. Narasimhan (Krishtel eMaging) 1461 1996 Oct 15 13:05:22

    Some studies have been reported on Tamil, Telugu and. Gurmukhi scripts ..... leaf node, giving rise to recognition errors. The contribution of these ... As a native speaker of Oriya, Anil Chand gave us useful advice about the script. P. Sashank.

  14. An automatic system for Turkish word recognition using Discrete Wavelet Neural Network based on adaptive entropy

    International Nuclear Information System (INIS)

    Avci, E.

    2007-01-01

    In this paper, an automatic system is presented for word recognition using real Turkish word signals. This paper especially deals with combination of the feature extraction and classification from real Turkish word signals. A Discrete Wavelet Neural Network (DWNN) model is used, which consists of two layers: discrete wavelet layer and multi-layer perceptron. The discrete wavelet layer is used for adaptive feature extraction in the time-frequency domain and is composed of Discrete Wavelet Transform (DWT) and wavelet entropy. The multi-layer perceptron used for classification is a feed-forward neural network. The performance of the used system is evaluated by using noisy Turkish word signals. Test results showing the effectiveness of the proposed automatic system are presented in this paper. The rate of correct recognition is about 92.5% for the sample speech signals. (author)

  15. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part...... and a probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  16. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The automatic analysis of affect is a relatively new and challenging multidisciplinary research area that has gained a lot of interest over the past few years. The research and development of affect recognition systems has opened many opportunities for improving the interaction between man and

  17. Phase effects in masking by harmonic complexes: speech recognition.

    Science.gov (United States)

    Deroche, Mickael L D; Culling, John F; Chatterjee, Monita

    2013-12-01

    Harmonic complexes that generate highly modulated temporal envelopes on the basilar membrane (BM) mask a tone less effectively than complexes that generate relatively flat temporal envelopes, because the non-linear active gain of the BM selectively amplifies a low-level tone in the dips of a modulated masker envelope. The present study examines a similar effect in speech recognition. Speech reception thresholds (SRTs) were measured for a voice masked by harmonic complexes with partials in sine phase (SP) or in random phase (RP). The masker's fundamental frequency (F0) was 50, 100 or 200 Hz. SRTs were considerably lower for SP than for RP maskers at 50-Hz F0, but the two converged at 100-Hz F0, while at 200-Hz F0, SRTs were a little higher for SP than RP maskers. The results were similar whether the target voice was male or female and whether the masker's spectral profile was flat or speech-shaped. Although listening in the masker dips has been shown to play a large role for artificial stimuli such as Schroeder-phase complexes at high levels, it contributes weakly to speech recognition in the presence of harmonic maskers with different crest factors at more moderate sound levels (65 dB SPL). Copyright © 2013 Elsevier B.V. All rights reserved.

  18. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    OpenAIRE

    Zhang, Wei; Zhang, Xueying; Sun, Ying

    2017-01-01

    Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM) approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between e...

  19. EXTENDED SPEECH EMOTION RECOGNITION AND PREDICTION

    Directory of Open Access Journals (Sweden)

    Theodoros Anagnostopoulos

    2014-11-01

    Full Text Available Humans are considered to reason and act rationally and that is believed to be their fundamental difference from the rest of the living entities. Furthermore, modern approaches in the science of psychology underline that humans as a thinking creatures are also sentimental and emotional organisms. There are fifteen universal extended emotions plus neutral emotion: hot anger, cold anger, panic, fear, anxiety, despair, sadness, elation, happiness, interest, boredom, shame, pride, disgust, contempt and neutral position. The scope of the current research is to understand the emotional state of a human being by capturing the speech utterances that one uses during a common conversation. It is proved that having enough acoustic evidence available the emotional state of a person can be classified by a set of majority voting classifiers. The proposed set of classifiers is based on three main classifiers: kNN, C4.5 and SVM RBF Kernel. This set achieves better performance than each basic classifier taken separately. It is compared with two other sets of classifiers: one-against-all (OAA multiclass SVM with Hybrid kernels and the set of classifiers which consists of the following two basic classifiers: C5.0 and Neural Network. The proposed variant achieves better performance than the other two sets of classifiers. The paper deals with emotion classification by a set of majority voting classifiers that combines three certain types of basic classifiers with low computational complexity. The basic classifiers stem from different theoretical background in order to avoid bias and redundancy which gives the proposed set of classifiers the ability to generalize in the emotion domain space.

  20. Automatic Recognition of Improperly Pronounced Initial 'r' Consonant in Romanian

    Directory of Open Access Journals (Sweden)

    VELICAN, V.

    2012-08-01

    Full Text Available Correctly assessing the degree of mispronunciation and deciding upon the necessary treatment are fundamental activities for all speech disorder specialists. Obviously, the experience and the availability of the specialists are essentials in order to assure an efficient therapy for the speech impaired. To overcome this deficiency a more objective approach would include the existence of a tool that independent of the specialist's abilities could be used to establish the diagnostics. A complete automated system based on speech processing algorithms capable of performing the recognition task is therefore thoroughly justified and can be viewed as a goal that will bring many benefits to the field of speech pronunciation correction. This paper presents further results of the authors' work on developing speech processing algorithms able to identify mispronunciations in Romanian language, more exactly we propose the use of the Walsh-Hadamard Transform (WHT as feature selection tool in the case of identifying rhotacism. The results are encouraging with a best recognition rate of 92.55%.

  1. Emerging technologies with potential for objectively evaluating speech recognition skills.

    Science.gov (United States)

    Rawool, Vishakha Waman

    2016-01-01

    Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed.

  2. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  3. Automatic Number Plate Recognition System for IPhone Devices

    Directory of Open Access Journals (Sweden)

    Călin Enăchescu

    2013-06-01

    Full Text Available This paper presents a system for automatic number plate recognition, implemented for devices running the iOS operating system. The methods used for number plate recognition are based on existing methods, but optimized for devices with low hardware resources. To solve the task of automatic number plate recognition we have divided it into the following subtasks: image acquisition, localization of the number plate position on the image and character detection. The first subtask is performed by the camera of an iPhone, the second one is done using image pre-processing methods and template matching. For the character recognition we are using a feed-forward artificial neural network. Each of these methods is presented along with its results.

  4. Speech recognition: impact on workflow and report availability

    International Nuclear Information System (INIS)

    Glaser, C.; Trumm, C.; Nissen-Meyer, S.; Francke, M.; Kuettner, B.; Reiser, M.

    2005-01-01

    With ongoing technical refinements speech recognition systems (SRS) are becoming an increasingly attractive alternative to traditional methods of preparing and transcribing medical reports. The two main components of any SRS are the acoustic model and the language model. Features of modern SRS with continuous speech recognition are macros with individually definable texts and report templates as well as the option to navigate in a text or to control SRS or RIS functions by speech recognition. The best benefit from SRS can be obtained if it is integrated into a RIS/RIS-PACS installation. Report availability and time efficiency of the reporting process (related to recognition rate, time expenditure for editing and correcting a report) are the principal determinants of the clinical performance of any SRS. For practical purposes the recognition rate is estimated by the error rate (unit ''word''). Error rates range from 4 to 28%. Roughly 20% of them are errors in the vocabulary which may result in clinically relevant misinterpretation. It is thus mandatory to thoroughly correct any transcribed text as well as to continuously train and adapt the SRS vocabulary. The implementation of SRS dramatically improves report availability. This is most pronounced for CT and CR. However, the individual time expenditure for (SRS-based) reporting increased by 20-25% (CR) and according to literature data there is an increase by 30% for CT and MRI. The extent to which the transcription staff profits from SRS depends largely on its qualification. Online dictation implies a workload shift from the transcription staff to the reporting radiologist. (orig.) [de

  5. Emotionally conditioning the target-speech voice enhances recognition of the target speech under "cocktail-party" listening conditions.

    Science.gov (United States)

    Lu, Lingxi; Bao, Xiaohan; Chen, Jing; Qu, Tianshu; Wu, Xihong; Li, Liang

    2018-05-01

    Under a noisy "cocktail-party" listening condition with multiple people talking, listeners can use various perceptual/cognitive unmasking cues to improve recognition of the target speech against informational speech-on-speech masking. One potential unmasking cue is the emotion expressed in a speech voice, by means of certain acoustical features. However, it was unclear whether emotionally conditioning a target-speech voice that has none of the typical acoustical features of emotions (i.e., an emotionally neutral voice) can be used by listeners for enhancing target-speech recognition under speech-on-speech masking conditions. In this study we examined the recognition of target speech against a two-talker speech masker both before and after the emotionally neutral target voice was paired with a loud female screaming sound that has a marked negative emotional valence. The results showed that recognition of the target speech (especially the first keyword in a target sentence) was significantly improved by emotionally conditioning the target speaker's voice. Moreover, the emotional unmasking effect was independent of the unmasking effect of the perceived spatial separation between the target speech and the masker. Also, (skin conductance) electrodermal responses became stronger after emotional learning when the target speech and masker were perceptually co-located, suggesting an increase of listening efforts when the target speech was informationally masked. These results indicate that emotionally conditioning the target speaker's voice does not change the acoustical parameters of the target-speech stimuli, but the emotionally conditioned vocal features can be used as cues for unmasking target speech.

  6. Speech Silicon: An FPGA Architecture for Real-Time Hidden Markov-Model-Based Speech Recognition

    Directory of Open Access Journals (Sweden)

    Schuster Jeffrey

    2006-01-01

    Full Text Available This paper examines the design of an FPGA-based system-on-a-chip capable of performing continuous speech recognition on medium sized vocabularies in real time. Through the creation of three dedicated pipelines, one for each of the major operations in the system, we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls in the system. Further, by implementing a token-passing scheme between the later stages of the system, the complexity of the control was greatly reduced and the amount of active data present in the system at any time was minimized. Additionally, through in-depth analysis of the SPHINX 3 large vocabulary continuous speech recognition engine, we were able to design models that could be efficiently benchmarked against a known software platform. These results, combined with the ability to reprogram the system for different recognition tasks, serve to create a system capable of performing real-time speech recognition in a vast array of environments.

  7. Speech Silicon: An FPGA Architecture for Real-Time Hidden Markov-Model-Based Speech Recognition

    Directory of Open Access Journals (Sweden)

    Alex K. Jones

    2006-11-01

    Full Text Available This paper examines the design of an FPGA-based system-on-a-chip capable of performing continuous speech recognition on medium sized vocabularies in real time. Through the creation of three dedicated pipelines, one for each of the major operations in the system, we were able to maximize the throughput of the system while simultaneously minimizing the number of pipeline stalls in the system. Further, by implementing a token-passing scheme between the later stages of the system, the complexity of the control was greatly reduced and the amount of active data present in the system at any time was minimized. Additionally, through in-depth analysis of the SPHINX 3 large vocabulary continuous speech recognition engine, we were able to design models that could be efficiently benchmarked against a known software platform. These results, combined with the ability to reprogram the system for different recognition tasks, serve to create a system capable of performing real-time speech recognition in a vast array of environments.

  8. Quality Assessment of Compressed Video for Automatic License Plate Recognition

    DEFF Research Database (Denmark)

    Ukhanova, Ann; Støttrup-Andersen, Jesper; Forchhammer, Søren

    2014-01-01

    Definition of video quality requirements for video surveillance poses new questions in the area of quality assessment. This paper presents a quality assessment experiment for an automatic license plate recognition scenario. We explore the influence of the compression by H.264/AVC and H.265/HEVC s...... recognition in our study has a behavior similar to human recognition, allowing the use of the same mathematical models. We furthermore propose an application of one of the models for video surveillance systems......Definition of video quality requirements for video surveillance poses new questions in the area of quality assessment. This paper presents a quality assessment experiment for an automatic license plate recognition scenario. We explore the influence of the compression by H.264/AVC and H.265/HEVC...... standards on the recognition performance. We compare logarithmic and logistic functions for quality modeling. Our results show that a logistic function can better describe the dependence of recognition performance on the quality for both compression standards. We observe that automatic license plate...

  9. Two Systems for Automatic Music Genre Recognition

    DEFF Research Database (Denmark)

    Sturm, Bob L.

    2012-01-01

    We re-implement and test two state-of-the-art systems for automatic music genre classification; but unlike past works in this area, we look closer than ever before at their behavior. First, we look at specific instances where each system consistently applies the same wrong label across multiple...... trials of cross-validation. Second, we test the robustness of each system to spectral equalization. Finally, we test how well human subjects recognize the genres of music excerpts composed by each system to be highly genre representative. Our results suggest that neither high-performing system has...... a capacity to recognize music genre....

  10. A Context Dependent Automatic Target Recognition System

    Science.gov (United States)

    Kim, J. H.; Payton, D. W.; Olin, K. E.; Tseng, D. Y.

    1984-06-01

    This paper describes a new approach to automatic target recognizer (ATR) development utilizing artificial intelligent techniques. The ATR system exploits contextual information in its detection and classification processes to provide a high degree of robustness and adaptability. In the system, knowledge about domain objects and their contextual relationships is encoded in frames, separating it from low level image processing algorithms. This knowledge-based system demonstrates an improvement over the conventional statistical approach through the exploitation of diverse forms of knowledge in its decision-making process.

  11. Integration of asynchronous knowledge sources in a novel speech recognition framework

    OpenAIRE

    Van hamme, Hugo

    2008-01-01

    Van hamme H., ''Integration of asynchronous knowledge sources in a novel speech recognition framework'', Proceedings ITRW on speech analysis and processing for knowledge discovery, 4 pp., June 2008, Aalborg, Denmark.

  12. Recognition of In-Ear Microphone Speech Data Using Multi-Layer Neural Networks

    National Research Council Canada - National Science Library

    Bulbuller, Gokhan

    2006-01-01

    .... In this study, a speech recognition system is presented, specifically an isolated word recognizer which uses speech collected from the external auditory canals of the subjects via an in-ear microphone...

  13. Radar automatic target recognition (ATR) and non-cooperative target recognition (NCTR)

    CERN Document Server

    Blacknell, David

    2013-01-01

    The ability to detect and locate targets by day or night, over wide areas, regardless of weather conditions has long made radar a key sensor in many military and civil applications. However, the ability to automatically and reliably distinguish different targets represents a difficult challenge. Radar Automatic Target Recognition (ATR) and Non-Cooperative Target Recognition (NCTR) captures material presented in the NATO SET-172 lecture series to provide an overview of the state-of-the-art and continuing challenges of radar target recognition. Topics covered include the problem as applied to th

  14. Automatization and Orthographic Development in Second Language Visual Word Recognition

    Science.gov (United States)

    Kida, Shusaku

    2016-01-01

    The present study investigated second language (L2) learners' acquisition of automatic word recognition and the development of L2 orthographic representation in the mental lexicon. Participants in the study were Japanese university students enrolled in a compulsory course involving a weekly 30-minute sustained silent reading (SSR) activity with…

  15. Laser gated viewing : An enabler for Automatic Target Recognition?

    NARCIS (Netherlands)

    Bovenkamp, E.G.P.; Schutte, K.

    2010-01-01

    For many decades attempts to accomplish Automatic Target Recognition have been made using both visual and FLIR camera systems. A recurring problem in these approaches is the segmentation problem, which is the separation between the target and its background. This paper describes an approach to

  16. Auditory signal design for automatic number plate recognition system

    NARCIS (Netherlands)

    Heydra, C.G.; Jansen, R.J.; Van Egmond, R.

    2014-01-01

    This paper focuses on the design of an auditory signal for the Automatic Number Plate Recognition system of Dutch national police. The auditory signal is designed to alert police officers of suspicious cars in their proximity, communicating priority level and location of the suspicious car and

  17. Automatic gang graffiti recognition and interpretation

    Science.gov (United States)

    Parra, Albert; Boutin, Mireille; Delp, Edward J.

    2017-09-01

    One of the roles of emergency first responders (e.g., police and fire departments) is to prevent and protect against events that can jeopardize the safety and well-being of a community. In the case of criminal gang activity, tools are needed for finding, documenting, and taking the necessary actions to mitigate the problem or issue. We describe an integrated mobile-based system capable of using location-based services, combined with image analysis, to track and analyze gang activity through the acquisition, indexing, and recognition of gang graffiti images. This approach uses image analysis methods for color recognition, image segmentation, and image retrieval and classification. A database of gang graffiti images is described that includes not only the images but also metadata related to the images, such as date and time, geoposition, gang, gang member, colors, and symbols. The user can then query the data in a useful manner. We have implemented these features both as applications for Android and iOS hand-held devices and as a web-based interface.

  18. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  19. Automatic Recognition of Object Names in Literature

    Science.gov (United States)

    Bonnin, C.; Lesteven, S.; Derriere, S.; Oberto, A.

    2008-08-01

    SIMBAD is a database of astronomical objects that provides (among other things) their bibliographic references in a large number of journals. Currently, these references have to be entered manually by librarians who read each paper. To cope with the increasing number of papers, CDS develops a tool to assist the librarians in their work, taking advantage of the Dictionary of Nomenclature of Celestial Objects, which keeps track of object acronyms and of their origin. The program searches for object names directly in PDF documents by comparing the words with all the formats stored in the Dictionary of Nomenclature. It also searches for variable star names based on constellation names and for a large list of usual names such as Aldebaran or the Crab. Object names found in the documents often correspond to several astronomical objects. The system retrieves all possible matches, displays them with their object type given by SIMBAD, and lets the librarian make the final choice. The bibliographic reference can then be automatically added to the object identifiers in the database. Besides, the systematic usage of the Dictionary of Nomenclature, which is updated manually, permitted to automatically check it and to detect errors and inconsistencies. Last but not least, the program collects some additional information such as the position of the object names in the document (in the title, subtitle, abstract, table, figure caption...) and their number of occurrences. In the future, this will permit to calculate the 'weight' of an object in a reference and to provide SIMBAD users with an important new information, which will help them to find the most relevant papers in the object reference list.

  20. Automatic Modulation Recognition by Support Vector Machines Using Wavelet Kernel

    Energy Technology Data Exchange (ETDEWEB)

    Feng, X Z; Yang, J; Luo, F L; Chen, J Y; Zhong, X P [College of Mechatronic Engineering and Automation, National University of Defense Technology, Changsha (China)

    2006-10-15

    Automatic modulation identification plays a significant role in electronic warfare, electronic surveillance systems and electronic counter measure. The task of modulation recognition of communication signals is to determine the modulation type and signal parameters. In fact, automatic modulation identification can be range to an application of pattern recognition in communication field. The support vector machines (SVM) is a new universal learning machine which is widely used in the fields of pattern recognition, regression estimation and probability density. In this paper, a new method using wavelet kernel function was proposed, which maps the input vector xi into a high dimensional feature space F. In this feature space F, we can construct the optimal hyperplane that realizes the maximal margin in this space. That is to say, we can use SVM to classify the communication signals into two groups, namely analogue modulated signals and digitally modulated signals. In addition, computer simulation results are given at last, which show good performance of the method.

  1. Automatic Modulation Recognition by Support Vector Machines Using Wavelet Kernel

    International Nuclear Information System (INIS)

    Feng, X Z; Yang, J; Luo, F L; Chen, J Y; Zhong, X P

    2006-01-01

    Automatic modulation identification plays a significant role in electronic warfare, electronic surveillance systems and electronic counter measure. The task of modulation recognition of communication signals is to determine the modulation type and signal parameters. In fact, automatic modulation identification can be range to an application of pattern recognition in communication field. The support vector machines (SVM) is a new universal learning machine which is widely used in the fields of pattern recognition, regression estimation and probability density. In this paper, a new method using wavelet kernel function was proposed, which maps the input vector xi into a high dimensional feature space F. In this feature space F, we can construct the optimal hyperplane that realizes the maximal margin in this space. That is to say, we can use SVM to classify the communication signals into two groups, namely analogue modulated signals and digitally modulated signals. In addition, computer simulation results are given at last, which show good performance of the method

  2. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...... for noisy speech while achieving higher compression rates....

  3. Spike Pattern Recognition for Automatic Collimation Alignment

    CERN Document Server

    Azzopardi, Gabriella; Salvachua Ferrando, Belen Maria; Mereghetti, Alessio; Redaelli, Stefano; CERN. Geneva. ATS Department

    2017-01-01

    The LHC makes use of a collimation system to protect its sensitive equipment by intercepting potentially dangerous beam halo particles. The appropriate collimator settings to protect the machine against beam losses relies on a very precise alignment of all the collimators with respect to the beam. The beam center at each collimator is then found by touching the beam halo using an alignment procedure. Until now, in order to determine whether a collimator is aligned with the beam or not, a user is required to follow the collimator’s BLM loss data and detect spikes. A machine learning (ML) model was trained in order to automatically recognize spikes when a collimator is aligned. The model was loosely integrated with the alignment implementation to determine the classification performance and reliability, without effecting the alignment process itself. The model was tested on a number of collimators during this MD and the machine learning was able to output the classifications in real-time.

  4. The effect of network degradation on speech recognition

    CSIR Research Space (South Africa)

    Joubert, G

    2005-11-01

    Full Text Available become increasingly popular, VoIP (Voice over Internet Protocol) is predicted to become the standard means of spoken telecommunication. As a consequence, a significant amount of research has been undertaken on the effect of various packet... to measure the effect of network traffic degeneration during a VoIP transmission, on speech-recognition accuracy. Sentences from the TIMIT database [2] were selected as basis for comparison. The open-source toolkit SOX [3] was used to code the samples...

  5. Automatic anatomy recognition via multiobject oriented active shape models.

    Science.gov (United States)

    Chen, Xinjian; Udupa, Jayaram K; Alavi, Abass; Torigian, Drew A

    2010-12-01

    This paper studies the feasibility of developing an automatic anatomy recognition (AAR) system in clinical radiology and demonstrates its operation on clinical 2D images. The anatomy recognition method described here consists of two main components: (a) multiobject generalization of OASM and (b) object recognition strategies. The OASM algorithm is generalized to multiple objects by including a model for each object and assigning a cost structure specific to each object in the spirit of live wire. The delineation of multiobject boundaries is done in MOASM via a three level dynamic programming algorithm, wherein the first level is at pixel level which aims to find optimal oriented boundary segments between successive landmarks, the second level is at landmark level which aims to find optimal location for the landmarks, and the third level is at the object level which aims to find optimal arrangement of object boundaries over all objects. The object recognition strategy attempts to find that pose vector (consisting of translation, rotation, and scale component) for the multiobject model that yields the smallest total boundary cost for all objects. The delineation and recognition accuracies were evaluated separately utilizing routine clinical chest CT, abdominal CT, and foot MRI data sets. The delineation accuracy was evaluated in terms of true and false positive volume fractions (TPVF and FPVF). The recognition accuracy was assessed (1) in terms of the size of the space of the pose vectors for the model assembly that yielded high delineation accuracy, (2) as a function of the number of objects and objects' distribution and size in the model, (3) in terms of the interdependence between delineation and recognition, and (4) in terms of the closeness of the optimum recognition result to the global optimum. When multiple objects are included in the model, the delineation accuracy in terms of TPVF can be improved to 97%-98% with a low FPVF of 0.1%-0.2%. Typically, a

  6. Multilevel Analysis in Analyzing Speech Data

    Science.gov (United States)

    Guddattu, Vasudeva; Krishna, Y.

    2011-01-01

    The speech produced by human vocal tract is a complex acoustic signal, with diverse applications in phonetics, speech synthesis, automatic speech recognition, speaker identification, communication aids, speech pathology, speech perception, machine translation, hearing research, rehabilitation and assessment of communication disorders and many…

  7. Automatic Facial Expression Recognition and Operator Functional State

    Science.gov (United States)

    Blanson, Nina

    2012-01-01

    The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions

  8. Automatic Facial Expression Recognition and Operator Functional State

    Science.gov (United States)

    Blanson, Nina

    2011-01-01

    The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions.

  9. Multimodal Approach for Automatic Emotion Recognition Applied to the Tension Levels Study in TV Newscasts

    Directory of Open Access Journals (Sweden)

    Moisés Henrique Ramos Pereira

    2015-12-01

    Full Text Available This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance with the distribution of audiovisual indicators extracted over a TV newscast, demonstrating the potential of the approach to support the TV journalistic discourse analysis.This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance

  10. A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

    Science.gov (United States)

    Oh, Yoo Rhee; Kim, Hong Kook

    In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

  11. Analysis of Documentation Speed Using Web-Based Medical Speech Recognition Technology: Randomized Controlled Trial.

    Science.gov (United States)

    Vogel, Markus; Kaisers, Wolfgang; Wassmuth, Ralf; Mayatepek, Ertan

    2015-11-03

    Clinical documentation has undergone a change due to the usage of electronic health records. The core element is to capture clinical findings and document therapy electronically. Health care personnel spend a significant portion of their time on the computer. Alternatives to self-typing, such as speech recognition, are currently believed to increase documentation efficiency and quality, as well as satisfaction of health professionals while accomplishing clinical documentation, but few studies in this area have been published to date. This study describes the effects of using a Web-based medical speech recognition system for clinical documentation in a university hospital on (1) documentation speed, (2) document length, and (3) physician satisfaction. Reports of 28 physicians were randomized to be created with (intervention) or without (control) the assistance of a Web-based system of medical automatic speech recognition (ASR) in the German language. The documentation was entered into a browser's text area and the time to complete the documentation including all necessary corrections, correction effort, number of characters, and mood of participant were stored in a database. The underlying time comprised text entering, text correction, and finalization of the documentation event. Participants self-assessed their moods on a scale of 1-3 (1=good, 2=moderate, 3=bad). Statistical analysis was done using permutation tests. The number of clinical reports eligible for further analysis stood at 1455. Out of 1455 reports, 718 (49.35%) were assisted by ASR and 737 (50.65%) were not assisted by ASR. Average documentation speed without ASR was 173 (SD 101) characters per minute, while it was 217 (SD 120) characters per minute using ASR. The overall increase in documentation speed through Web-based ASR assistance was 26% (P=.04). Participants documented an average of 356 (SD 388) characters per report when not assisted by ASR and 649 (SD 561) characters per report when assisted

  12. Measuring the accuracy of automatic shoeprint recognition methods.

    Science.gov (United States)

    Luostarinen, Tapio; Lehmussola, Antti

    2014-11-01

    Shoeprints are an important source of information for criminal investigation. Therefore, an increasing number of automatic shoeprint recognition methods have been proposed for detecting the corresponding shoe models. However, comprehensive comparisons among the methods have not previously been made. In this study, an extensive set of methods proposed in the literature was implemented, and their performance was studied in varying conditions. Three datasets of different quality shoeprints were used, and the methods were evaluated also with partial and rotated prints. The results show clear differences between the algorithms: while the best performing method, based on local image descriptors and RANSAC, provides rather good results with most of the experiments, some methods are almost completely unrobust against any unidealities in the images. Finally, the results demonstrate that there is still a need for extensive research to improve the accuracy of automatic recognition of crime scene prints. © 2014 American Academy of Forensic Sciences.

  13. Automatic TLI recognition system. Part 1: System description

    Energy Technology Data Exchange (ETDEWEB)

    Partin, J.K.; Lassahn, G.D.; Davidson, J.R.

    1994-05-01

    This report describes an automatic target recognition system for fast screening of large amounts of multi-sensor image data, based on low-cost parallel processors. This system uses image data fusion and gives uncertainty estimates. It is relatively low cost, compact, and transportable. The software is easily enhanced to expand the system`s capabilities, and the hardware is easily expandable to increase the system`s speed. This volume gives a general description of the ATR system.

  14. Forensic Automatic Speaker Recognition Based on Likelihood Ratio Using Acoustic-phonetic Features Measured Automatically

    Directory of Open Access Journals (Sweden)

    Huapeng Wang

    2015-01-01

    Full Text Available Forensic speaker recognition is experiencing a remarkable paradigm shift in terms of the evaluation framework and presentation of voice evidence. This paper proposes a new method of forensic automatic speaker recognition using the likelihood ratio framework to quantify the strength of voice evidence. The proposed method uses a reference database to calculate the within- and between-speaker variability. Some acoustic-phonetic features are extracted automatically using the software VoiceSauce. The effectiveness of the approach was tested using two Mandarin databases: A mobile telephone database and a landline database. The experiment's results indicate that these acoustic-phonetic features do have some discriminating potential and are worth trying in discrimination. The automatic acoustic-phonetic features have acceptable discriminative performance and can provide more reliable results in evidence analysis when fused with other kind of voice features.

  15. Automatic Blastomere Recognition from a Single Embryo Image

    Directory of Open Access Journals (Sweden)

    Yun Tian

    2014-01-01

    Full Text Available The number of blastomeres of human day 3 embryos is one of the most important criteria for evaluating embryo viability. However, due to the transparency and overlap of blastomeres, it is a challenge to recognize blastomeres automatically using a single embryo image. This study proposes an approach based on least square curve fitting (LSCF for automatic blastomere recognition from a single image. First, combining edge detection, deletion of multiple connected points, and dilation and erosion, an effective preprocessing method was designed to obtain part of blastomere edges that were singly connected. Next, an automatic recognition method for blastomeres was proposed using least square circle fitting. This algorithm was tested on 381 embryo microscopic images obtained from the eight-cell period, and the results were compared with those provided by experts. Embryos were recognized with a 0 error rate occupancy of 21.59%, and the ratio of embryos in which the false recognition number was less than or equal to 2 was 83.16%. This experiment demonstrated that our method could efficiently and rapidly recognize the number of blastomeres from a single embryo image without the need to reconstruct the three-dimensional model of the blastomeres first; this method is simple and efficient.

  16. iFER: facial expression recognition using automatically selected geometric eye and eyebrow features

    Science.gov (United States)

    Oztel, Ismail; Yolcu, Gozde; Oz, Cemil; Kazan, Serap; Bunyak, Filiz

    2018-03-01

    Facial expressions have an important role in interpersonal communications and estimation of emotional states or intentions. Automatic recognition of facial expressions has led to many practical applications and became one of the important topics in computer vision. We present a facial expression recognition system that relies on geometry-based features extracted from eye and eyebrow regions of the face. The proposed system detects keypoints on frontal face images and forms a feature set using geometric relationships among groups of detected keypoints. Obtained feature set is refined and reduced using the sequential forward selection (SFS) algorithm and fed to a support vector machine classifier to recognize five facial expression classes. The proposed system, iFER (eye-eyebrow only facial expression recognition), is robust to lower face occlusions that may be caused by beards, mustaches, scarves, etc. and lower face motion during speech production. Preliminary experiments on benchmark datasets produced promising results outperforming previous facial expression recognition studies using partial face features, and comparable results to studies using whole face information, only slightly lower by ˜ 2.5 % compared to the best whole face facial recognition system while using only ˜ 1 / 3 of the facial region.

  17. Masked Speech Recognition and Reading Ability in School-Age Children: Is There a Relationship?

    Science.gov (United States)

    Miller, Gabrielle; Lewis, Barbara; Benchek, Penelope; Buss, Emily; Calandruccio, Lauren

    2018-01-01

    Purpose: The relationship between reading (decoding) skills, phonological processing abilities, and masked speech recognition in typically developing children was explored. This experiment was designed to evaluate the relationship between phonological processing and decoding abilities and 2 aspects of masked speech recognition in typically…

  18. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    NARCIS (Netherlands)

    Jesse, A.; McQueen, J.M.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes

  19. Automatic anatomy recognition on CT images with pathology

    Science.gov (United States)

    Huang, Lidong; Udupa, Jayaram K.; Tong, Yubing; Odhner, Dewey; Torigian, Drew A.

    2016-03-01

    Body-wide anatomy recognition on CT images with pathology becomes crucial for quantifying body-wide disease burden. This, however, is a challenging problem because various diseases result in various abnormalities of objects such as shape and intensity patterns. We previously developed an automatic anatomy recognition (AAR) system [1] whose applicability was demonstrated on near normal diagnostic CT images in different body regions on 35 organs. The aim of this paper is to investigate strategies for adapting the previous AAR system to diagnostic CT images of patients with various pathologies as a first step toward automated body-wide disease quantification. The AAR approach consists of three main steps - model building, object recognition, and object delineation. In this paper, within the broader AAR framework, we describe a new strategy for object recognition to handle abnormal images. In the model building stage an optimal threshold interval is learned from near-normal training images for each object. This threshold is optimally tuned to the pathological manifestation of the object in the test image. Recognition is performed following a hierarchical representation of the objects. Experimental results for the abdominal body region based on 50 near-normal images used for model building and 20 abnormal images used for object recognition show that object localization accuracy within 2 voxels for liver and spleen and 3 voxels for kidney can be achieved with the new strategy.

  20. A system of automatic speaker recognition on a minicomputer

    International Nuclear Information System (INIS)

    El Chafei, Cherif

    1978-01-01

    This study describes a system of automatic speaker recognition using the pitch of the voice. The pre-treatment consists in the extraction of the speakers' discriminating characteristics taken from the pitch. The programme of recognition gives, firstly, a preselection and then calculates the distance between the speaker's characteristics to be recognized and those of the speakers already recorded. An experience of recognition has been realized. It has been undertaken with 15 speakers and included 566 tests spread over an intermittent period of four months. The discriminating characteristics used offer several interesting qualities. The algorithms concerning the measure of the characteristics on one hand, the speakers' classification on the other hand, are simple. The results obtained in real time with a minicomputer are satisfactory. Furthermore they probably could be improved if we considered other speaker's discriminating characteristics but this was unfortunately not in our possibilities. (author) [fr

  1. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Science.gov (United States)

    Ellis, Rachel J; Rönnberg, Jerker

    2015-01-01

    Proactive interference (PI) is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  2. Segment-based acoustic models for continuous speech recognition

    Science.gov (United States)

    Ostendorf, Mari; Rohlicek, J. R.

    1993-07-01

    This research aims to develop new and more accurate stochastic models for speaker-independent continuous speech recognition, by extending previous work in segment-based modeling and by introducing a new hierarchical approach to representing intra-utterance statistical dependencies. These techniques, which are more costly than traditional approaches because of the large search space associated with higher order models, are made feasible through rescoring a set of HMM-generated N-best sentence hypotheses. We expect these different modeling techniques to result in improved recognition performance over that achieved by current systems, which handle only frame-based observations and assume that these observations are independent given an underlying state sequence. In the fourth quarter of the project, we have completed the following: (1) ported our recognition system to the Wall Street Journal task, a standard task in the ARPA community; (2) developed an initial dependency-tree model of intra-utterance observation correlation; and (3) implemented baseline language model estimation software. Our initial results on the Wall Street Journal task are quite good and represent significantly improved performance over most HMM systems reporting on the Nov. 1992 5k vocabulary test set.

  3. Speech recognition technology: an outlook for human-to-machine interaction.

    Science.gov (United States)

    Erdel, T; Crooks, S

    2000-01-01

    Speech recognition, as an enabling technology in healthcare-systems computing, is a topic that has been discussed for quite some time, but is just now coming to fruition. Traditionally, speech-recognition software has been constrained by hardware, but improved processors and increased memory capacities are starting to remove some of these limitations. With these barriers removed, companies that create software for the healthcare setting have the opportunity to write more successful applications. Among the criticisms of speech-recognition applications are the high rates of error and steep training curves. However, even in the face of such negative perceptions, there remains significant opportunities for speech recognition to allow healthcare providers and, more specifically, physicians, to work more efficiently and ultimately spend more time with their patients and less time completing necessary documentation. This article will identify opportunities for inclusion of speech-recognition technology in the healthcare setting and examine major categories of speech-recognition software--continuous speech recognition, command and control, and text-to-speech. We will discuss the advantages and disadvantages of each area, the limitations of the software today, and how future trends might affect them.

  4. Neuroscience-inspired computational systems for speech recognition under noisy conditions

    Science.gov (United States)

    Schafer, Phillip B.

    Humans routinely recognize speech in challenging acoustic environments with background music, engine sounds, competing talkers, and other acoustic noise. However, today's automatic speech recognition (ASR) systems perform poorly in such environments. In this dissertation, I present novel methods for ASR designed to approach human-level performance by emulating the brain's processing of sounds. I exploit recent advances in auditory neuroscience to compute neuron-based representations of speech, and design novel methods for decoding these representations to produce word transcriptions. I begin by considering speech representations modeled on the spectrotemporal receptive fields of auditory neurons. These representations can be tuned to optimize a variety of objective functions, which characterize the response properties of a neural population. I propose an objective function that explicitly optimizes the noise invariance of the neural responses, and find that it gives improved performance on an ASR task in noise compared to other objectives. The method as a whole, however, fails to significantly close the performance gap with humans. I next consider speech representations that make use of spiking model neurons. The neurons in this method are feature detectors that selectively respond to spectrotemporal patterns within short time windows in speech. I consider a number of methods for training the response properties of the neurons. In particular, I present a method using linear support vector machines (SVMs) and show that this method produces spikes that are robust to additive noise. I compute the spectrotemporal receptive fields of the neurons for comparison with previous physiological results. To decode the spike-based speech representations, I propose two methods designed to work on isolated word recordings. The first method uses a classical ASR technique based on the hidden Markov model. The second method is a novel template-based recognition scheme that takes

  5. Lexical decoder for continuous speech recognition: sequential neural network approach

    International Nuclear Information System (INIS)

    Iooss, Christine

    1991-01-01

    The work presented in this dissertation concerns the study of a connectionist architecture to treat sequential inputs. In this context, the model proposed by J.L. Elman, a recurrent multilayers network, is used. Its abilities and its limits are evaluated. Modifications are done in order to treat erroneous or noisy sequential inputs and to classify patterns. The application context of this study concerns the realisation of a lexical decoder for analytical multi-speakers continuous speech recognition. Lexical decoding is completed from lattices of phonemes which are obtained after an acoustic-phonetic decoding stage relying on a K Nearest Neighbors search technique. Test are done on sentences formed from a lexicon of 20 words. The results are obtained show the ability of the proposed connectionist model to take into account the sequentiality at the input level, to memorize the context and to treat noisy or erroneous inputs. (author) [fr

  6. Robust Digital Speech Watermarking For Online Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Ali Nematollahi

    2015-01-01

    Full Text Available A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz. Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT, Massachusetts Institute of Technology (MIT, and Mobile Biometry (MOBIO show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

  7. Evaluation of automatic face recognition for automatic border control on actual data recorded of travellers at Schiphol Airport

    NARCIS (Netherlands)

    Spreeuwers, Lieuwe Jan; Hendrikse, A.J.; Gerritsen, K.J.; Brömme, A.; Busch, C.

    2012-01-01

    Automatic border control at airports using automated facial recognition for checking the passport is becoming more and more common. A problem is that it is not clear how reliable these automatic gates are. Very few independent studies exist that assess the reliability of automated facial recognition

  8. Individual differences in language and working memory affect children's speech recognition in noise.

    Science.gov (United States)

    McCreery, Ryan W; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-05-01

    We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Ninety-six children with normal hearing, who were between 5 and 12 years of age. Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Working memory and language both influence children's speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child's auditory skills, consistent with the Ease of Language Understanding model.

  9. Individual differences in language and working memory affect children’s speech recognition in noise

    Science.gov (United States)

    McCreery, Ryan W.; Spratford, Meredith; Kirby, Benjamin; Brennan, Marc

    2017-01-01

    Objective We examined how cognitive and linguistic skills affect speech recognition in noise for children with normal hearing. Children with better working memory and language abilities were expected to have better speech recognition in noise than peers with poorer skills in these domains. Design As part of a prospective, cross-sectional study, children with normal hearing completed speech recognition in noise for three types of stimuli: (1) monosyllabic words, (2) syntactically correct but semantically anomalous sentences and (3) semantically and syntactically anomalous word sequences. Measures of vocabulary, syntax and working memory were used to predict individual differences in speech recognition in noise. Study sample Ninety-six children with normal hearing, who were between 5 and 12 years of age. Results Higher working memory was associated with better speech recognition in noise for all three stimulus types. Higher vocabulary abilities were associated with better recognition in noise for sentences and word sequences, but not for words. Conclusions Working memory and language both influence children’s speech recognition in noise, but the relationships vary across types of stimuli. These findings suggest that clinical assessment of speech recognition is likely to reflect underlying cognitive and linguistic abilities, in addition to a child’s auditory skills, consistent with the Ease of Language Understanding model. PMID:27981855

  10. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition

    OpenAIRE

    Jesse, A.; McQueen, J.

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker...

  11. Interfacing COTS Speech Recognition and Synthesis Software to a Lotus Notes Military Command and Control Database

    Science.gov (United States)

    Carr, Oliver

    2002-10-01

    Speech recognition and synthesis technologies have become commercially viable over recent years. Two current market leading products in speech recognition technology are Dragon NaturallySpeaking and IBM ViaVoice. This report describes the development of speech user interfaces incorporating these products with Lotus Notes and Java applications. These interfaces enable data entry using speech recognition and allow warnings and instructions to be issued via speech synthesis. The development of a military vocabulary to improve user interaction is discussed. The report also describes an evaluation in terms of speed of the various speech user interfaces developed using Dragon NaturallySpeaking and IBM ViaVoice with a Lotus Notes Command and Control Support System Log database.

  12. Recognition of speaker-dependent continuous speech with KEAL

    Science.gov (United States)

    Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.

    1989-04-01

    A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.

  13. Automatic radar target recognition of objects falling on railway tracks

    International Nuclear Information System (INIS)

    Mroué, A; Heddebaut, M; Elbahhar, F; Rivenq, A; Rouvaen, J-M

    2012-01-01

    This paper presents an automatic radar target recognition procedure based on complex resonances using the signals provided by ultra-wideband radar. This procedure is dedicated to detection and identification of objects lying on railway tracks. For an efficient complex resonance extraction, a comparison between several pole extraction methods is illustrated. Therefore, preprocessing methods are presented aiming to remove most of the erroneous poles interfering with the discrimination scheme. Once physical poles are determined, a specific discrimination technique is introduced based on the Euclidean distances. Both simulation and experimental results are depicted showing an efficient discrimination of different targets including guided transport passengers

  14. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems.

    Science.gov (United States)

    Greene, Beth G; Logan, John S; Pisoni, David B

    1986-03-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.

  15. Perception of synthetic speech produced automatically by rule: Intelligibility of eight text-to-speech systems

    Science.gov (United States)

    GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.

    2012-01-01

    We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916

  16. A pattern recognition approach based on DTW for automatic transient identification in nuclear power plants

    International Nuclear Information System (INIS)

    Galbally, Javier; Galbally, David

    2015-01-01

    Highlights: • Novel transient identification method for NPPs. • Low-complexity. • Low training data requirements. • High accuracy. • Fully reproducible protocol carried out on a real benchmark. - Abstract: Automatic identification of transients in nuclear power plants (NPPs) allows monitoring the fatigue damage accumulated by critical components during plant operation, and is therefore of great importance for ensuring that usage factors remain within the original design bases postulated by the plant designer. Although several schemes to address this important issue have been explored in the literature, there is still no definitive solution available. In the present work, a new method for automatic transient identification is proposed, based on the Dynamic Time Warping (DTW) algorithm, largely used in other related areas such as signature or speech recognition. The novel transient identification system is evaluated on real operational data following a rigorous pattern recognition protocol. Results show the high accuracy of the proposed approach, which is combined with other interesting features such as its low complexity and its very limited requirements of training data

  17. The Effects of Background Noise on the Performance of an Automatic Speech Recogniser

    Science.gov (United States)

    Littlefield, Jason; HashemiSakhtsari, Ahmad

    2002-11-01

    Ambient or environmental noise is a major factor that affects the performance of an automatic speech recognizer. Large vocabulary, speaker-dependent, continuous speech recognizers are commercially available. Speech recognizers, perform well in a quiet environment, but poorly in a noisy environment. Speaker-dependent speech recognizers require training prior to them being tested, where the level of background noise in both phases affects the performance of the recognizer. This study aims to determine whether the best performance of a speech recognizer occurs when the levels of background noise during the training and test phases are the same, and how the performance is affected when the levels of background noise during the training and test phases are different. The relationship between the performance of the speech recognizer and upgrading the computer speed and amount of memory as well as software version was also investigated.

  18. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance.

    Science.gov (United States)

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient.

  19. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    Science.gov (United States)

    Yildiz, Izzet B; von Kriegstein, Katharina; Kiebel, Stefan J

    2013-01-01

    Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  20. From birdsong to human speech recognition: bayesian inference on a hierarchy of nonlinear dynamical systems.

    Directory of Open Access Journals (Sweden)

    Izzet B Yildiz

    Full Text Available Our knowledge about the computational mechanisms underlying human learning and recognition of sound sequences, especially speech, is still very limited. One difficulty in deciphering the exact means by which humans recognize speech is that there are scarce experimental findings at a neuronal, microscopic level. Here, we show that our neuronal-computational understanding of speech learning and recognition may be vastly improved by looking at an animal model, i.e., the songbird, which faces the same challenge as humans: to learn and decode complex auditory input, in an online fashion. Motivated by striking similarities between the human and songbird neural recognition systems at the macroscopic level, we assumed that the human brain uses the same computational principles at a microscopic level and translated a birdsong model into a novel human sound learning and recognition model with an emphasis on speech. We show that the resulting Bayesian model with a hierarchy of nonlinear dynamical systems can learn speech samples such as words rapidly and recognize them robustly, even in adverse conditions. In addition, we show that recognition can be performed even when words are spoken by different speakers and with different accents-an everyday situation in which current state-of-the-art speech recognition models often fail. The model can also be used to qualitatively explain behavioral data on human speech learning and derive predictions for future experiments.

  1. Entrance C - New Automatic Number Plate Recognition System

    CERN Multimedia

    2013-01-01

    Entrance C (Satigny) is now equipped with a latest-generation Automatic Number Plate Recognition (ANPR) system and a fast-action road gate.   During the month of August, Entrance C will be continuously open from 7.00 a.m. to 7.00 p.m. (working days only). The security guards will open the gate as usual from 7.00 a.m. to 9.00 a.m. and from 5.00 p.m. to 7.00 p.m. For the rest of the working day (9.00 a.m. to 5.00 p.m.) the gate will operate automatically. Please observe the following points:       Stop at the STOP sign on the ground     Position yourself next to the card reader for optimal recognition     Motorcyclists must use their CERN card     Cyclists may not activate the gate and should use the bicycle turnstile     Keep a safe distance from the vehicle in front of you   If access is denied, please check that your vehicle regist...

  2. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users

    OpenAIRE

    Kaandorp, Marre W.; Smits, Cas; Merkus, Paul; Festen, Joost M.; Goverts, S. Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with ...

  3. Automatic initial and final segmentation in cleft palate speech of Mandarin speakers.

    Directory of Open Access Journals (Sweden)

    Ling He

    Full Text Available The speech unit segmentation is an important pre-processing step in the analysis of cleft palate speech. In Mandarin, one syllable is composed of two parts: initial and final. In cleft palate speech, the resonance disorders occur at the finals and the voiced initials, while the articulation disorders occur at the unvoiced initials. Thus, the initials and finals are the minimum speech units, which could reflect the characteristics of cleft palate speech disorders. In this work, an automatic initial/final segmentation method is proposed. It is an important preprocessing step in cleft palate speech signal processing. The tested cleft palate speech utterances are collected from the Cleft Palate Speech Treatment Center in the Hospital of Stomatology, Sichuan University, which has the largest cleft palate patients in China. The cleft palate speech data includes 824 speech segments, and the control samples contain 228 speech segments. The syllables are extracted from the speech utterances firstly. The proposed syllable extraction method avoids the training stage, and achieves a good performance for both voiced and unvoiced speech. Then, the syllables are classified into with "quasi-unvoiced" or with "quasi-voiced" initials. Respective initial/final segmentation methods are proposed to these two types of syllables. Moreover, a two-step segmentation method is proposed. The rough locations of syllable and initial/final boundaries are refined in the second segmentation step, in order to improve the robustness of segmentation accuracy. The experiments show that the initial/final segmentation accuracies for syllables with quasi-unvoiced initials are higher than quasi-voiced initials. For the cleft palate speech, the mean time error is 4.4ms for syllables with quasi-unvoiced initials, and 25.7ms for syllables with quasi-voiced initials, and the correct segmentation accuracy P30 for all the syllables is 91.69%. For the control samples, P30 for all the

  4. Automatic Pavement Crack Recognition Based on BP Neural Network

    Directory of Open Access Journals (Sweden)

    Li Li

    2014-02-01

    Full Text Available A feasible pavement crack detection system plays an important role in evaluating the road condition and providing the necessary road maintenance. In this paper, a back propagation neural network (BPNN is used to recognize pavement cracks from images. To improve the recognition accuracy of the BPNN, a complete framework of image processing is proposed including image preprocessing and crack information extraction. In this framework, the redundant image information is reduced as much as possible and two sets of feature parameters are constructed to classify the crack images. Then a BPNN is adopted to distinguish pavement images between linear and alligator cracks to acquire high recognition accuracy. Besides, the linear cracks can be further classified into transversal and longitudinal cracks according to the direction angle. Finally, the proposed method is evaluated on the data of 400 pavement images obtained by the Automatic Road Analyzer (ARAN in Northern China and the results show that the proposed method seems to be a powerful tool for pavement crack recognition. The rates of correct classification for alligator, transversal and longitudinal cracks are 97.5%, 100% and 88.0%, respectively. Compared to some previous studies, the method proposed in this paper is effective for all three kinds of cracks and the results are also acceptable for engineering application.

  5. The effects of reverberant self- and overlap-masking on speech recognition in cochlear implant listeners.

    Science.gov (United States)

    Desmond, Jill M; Collins, Leslie M; Throckmorton, Chandra S

    2014-06-01

    Many cochlear implant (CI) listeners experience decreased speech recognition in reverberant environments [Kokkinakis et al., J. Acoust. Soc. Am. 129(5), 3221-3232 (2011)], which may be caused by a combination of self- and overlap-masking [Bolt and MacDonald, J. Acoust. Soc. Am. 21(6), 577-580 (1949)]. Determining the extent to which these effects decrease speech recognition for CI listeners may influence reverberation mitigation algorithms. This study compared speech recognition with ideal self-masking mitigation, with ideal overlap-masking mitigation, and with no mitigation. Under these conditions, mitigating either self- or overlap-masking resulted in significant improvements in speech recognition for both normal hearing subjects utilizing an acoustic model and for CI listeners using their own devices.

  6. Research Into the Use of Speech Recognition Enhanced Microworlds in an Authorable Language Tutor

    National Research Council Canada - National Science Library

    Plott, Beth

    1999-01-01

    .... Once the first microworld exercise was completed and integrated into MILT, ARI funded the investigation of the use of discreet speech recognition technology in language learning using the microworld exercise as a basis...

  7. Hearing Handicap and Speech Recognition Correlate With Self-Reported Listening Effort and Fatigue.

    Science.gov (United States)

    Alhanbali, Sara; Dawes, Piers; Lloyd, Simon; Munro, Kevin J

    To investigate the correlations between hearing handicap, speech recognition, listening effort, and fatigue. Eighty-four adults with hearing loss (65 to 85 years) completed three self-report questionnaires: the Fatigue Assessment Scale, the Effort Assessment Scale, and the Hearing Handicap Inventory for Elderly. Audiometric assessment included pure-tone audiometry and speech recognition in noise. There was a significant positive correlation between handicap and fatigue (r = 0.39, p speech recognition and fatigue (r = 0.22, p speech recognition both correlate with self-reported listening effort and fatigue, which is consistent with a model of listening effort and fatigue where perceived difficulty is related to sustained effort and fatigue for unrewarding tasks over which the listener has low control. A clinical implication is that encouraging clients to recognize and focus on the pleasure and positive experiences of listening may result in greater satisfaction and benefit from hearing aid use.

  8. A New Bigram-PLSA Language Model for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bahrani Mohammad

    2010-01-01

    Full Text Available A novel method for combining bigram model and Probabilistic Latent Semantic Analysis (PLSA is introduced for language modeling. The motivation behind this idea is the relaxation of the "bag of words" assumption fundamentally present in latent topic models including the PLSA model. An EM-based parameter estimation technique for the proposed model is presented in this paper. Previous attempts to incorporate word order in the PLSA model are surveyed and compared with our new proposed model both in theory and by experimental evaluation. Perplexity measure is employed to compare the effectiveness of recently introduced models with the new proposed model. Furthermore, experiments are designed and carried out on continuous speech recognition (CSR tasks using word error rate (WER as the evaluation criterion. The superiority of the new bigram-PLSA model over Nie et al.'s bigram-PLSA and simple PLSA models is demonstrated in the results of our experiments. Experiments on BLLIP WSJ corpus show about 12% reduction in perplexity and 2.8% WER improvement compared to Nie et al.'s bigram-PLSA model.

  9. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments.

  10. Speech recognition by means of a three-integrated-circuit set

    Energy Technology Data Exchange (ETDEWEB)

    Zoicas, A.

    1983-11-03

    The author uses pattern recognition methods for detecting word boundaries, and monitors incoming speech at 12 millisecond intervals. Frequency is divided into eight bands and analysis is achieved in an analogue interface integrated circuit, a pipeline digital processor and a control integrated circuit. Applications are suggested, including speech input to personal computers. 3 references.

  11. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    Science.gov (United States)

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  12. Influences of Infant-Directed Speech on Early Word Recognition

    Science.gov (United States)

    Singh, Leher; Nestor, Sarah; Parikh, Chandni; Yull, Ashley

    2009-01-01

    When addressing infants, many adults adopt a particular type of speech, known as infant-directed speech (IDS). IDS is characterized by exaggerated intonation, as well as reduced speech rate, shorter utterance duration, and grammatical simplification. It is commonly asserted that IDS serves in part to facilitate language learning. Although…

  13. Listeners Experience Linguistic Masking Release in Noise-Vocoded Speech-in-Speech Recognition

    Science.gov (United States)

    Viswanathan, Navin; Kokkinakis, Kostas; Williams, Brittany T.

    2018-01-01

    Purpose: The purpose of this study was to evaluate whether listeners with normal hearing perceiving noise-vocoded speech-in-speech demonstrate better intelligibility of target speech when the background speech was mismatched in language (linguistic release from masking [LRM]) and/or location (spatial release from masking [SRM]) relative to the…

  14. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...... has been shifting towards dimensional affect recognition based on the idea that emotional states are not independent from one another but related in a systematic manner. In this paper, we design a continuous dimensional speech affect recognition model that combines acoustic and semantic features. We...... show that combining semantic and acoustic information for dimensional speech recognition improves the results. Moreover, we show that valence is better estimated using semantic features while arousal is better estimated using acoustic features....

  15. A Russian Keyword Spotting System Based on Large Vocabulary Continuous Speech Recognition and Linguistic Knowledge

    Directory of Open Access Journals (Sweden)

    Valentin Smirnov

    2016-01-01

    Full Text Available The paper describes the key concepts of a word spotting system for Russian based on large vocabulary continuous speech recognition. Key algorithms and system settings are described, including the pronunciation variation algorithm, and the experimental results on the real-life telecom data are provided. The description of system architecture and the user interface is provided. The system is based on CMU Sphinx open-source speech recognition platform and on the linguistic models and algorithms developed by Speech Drive LLC. The effective combination of baseline statistic methods, real-world training data, and the intensive use of linguistic knowledge led to a quality result applicable to industrial use.

  16. AUTOMATIC RECOGNITION OF INDOOR NAVIGATION ELEMENTS FROM KINECT POINT CLOUDS

    Directory of Open Access Journals (Sweden)

    L. Zeng

    2017-09-01

    Full Text Available This paper realizes automatically the navigating elements defined by indoorGML data standard – door, stairway and wall. The data used is indoor 3D point cloud collected by Kinect v2 launched in 2011 through the means of ORB-SLAM. By contrast, it is cheaper and more convenient than lidar, but the point clouds also have the problem of noise, registration error and large data volume. Hence, we adopt a shape descriptor – histogram of distances between two randomly chosen points, proposed by Osada and merges with other descriptor – in conjunction with random forest classifier to recognize the navigation elements (door, stairway and wall from Kinect point clouds. This research acquires navigation elements and their 3-d location information from each single data frame through segmentation of point clouds, boundary extraction, feature calculation and classification. Finally, this paper utilizes the acquired navigation elements and their information to generate the state data of the indoor navigation module automatically. The experimental results demonstrate a high recognition accuracy of the proposed method.

  17. Automatic Recognition of Indoor Navigation Elements from Kinect Point Clouds

    Science.gov (United States)

    Zeng, L.; Kang, Z.

    2017-09-01

    This paper realizes automatically the navigating elements defined by indoorGML data standard - door, stairway and wall. The data used is indoor 3D point cloud collected by Kinect v2 launched in 2011 through the means of ORB-SLAM. By contrast, it is cheaper and more convenient than lidar, but the point clouds also have the problem of noise, registration error and large data volume. Hence, we adopt a shape descriptor - histogram of distances between two randomly chosen points, proposed by Osada and merges with other descriptor - in conjunction with random forest classifier to recognize the navigation elements (door, stairway and wall) from Kinect point clouds. This research acquires navigation elements and their 3-d location information from each single data frame through segmentation of point clouds, boundary extraction, feature calculation and classification. Finally, this paper utilizes the acquired navigation elements and their information to generate the state data of the indoor navigation module automatically. The experimental results demonstrate a high recognition accuracy of the proposed method.

  18. Automatic recognition of offensive team formation in american football plays

    KAUST Repository

    Atmosukarto, Indriyati

    2013-06-01

    Compared to security surveillance and military applications, where automated action analysis is prevalent, the sports domain is extremely under-served. Most existing software packages for sports video analysis require manual annotation of important events in the video. American football is the most popular sport in the United States, however most game analysis is still done manually. Line of scrimmage and offensive team formation recognition are two statistics that must be tagged by American Football coaches when watching and evaluating past play video clips, a process which takes many man hours per week. These two statistics are also the building blocks for more high-level analysis such as play strategy inference and automatic statistic generation. In this paper, we propose a novel framework where given an American football play clip, we automatically identify the video frame in which the offensive team lines in formation (formation frame), the line of scrimmage for that play, and the type of player formation the offensive team takes on. The proposed framework achieves 95% accuracy in detecting the formation frame, 98% accuracy in detecting the line of scrimmage, and up to 67% accuracy in classifying the offensive team\\'s formation. To validate our framework, we compiled a large dataset comprising more than 800 play-clips of standard and high definition resolution from real-world football games. This dataset will be made publicly available for future comparison. © 2013 IEEE.

  19. Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model.

    Science.gov (United States)

    Jürgens, Tim; Brand, Thomas

    2009-11-01

    This study compares the phoneme recognition performance in speech-shaped noise of a microscopic model for speech recognition with the performance of normal-hearing listeners. "Microscopic" is defined in terms of this model twofold. First, the speech recognition rate is predicted on a phoneme-by-phoneme basis. Second, microscopic modeling means that the signal waveforms to be recognized are processed by mimicking elementary parts of human's auditory processing. The model is based on an approach by Holube and Kollmeier [J. Acoust. Soc. Am. 100, 1703-1716 (1996)] and consists of a psychoacoustically and physiologically motivated preprocessing and a simple dynamic-time-warp speech recognizer. The model is evaluated while presenting nonsense speech in a closed-set paradigm. Averaged phoneme recognition rates, specific phoneme recognition rates, and phoneme confusions are analyzed. The influence of different perceptual distance measures and of the model's a-priori knowledge is investigated. The results show that human performance can be predicted by this model using an optimal detector, i.e., identical speech waveforms for both training of the recognizer and testing. The best model performance is yielded by distance measures which focus mainly on small perceptual distances and neglect outliers.

  20. Joint variable frame rate and length analysis for speech recognition under adverse conditions

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Kraljevski, Ivan

    2014-01-01

    This paper presents a method that combines variable frame length and rate analysis for speech recognition in noisy environments, together with an investigation of the effect of different frame lengths on speech recognition performance. The method adopts frame selection using an a posteriori signal......-to-noise (SNR) ratio weighted energy distance and increases the length of the selected frames, according to the number of non-selected preceding frames. It assigns a higher frame rate and a normal frame length to a rapidly changing and high SNR region of a speech signal, and a lower frame rate and an increased...... frame length to a steady or low SNR region. The speech recognition results show that the proposed variable frame rate and length method outperforms fixed frame rate and length analysis, as well as standalone variable frame rate analysis in terms of noise-robustness....

  1. Using Face Recognition in the Automatic Door Access Control in a Secured Room

    Directory of Open Access Journals (Sweden)

    Gheorghe Gilca

    2017-06-01

    Full Text Available The aim of this paper is to help users improve the door security of sensitive locations by using face detection and recognition. This paper is comprised mainly of three subsystems: face detection, face recognition and automatic door access control. The door will open automatically for the known person due to the command of the microcontroller.

  2. Report generation using digital speech recognition in radiology

    International Nuclear Information System (INIS)

    Vorbeck, F.; Ba-Ssalamah, A.; Kettenbach, J.; Huebsch, P.

    2000-01-01

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5 %, and the mean error rate for conventional typing was 0.4 %. Reports edited with the CSR, on average, were generated 19 % faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28 % for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. (orig.)

  3. Use of Authentic-Speech Technique for Teaching Sound Recognition to EFL Students

    Science.gov (United States)

    Sersen, William J.

    2011-01-01

    The main objective of this research was to test an authentic-speech technique for improving the sound-recognition skills of EFL (English as a foreign language) students at Roi-Et Rajabhat University. The secondary objective was to determine the correlation, if any, between students' self-evaluation of sound-recognition progress and the actual…

  4. Conversation electrified: ERP correlates of speech act recognition in underspecified utterances.

    Directory of Open Access Journals (Sweden)

    Rosa S Gisladottir

    Full Text Available The ability to recognize speech acts (verbal actions in conversation is critical for everyday interaction. However, utterances are often underspecified for the speech act they perform, requiring listeners to rely on the context to recognize the action. The goal of this study was to investigate the time-course of auditory speech act recognition in action-underspecified utterances and explore how sequential context (the prior action impacts this process. We hypothesized that speech acts are recognized early in the utterance to allow for quick transitions between turns in conversation. Event-related potentials (ERPs were recorded while participants listened to spoken dialogues and performed an action categorization task. The dialogues contained target utterances that each of which could deliver three distinct speech acts depending on the prior turn. The targets were identical across conditions, but differed in the type of speech act performed and how it fit into the larger action sequence. The ERP results show an early effect of action type, reflected by frontal positivities as early as 200 ms after target utterance onset. This indicates that speech act recognition begins early in the turn when the utterance has only been partially processed. Providing further support for early speech act recognition, actions in highly constraining contexts did not elicit an ERP effect to the utterance-final word. We take this to show that listeners can recognize the action before the final word through predictions at the speech act level. However, additional processing based on the complete utterance is required in more complex actions, as reflected by a posterior negativity at the final word when the speech act is in a less constraining context and a new action sequence is initiated. These findings demonstrate that sentence comprehension in conversational contexts crucially involves recognition of verbal action which begins as soon as it can.

  5. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners.

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions.

  6. Effect of speech rate variation on acoustic phone stability in Afrikaans speech recognition

    CSIR Research Space (South Africa)

    Badenhorst, JAC

    2007-11-01

    Full Text Available The authors analyse the effect of speech rate variation on Afrikaans phone stability from an acoustic perspective. Specifically they introduce two techniques for the acoustic analysis of speech rate variation, apply these techniques to an Afrikaans...

  7. Visual face-movement sensitive cortex is relevant for auditory-only speech recognition.

    Science.gov (United States)

    Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina

    2015-07-01

    It is commonly assumed that the recruitment of visual areas during audition is not relevant for performing auditory tasks ('auditory-only view'). According to an alternative view, however, the recruitment of visual cortices is thought to optimize auditory-only task performance ('auditory-visual view'). This alternative view is based on functional magnetic resonance imaging (fMRI) studies. These studies have shown, for example, that even if there is only auditory input available, face-movement sensitive areas within the posterior superior temporal sulcus (pSTS) are involved in understanding what is said (auditory-only speech recognition). This is particularly the case when speakers are known audio-visually, that is, after brief voice-face learning. Here we tested whether the left pSTS involvement is causally related to performance in auditory-only speech recognition when speakers are known by face. To test this hypothesis, we applied cathodal transcranial direct current stimulation (tDCS) to the pSTS during (i) visual-only speech recognition of a speaker known only visually to participants and (ii) auditory-only speech recognition of speakers they learned by voice and face. We defined the cathode as active electrode to down-regulate cortical excitability by hyperpolarization of neurons. tDCS to the pSTS interfered with visual-only speech recognition performance compared to a control group without pSTS stimulation (tDCS to BA6/44 or sham). Critically, compared to controls, pSTS stimulation additionally decreased auditory-only speech recognition performance selectively for voice-face learned speakers. These results are important in two ways. First, they provide direct evidence that the pSTS is causally involved in visual-only speech recognition; this confirms a long-standing prediction of current face-processing models. Secondly, they show that visual face-sensitive pSTS is causally involved in optimizing auditory-only speech recognition. These results are in line

  8. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  9. The Automatic Annotation of the Semiotic Type of Hand Gestures in Obama’s Humorous Speeches

    DEFF Research Database (Denmark)

    Navarretta, Costanza

    2018-01-01

    is expressed by speech or by adding new information to what is uttered. The automatic classification of the semiotic type of gestures from their shape description can contribute to their interpretation in human-human communication and in advanced multimodal interactive systems. We annotated and analysed hand...

  10. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    of training. Methods: Eight volunteers read aloud a total of about 3 600 typical short anaesthesia comments to be transcribed by a continuous speech recognition system. Background noises were collected in an operating room and reproduced. A regression analysis and descriptive statistics were done to evaluate...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type...... the relative effect of various factors. Results: Some factors have a major impact, such as the words to be recognised, the type of recognition, and participants. The type of microphone is especially significant when combined with the type of noise. While loud noises in the operating room can have a predominant...

  11. Accelerometer-based automatic voice onset detection in speech mapping with navigated repetitive transcranial magnetic stimulation.

    Science.gov (United States)

    Vitikainen, Anne-Mari; Mäkelä, Elina; Lioumis, Pantelis; Jousmäki, Veikko; Mäkelä, Jyrki P

    2015-09-30

    The use of navigated repetitive transcranial magnetic stimulation (rTMS) in mapping of speech-related brain areas has recently shown to be useful in preoperative workflow of epilepsy and tumor patients. However, substantial inter- and intraobserver variability and non-optimal replicability of the rTMS results have been reported, and a need for additional development of the methodology is recognized. In TMS motor cortex mappings the evoked responses can be quantitatively monitored by electromyographic recordings; however, no such easily available setup exists for speech mappings. We present an accelerometer-based setup for detection of vocalization-related larynx vibrations combined with an automatic routine for voice onset detection for rTMS speech mapping applying naming. The results produced by the automatic routine were compared with the manually reviewed video-recordings. The new method was applied in the routine navigated rTMS speech mapping for 12 consecutive patients during preoperative workup for epilepsy or tumor surgery. The automatic routine correctly detected 96% of the voice onsets, resulting in 96% sensitivity and 71% specificity. Majority (63%) of the misdetections were related to visible throat movements, extra voices before the response, or delayed naming of the previous stimuli. The no-response errors were correctly detected in 88% of events. The proposed setup for automatic detection of voice onsets provides quantitative additional data for analysis of the rTMS-induced speech response modifications. The objectively defined speech response latencies increase the repeatability, reliability and stratification of the rTMS results. Copyright © 2015 Elsevier B.V. All rights reserved.

  12. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN

    Science.gov (United States)

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai

    2017-01-01

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed. PMID:28737705

  13. Eyes and ears: Using eye tracking and pupillometry to understand challenges to speech recognition.

    Science.gov (United States)

    Van Engen, Kristin J; McLaughlin, Drew J

    2018-05-04

    Although human speech recognition is often experienced as relatively effortless, a number of common challenges can render the task more difficult. Such challenges may originate in talkers (e.g., unfamiliar accents, varying speech styles), the environment (e.g. noise), or in listeners themselves (e.g., hearing loss, aging, different native language backgrounds). Each of these challenges can reduce the intelligibility of spoken language, but even when intelligibility remains high, they can place greater processing demands on listeners. Noisy conditions, for example, can lead to poorer recall for speech, even when it has been correctly understood. Speech intelligibility measures, memory tasks, and subjective reports of listener difficulty all provide critical information about the effects of such challenges on speech recognition. Eye tracking and pupillometry complement these methods by providing objective physiological measures of online cognitive processing during listening. Eye tracking records the moment-to-moment direction of listeners' visual attention, which is closely time-locked to unfolding speech signals, and pupillometry measures the moment-to-moment size of listeners' pupils, which dilate in response to increased cognitive load. In this paper, we review the uses of these two methods for studying challenges to speech recognition. Copyright © 2018. Published by Elsevier B.V.

  14. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN.

    Science.gov (United States)

    Zhu, Lianzhang; Chen, Leiming; Zhao, Dehai; Zhou, Jiehan; Zhang, Weishan

    2017-07-24

    Accurate emotion recognition from speech is important for applications like smart health care, smart entertainment, and other smart services. High accuracy emotion recognition from Chinese speech is challenging due to the complexities of the Chinese language. In this paper, we explore how to improve the accuracy of speech emotion recognition, including speech signal feature extraction and emotion classification methods. Five types of features are extracted from a speech sample: mel frequency cepstrum coefficient (MFCC), pitch, formant, short-term zero-crossing rate and short-term energy. By comparing statistical features with deep features extracted by a Deep Belief Network (DBN), we attempt to find the best features to identify the emotion status for speech. We propose a novel classification method that combines DBN and SVM (support vector machine) instead of using only one of them. In addition, a conjugate gradient method is applied to train DBN in order to speed up the training process. Gender-dependent experiments are conducted using an emotional speech database created by the Chinese Academy of Sciences. The results show that DBN features can reflect emotion status better than artificial features, and our new classification approach achieves an accuracy of 95.8%, which is higher than using either DBN or SVM separately. Results also show that DBN can work very well for small training databases if it is properly designed.

  15. Mathematical algorithm for the automatic recognition of intestinal parasites.

    Directory of Open Access Journals (Sweden)

    Alicia Alva

    Full Text Available Parasitic infections are generally diagnosed by professionals trained to recognize the morphological characteristics of the eggs in microscopic images of fecal smears. However, this laboratory diagnosis requires medical specialists which are lacking in many of the areas where these infections are most prevalent. In response to this public health issue, we developed a software based on pattern recognition analysis from microscopi digital images of fecal smears, capable of automatically recognizing and diagnosing common human intestinal parasites. To this end, we selected 229, 124, 217, and 229 objects from microscopic images of fecal smears positive for Taenia sp., Trichuris trichiura, Diphyllobothrium latum, and Fasciola hepatica, respectively. Representative photographs were selected by a parasitologist. We then implemented our algorithm in the open source program SCILAB. The algorithm processes the image by first converting to gray-scale, then applies a fourteen step filtering process, and produces a skeletonized and tri-colored image. The features extracted fall into two general categories: geometric characteristics and brightness descriptions. Individual characteristics were quantified and evaluated with a logistic regression to model their ability to correctly identify each parasite separately. Subsequently, all algorithms were evaluated for false positive cross reactivity with the other parasites studied, excepting Taenia sp. which shares very few morphological characteristics with the others. The principal result showed that our algorithm reached sensitivities between 99.10%-100% and specificities between 98.13%- 98.38% to detect each parasite separately. We did not find any cross-positivity in the algorithms for the three parasites evaluated. In conclusion, the results demonstrated the capacity of our computer algorithm to automatically recognize and diagnose Taenia sp., Trichuris trichiura, Diphyllobothrium latum, and Fasciola hepatica

  16. Mathematical algorithm for the automatic recognition of intestinal parasites.

    Science.gov (United States)

    Alva, Alicia; Cangalaya, Carla; Quiliano, Miguel; Krebs, Casey; Gilman, Robert H; Sheen, Patricia; Zimic, Mirko

    2017-01-01

    Parasitic infections are generally diagnosed by professionals trained to recognize the morphological characteristics of the eggs in microscopic images of fecal smears. However, this laboratory diagnosis requires medical specialists which are lacking in many of the areas where these infections are most prevalent. In response to this public health issue, we developed a software based on pattern recognition analysis from microscopi digital images of fecal smears, capable of automatically recognizing and diagnosing common human intestinal parasites. To this end, we selected 229, 124, 217, and 229 objects from microscopic images of fecal smears positive for Taenia sp., Trichuris trichiura, Diphyllobothrium latum, and Fasciola hepatica, respectively. Representative photographs were selected by a parasitologist. We then implemented our algorithm in the open source program SCILAB. The algorithm processes the image by first converting to gray-scale, then applies a fourteen step filtering process, and produces a skeletonized and tri-colored image. The features extracted fall into two general categories: geometric characteristics and brightness descriptions. Individual characteristics were quantified and evaluated with a logistic regression to model their ability to correctly identify each parasite separately. Subsequently, all algorithms were evaluated for false positive cross reactivity with the other parasites studied, excepting Taenia sp. which shares very few morphological characteristics with the others. The principal result showed that our algorithm reached sensitivities between 99.10%-100% and specificities between 98.13%- 98.38% to detect each parasite separately. We did not find any cross-positivity in the algorithms for the three parasites evaluated. In conclusion, the results demonstrated the capacity of our computer algorithm to automatically recognize and diagnose Taenia sp., Trichuris trichiura, Diphyllobothrium latum, and Fasciola hepatica with a high

  17. How does susceptibility to proactive interference relate to speech recognition in aided and unaided conditions?

    Directory of Open Access Journals (Sweden)

    Rachel Jane Ellis

    2015-08-01

    Full Text Available Proactive interference (PI is the capacity to resist interference to the acquisition of new memories from information stored in the long-term memory. Previous research has shown that PI correlates significantly with the speech-in-noise recognition scores of younger adults with normal hearing. In this study, we report the results of an experiment designed to investigate the extent to which tests of visual PI relate to the speech-in-noise recognition scores of older adults with hearing loss, in aided and unaided conditions. The results suggest that measures of PI correlate significantly with speech-in-noise recognition only in the unaided condition. Furthermore the relation between PI and speech-in-noise recognition differs to that observed in younger listeners without hearing loss. The findings suggest that the relation between PI tests and the speech-in-noise recognition scores of older adults with hearing loss relates to capability of the test to index cognitive flexibility.

  18. The Relationship Between Spectral Modulation Detection and Speech Recognition: Adult Versus Pediatric Cochlear Implant Recipients.

    Science.gov (United States)

    Gifford, René H; Noble, Jack H; Camarata, Stephen M; Sunderhaus, Linsey W; Dwyer, Robert T; Dawant, Benoit M; Dietrich, Mary S; Labadie, Robert F

    2018-01-01

    Adult cochlear implant (CI) recipients demonstrate a reliable relationship between spectral modulation detection and speech understanding. Prior studies documenting this relationship have focused on postlingually deafened adult CI recipients-leaving an open question regarding the relationship between spectral resolution and speech understanding for adults and children with prelingual onset of deafness. Here, we report CI performance on the measures of speech recognition and spectral modulation detection for 578 CI recipients including 477 postlingual adults, 65 prelingual adults, and 36 prelingual pediatric CI users. The results demonstrated a significant correlation between spectral modulation detection and various measures of speech understanding for 542 adult CI recipients. For 36 pediatric CI recipients, however, there was no significant correlation between spectral modulation detection and speech understanding in quiet or in noise nor was spectral modulation detection significantly correlated with listener age or age at implantation. These findings suggest that pediatric CI recipients might not depend upon spectral resolution for speech understanding in the same manner as adult CI recipients. It is possible that pediatric CI users are making use of different cues, such as those contained within the temporal envelope, to achieve high levels of speech understanding. Further investigation is warranted to investigate the relationship between spectral and temporal resolution and speech recognition to describe the underlying mechanisms driving peripheral auditory processing in pediatric CI users.

  19. Speech Recognition in Adults With Cochlear Implants: The Effects of Working Memory, Phonological Sensitivity, and Aging.

    Science.gov (United States)

    Moberly, Aaron C; Harris, Michael S; Boyce, Lauren; Nittrouer, Susan

    2017-04-14

    Models of speech recognition suggest that "top-down" linguistic and cognitive functions, such as use of phonotactic constraints and working memory, facilitate recognition under conditions of degradation, such as in noise. The question addressed in this study was what happens to these functions when a listener who has experienced years of hearing loss obtains a cochlear implant. Thirty adults with cochlear implants and 30 age-matched controls with age-normal hearing underwent testing of verbal working memory using digit span and serial recall of words. Phonological capacities were assessed using a lexical decision task and nonword repetition. Recognition of words in sentences in speech-shaped noise was measured. Implant users had only slightly poorer working memory accuracy than did controls and only on serial recall of words; however, phonological sensitivity was highly impaired. Working memory did not facilitate speech recognition in noise for either group. Phonological sensitivity predicted sentence recognition for implant users but not for listeners with normal hearing. Clinical speech recognition outcomes for adult implant users relate to the ability of these users to process phonological information. Results suggest that phonological capacities may serve as potential clinical targets through rehabilitative training. Such novel interventions may be particularly helpful for older adult implant users.

  20. Self-organizing map classifier for stressed speech recognition

    Science.gov (United States)

    Partila, Pavol; Tovarek, Jaromir; Voznak, Miroslav

    2016-05-01

    This paper presents a method for detecting speech under stress using Self-Organizing Maps. Most people who are exposed to stressful situations can not adequately respond to stimuli. Army, police, and fire department occupy the largest part of the environment that are typical of an increased number of stressful situations. The role of men in action is controlled by the control center. Control commands should be adapted to the psychological state of a man in action. It is known that the psychological changes of the human body are also reflected physiologically, which consequently means the stress effected speech. Therefore, it is clear that the speech stress recognizing system is required in the security forces. One of the possible classifiers, which are popular for its flexibility, is a self-organizing map. It is one type of the artificial neural networks. Flexibility means independence classifier on the character of the input data. This feature is suitable for speech processing. Human Stress can be seen as a kind of emotional state. Mel-frequency cepstral coefficients, LPC coefficients, and prosody features were selected for input data. These coefficients were selected for their sensitivity to emotional changes. The calculation of the parameters was performed on speech recordings, which can be divided into two classes, namely the stress state recordings and normal state recordings. The benefit of the experiment is a method using SOM classifier for stress speech detection. Results showed the advantage of this method, which is input data flexibility.

  1. Speech Recognition of Aged Voices in the AAL Context: Detection of Distress Sentences

    OpenAIRE

    Aman , Frédéric; Vacher , Michel; Rossato , Solange; Portet , François

    2013-01-01

    International audience; By 2050, about a third of the French population will be over 65. In the context of technologies development aiming at helping aged people to live independently at home, the CIRDO project aims at implementing an ASR system into a social inclusion product designed for elderly people in order to detect distress situations. Speech recognition systems present higher word error rate when speech is uttered by elderly speakers compared to when non-aged voice is considered. Two...

  2. Investigations on search methods for speech recognition using weighted finite state transducers

    OpenAIRE

    Rybach, David

    2014-01-01

    The search problem in the statistical approach to speech recognition is to find the most likely word sequence for an observed speech signal using a combination of knowledge sources, i.e. the language model, the pronunciation model, and the acoustic models of phones. The resulting search space is enormous. Therefore, an efficient search strategy is required to compute the result with a feasible amount of time and memory. The structured statistical models as well as their combination, the searc...

  3. A Novel DBN Feature Fusion Model for Cross-Corpus Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Zou Cairong

    2016-01-01

    Full Text Available The feature fusion from separate source is the current technical difficulties of cross-corpus speech emotion recognition. The purpose of this paper is to, based on Deep Belief Nets (DBN in Deep Learning, use the emotional information hiding in speech spectrum diagram (spectrogram as image features and then implement feature fusion with the traditional emotion features. First, based on the spectrogram analysis by STB/Itti model, the new spectrogram features are extracted from the color, the brightness, and the orientation, respectively; then using two alternative DBN models they fuse the traditional and the spectrogram features, which increase the scale of the feature subset and the characterization ability of emotion. Through the experiment on ABC database and Chinese corpora, the new feature subset compared with traditional speech emotion features, the recognition result on cross-corpus, distinctly advances by 8.8%. The method proposed provides a new idea for feature fusion of emotion recognition.

  4. Speech Recognition for Medical Dictation: Overview in Quebec and Systematic Review.

    Science.gov (United States)

    Poder, Thomas G; Fisette, Jean-François; Déry, Véronique

    2018-04-03

    Speech recognition is increasingly used in medical reporting. The aim of this article is to identify in the literature the strengths and weaknesses of this technology, as well as barriers to and facilitators of its implementation. A systematic review of systematic reviews was performed using PubMed, Scopus, the Cochrane Library and the Center for Reviews and Dissemination through August 2017. The gray literature has also been consulted. The quality of systematic reviews has been assessed with the AMSTAR checklist. The main inclusion criterion was use of speech recognition for medical reporting (front-end or back-end). A survey has also been conducted in Quebec, Canada, to identify the dissemination of this technology in this province, as well as the factors leading to the success or failure of its implementation. Five systematic reviews were identified. These reviews indicated a high level of heterogeneity across studies. The quality of the studies reported was generally poor. Speech recognition is not as accurate as human transcription, but it can dramatically reduce turnaround times for reporting. In front-end use, medical doctors need to spend more time on dictation and correction than required with human transcription. With speech recognition, major errors occur up to three times more frequently. In back-end use, a potential increase in productivity of transcriptionists was noted. In conclusion, speech recognition offers several advantages for medical reporting. However, these advantages are countered by an increased burden on medical doctors and by risks of additional errors in medical reports. It is also hard to identify for which medical specialties and which clinical activities the use of speech recognition will be the most beneficial.

  5. Multi-Stage Recognition of Speech Emotion Using Sequential Forward Feature Selection

    Directory of Open Access Journals (Sweden)

    Liogienė Tatjana

    2016-07-01

    Full Text Available The intensive research of speech emotion recognition introduced a huge collection of speech emotion features. Large feature sets complicate the speech emotion recognition task. Among various feature selection and transformation techniques for one-stage classification, multiple classifier systems were proposed. The main idea of multiple classifiers is to arrange the emotion classification process in stages. Besides parallel and serial cases, the hierarchical arrangement of multi-stage classification is most widely used for speech emotion recognition. In this paper, we present a sequential-forward-feature-selection-based multi-stage classification scheme. The Sequential Forward Selection (SFS and Sequential Floating Forward Selection (SFFS techniques were employed for every stage of the multi-stage classification scheme. Experimental testing of the proposed scheme was performed using the German and Lithuanian emotional speech datasets. Sequential-feature-selection-based multi-stage classification outperformed the single-stage scheme by 12–42 % for different emotion sets. The multi-stage scheme has shown higher robustness to the growth of emotion set. The decrease in recognition rate with the increase in emotion set for multi-stage scheme was lower by 10–20 % in comparison with the single-stage case. Differences in SFS and SFFS employment for feature selection were negligible.

  6. Collecting and evaluating speech recognition corpora for 11 South African languages

    CSIR Research Space (South Africa)

    Badenhorst, J

    2011-08-01

    Full Text Available . In addition, speech-based access to information may empower illiterate or semi-literate peo- ple, 98% of whom live in the developing world. SDSs can play a useful role in a wide range of applications. Of particular importance in Africa are applications... speech (i.e. appropriate for the recognition task in terms of the language used, the profile of the speakers, speaking style, etc.) This speech generally needs to be curated and transcribed prior to the development of ASR sys- tems, and for most...

  7. Effects of noise on speech recognition: Challenges for communication by service members.

    Science.gov (United States)

    Le Prell, Colleen G; Clavier, Odile H

    2017-06-01

    Speech communication often takes place in noisy environments; this is an urgent issue for military personnel who must communicate in high-noise environments. The effects of noise on speech recognition vary significantly according to the sources of noise, the number and types of talkers, and the listener's hearing ability. In this review, speech communication is first described as it relates to current standards of hearing assessment for military and civilian populations. The next section categorizes types of noise (also called maskers) according to their temporal characteristics (steady or fluctuating) and perceptive effects (energetic or informational masking). Next, speech recognition difficulties experienced by listeners with hearing loss and by older listeners are summarized, and questions on the possible causes of speech-in-noise difficulty are discussed, including recent suggestions of "hidden hearing loss". The final section describes tests used by military and civilian researchers, audiologists, and hearing technicians to assess performance of an individual in recognizing speech in background noise, as well as metrics that predict performance based on a listener and background noise profile. This article provides readers with an overview of the challenges associated with speech communication in noisy backgrounds, as well as its assessment and potential impact on functional performance, and provides guidance for important new research directions relevant not only to military personnel, but also to employees who work in high noise environments. Copyright © 2016 Elsevier B.V. All rights reserved.

  8. [Intermodal timing cues for audio-visual speech recognition].

    Science.gov (United States)

    Hashimoto, Masahiro; Kumashiro, Masaharu

    2004-06-01

    The purpose of this study was to investigate the limitations of lip-reading advantages for Japanese young adults by desynchronizing visual and auditory information in speech. In the experiment, audio-visual speech stimuli were presented under the six test conditions: audio-alone, and audio-visually with either 0, 60, 120, 240 or 480 ms of audio delay. The stimuli were the video recordings of a face of a female Japanese speaking long and short Japanese sentences. The intelligibility of the audio-visual stimuli was measured as a function of audio delays in sixteen untrained young subjects. Speech intelligibility under the audio-delay condition of less than 120 ms was significantly better than that under the audio-alone condition. On the other hand, the delay of 120 ms corresponded to the mean mora duration measured for the audio stimuli. The results implied that audio delays of up to 120 ms would not disrupt lip-reading advantage, because visual and auditory information in speech seemed to be integrated on a syllabic time scale. Potential applications of this research include noisy workplace in which a worker must extract relevant speech from all the other competing noises.

  9. A New Fuzzy Cognitive Map Learning Algorithm for Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Wei Zhang

    2017-01-01

    Full Text Available Selecting an appropriate recognition method is crucial in speech emotion recognition applications. However, the current methods do not consider the relationship between emotions. Thus, in this study, a speech emotion recognition system based on the fuzzy cognitive map (FCM approach is constructed. Moreover, a new FCM learning algorithm for speech emotion recognition is proposed. This algorithm includes the use of the pleasure-arousal-dominance emotion scale to calculate the weights between emotions and certain mathematical derivations to determine the network structure. The proposed algorithm can handle a large number of concepts, whereas a typical FCM can handle only relatively simple networks (maps. Different acoustic features, including fundamental speech features and a new spectral feature, are extracted to evaluate the performance of the proposed method. Three experiments are conducted in this paper, namely, single feature experiment, feature combination experiment, and comparison between the proposed algorithm and typical networks. All experiments are performed on TYUT2.0 and EMO-DB databases. Results of the feature combination experiments show that the recognition rates of the combination features are 10%–20% better than those of single features. The proposed FCM learning algorithm generates 5%–20% performance improvement compared with traditional classification networks.

  10. Speech recognition in normal hearing and sensorineural hearing loss as a function of the number of spectral channels

    NARCIS (Netherlands)

    Baskent, Deniz

    Speech recognition by normal-hearing listeners improves as a function of the number of spectral channels when tested with a noiseband vocoder simulating cochlear implant signal processing. Speech recognition by the best cochlear implant users, however, saturates around eight channels and does not

  11. Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

    Science.gov (United States)

    Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy

    2016-04-01

    Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. Copyright © 2016 Elsevier B.V. All rights reserved.

  12. Memristive Computational Architecture of an Echo State Network for Real-Time Speech Emotion Recognition

    Science.gov (United States)

    2015-05-28

    recognition is simpler and requires less computational resources compared to other inputs such as facial expressions . The Berlin database of Emotional ...Processing Magazine, IEEE, vol. 18, no. 1, pp. 32– 80, 2001. [15] K. R. Scherer, T. Johnstone, and G. Klasmeyer, “Vocal expression of emotion ...Network for Real-Time Speech- Emotion Recognition 5a. CONTRACT NUMBER IN-HOUSE 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 62788F 6. AUTHOR(S) Q

  13. Divided attention disrupts perceptual encoding during speech recognition.

    Science.gov (United States)

    Mattys, Sven L; Palmer, Shekeila D

    2015-03-01

    Performing a secondary task while listening to speech has a detrimental effect on speech processing, but the locus of the disruption within the speech system is poorly understood. Recent research has shown that cognitive load imposed by a concurrent visual task increases dependency on lexical knowledge during speech processing, but it does not affect lexical activation per se. This suggests that "lexical drift" under cognitive load occurs either as a post-lexical bias at the decisional level or as a secondary consequence of reduced perceptual sensitivity. This study aimed to adjudicate between these alternatives using a forced-choice task that required listeners to identify noise-degraded spoken words with or without the addition of a concurrent visual task. Adding cognitive load increased the likelihood that listeners would select a word acoustically similar to the target even though its frequency was lower than that of the target. Thus, there was no evidence that cognitive load led to a high-frequency response bias. Rather, cognitive load seems to disrupt sublexical encoding, possibly by impairing perceptual acuity at the auditory periphery.

  14. Automatic feedback to promote safe walking and speech loudness control in persons with multiple disabilities: two single-case studies.

    Science.gov (United States)

    Lancioni, Giulio E; Singh, Nirbhay N; O'Reilly, Mark F; Green, Vanessa A; Alberti, Gloria; Boccasini, Adele; Smaldone, Angela; Oliva, Doretta; Bosco, Andrea

    2014-08-01

    Assessing automatic feedback technologies to promote safe travel and speech loudness control in two men with multiple disabilities, respectively. The men were involved in two single-case studies. In Study I, the technology involved a microprocessor, two photocells, and a verbal feedback device. The man received verbal alerting/feedback when the photocells spotted an obstacle in front of him. In Study II, the technology involved a sound-detecting unit connected to a throat and an airborne microphone, and to a vibration device. Vibration occurred when the man's speech loudness exceeded a preset level. The man included in Study I succeeded in using the automatic feedback in substitution of caregivers' alerting/feedback for safe travel. The man of Study II used the automatic feedback to successfully reduce his speech loudness. Automatic feedback can be highly effective in helping persons with multiple disabilities improve their travel and speech performance.

  15. [Repetitive phenomenona in the spontaneous speech of aphasic patients: perseveration, stereotypy, echolalia, automatism and recurring utterance].

    Science.gov (United States)

    Wallesch, C W; Brunner, R J; Seemüller, E

    1983-12-01

    Repetitive phenomena in spontaneous speech were investigated in 30 patients with chronic infarctions of the left hemisphere which included Broca's and/or Wernicke's area and/or the basal ganglia. Perseverations, stereotypies, and echolalias occurred with all types of brain lesions, automatisms and recurring utterances only with those patients, whose infarctions involved Wernicke's area and basal ganglia. These patients also showed more echolalic responses. The results are discussed in view of the role of the basal ganglia as motor program generators.

  16. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly

    Directory of Open Access Journals (Sweden)

    Na W

    2017-08-01

    Full Text Available Wondo Na,1 Gibbeum Kim,1 Gungu Kim,1 Woojae Han,2 Jinsook Kim2 1Department of Speech Pathology and Audiology, Graduate School, 2Division of Speech Pathology and Audiology, Research Institute of Audiology and Speech Pathology, College of Natural Sciences, Hallym University, Chuncheon, Republic of Korea Purpose: The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss.Methods: One hundred subjects aged 65–84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants’ working memory.Results: 1 As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2 As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3 Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss.Conclusion: The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing

  17. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  18. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  19. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  20. Automatic Type Recognition and Mapping of Global Tropical Cyclone Disaster Chains (TDC

    Directory of Open Access Journals (Sweden)

    Ran Wang

    2016-10-01

    Full Text Available The catastrophic events caused by meteorological disasters are becoming more severe in the context of global warming. The disaster chains triggered by Tropical Cyclones induce the serious losses of population and economy. It is necessary to make the regional type recognition of Tropical Cyclone Disaster Chain (TDC effective in order to make targeted preventions. This study mainly explores the method of automatic recognition and the mapping of TDC and designs a software system. We constructed an automatic recognition system in terms of the characteristics of a hazard-formative environment based on the theory of a natural disaster system. The ArcEngine components enable an intelligent software system to present results by the automatic mapping approach. The study data comes from global metadata such as Digital Elevation Model (DEM, terrain slope, population density and Gross Domestic Product (GDP. The result shows that: (1 according to the characteristic of geomorphology type, we establish a type of recognition system for global TDC; (2 based on the recognition principle, we design a software system with the functions of automatic recognition and mapping; and (3 we validate the type of distribution in terms of real cases of TDC. The result shows that the automatic recognition function has good reliability. The study can provide the basis for targeted regional disaster prevention strategy, as well as regional sustainable development.

  1. Tone realisation in a Yoruba speech recognition corpus

    CSIR Research Space (South Africa)

    Van Niekerk, D

    2012-05-01

    Full Text Available development. Extracted contours are processed and analysed statistically to describe acoustic properties in different tonal contexts. The authors demonstrate how features useful for tone recognition or synthesis can be successfully extracted from a corpus...

  2. Progressive-Search Algorithms for Large-Vocabulary Speech Recognition

    National Research Council Canada - National Science Library

    Murveit, Hy; Butzberger, John; Digalakis, Vassilios; Weintraub, Mitch

    1993-01-01

    .... An algorithm, the "Forward-Backward Word-Life Algorithm," is described. It can generate a word lattice in a progressive search that would be used as a language model embedded in a succeeding recognition pass to reduce computation requirements...

  3. Audiovisual cues benefit recognition of accented speech in noise but not perceptual adaptation.

    Science.gov (United States)

    Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti

    2015-01-01

    Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.

  4. Towards social touch intelligence: developing a robust system for automatic touch recognition

    NARCIS (Netherlands)

    Jung, Merel Madeleine

    2014-01-01

    Touch behavior is of great importance during social interaction. Automatic recognition of social touch is necessary to transfer the touch modality from interpersonal interaction to other areas such as Human-Robot Interaction (HRI). This paper describes a PhD research program on the automatic

  5. Multistage Data Selection-based Unsupervised Speaker Adaptation for Personalized Speech Emotion Recognition

    NARCIS (Netherlands)

    Kim, Jaebok; Park, Jeong-Sik

    This paper proposes an efficient speech emotion recognition (SER) approach that utilizes personal voice data accumulated on personal devices. A representative weakness of conventional SER systems is the user-dependent performance induced by the speaker independent (SI) acoustic model framework. But,

  6. Enhancing Speech Recognition Using Improved Particle Swarm Optimization Based Hidden Markov Model

    Directory of Open Access Journals (Sweden)

    Lokesh Selvaraj

    2014-01-01

    Full Text Available Enhancing speech recognition is the primary intention of this work. In this paper a novel speech recognition method based on vector quantization and improved particle swarm optimization (IPSO is suggested. The suggested methodology contains four stages, namely, (i denoising, (ii feature mining (iii, vector quantization, and (iv IPSO based hidden Markov model (HMM technique (IP-HMM. At first, the speech signals are denoised using median filter. Next, characteristics such as peak, pitch spectrum, Mel frequency Cepstral coefficients (MFCC, mean, standard deviation, and minimum and maximum of the signal are extorted from the denoised signal. Following that, to accomplish the training process, the extracted characteristics are given to genetic algorithm based codebook generation in vector quantization. The initial populations are created by selecting random code vectors from the training set for the codebooks for the genetic algorithm process and IP-HMM helps in doing the recognition. At this point the creativeness will be done in terms of one of the genetic operation crossovers. The proposed speech recognition technique offers 97.14% accuracy.

  7. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  8. Investigating an Innovative Computer Application to Improve L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; O'Toole, John Mitchell

    2015-01-01

    The ability to recognise words from the aural modality is a critical aspect of successful second language (L2) listening comprehension. However, little research has been reported on computer-mediated development of L2 word recognition from speech in L2 learning contexts. This report describes the development of an innovative computer application…

  9. A Neuro-Linguistic Model for Speech Recognition in Tone Language

    African Journals Online (AJOL)

    The primary aim for this work is to develop a speech recognition system that exploits the computational paradigm with learning ability and the inherent robustness and parallelism in ANN coupled with the capability of fuzzy logic to model vagueness, handling uncertainness and support for human reasoning. This research ...

  10. Learning spectral-temporal features with 3D CNNs for speech emotion recognition

    NARCIS (Netherlands)

    Kim, Jaebok; Truong, Khiet; Englebienne, Gwenn; Evers, Vanessa

    2017-01-01

    In this paper, we propose to use deep 3-dimensional convolutional networks (3D CNNs) in order to address the challenge of modelling spectro-temporal dynamics for speech emotion recognition (SER). Compared to a hybrid of Convolutional Neural Network and Long-Short-Term-Memory (CNN-LSTM), our proposed

  11. Using word spotting to evaluate ROILA: a speech recognition friendly artificial language

    NARCIS (Netherlands)

    Mubin, O.; Bartneck, C.; Feijs, L.M.G.

    2010-01-01

    In our research we argue for the benefits that an artificial language could provide to improve the accuracy of speech recognition. We briefly present the design and implementation of a vocabulary of our intended artificial language (ROILA), the latter by means of a genetic algorithm that attempted

  12. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  13. Speech Recognition: Acoustic-Phonetic Knowledge Acquisition and Representation.

    Science.gov (United States)

    1987-09-25

    Society of "" America , Anaheim, CA, Dec. 1986. # Randolph, M. A., and V. W. Zue, "The Role of Syllable Structure in the Acoustic Realizations of Stops...input speech signal is first transformed into a represen- ences in sociolinguistic background, dialect, and vocal tract tation that takes into account...Perceptual Evidence,’ Journal of the Acovuticai Society of America , vol. 59, * no. 5, pp. 1208-1221, May 1976. � G. E. Kupec and M. A. Bush, ’Network

  14. Automatic landmark detection and face recognition for side-view face images

    NARCIS (Netherlands)

    Santemiz, P.; Spreeuwers, Lieuwe Jan; Veldhuis, Raymond N.J.; Broemme, Arslan; Busch, Christoph

    2013-01-01

    In real-life scenarios where pose variation is up to side-view positions, face recognition becomes a challenging task. In this paper we propose an automatic side-view face recognition system designed for home-safety applications. Our goal is to recognize people as they pass through doors in order to

  15. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  16. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  17. Effects of hearing loss on speech recognition under distracting conditions and working memory in the elderly.

    Science.gov (United States)

    Na, Wondo; Kim, Gibbeum; Kim, Gungu; Han, Woojae; Kim, Jinsook

    2017-01-01

    The current study aimed to evaluate hearing-related changes in terms of speech-in-noise processing, fast-rate speech processing, and working memory; and to identify which of these three factors is significantly affected by age-related hearing loss. One hundred subjects aged 65-84 years participated in the study. They were classified into four groups ranging from normal hearing to moderate-to-severe hearing loss. All the participants were tested for speech perception in quiet and noisy conditions and for speech perception with time alteration in quiet conditions. Forward- and backward-digit span tests were also conducted to measure the participants' working memory. 1) As the level of background noise increased, speech perception scores systematically decreased in all the groups. This pattern was more noticeable in the three hearing-impaired groups than in the normal hearing group. 2) As the speech rate increased faster, speech perception scores decreased. A significant interaction was found between speed of speech and hearing loss. In particular, 30% of compressed sentences revealed a clear differentiation between moderate hearing loss and moderate-to-severe hearing loss. 3) Although all the groups showed a longer span on the forward-digit span test than the backward-digit span test, there was no significant difference as a function of hearing loss. The degree of hearing loss strongly affects the speech recognition of babble-masked and time-compressed speech in the elderly but does not affect the working memory. We expect these results to be applied to appropriate rehabilitation strategies for hearing-impaired elderly who experience difficulty in communication.

  18. Determination of ocular torsion by means of automatic pattern recognition

    NARCIS (Netherlands)

    Groen, E.L.; Bos, J.E.; Nacken, P.F.M.; Graaf, B. de

    1996-01-01

    A new, automatic method for determination of human ocular torsion (OT) was devel-oped based on the tracking of iris patterns in digitized video images. Instead of quanti-fying OT by means of cross-correlation of circular iris samples, a procedure commonly applied, this new method automatically

  19. Determination of ocular torsion by means of automatic pattern recognition

    NARCIS (Netherlands)

    Groen, Eric; Bos, Jelte E.; Nacken, Peter F M; De Graaf, Bernd

    A new, automatic method for determination of human ocular torsion (OT) was developed based on the tracking of iris patterns in digitized video images. Instead of quantifying OT by means of cross-correlation of circular iris samples, a procedure commonly applied, this new method automatically selects

  20. Development of a Low-Cost, Noninvasive, Portable Visual Speech Recognition Program.

    Science.gov (United States)

    Kohlberg, Gavriel D; Gal, Ya'akov Kobi; Lalwani, Anil K

    2016-09-01

    Loss of speech following tracheostomy and laryngectomy severely limits communication to simple gestures and facial expressions that are largely ineffective. To facilitate communication in these patients, we seek to develop a low-cost, noninvasive, portable, and simple visual speech recognition program (VSRP) to convert articulatory facial movements into speech. A Microsoft Kinect-based VSRP was developed to capture spatial coordinates of lip movements and translate them into speech. The articulatory speech movements associated with 12 sentences were used to train an artificial neural network classifier. The accuracy of the classifier was then evaluated on a separate, previously unseen set of articulatory speech movements. The VSRP was successfully implemented and tested in 5 subjects. It achieved an accuracy rate of 77.2% (65.0%-87.6% for the 5 speakers) on a 12-sentence data set. The mean time to classify an individual sentence was 2.03 milliseconds (1.91-2.16). We have demonstrated the feasibility of a low-cost, noninvasive, portable VSRP based on Kinect to accurately predict speech from articulation movements in clinically trivial time. This VSRP could be used as a novel communication device for aphonic patients. © The Author(s) 2016.

  1. Automatic evaluation of speech rhythm instability and acceleration in dysarthrias associated with basal ganglia dysfunction

    Directory of Open Access Journals (Sweden)

    Jan eRusz

    2015-07-01

    Full Text Available Speech rhythm abnormalities are commonly present in patients with different neurodegenerative disorders. These alterations are hypothesized to be a consequence of disruption to the basal ganglia circuitry involving dysfunction of motor planning, programming and execution, which can be detected by a syllable repetition paradigm. Therefore, the aim of the present study was to design a robust signal processing technique that allows the automatic detection of spectrally-distinctive nuclei of syllable vocalizations and to determine speech features that represent rhythm instability and acceleration. A further aim was to elucidate specific patterns of dysrhythmia across various neurodegenerative disorders that share disruption of basal ganglia function. Speech samples based on repetition of the syllable /pa/ at a self-determined steady pace were acquired from 109 subjects, including 22 with Parkinson's disease (PD, 11 progressive supranuclear palsy (PSP, 9 multiple system atrophy (MSA, 24 ephedrone-induced parkinsonism (EP, 20 Huntington's disease (HD, and 23 healthy controls. Subsequently, an algorithm for the automatic detection of syllables as well as features representing rhythm instability and rhythm acceleration were designed. The proposed detection algorithm was able to correctly identify syllables and remove erroneous detections due to excessive inspiration and nonspeech sounds with a very high accuracy of 99.6%. Instability of vocal pace performance was observed in PSP, MSA, EP and HD groups. Significantly increased pace acceleration was observed only in the PD group. Although not significant, a tendency for pace acceleration was observed also in the PSP and MSA groups. Our findings underline the crucial role of the basal ganglia in the execution and maintenance of automatic speech motor sequences. We envisage the current approach to become the first step towards the development of acoustic technologies allowing automated assessment of rhythm

  2. Speech variability effects on recognition accuracy associated with concurrent task performance by pilots

    Science.gov (United States)

    Simpson, C. A.

    1985-01-01

    In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.

  3. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio-visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi-stream Dynamic Bayesian Network and coupled HMM are widely used for audio-visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial-temporal multimodal features from Tibetan audio-visual speech data and build an accurate audio-visual speech recognition model under a no frame-independency assumption. The experiment results on Tibetan speech data from some real-world environments showed the proposed DDBN outperforms the state-of-art methods in word recognition accuracy.

  4. Effects of pose and image resolution on automatic face recognition

    NARCIS (Netherlands)

    Mahmood, Zahid; Ali, Tauseef; Khan, Samee U.

    The popularity of face recognition systems have increased due to their use in widespread applications. Driven by the enormous number of potential application domains, several algorithms have been proposed for face recognition. Face pose and image resolutions are among the two important factors that

  5. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  6. Spectro-Temporal Analysis of Speech for Spanish Phoneme Recognition

    DEFF Research Database (Denmark)

    Sharifzadeh, Sara; Serrano, Javier; Carrabina, Jordi

    2012-01-01

    are considered. This has improved the recognition performance especially in case of noisy situation and phonemes with time domain modulations such as stops. In this method, the 2D Discrete Cosine Transform (DCT) is applied on small overlapped 2D Hamming windowed patches of spectrogram of Spanish phonemes...

  7. Behavioral and electrophysiological evidence for early and automatic detection of phonological equivalence in variable speech inputs.

    Science.gov (United States)

    Kharlamov, Viktor; Campbell, Kenneth; Kazanina, Nina

    2011-11-01

    Speech sounds are not always perceived in accordance with their acoustic-phonetic content. For example, an early and automatic process of perceptual repair, which ensures conformity of speech inputs to the listener's native language phonology, applies to individual input segments that do not exist in the native inventory or to sound sequences that are illicit according to the native phonotactic restrictions on sound co-occurrences. The present study with Russian and Canadian English speakers shows that listeners may perceive phonetically distinct and licit sound sequences as equivalent when the native language system provides robust evidence for mapping multiple phonetic forms onto a single phonological representation. In Russian, due to an optional but productive t-deletion process that affects /stn/ clusters, the surface forms [sn] and [stn] may be phonologically equivalent and map to a single phonological form /stn/. In contrast, [sn] and [stn] clusters are usually phonologically distinct in (Canadian) English. Behavioral data from identification and discrimination tasks indicated that [sn] and [stn] clusters were more confusable for Russian than for English speakers. The EEG experiment employed an oddball paradigm with nonwords [asna] and [astna] used as the standard and deviant stimuli. A reliable mismatch negativity response was elicited approximately 100 msec postchange in the English group but not in the Russian group. These findings point to a perceptual repair mechanism that is engaged automatically at a prelexical level to ensure immediate encoding of speech inputs in phonological terms, which in turn enables efficient access to the meaning of a spoken utterance.

  8. A glimpsing account of the role of temporal fine structure information in speech recognition.

    Science.gov (United States)

    Apoux, Frédéric; Healy, Eric W

    2013-01-01

    Many behavioral studies have reported a significant decrease in intelligibility when the temporal fine structure (TFS) of a sound mixture is replaced with noise or tones (i.e., vocoder processing). This finding has led to the conclusion that TFS information is critical for speech recognition in noise. How the normal -auditory system takes advantage of the original TFS, however, remains unclear. Three -experiments on the role of TFS in noise are described. All three experiments measured speech recognition in various backgrounds while manipulating the envelope, TFS, or both. One experiment tested the hypothesis that vocoder processing may artificially increase the apparent importance of TFS cues. Another experiment evaluated the relative contribution of the target and masker TFS by disturbing only the TFS of the target or that of the masker. Finally, a last experiment evaluated the -relative contribution of envelope and TFS information. In contrast to previous -studies, however, the original envelope and TFS were both preserved - to some extent - in all conditions. Overall, the experiments indicate a limited influence of TFS and suggest that little speech information is extracted from the TFS. Concomitantly, these experiments confirm that most speech information is carried by the temporal envelope in real-world conditions. When interpreted within the framework of the glimpsing model, the results of these experiments suggest that TFS is primarily used as a grouping cue to select the time-frequency regions -corresponding to the target speech signal.

  9. Exploring the link between cognitive abilities and speech recognition in the elderly under different listening conditions

    DEFF Research Database (Denmark)

    Nuesse, Theresa; Steenken, Rike; Neher, Tobias

    2018-01-01

    , which included measures of verbal working- and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, N = 20) or aided hearing impairment (EHI, N = 21) participated...... for the ENH listeners. Whereas better lexical and semantic abilities were associated with lower (better) SRTs in this group, there was a negative association between attentional abilities and speech recognition in the presence of spatially separated speech-like maskers. For the EHI group, the pure...

  10. Fidelity of Automatic Speech Processing for Adult and Child Talker Classifications.

    Directory of Open Access Journals (Sweden)

    Mark VanDam

    Full Text Available Automatic speech processing (ASP has recently been applied to very large datasets of naturalistically collected, daylong recordings of child speech via an audio recorder worn by young children. The system developed by the LENA Research Foundation analyzes children's speech for research and clinical purposes, with special focus on of identifying and tagging family speech dynamics and the at-home acoustic environment from the auditory perspective of the child. A primary issue for researchers, clinicians, and families using the Language ENvironment Analysis (LENA system is to what degree the segment labels are valid. This classification study evaluates the performance of the computer ASP output against 23 trained human judges who made about 53,000 judgements of classification of segments tagged by the LENA ASP. Results indicate performance consistent with modern ASP such as those using HMM methods, with acoustic characteristics of fundamental frequency and segment duration most important for both human and machine classifications. Results are likely to be important for interpreting and improving ASP output.

  11. DEVELOPMENT OF AUTOMATED SPEECH RECOGNITION SYSTEM FOR EGYPTIAN ARABIC PHONE CONVERSATIONS

    Directory of Open Access Journals (Sweden)

    A. N. Romanenko

    2016-07-01

    Full Text Available The paper deals with description of several speech recognition systems for the Egyptian Colloquial Arabic. The research is based on the CALLHOME Egyptian corpus. The description of both systems, classic: based on Hidden Markov and Gaussian Mixture Models, and state-of-the-art: deep neural network acoustic models is given. We have demonstrated the contribution from the usage of speaker-dependent bottleneck features; for their extraction three extractors based on neural networks were trained. For their training three datasets in several languageswere used:Russian, English and differentArabic dialects.We have studied the possibility of application of a small Modern Standard Arabic (MSA corpus to derive phonetic transcriptions. The experiments have shown that application of the extractor obtained on the basis of the Russian dataset enables to increase significantly the quality of the Arabic speech recognition. We have also stated that the usage of phonetic transcriptions based on modern standard Arabic decreases recognition quality. Nevertheless, system operation results remain applicable in practice. In addition, we have carried out the study of obtained models application for the keywords searching problem solution. The systems obtained demonstrate good results as compared to those published before. Some ways to improve speech recognition are offered.

  12. Automatic recognition of smoke-plume signatures in lidar signal

    Science.gov (United States)

    Utkin, Andrei B.; Lavrov, Alexander; Vilar, Rui

    2008-10-01

    A simple and robust algorithm for lidar-signal classification based on the fast extraction of sufficiently pronounced peaks and their recognition with a perceptron, whose efficiency is enhanced by a fast nonlinear preprocessing that increases the signal dimension, is reported. The method allows smoke-plume recognition with an error rate as small as 0.31% (19 misdetections and 4 false alarms in analyzing a test set of 7409 peaks).

  13. Automatic system for localization and recognition of vehicle plate numbers

    OpenAIRE

    Vázquez, N.; Nakano, M.; Pérez-Meana, H.

    2003-01-01

    This paper proposes a vehicle numbers plate identification system, which extracts the characters features of a plate from a captured image by a digital camera. Then identify the symbols of the number plate using a multilayer neural network. The proposed recognition system consists of two processes: The training process and the recognition process. During the training process, a database is created using 310 vehicular plate images. Then using this database a multilayer neural network is traine...

  14. Mandarin-Speaking Children’s Speech Recognition: Developmental Changes in the Influences of Semantic Context and F0 Contours

    Directory of Open Access Journals (Sweden)

    Hong Zhou

    2017-06-01

    Full Text Available The goal of this developmental speech perception study was to assess whether and how age group modulated the influences of high-level semantic context and low-level fundamental frequency (F0 contours on the recognition of Mandarin speech by elementary and middle-school-aged children in quiet and interference backgrounds. The results revealed different patterns for semantic and F0 information. One the one hand, age group modulated significantly the use of F0 contours, indicating that elementary school children relied more on natural F0 contours than middle school children during Mandarin speech recognition. On the other hand, there was no significant modulation effect of age group on semantic context, indicating that children of both age groups used semantic context to assist speech recognition to a similar extent. Furthermore, the significant modulation effect of age group on the interaction between F0 contours and semantic context revealed that younger children could not make better use of semantic context in recognizing speech with flat F0 contours compared with natural F0 contours, while older children could benefit from semantic context even when natural F0 contours were altered, thus confirming the important role of F0 contours in Mandarin speech recognition by elementary school children. The developmental changes in the effects of high-level semantic and low-level F0 information on speech recognition might reflect the differences in auditory and cognitive resources associated with processing of the two types of information in speech perception.

  15. Hybrid model decomposition of speech and noise in a radial basis function neural model framework

    DEFF Research Database (Denmark)

    Sørensen, Helge Bjarup Dissing; Hartmann, Uwe

    1994-01-01

    The aim of the paper is to focus on a new approach to automatic speech recognition in noisy environments where the noise has either stationary or non-stationary statistical characteristics. The aim is to perform automatic recognition of speech in the presence of additive car noise. The technique...

  16. Non-native Listeners’ Recognition of High-Variability Speech Using PRESTO

    Science.gov (United States)

    Tamati, Terrin N.; Pisoni, David B.

    2015-01-01

    Background Natural variability in speech is a significant challenge to robust successful spoken word recognition. In everyday listening environments, listeners must quickly adapt and adjust to multiple sources of variability in both the signal and listening environments. High-variability speech may be particularly difficult to understand for non-native listeners, who have less experience with the second language (L2) phonological system and less detailed knowledge of sociolinguistic variation of the L2. Purpose The purpose of this study was to investigate the effects of high-variability sentences on non-native speech recognition and to explore the underlying sources of individual differences in speech recognition abilities of non-native listeners. Research Design Participants completed two sentence recognition tasks involving high-variability and low-variability sentences. They also completed a battery of behavioral tasks and self-report questionnaires designed to assess their indexical processing skills, vocabulary knowledge, and several core neurocognitive abilities. Study Sample Native speakers of Mandarin (n = 25) living in the United States recruited from the Indiana University community participated in the current study. A native comparison group consisted of scores obtained from native speakers of English (n = 21) in the Indiana University community taken from an earlier study. Data Collection and Analysis Speech recognition in high-variability listening conditions was assessed with a sentence recognition task using sentences from PRESTO (Perceptually Robust English Sentence Test Open-Set) mixed in 6-talker multitalker babble. Speech recognition in low-variability listening conditions was assessed using sentences from HINT (Hearing In Noise Test) mixed in 6-talker multitalker babble. Indexical processing skills were measured using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Vocabulary

  17. Automatic SIMD parallelization of embedded applications based on pattern recognition

    NARCIS (Netherlands)

    Manniesing, R.; Karkowski, I.P.; Corporaal, H.

    2000-01-01

    This paper investigates the potential for automatic mapping of typical embedded applications to architectures with multimedia instruction set extensions. For this purpose a (pattern matching based) code transformation engine is used, which involves a three-step process of matching, condition

  18. Improved Techniques for Automatic Chord Recognition from Music Audio Signals

    Science.gov (United States)

    Cho, Taemin

    2014-01-01

    This thesis is concerned with the development of techniques that facilitate the effective implementation of capable automatic chord transcription from music audio signals. Since chord transcriptions can capture many important aspects of music, they are useful for a wide variety of music applications and also useful for people who learn and perform…

  19. Fully Automatic Recognition of the Temporal Phases of Facial Actions

    NARCIS (Netherlands)

    Valstar, M.F.; Pantic, Maja

    Past work on automatic analysis of facial expressions has focused mostly on detecting prototypic expressions of basic emotions like happiness and anger. The method proposed here enables the detection of a much larger range of facial behavior by recognizing facial muscle actions [action units (AUs)

  20. Effects of hearing loss and cognitive load on speech recognition with competing talkers

    Directory of Open Access Journals (Sweden)

    Hartmut eMeister

    2016-03-01

    Full Text Available Everyday communication frequently comprises situations with more than one talker speaking at a time. These situations are challenging since they pose high attentional and memory demands placing cognitive load on the listener. Hearing impairment additionally exacerbates communication problems under these circumstances. We examined the effects of hearing loss and attention tasks on speech recognition with competing talkers in older adults with and without hearing impairment. We hypothesized that hearing loss would affect word identification, talker separation and word recall and that the difficulties experienced by the hearing impaired listeners would be especially pronounced in a task with high attentional and memory demands. Two listener groups closely matched regarding their age and neuropsychological profile but differing in hearing acuity were examined regarding their speech recognition with competing talkers in two different tasks. One task required repeating back words from one target talker (1TT while ignoring the competing talker whereas the other required repeating back words from both talkers (2TT. The competing talkers differed with respect to their voice characteristics. Moreover, sentences either with low or high context were used in order to consider linguistic properties. Compared to their normal hearing peers, listeners with hearing loss revealed limited speech recognition in both tasks. Their difficulties were especially pronounced in the more demanding 2TT task. In order to shed light on the underlying mechanisms, different error sources, namely having misunderstood, confused, or omitted words were investigated. Misunderstanding and omitting words were more frequently observed in the hearing impaired than in the normal hearing listeners. In line with common speech perception models it is suggested that these effects are related to impaired object formation and taxed working memory capacity (WMC. In a post hoc analysis the

  1. Exploring the Link Between Cognitive Abilities and Speech Recognition in the Elderly Under Different Listening Conditions

    Directory of Open Access Journals (Sweden)

    Theresa Nuesse

    2018-05-01

    Full Text Available Elderly listeners are known to differ considerably in their ability to understand speech in noise. Several studies have addressed the underlying factors that contribute to these differences. These factors include audibility, and age-related changes in supra-threshold auditory processing abilities, and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing or hearing-impairment. To that end, speech recognition threshold (SRT measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered, which included measures of verbal working and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, n = 20 or aided hearing impairment (EHI, n = 21 participated. In repeated linear regression analyses, composite scores of cognitive test outcomes (evaluated using PCA were included to predict SRTs. These associations were different for the two groups. When hearing thresholds were controlled for, composed cognitive factors were significantly associated with the SRTs for the ENH listeners. Whereas better lexical and semantic abilities were associated with lower (better SRTs in this group, there was a negative association between attentional abilities and speech recognition in the presence of spatially separated speech-like maskers. For the EHI group, the pure-tone thresholds (averaged across 0.5, 1, 2, and 4 kHz were significantly associated with the SRTs, despite the fact that all signals were amplified and therefore in principle

  2. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  3. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  4. Exploring the link between cognitive abilities and speech recognition in the elderly under different listening conditions

    DEFF Research Database (Denmark)

    Nuesse, Theresa; Steenken, Rike; Neher, Tobias

    2018-01-01

    , and it has been suggested that differences in cognitive abilities may also be important. The objective of this study was to investigate associations between performance in cognitive tasks and speech recognition under different listening conditions in older adults with either age appropriate hearing...... or hearing-impairment. To that end, speech recognition threshold (SRT) measurements were performed under several masking conditions that varied along the perceptual dimensions of dip listening, spatial separation, and informational masking. In addition, a neuropsychological test battery was administered......, which included measures of verbal working- and short-term memory, executive functioning, selective and divided attention, and lexical and semantic abilities. Age-matched groups of older adults with either age-appropriate hearing (ENH, N = 20) or aided hearing impairment (EHI, N = 21) participated...

  5. Syntactic and semantic errors in radiology reports associated with speech recognition software.

    Science.gov (United States)

    Ringler, Michael D; Goss, Brian C; Bartholmai, Brian J

    2017-03-01

    Speech recognition software can increase the frequency of errors in radiology reports, which may affect patient care. We retrieved 213,977 speech recognition software-generated reports from 147 different radiologists and proofread them for errors. Errors were classified as "material" if they were believed to alter interpretation of the report. "Immaterial" errors were subclassified as intrusion/omission or spelling errors. The proportion of errors and error type were compared among individual radiologists, imaging subspecialty, and time periods. In all, 20,759 reports (9.7%) contained errors, of which 3992 (1.9%) were material errors. Among immaterial errors, spelling errors were more common than intrusion/omission errors ( p reports, reports reinterpreting results of outside examinations, and procedural studies (all p < .001). Error rate decreased over time ( p < .001), which suggests that a quality control program with regular feedback may reduce errors.

  6. Using speech recognition to enhance the Tongue Drive System functionality in computer access.

    Science.gov (United States)

    Huo, Xueliang; Ghovanloo, Maysam

    2011-01-01

    Tongue Drive System (TDS) is a wireless tongue operated assistive technology (AT), which can enable people with severe physical disabilities to access computers and drive powered wheelchairs using their volitional tongue movements. TDS offers six discrete commands, simultaneously available to the users, for pointing and typing as a substitute for mouse and keyboard in computer access, respectively. To enhance the TDS performance in typing, we have added a microphone, an audio codec, and a wireless audio link to its readily available 3-axial magnetic sensor array, and combined it with a commercially available speech recognition software, the Dragon Naturally Speaking, which is regarded as one of the most efficient ways for text entry. Our preliminary evaluations indicate that the combined TDS and speech recognition technologies can provide end users with significantly higher performance than using each technology alone, particularly in completing tasks that require both pointing and text entry, such as web surfing.

  7. Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

    Directory of Open Access Journals (Sweden)

    Reinhold Orglmeister

    2010-01-01

    Full Text Available When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.

  8. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems, Phase I

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  9. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems, Phase II

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  10. Searching for sources of variance in speech recognition: Young adults with normal hearing

    Science.gov (United States)

    Watson, Charles S.; Kidd, Gary R.

    2005-04-01

    In the present investigation, sensory-perceptual abilities of one thousand young adults with normal hearing are being evaluated with a range of auditory, visual, and cognitive measures. Four auditory measures were derived from factor-analytic analyses of previous studies with 18-20 speech and non-speech variables [G. R. Kidd et al., J. Acoust. Soc. Am. 108, 2641 (2000)]. Two measures of visual acuity are obtained to determine whether variation in sensory skills tends to exist primarily within or across sensory modalities. A working memory test, grade point average, and Scholastic Aptitude Test scores (Verbal and Quantitative) are also included. Preliminary multivariate analyses support previous studies of individual differences in auditory abilities (e.g., A. M. Surprenant and C. S. Watson, J. Acoust. Soc. Am. 110, 2085-2095 (2001)] which found that spectral and temporal resolving power obtained with pure tones and more complex unfamiliar stimuli have little or no correlation with measures of speech recognition under difficult listening conditions. The current findings show that visual acuity, working memory, and intellectual measures are also very poor predictors of speech recognition ability, supporting the independence of this processing skill. Remarkable performance by some exceptional listeners will be described. [Work supported by the Office of Naval Research, Award No. N000140310644.

  11. Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

    OpenAIRE

    Li, Xiangang; Wu, Xihong

    2014-01-01

    Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed an...

  12. Automatic recognition of the unconscious reactions from physiological signals

    NARCIS (Netherlands)

    Ivonin, L.; Chang, H.M.; Chen, W.; Rauterberg, G.W.M.; Holzinger, A.; Ziefle, M.; Hitz, M.; Debevc, M.

    2013-01-01

    While the research in affective computing has been exclusively dealing with the recognition of explicit affective and cognitive states, carefully designed psychological and neuroimaging studies indicated that a considerable part of human experiences is tied to a deeper level of a psyche and not

  13. Automatic recognition of context and stress to support knowledge workers

    NARCIS (Netherlands)

    Koldijk, S.J.

    2012-01-01

    Motivation – Developing a computer tool that improves well-being at work. Research approach – We collect unobtrusive sensor data and apply pattern recognition approaches to infer the context and stress level of the user. We will develop a coaching tool based upon this information and evaluate its

  14. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearing.

    Science.gov (United States)

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-07-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes.

  15. Comparing auditory filter bandwidths, spectral ripple modulation detection, spectral ripple discrimination, and speech recognition: Normal and impaired hearinga)

    Science.gov (United States)

    Davies-Venn, Evelyn; Nelson, Peggy; Souza, Pamela

    2015-01-01

    Some listeners with hearing loss show poor speech recognition scores in spite of using amplification that optimizes audibility. Beyond audibility, studies have suggested that suprathreshold abilities such as spectral and temporal processing may explain differences in amplified speech recognition scores. A variety of different methods has been used to measure spectral processing. However, the relationship between spectral processing and speech recognition is still inconclusive. This study evaluated the relationship between spectral processing and speech recognition in listeners with normal hearing and with hearing loss. Narrowband spectral resolution was assessed using auditory filter bandwidths estimated from simultaneous notched-noise masking. Broadband spectral processing was measured using the spectral ripple discrimination (SRD) task and the spectral ripple depth detection (SMD) task. Three different measures were used to assess unamplified and amplified speech recognition in quiet and noise. Stepwise multiple linear regression revealed that SMD at 2.0 cycles per octave (cpo) significantly predicted speech scores for amplified and unamplified speech in quiet and noise. Commonality analyses revealed that SMD at 2.0 cpo combined with SRD and equivalent rectangular bandwidth measures to explain most of the variance captured by the regression model. Results suggest that SMD and SRD may be promising clinical tools for diagnostic evaluation and predicting amplification outcomes. PMID:26233047

  16. Speech recognition training for enhancing written language generation by a traumatic brain injury survivor.

    Science.gov (United States)

    Manasse, N J; Hux, K; Rankin-Erickson, J L

    2000-11-01

    Impairments in motor functioning, language processing, and cognitive status may impact the written language performance of traumatic brain injury (TBI) survivors. One strategy to minimize the impact of these impairments is to use a speech recognition system. The purpose of this study was to explore the effect of mild dysarthria and mild cognitive-communication deficits secondary to TBI on a 19-year-old survivor's mastery and use of such a system-specifically, Dragon Naturally Speaking. Data included the % of the participant's words accurately perceived by the system over time, the participant's accuracy over time in using commands for navigation and error correction, and quantitative and qualitative changes in the participant's written texts generated with and without the use of the speech recognition system. Results showed that Dragon NaturallySpeaking was approximately 80% accurate in perceiving words spoken by the participant, and the participant quickly and easily mastered all navigation and error correction commands presented. Quantitatively, the participant produced a greater amount of text using traditional word processing and a standard keyboard than using the speech recognition system. Minimal qualitative differences appeared between writing samples. Discussion of factors that may have contributed to the obtained results and that may affect the generalization of the findings to other TBI survivors is provided.

  17. Automatic identification of otological drilling faults: an intelligent recognition algorithm.

    Science.gov (United States)

    Cao, Tianyang; Li, Xisheng; Gao, Zhiqiang; Feng, Guodong; Shen, Peng

    2010-06-01

    This article presents an intelligent recognition algorithm that can recognize milling states of the otological drill by fusing multi-sensor information. An otological drill was modified by the addition of sensors. The algorithm was designed according to features of the milling process and is composed of a characteristic curve, an adaptive filter and a rule base. The characteristic curve can weaken the impact of the unstable normal milling process and reserve the features of drilling faults. The adaptive filter is capable of suppressing interference in the characteristic curve by fusing multi-sensor information. The rule base can identify drilling faults through the filtering result data. The experiments were repeated on fresh porcine scapulas, including normal milling and two drilling faults. The algorithm has high rates of identification. This study shows that the intelligent recognition algorithm can identify drilling faults under interference conditions. (c) 2010 John Wiley & Sons, Ltd.

  18. Evaluation of Speech Recognition of Cochlear Implant Recipients Using Adaptive, Digital Remote Microphone Technology and a Speech Enhancement Sound Processing Algorithm.

    Science.gov (United States)

    Wolfe, Jace; Morais, Mila; Schafer, Erin; Agrawal, Smita; Koch, Dawn

    2015-05-01

    Cochlear implant recipients often experience difficulty with understanding speech in the presence of noise. Cochlear implant manufacturers have developed sound processing algorithms designed to improve speech recognition in noise, and research has shown these technologies to be effective. Remote microphone technology utilizing adaptive, digital wireless radio transmission has also been shown to provide significant improvement in speech recognition in noise. There are no studies examining the potential improvement in speech recognition in noise when these two technologies are used simultaneously. The goal of this study was to evaluate the potential benefits and limitations associated with the simultaneous use of a sound processing algorithm designed to improve performance in noise (Advanced Bionics ClearVoice) and a remote microphone system that incorporates adaptive, digital wireless radio transmission (Phonak Roger). A two-by-two way repeated measures design was used to examine performance differences obtained without these technologies compared to the use of each technology separately as well as the simultaneous use of both technologies. Eleven Advanced Bionics (AB) cochlear implant recipients, ages 11 to 68 yr. AzBio sentence recognition was measured in quiet and in the presence of classroom noise ranging in level from 50 to 80 dBA in 5-dB steps. Performance was evaluated in four conditions: (1) No ClearVoice and no Roger, (2) ClearVoice enabled without the use of Roger, (3) ClearVoice disabled with Roger enabled, and (4) simultaneous use of ClearVoice and Roger. Speech recognition in quiet was better than speech recognition in noise for all conditions. Use of ClearVoice and Roger each provided significant improvement in speech recognition in noise. The best performance in noise was obtained with the simultaneous use of ClearVoice and Roger. ClearVoice and Roger technology each improves speech recognition in noise, particularly when used at the same time

  19. Bi-channel Sensor Fusion for Automatic Sign Language Recognition

    DEFF Research Database (Denmark)

    Kim, Jonghwa; Wagner, Johannes; Rehm, Matthias

    2008-01-01

    In this paper, we investigate the mutual-complementary functionality of accelerometer (ACC) and electromyogram (EMG) for recognizing seven word-level sign vocabularies in German sign language (GSL). Results are discussed for the single channels and for feature-level fusion for the bichannel senso......-independent condition, where subjective differences do not allow for high recognition rates. Finally we discuss a problem of feature-level fusion caused by high disparity between accuracies of each single channel classification....

  20. HMM adaptation for child speech synthesis using ASR data

    CSIR Research Space (South Africa)

    Govender, N

    2015-11-01

    Full Text Available . This paper reports on a feasibility study that was conducted to determine whether it is possible to synthesize good quality child voices using child speech data that was recorded for automatic speech recognition (ASR) purposes. A text-to-speech system...

  1. Recognizing Stress Using Semantics and Modulation of Speech and Gestures

    NARCIS (Netherlands)

    Lefter, I.; Burghouts, G.J.; Rothkrantz, L.J.M.

    2016-01-01

    This paper investigates how speech and gestures convey stress, and how they can be used for automatic stress recognition. As a first step, we look into how humans use speech and gestures to convey stress. In particular, for both speech and gestures, we distinguish between stress conveyed by the

  2. Application of image recognition-based automatic hyphae detection in fungal keratitis.

    Science.gov (United States)

    Wu, Xuelian; Tao, Yuan; Qiu, Qingchen; Wu, Xinyi

    2018-03-01

    The purpose of this study is to evaluate the accuracy of two methods in diagnosis of fungal keratitis, whereby one method is automatic hyphae detection based on images recognition and the other method is corneal smear. We evaluate the sensitivity and specificity of the method in diagnosis of fungal keratitis, which is automatic hyphae detection based on image recognition. We analyze the consistency of clinical symptoms and the density of hyphae, and perform quantification using the method of automatic hyphae detection based on image recognition. In our study, 56 cases with fungal keratitis (just single eye) and 23 cases with bacterial keratitis were included. All cases underwent the routine inspection of slit lamp biomicroscopy, corneal smear examination, microorganism culture and the assessment of in vivo confocal microscopy images before starting medical treatment. Then, we recognize the hyphae images of in vivo confocal microscopy by using automatic hyphae detection based on image recognition to evaluate its sensitivity and specificity and compare with the method of corneal smear. The next step is to use the index of density to assess the severity of infection, and then find the correlation with the patients' clinical symptoms and evaluate consistency between them. The accuracy of this technology was superior to corneal smear examination (p hyphae detection of image recognition was 89.29%, and the specificity was 95.65%. The area under the ROC curve was 0.946. The correlation coefficient between the grading of the severity in the fungal keratitis by the automatic hyphae detection based on image recognition and the clinical grading is 0.87. The technology of automatic hyphae detection based on image recognition was with high sensitivity and specificity, able to identify fungal keratitis, which is better than the method of corneal smear examination. This technology has the advantages when compared with the conventional artificial identification of confocal

  3. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition.

    Science.gov (United States)

    Jesse, Alexandra; McQueen, James M

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker say fragments of word pairs that were segmentally identical but differed in their stress realization (e.g., 'ca-vi from cavia "guinea pig" vs. 'ka-vi from kaviaar "caviar"). Participants were able to distinguish between these pairs from seeing a speaker alone. Only the presence of primary stress in the fragment, not its absence, was informative. Participants were able to distinguish visually primary from secondary stress on first syllables, but only when the fragment-bearing target word carried phrase-level emphasis. Furthermore, participants distinguished fragments with primary stress on their second syllable from those with secondary stress on their first syllable (e.g., pro-'jec from projector "projector" vs. 'pro-jec from projectiel "projectile"), independently of phrase-level emphasis. Seeing a speaker thus contributes to spoken-word recognition by providing suprasegmental information about the presence of primary lexical stress.

  4. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  5. Automatic Recognition of Chinese Personal Name Using Conditional Random Fields and Knowledge Base

    Directory of Open Access Journals (Sweden)

    Chuan Gu

    2015-01-01

    Full Text Available According to the features of Chinese personal name, we present an approach for Chinese personal name recognition based on conditional random fields (CRF and knowledge base in this paper. The method builds multiple features of CRF model by adopting Chinese character as processing unit, selects useful features based on selection algorithm of knowledge base and incremental feature template, and finally implements the automatic recognition of Chinese personal name from Chinese document. The experimental results on open real corpus demonstrated the effectiveness of our method and obtained high accuracy rate and high recall rate of recognition.

  6. Segmentation, Diarization and Speech Transcription: Surprise Data Unraveled

    NARCIS (Netherlands)

    Huijbregts, M.A.H.

    2008-01-01

    In this thesis, research on large vocabulary continuous speech recognition for unknown audio conditions is presented. For automatic speech recognition systems based on statistical methods, it is important that the conditions of the audio used for training the statistical models match the conditions

  7. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  8. Automatic stimulation of experiments and learning based on prediction failure recognition

    NARCIS (Netherlands)

    Juarez Cordova, A.G.; Kahl, B.; Henne, T.; Prassler, E.

    2009-01-01

    In this paper we focus on the task of automatically and autonomously initiating experimentation and learning based on the recognition of prediction failure. We present a mechanism that utilizes conceptual knowledge to predict the outcome of robot actions, observes their execution and indicates when

  9. Modular Algorithm Testbed Suite (MATS): A Software Framework for Automatic Target Recognition

    Science.gov (United States)

    2017-01-01

    NAVAL SURFACE WARFARE CENTER PANAMA CITY DIVISION PANAMA CITY, FL 32407-7001 TECHNICAL REPORT NSWC PCD TR-2017-004 MODULAR ...31-01-2017 Technical Modular Algorithm Testbed Suite (MATS): A Software Framework for Automatic Target Recognition DR...flexible platform to facilitate the development and testing of ATR algorithms. To that end, NSWC PCD has created the Modular Algorithm Testbed Suite

  10. The effect of sensorineural hearing loss and tinnitus on speech recognition over air and bone conduction military communications headsets.

    Science.gov (United States)

    Manning, Candice; Mermagen, Timothy; Scharine, Angelique

    2017-06-01

    Military personnel are at risk for hearing loss due to noise exposure during deployment (USACHPPM, 2008). Despite mandated use of hearing protection, hearing loss and tinnitus are prevalent due to reluctance to use hearing protection. Bone conduction headsets can offer good speech intelligibility for normal hearing (NH) listeners while allowing the ears to remain open in quiet environments and the use of hearing protection when needed. Those who suffer from tinnitus, the experience of perceiving a sound not produced by an external source, often show degraded speech recognition; however, it is unclear whether this is a result of decreased hearing sensitivity or increased distractibility (Moon et al., 2015). It has been suggested that the vibratory stimulation of a bone conduction headset might ameliorate the effects of tinnitus on speech perception; however, there is currently no research to support or refute this claim (Hoare et al., 2014). Speech recognition of words presented over air conduction and bone conduction headsets was measured for three groups of listeners: NH, sensorineural hearing impaired, and/or tinnitus sufferers. Three levels of speech-to-noise (SNR = 0, -6, -12 dB) were created by embedding speech items in pink noise. Better speech recognition performance was observed with the bone conduction headset regardless of hearing profile, and speech intelligibility was a function of SNR. Discussion will include study limitations and the implications of these findings for those serving in the military. Published by Elsevier B.V.

  11. Speech recognition and parent-ratings from auditory development questionnaires in children who are hard of hearing

    Science.gov (United States)

    McCreery, Ryan W.; Walker, Elizabeth A.; Spratford, Meredith; Oleson, Jacob; Bentler, Ruth; Holte, Lenore; Roush, Patricia

    2015-01-01

    Objectives Progress has been made in recent years in the provision of amplification and early intervention for children who are hard of hearing. However, children who use hearing aids (HA) may have inconsistent access to their auditory environment due to limitations in speech audibility through their HAs or limited HA use. The effects of variability in children’s auditory experience on parent-report auditory skills questionnaires and on speech recognition in quiet and in noise were examined for a large group of children who were followed as part of the Outcomes of Children with Hearing Loss study. Design Parent ratings on auditory development questionnaires and children’s speech recognition were assessed for 306 children who are hard of hearing. Children ranged in age from 12 months to 9 years of age. Three questionnaires involving parent ratings of auditory skill development and behavior were used, including the LittlEARS Auditory Questionnaire, Parents Evaluation of Oral/Aural Performance in Children Rating Scale, and an adaptation of the Speech, Spatial and Qualities of Hearing scale. Speech recognition in quiet was assessed using the Open and Closed set task, Early Speech Perception Test, Lexical Neighborhood Test, and Phonetically-balanced Kindergarten word lists. Speech recognition in noise was assessed using the Computer-Assisted Speech Perception Assessment. Children who are hard of hearing were compared to peers with normal hearing matched for age, maternal educational level and nonverbal intelligence. The effects of aided audibility, HA use and language ability on parent responses to auditory development questionnaires and on children’s speech recognition were also examined. Results Children who are hard of hearing had poorer performance than peers with normal hearing on parent ratings of auditory skills and had poorer speech recognition. Significant individual variability among children who are hard of hearing was observed. Children with greater

  12. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  13. Prefixes versus suffixes: a search for a word-beginning superiority effect in word recognition from degraded speech

    NARCIS (Netherlands)

    Nooteboom, S.G.; Vlugt, van der M.J.

    1985-01-01

    This paper reports on a word recognition experiment in search of evidence for a word- beginning superiority effect in recognition from low-quality speech . In the experiment, lexical redundancy was controlled by combining monosyllable word stems with strongly constraining or weakly constraining

  14. Automatic music genres classification as a pattern recognition problem

    Science.gov (United States)

    Ul Haq, Ihtisham; Khan, Fauzia; Sharif, Sana; Shaukat, Arsalan

    2013-12-01

    Music genres are the simplest and effect descriptors for searching music libraries stores or catalogues. The paper compares the results of two automatic music genres classification systems implemented by using two different yet simple classifiers (K-Nearest Neighbor and Naïve Bayes). First a 10-12 second sample is selected and features are extracted from it, and then based on those features results of both classifiers are represented in the form of accuracy table and confusion matrix. An experiment carried out on test 60 taken from middle of a song represents the true essence of its genre as compared to the samples taken from beginning and ending of a song. The novel techniques have achieved an accuracy of 91% and 78% by using Naïve Bayes and KNN classifiers respectively.

  15. Event-related potential evidence of form and meaning coding during online speech recognition.

    Science.gov (United States)

    Friedrich, Claudia K; Kotz, Sonja A

    2007-04-01

    It is still a matter of debate whether initial analysis of speech is independent of contextual influences or whether meaning can modulate word activation directly. Utilizing event-related brain potentials (ERPs), we tested the neural correlates of speech recognition by presenting sentences that ended with incomplete words, such as To light up the dark she needed her can-. Immediately following the incomplete words, subjects saw visual words that (i) matched form and meaning, such as candle; (ii) matched meaning but not form, such as lantern; (iii) matched form but not meaning, such as candy; or (iv) mismatched form and meaning, such as number. We report ERP evidence for two distinct cohorts of lexical tokens: (a) a left-lateralized effect, the P250, differentiates form-matching words (i, iii) and form-mismatching words (ii, iv); (b) a right-lateralized effect, the P220, differentiates words that match in form and/or meaning (i, ii, iii) from mismatching words (iv). Lastly, fully matching words (i) reduce the amplitude of the N400. These results accommodate bottom-up and top-down accounts of human speech recognition. They suggest that neural representations of form and meaning are activated independently early on and are integrated at a later stage during sentence comprehension.

  16. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms

    Science.gov (United States)

    Schädler, Marc R.; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than −20 dB could not be predicted. PMID:29692200

  17. Objective Prediction of Hearing Aid Benefit Across Listener Groups Using Machine Learning: Speech Recognition Performance With Binaural Noise-Reduction Algorithms.

    Science.gov (United States)

    Schädler, Marc R; Warzybok, Anna; Kollmeier, Birger

    2018-01-01

    The simulation framework for auditory discrimination experiments (FADE) was adopted and validated to predict the individual speech-in-noise recognition performance of listeners with normal and impaired hearing with and without a given hearing-aid algorithm. FADE uses a simple automatic speech recognizer (ASR) to estimate the lowest achievable speech reception thresholds (SRTs) from simulated speech recognition experiments in an objective way, independent from any empirical reference data. Empirical data from the literature were used to evaluate the model in terms of predicted SRTs and benefits in SRT with the German matrix sentence recognition test when using eight single- and multichannel binaural noise-reduction algorithms. To allow individual predictions of SRTs in binaural conditions, the model was extended with a simple better ear approach and individualized by taking audiograms into account. In a realistic binaural cafeteria condition, FADE explained about 90% of the variance of the empirical SRTs for a group of normal-hearing listeners and predicted the corresponding benefits with a root-mean-square prediction error of 0.6 dB. This highlights the potential of the approach for the objective assessment of benefits in SRT without prior knowledge about the empirical data. The predictions for the group of listeners with impaired hearing explained 75% of the empirical variance, while the individual predictions explained less than 25%. Possibly, additional individual factors should be considered for more accurate predictions with impaired hearing. A competing talker condition clearly showed one limitation of current ASR technology, as the empirical performance with SRTs lower than -20 dB could not be predicted.

  18. Visual object recognition for automatic micropropagation of plants

    Science.gov (United States)

    Brendel, Thorsten; Schwanke, Joerg; Jensch, Peter F.

    1994-11-01

    Micropropagation of plants is done by cutting juvenile plants and placing them into special container-boxes with nutrient-solution where the pieces can grow up and be cut again several times. To produce high amounts of biomass it is necessary to do plant micropropagation by a robotic system. In this paper we describe parts of the vision system that recognizes plants and their particular cutting points. Therefore, it is necessary to extract elements of the plants and relations between these elements (for example root, stem, leaf). Different species vary in their morphological appearance, variation is also immanent in plants of the same species. Therefore, we introduce several morphological classes of plants from that we expect same recognition methods.

  19. Uav Visual Autolocalizaton Based on Automatic Landmark Recognition

    Science.gov (United States)

    Silva Filho, P.; Shiguemori, E. H.; Saotome, O.

    2017-08-01

    Deploying an autonomous unmanned aerial vehicle in GPS-denied areas is a highly discussed problem in the scientific community. There are several approaches being developed, but the main strategies yet considered are computer vision based navigation systems. This work presents a new real-time computer-vision position estimator for UAV navigation. The estimator uses images captured during flight to recognize specific, well-known, landmarks in order to estimate the latitude and longitude of the aircraft. The method was tested in a simulated environment, using a dataset of real aerial images obtained in previous flights, with synchronized images, GPS and IMU data. The estimated position in each landmark recognition was compatible with the GPS data, stating that the developed method can be used as an alternative navigation system.

  20. UAV VISUAL AUTOLOCALIZATON BASED ON AUTOMATIC LANDMARK RECOGNITION

    Directory of Open Access Journals (Sweden)

    P. Silva Filho

    2017-08-01

    Full Text Available Deploying an autonomous unmanned aerial vehicle in GPS-denied areas is a highly discussed problem in the scientific community. There are several approaches being developed, but the main strategies yet considered are computer vision based navigation systems. This work presents a new real-time computer-vision position estimator for UAV navigation. The estimator uses images captured during flight to recognize specific, well-known, landmarks in order to estimate the latitude and longitude of the aircraft. The method was tested in a simulated environment, using a dataset of real aerial images obtained in previous flights, with synchronized images, GPS and IMU data. The estimated position in each landmark recognition was compatible with the GPS data, stating that the developed method can be used as an alternative navigation system.

  1. Deep Visual Attributes vs. Hand-Crafted Audio Features on Multidomain Speech Emotion Recognition

    Directory of Open Access Journals (Sweden)

    Michalis Papakostas

    2017-06-01

    Full Text Available Emotion recognition from speech may play a crucial role in many applications related to human–computer interaction or understanding the affective state of users in certain tasks, where other modalities such as video or physiological parameters are unavailable. In general, a human’s emotions may be recognized using several modalities such as analyzing facial expressions, speech, physiological parameters (e.g., electroencephalograms, electrocardiograms etc. However, measuring of these modalities may be difficult, obtrusive or require expensive hardware. In that context, speech may be the best alternative modality in many practical applications. In this work we present an approach that uses a Convolutional Neural Network (CNN functioning as a visual feature extractor and trained using raw speech information. In contrast to traditional machine learning approaches, CNNs are responsible for identifying the important features of the input thus, making the need of hand-crafted feature engineering optional in many tasks. In this paper no extra features are required other than the spectrogram representations and hand-crafted features were only extracted for validation purposes of our method. Moreover, it does not require any linguistic model and is not specific to any particular language. We compare the proposed approach using cross-language datasets and demonstrate that it is able to provide superior results vs. traditional ones that use hand-crafted features.

  2. Automatic shape recognition of a fast transient signal

    International Nuclear Information System (INIS)

    Charles, Gilbert.

    1976-01-01

    A system was developed to recognize if the shape of a signal x(t) is similar (or identical) to the one of an element yi(t) of an ensemble S composed by N known signals, that are memorised. x(t) is a time limited T 2 ) give the similarity measure of two signals. To solve the problem of the digital recording of the signals x(t) two devices were realized: a digital-to-analog converter which permits the recording of fast transient signals (band pass>1GHz, sampling-frequency approximately 100GHz, resolution: 9 bits, 576 samples); an automatic attenuator which scales the signal x(t) before the digitalization (the band pass is 70MHz at -1dB). A theoretical analysis permits to determine what must be the resolution of the digital-to-analog converter as a fonction of the signal-caracteristics and of the wanted precision for the calculus of rho 2 [fr

  3. Automatic target recognition performance losses in the presence of atmospheric and camera effects

    Science.gov (United States)

    Chen, Xiaohan; Schmid, Natalia A.

    2010-04-01

    The importance of networked automatic target recognition systems for surveillance applications is continuously increasing. Because of the requirement of a low cost and limited payload, these networks are traditionally equipped with lightweight, low-cost sensors such as electro-optical (EO) or infrared sensors. The quality of imagery acquired by these sensors critically depends on the environmental conditions, type and characteristics of sensors, and absence of occluding or concealing objects. In the past, a large number of efficient detection, tracking, and recognition algorithms have been designed to operate on imagery of good quality. However, detection and recognition limits under nonideal environmental and/or sensor-based distortions have not been carefully evaluated. We introduce a fully automatic target recognition system that involves a Haar-based detector to select potential regions of interest within images, performs adjustment of detected regions, segments potential targets using a region-based approach, identifies targets using Bessel K form-based encoding, and performs clutter rejection. We investigate the effects of environmental and camera conditions on target detection and recognition performance. Two databases are involved. One is a simulated database generated using a 3-D tool. The other database is formed by imaging 10 die-cast models of military vehicles from different elevation and orientation angles. The database contains imagery acquired both indoors and outdoors. The indoors data set is composed of clear and distorted images. The distortions include defocus blur, sided illumination, low contrast, shadows, and occlusions. All images in this database, however, have a uniform (blue) background. The indoors database is applied to evaluate the degradations of recognition performance due to camera and illumination effects. The database collected outdoors includes a real background and is much more complex to process. The numerical results

  4. Contribution to automatic image recognition applied to robot technology

    International Nuclear Information System (INIS)

    Juvin, Didier

    1983-01-01

    This paper describes a method for the analysis and interpretation of the images of objects located in a plain scene which is the environment of a robot. The first part covers the recovery of the contour of objects present in the image, and discusses a novel contour-following technique based on the line arborescence concept in combination with a 'cost function' giving a quantitative assessment of contour quality. We present heuristics for moderate-cost, minimum-time arborescence coverage, which is equivalent to following probable contour lines in the image. A contour segmentation technique, invariant in the translational and rotational modes, is presented next. The second part describes a recognition method based on the above invariant encoding: the algorithm performs a preliminary screening based on coarse data derived from segmentation, followed by a comparison of forms with probable identity through application of a distance specified in terms of the invariant encoding. The last part covers the outcome of the above investigations, which have found an industrial application in the vision system of a range of robots. The system is set up in a 16-bit microprocessor and operates in real time. (author) [fr

  5. Long term Suboxone™ emotional reactivity as measured by automatic detection in speech.

    Directory of Open Access Journals (Sweden)

    Edward Hill

    Full Text Available Addictions to illicit drugs are among the nation's most critical public health and societal problems. The current opioid prescription epidemic and the need for buprenorphine/naloxone (Suboxone®; SUBX as an opioid maintenance substance, and its growing street diversion provided impetus to determine affective states ("true ground emotionality" in long-term SUBX patients. Toward the goal of effective monitoring, we utilized emotion-detection in speech as a measure of "true" emotionality in 36 SUBX patients compared to 44 individuals from the general population (GP and 33 members of Alcoholics Anonymous (AA. Other less objective studies have investigated emotional reactivity of heroin, methadone and opioid abstinent patients. These studies indicate that current opioid users have abnormal emotional experience, characterized by heightened response to unpleasant stimuli and blunted response to pleasant stimuli. However, this is the first study to our knowledge to evaluate "true ground" emotionality in long-term buprenorphine/naloxone combination (Suboxone™. We found in long-term SUBX patients a significantly flat affect (p<0.01, and they had less self-awareness of being happy, sad, and anxious compared to both the GP and AA groups. We caution definitive interpretation of these seemingly important results until we compare the emotional reactivity of an opioid abstinent control using automatic detection in speech. These findings encourage continued research strategies in SUBX patients to target the specific brain regions responsible for relapse prevention of opioid addiction.

  6. Exploiting Speech for Automatic TV Delinearization: From Streams to Cross-Media Semantic Navigation

    Directory of Open Access Journals (Sweden)

    Guinaudeau Camille

    2011-01-01

    Full Text Available The gradual migration of television from broadcast diffusion to Internet diffusion offers countless possibilities for the generation of rich navigable contents. However, it also raises numerous scientific issues regarding delinearization of TV streams and content enrichment. In this paper, we study how speech can be used at different levels of the delinearization process, using automatic speech transcription and natural language processing (NLP for the segmentation and characterization of TV programs and for the generation of semantic hyperlinks in videos. Transcript-based video delinearization requires natural language processing techniques robust to transcription peculiarities, such as transcription errors, and to domain and genre differences. We therefore propose to modify classical NLP techniques, initially designed for regular texts, to improve their robustness in the context of TV delinearization. We demonstrate that the modified NLP techniques can efficiently handle various types of TV material and be exploited for program description, for topic segmentation, and for the generation of semantic hyperlinks between multimedia contents. We illustrate the concept of cross-media semantic navigation with a description of our news navigation demonstrator presented during the NEM Summit 2009.

  7. The Relationship between Binaural Benefit and Difference in Unilateral Speech Recognition Performance for Bilateral Cochlear Implant Users

    Science.gov (United States)

    Yoon, Yang-soo; Li, Yongxin; Kang, Hou-Yong; Fu, Qian-Jie

    2011-01-01

    Objective The full benefit of bilateral cochlear implants may depend on the unilateral performance with each device, the speech materials, processing ability of the user, and/or the listening environment. In this study, bilateral and unilateral speech performances were evaluated in terms of recognition of phonemes and sentences presented in quiet or in noise. Design Speech recognition was measured for unilateral left, unilateral right, and bilateral listening conditions; speech and noise were presented at 0° azimuth. The “binaural benefit” was defined as the difference between bilateral performance and unilateral performance with the better ear. Study Sample 9 adults with bilateral cochlear implants participated. Results On average, results showed a greater binaural benefit in noise than in quiet for all speech tests. More importantly, the binaural benefit was greater when unilateral performance was similar across ears. As the difference in unilateral performance between ears increased, the binaural advantage decreased; this functional relationship was observed across the different speech materials and noise levels even though there was substantial intra- and inter-subject variability. Conclusions The results indicate that subjects who show symmetry in speech recognition performance between implanted ears in general show a large binaural benefit. PMID:21696329

  8. A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

    Science.gov (United States)

    Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

    2015-11-01

    This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks.

  9. Automatic Recognition Method for Optical Measuring Instruments Based on Machine Vision

    Institute of Scientific and Technical Information of China (English)

    SONG Le; LIN Yuchi; HAO Liguo

    2008-01-01

    Based on a comprehensive study of various algorithms, the automatic recognition of traditional ocular optical measuring instruments is realized. Taking a universal tools microscope (UTM) lens view image as an example, a 2-layer automatic recognition model for data reading is established after adopting a series of pre-processing algorithms. This model is an optimal combination of the correlation-based template matching method and a concurrent back propagation (BP) neural network. Multiple complementary feature extraction is used in generating the eigenvectors of the concurrent network. In order to improve fault-tolerance capacity, rotation invariant features based on Zernike moments are extracted from digit characters and a 4-dimensional group of the outline features is also obtained. Moreover, the operating time and reading accuracy can be adjusted dynamically by setting the threshold value. The experimental result indicates that the newly developed algorithm has optimal recognition precision and working speed. The average reading ratio can achieve 97.23%. The recognition method can automatically obtain the results of optical measuring instruments rapidly and stably without modifying their original structure, which meets the application requirements.

  10. Aplikasi sistem pakar diagnosis penyakit ispa berbasis speech recognition menggunakan metode naive bayes classifier

    Directory of Open Access Journals (Sweden)

    Mariam Marlina

    2017-05-01

    Full Text Available AbstrakISPA (Infeksi Saluran Pernafasan Akut adalah suatu penyakit gangguan saluran pernapasan yang dapat menimbulkan berbagai spektrum penyakit mulai dari penyakit tanpa gejala, infeksi ringan sampai penyakit yang parah dan mematikan akibat faktor lingkungan. Kurangnya pengetahuan masyarakat mengenai gejala dan cara penanganan penyakit ISPA merupakan salah satu faktor penyebab tingginya angka kematian akibat ISPA. Peran sistem pakar yang disediakan dalam bentuk aplikasi sangat diperlukan untuk membantu seseorang dalam melakukan diagnosa penyakit ISPA secara mudah dan cepat. Dengan berusaha mengadopsi pengetahuan manusia ke komputer, sistem pakar mampu menyelesaikan permasalahan seperti yang dilakukan oleh seorang pakar. Oleh Karena itu, Aplikasi Sistem Pakar Diagnosis Penyakit ISPA Berbasis Speech Recognition Menggunakan Metode Naive Bayes Classifier dapat digunakan untuk mendiagnosis penyakit ISPA terhadap seseorang berdasarkan konversi hasil deteksi suara pengguna. Dengan aplikasi ini pengguna seakan berkonsultasi kepada seorang dokter/pakar yang menangani penyakit ISPA. Aplikasi dibangun berbasis android dengan menggunakan bahasa pemrograman Java dan database MySQL. Kata kunci : Sistem pakar, speech recognition, ISPA, metode naïve bayes classifier, Android. AbstractISPA (Acute Respiratory Tract Infection is a respiratory disorder disease that can lead to a wide spectrum of diseases ranging from asymptomatic disease, mild infection to severe and deadly disease due to environmental factors. So if someone complains of respiratory disorders not necessarily just have regular respiratory problems because it could be the person has ARI disease. The role of expert systems provided in the form of an application is needed to help a person in the diagnosis of ARI disease easily and quickly. By trying to adopt human knowledge into a computer, an expert system is capable of solving problems like that of an expert. Therefore, the Application of Expert

  11. Effects of Age and Working Memory Capacity on Speech Recognition Performance in Noise Among Listeners With Normal Hearing.

    Science.gov (United States)

    Gordon-Salant, Sandra; Cole, Stacey Samuels

    2016-01-01

    This study aimed to determine if younger and older listeners with normal hearing who differ on working memory span perform differently on speech recognition tests in noise. Older adults typically exhibit poorer speech recognition scores in noise than younger adults, which is attributed primarily to poorer hearing sensitivity and more limited working memory capacity in older than younger adults. Previous studies typically tested older listeners with poorer hearing sensitivity and shorter working memory spans than younger listeners, making it difficult to discern the importance of working memory capacity on speech recognition. This investigation controlled for hearing sensitivity and compared speech recognition performance in noise by younger and older listeners who were subdivided into high and low working memory groups. Performance patterns were compared for different speech materials to assess whether or not the effect of working memory capacity varies with the demands of the specific speech test. The authors hypothesized that (1) normal-hearing listeners with low working memory span would exhibit poorer speech recognition performance in noise than those with high working memory span; (2) older listeners with normal hearing would show poorer speech recognition scores than younger listeners with normal hearing, when the two age groups were matched for working memory span; and (3) an interaction between age and working memory would be observed for speech materials that provide contextual cues. Twenty-eight older (61 to 75 years) and 25 younger (18 to 25 years) normal-hearing listeners were assigned to groups based on age and working memory status. Northwestern University Auditory Test No. 6 words and Institute of Electrical and Electronics Engineers sentences were presented in noise using an adaptive procedure to measure the signal-to-noise ratio corresponding to 50% correct performance. Cognitive ability was evaluated with two tests of working memory (Listening

  12. Multi-Stage System for Automatic Target Recognition

    Science.gov (United States)

    Chao, Tien-Hsin; Lu, Thomas T.; Ye, David; Edens, Weston; Johnson, Oliver

    2010-01-01

    A multi-stage automated target recognition (ATR) system has been designed to perform computer vision tasks with adequate proficiency in mimicking human vision. The system is able to detect, identify, and track targets of interest. Potential regions of interest (ROIs) are first identified by the detection stage using an Optimum Trade-off Maximum Average Correlation Height (OT-MACH) filter combined with a wavelet transform. False positives are then eliminated by the verification stage using feature extraction methods in conjunction with neural networks. Feature extraction transforms the ROIs using filtering and binning algorithms to create feature vectors. A feedforward back-propagation neural network (NN) is then trained to classify each feature vector and to remove false positives. The system parameter optimizations process has been developed to adapt to various targets and datasets. The objective was to design an efficient computer vision system that can learn to detect multiple targets in large images with unknown backgrounds. Because the target size is small relative to the image size in this problem, there are many regions of the image that could potentially contain the target. A cursory analysis of every region can be computationally efficient, but may yield too many false positives. On the other hand, a detailed analysis of every region can yield better results, but may be computationally inefficient. The multi-stage ATR system was designed to achieve an optimal balance between accuracy and computational efficiency by incorporating both models. The detection stage first identifies potential ROIs where the target may be present by performing a fast Fourier domain OT-MACH filter-based correlation. Because threshold for this stage is chosen with the goal of detecting all true positives, a number of false positives are also detected as ROIs. The verification stage then transforms the regions of interest into feature space, and eliminates false positives using an

  13. A Full-Body Layered Deformable Model for Automatic Model-Based Gait Recognition

    Science.gov (United States)

    Lu, Haiping; Plataniotis, Konstantinos N.; Venetsanopoulos, Anastasios N.

    2007-12-01

    This paper proposes a full-body layered deformable model (LDM) inspired by manually labeled silhouettes for automatic model-based gait recognition from part-level gait dynamics in monocular video sequences. The LDM is defined for the fronto-parallel gait with 22 parameters describing the human body part shapes (widths and lengths) and dynamics (positions and orientations). There are four layers in the LDM and the limbs are deformable. Algorithms for LDM-based human body pose recovery are then developed to estimate the LDM parameters from both manually labeled and automatically extracted silhouettes, where the automatic silhouette extraction is through a coarse-to-fine localization and extraction procedure. The estimated LDM parameters are used for model-based gait recognition by employing the dynamic time warping for matching and adopting the combination scheme in AdaBoost.M2. While the existing model-based gait recognition approaches focus primarily on the lower limbs, the estimated LDM parameters enable us to study full-body model-based gait recognition by utilizing the dynamics of the upper limbs, the shoulders and the head as well. In the experiments, the LDM-based gait recognition is tested on gait sequences with differences in shoe-type, surface, carrying condition and time. The results demonstrate that the recognition performance benefits from not only the lower limb dynamics, but also the dynamics of the upper limbs, the shoulders and the head. In addition, the LDM can serve as an analysis tool for studying factors affecting the gait under various conditions.

  14. Robust audio-visual speech recognition under noisy audio-video conditions.

    Science.gov (United States)

    Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji

    2014-02-01

    This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.

  15. Morphological self-organizing feature map neural network with applications to automatic target recognition

    Science.gov (United States)

    Zhang, Shijun; Jing, Zhongliang; Li, Jianxun

    2005-01-01

    The rotation invariant feature of the target is obtained using the multi-direction feature extraction property of the steerable filter. Combining the morphological operation top-hat transform with the self-organizing feature map neural network, the adaptive topological region is selected. Using the erosion operation, the topological region shrinkage is achieved. The steerable filter based morphological self-organizing feature map neural network is applied to automatic target recognition of binary standard patterns and real-world infrared sequence images. Compared with Hamming network and morphological shared-weight networks respectively, the higher recognition correct rate, robust adaptability, quick training, and better generalization of the proposed method are achieved.

  16. Neural Network Based Recognition of Signal Patterns in Application to Automatic Testing of Rails

    Directory of Open Access Journals (Sweden)

    Tomasz Ciszewski

    2006-01-01

    Full Text Available The paper describes the application of neural network for recognition of signal patterns in measuring data gathered by the railroad ultrasound testing car. Digital conversion of the measuring signal allows to store and process large quantities of data. The elaboration of smart, effective and automatic procedures recognizing the obtained patterns on the basisof measured signal amplitude has been presented. The test shows only two classes of pattern recognition. In authors’ opinion if we deliver big enough quantity of training data, presented method is applicable to a system that recognizes many classes.

  17. Speech Recognition in Real-Life Background Noise by Young and Middle-Aged Adults with Normal Hearing

    OpenAIRE

    Lee, Ji Young; Lee, Jin Tae; Heo, Hye Jeong; Choi, Chul-Hee; Choi, Seong Hee; Lee, Kyungjae

    2015-01-01

    Background and Objectives People usually converse in real-life background noise. They experience more difficulty understanding speech in noise than in a quiet environment. The present study investigated how speech recognition in real-life background noise is affected by the type of noise, signal-to-noise ratio (SNR), and age. Subjects and Methods Eighteen young adults and fifteen middle-aged adults with normal hearing participated in the present study. Three types of noise [subway noise, vacu...

  18. Cognition and speech-in-noise recognition: the role of proactive interference.

    Science.gov (United States)

    Ellis, Rachel J; Rönnberg, Jerker

    2014-01-01

    Complex working memory (WM) span tasks have been shown to predict speech-in-noise (SIN) recognition. Studies of complex WM span tasks suggest that, rather than indexing a single cognitive process, performance on such tasks may be governed by separate cognitive subprocesses embedded within WM. Previous research has suggested that one such subprocess indexed by WM tasks is proactive interference (PI), which refers to difficulties memorizing current information because of interference from previously stored long-term memory representations for similar information. The aim of the present study was to investigate phonological PI and to examine the relationship between PI (semantic and phonological) and SIN perception. A within-subjects experimental design was used. An opportunity sample of 24 young listeners with normal hearing was recruited. Measures of resistance to, and release from, semantic and phonological PI were calculated alongside the signal-to-noise ratio required to identify 50% of keywords correctly in a SIN recognition task. The data were analyzed using t-tests and correlations. Evidence of release from and resistance to semantic interference was observed. These measures correlated significantly with SIN recognition. Limited evidence of phonological PI was observed. The results show that capacity to resist semantic PI can be used to predict SIN recognition scores in young listeners with normal hearing. On the basis of these findings, future research will focus on investigating whether tests of PI can be used in the treatment and/or rehabilitation of hearing loss. American Academy of Audiology.

  19. Comparing models of the combined-stimulation advantage for speech recognition.

    Science.gov (United States)

    Micheyl, Christophe; Oxenham, Andrew J

    2012-05-01

    The "combined-stimulation advantage" refers to an improvement in speech recognition when cochlear-implant or vocoded stimulation is supplemented by low-frequency acoustic information. Previous studies have been interpreted as evidence for "super-additive" or "synergistic" effects in the combination of low-frequency and electric or vocoded speech information by human listeners. However, this conclusion was based on predictions of performance obtained using a suboptimal high-threshold model of information combination. The present study shows that a different model, based on Gaussian signal detection theory, can predict surprisingly large combined-stimulation advantages, even when performance with either information source alone is close to chance, without involving any synergistic interaction. A reanalysis of published data using this model reveals that previous results, which have been interpreted as evidence for super-additive effects in perception of combined speech stimuli, are actually consistent with a more parsimonious explanation, according to which the combined-stimulation advantage reflects an optimal combination of two independent sources of information. The present results do not rule out the possible existence of synergistic effects in combined stimulation; however, they emphasize the possibility that the combined-stimulation advantages observed in some studies can be explained simply by non-interactive combination of two information sources.

  20. Comparison of middle latency responses in presbycusis patients with two different speech recognition scores.

    Science.gov (United States)

    Kirkim, Gunay; Madanoglu, Nevma; Akdas, Ferda; Serbetcioglu, M Bulent

    2007-12-01

    The purpose of this study is to evaluate whether the middle latency responses (MLR) can be used for an objective differentiation of patients with presbycusis having relatively good (Group I) and relatively poor speech recognition scores (Group II). All the participants of these groups had high frequency down-sloping hearing loss with an average of 26-60 dB HL. Data were collected from two described study groups and a control group, using pure tone audiometry, monosyllabic phonetically balanced word and synthetic sentence identification, as well as MLR. The study groups were compared with the control group. When patients in Group I were compared with the control group, only ipsilateral Na latency of middle latency evoked response was statistically significant in the right ear whereas ipsilateral Na latency in the right ear, ipsilateral and contralateral Na latency in the left ear of the patients in Group II were statistically significant. Thus, as an objective complementary tool for the evaluation of the speech perception ability of the patients with presbycusis, Na latency of MLR may be used in combination with the speech discrimination tests.

  1. Deficits in audiovisual speech perception in normal aging emerge at the level of whole-word recognition.

    Science.gov (United States)

    Stevenson, Ryan A; Nelms, Caitlin E; Baum, Sarah H; Zurkovsky, Lilia; Barense, Morgan D; Newhouse, Paul A; Wallace, Mark T

    2015-01-01

    Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. DEVELOPING VISUAL NOVEL GAME WITH SPEECH-RECOGNITION INTERACTIVITY TO ENHANCE STUDENTS’ MASTERY ON ENGLISH EXPRESSIONS

    Directory of Open Access Journals (Sweden)

    Elizabeth Anggraeni Amalo

    2017-11-01

    Full Text Available The teaching of English-expressions has always been done through conversation samples in form of written texts, audio recordings, and videos. In the meantime, the development of computer-aided learning technology has made autonomous language learning possible. Game, as one of computer-aided learning technology products, can serve as a medium to provide educational contents like that of language teaching and learning. Visual Novel is considered as a conversational game that is suitable to be combined with English-expressions material. Unlike the other click-based interaction Visual Novel Games, the visual novel game in this research implements speech recognition as the interaction trigger. Hence, this paper aims at elaborating how visual novel games are utilized to deliver English-expressions with speech recognition command for the interaction. This research used Research and Development (R&D method with Experimental design through control and experimental groups to measure its effectiveness in enhancing students’ English-expressions mastery. ANOVA was utilized to prove the significant differences between the control and experimental groups. It is expected that the result of this development and experiment can devote benefits to the English teaching and learning, especially on English-expressions.

  3. Health smart home for elders - a tool for automatic recognition of activities of daily living.

    Science.gov (United States)

    Le, Xuan Hoa Binh; Di Mascolo, Maria; Gouin, Alexia; Noury, Norbert

    2008-01-01

    Elders live preferently in their own home, but with aging comes the loss of autonomy and associated risks. In order to help them live longer in safe conditions, we need a tool to automatically detect their loss of autonomy by assessing the degree of performance of activities of daily living. This article presents an approach enabling the activities recognition of an elder living alone in a home equipped with noninvasive sensors.

  4. Face Prediction Model for an Automatic Age-invariant Face Recognition System

    OpenAIRE

    Yadav, Poonam

    2015-01-01

    07.11.14 KB. Emailed author re copyright. Author says that copyright is retained by author. Ok to add to spiral Automated face recognition and identi cation softwares are becoming part of our daily life; it nds its abode not only with Facebooks auto photo tagging, Apples iPhoto, Googles Picasa, Microsofts Kinect, but also in Homeland Security Departments dedicated biometric face detection systems. Most of these automatic face identification systems fail where the e ects of aging come into...

  5. AUTOMATIC RECOGNITION OF FALLS IN GAIT-SLIP: A HARNESS LOAD CELL BASED CRITERION

    OpenAIRE

    Yang, Feng; Pai, Yi-Chung

    2011-01-01

    Over-head-harness systems, equipped with load cell sensors, are essential to the participants’ safety and to the outcome assessment in perturbation training. The purpose of this study was to first develop an automatic outcome recognition criterion among young adults for gait-slip training and then verify such criterion among older adults. Each of 39 young and 71 older subjects, all protected by safety harness, experienced 8 unannounced, repeated slips, while walking on a 7-m walkway. Each tri...

  6. Contribution to automatic handwritten characters recognition. Application to optical moving characters recognition

    International Nuclear Information System (INIS)

    Gokana, Denis

    1986-01-01

    This paper describes a research work on computer aided vision relating to the design of a vision system which can recognize isolated handwritten characters written on a mobile support. We use a technique which consists in analyzing information contained in the contours of the polygon circumscribed to the character's shape. These contours are segmented and labelled to give a new set of features constituted by: - right and left 'profiles', - topological and algebraic unvarying properties. A new method of character's recognition induced from this representation based on a multilevel hierarchical technique is then described. In the primary level, we use a fuzzy classification with dynamic programming technique using 'profiles'. The other levels adjust the recognition by using topological and algebraic unvarying properties. Several results are presented and an accuracy of 99 pc was reached for handwritten numeral characters, thereby attesting the robustness of our algorithm. (author) [fr

  7. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  8. Application of Business Process Management to drive the deployment of a speech recognition system in a healthcare organization.

    Science.gov (United States)

    González Sánchez, María José; Framiñán Torres, José Manuel; Parra Calderón, Carlos Luis; Del Río Ortega, Juan Antonio; Vigil Martín, Eduardo; Nieto Cervera, Jaime

    2008-01-01

    We present a methodology based on Business Process Management to guide the development of a speech recognition system in a hospital in Spain. The methodology eases the deployment of the system by 1) involving the clinical staff in the process, 2) providing the IT professionals with a description of the process and its requirements, 3) assessing advantages and disadvantages of the speech recognition system, as well as its impact in the organisation, and 4) help reorganising the healthcare process before implementing the new technology in order to identify how it can better contribute to the overall objective of the organisation.

  9. Two Methods of Automatic Evaluation of Speech Signal Enhancement Recorded in the Open-Air MRI Environment

    Science.gov (United States)

    Přibil, Jiří; Přibilová, Anna; Frollo, Ivan

    2017-12-01

    The paper focuses on two methods of evaluation of successfulness of speech signal enhancement recorded in the open-air magnetic resonance imager during phonation for the 3D human vocal tract modeling. The first approach enables to obtain a comparison based on statistical analysis by ANOVA and hypothesis tests. The second method is based on classification by Gaussian mixture models (GMM). The performed experiments have confirmed that the proposed ANOVA and GMM classifiers for automatic evaluation of the speech quality are functional and produce fully comparable results with the standard evaluation based on the listening test method.

  10. Methods and Application of Phonetic Label Alignment in Speech Processing Tasks

    Directory of Open Access Journals (Sweden)

    M. Myslivec

    2000-12-01

    Full Text Available The paper deals with the problem of automatic phonetic segmentation ofspeech signals, namely for speech analysis and recognition purposes.Several methods and approaches are described and evaluated from thepoint of view of their accuracy. A complete instruction for creating anannotated database for training a Czech speech recognition system isprovided together with the authors' own experience. The results of thework have found practical applications, for example, in developing atool for semi-automatic speech segmentation, building alarge-vocabulary phoneme-based speech recognition system and designingan aid for learning and practicing pronunciation of words or phrases inthe native or a foreign language.

  11. Some factors underlying individual differences in speech recognition on PRESTO: a first report.

    Science.gov (United States)

    Tamati, Terrin N; Gilbert, Jaimie L; Pisoni, David B

    2013-01-01

    Previous studies investigating speech recognition in adverse listening conditions have found extensive variability among individual listeners. However, little is currently known about the core underlying factors that influence speech recognition abilities. To investigate sensory, perceptual, and neurocognitive differences between good and poor listeners on the Perceptually Robust English Sentence Test Open-set (PRESTO), a new high-variability sentence recognition test under adverse listening conditions. Participants who fell in the upper quartile (HiPRESTO listeners) or lower quartile (LoPRESTO listeners) on key word recognition on sentences from PRESTO in multitalker babble completed a battery of behavioral tasks and self-report questionnaires designed to investigate real-world hearing difficulties, indexical processing skills, and neurocognitive abilities. Young, normal-hearing adults (N = 40) from the Indiana University community participated in the current study. Participants' assessment of their own real-world hearing difficulties was measured with a self-report questionnaire on situational hearing and hearing health history. Indexical processing skills were assessed using a talker discrimination task, a gender discrimination task, and a forced-choice regional dialect categorization task. Neurocognitive abilities were measured with the Auditory Digit Span Forward (verbal short-term memory) and Digit Span Backward (verbal working memory) tests, the Stroop Color and Word Test (attention/inhibition), the WordFam word familiarity test (vocabulary size), the Behavioral Rating Inventory of Executive Function-Adult Version (BRIEF-A) self-report questionnaire on executive function, and two performance subtests of the Wechsler Abbreviated Scale of Intelligence (WASI) Performance Intelligence Quotient (IQ; nonverbal intelligence). Scores on self-report questionnaires and behavioral tasks were tallied and analyzed by listener group (HiPRESTO and LoPRESTO). The extreme

  12. Development of equally intelligible Telugu sentence-lists to test speech recognition in noise.

    Science.gov (United States)

    Tanniru, Kishore; Narne, Vijaya Kumar; Jain, Chandni; Konadath, Sreeraj; Singh, Niraj Kumar; Sreenivas, K J Ramadevi; K, Anusha

    2017-09-01

    To develop sentence lists in the Telugu language for the assessment of speech recognition threshold (SRT) in the presence of background noise through identification of the mean signal-to-noise ratio required to attain a 50% sentence recognition score (SRTn). This study was conducted in three phases. The first phase involved the selection and recording of Telugu sentences. In the second phase, 20 lists, each consisting of 10 sentences with equal intelligibility, were formulated using a numerical optimisation procedure. In the third phase, the SRTn of the developed lists was estimated using adaptive procedures on individuals with normal hearing. A total of 68 native Telugu speakers with normal hearing participated in the study. Of these, 18 (including the speakers) performed on various subjective measures in first phase, 20 performed on sentence/word recognition in noise for second phase and 30 participated in the list equivalency procedures in third phase. In all, 15 lists of comparable difficulty were formulated as test material. The mean SRTn across these lists corresponded to -2.74 (SD = 0.21). The developed sentence lists provided a valid and reliable tool to measure SRTn in Telugu native speakers.

  13. Automatic recognition of ship types from infrared images using superstructure moment invariants

    Science.gov (United States)

    Li, Heng; Wang, Xinyu

    2007-11-01

    Automatic object recognition is an active area of interest for military and commercial applications. In this paper, a system addressing autonomous recognition of ship types in infrared images is proposed. Firstly, an approach of segmentation based on detection of salient features of the target with subsequent shadow removing is proposed, as is the base of the subsequent object recognition. Considering the differences between the shapes of various ships mainly lie in their superstructures, we then use superstructure moment functions invariant to translation, rotation and scale differences in input patterns and develop a robust algorithm of obtaining ship superstructure. Subsequently a back-propagation neural network is used as a classifier in the recognition stage and projection images of simulated three-dimensional ship models are used as the training sets. Our recognition model was implemented and experimentally validated using both simulated three-dimensional ship model images and real images derived from video of an AN/AAS-44V Forward Looking Infrared(FLIR) sensor.

  14. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  15. The Compensatory Effectiveness of Optical Character Recognition/Speech Synthesis on Reading Comprehension of Postsecondary Students with Learning Disabilities.

    Science.gov (United States)

    Higgins, Eleanor L.; Raskind, Marshall H.

    1997-01-01

    Thirty-seven college students with learning disabilities were given a reading comprehension task under the following conditions: (1) using an optical character recognition/speech synthesis system; (2) having the text read aloud by a human reader; or (3) reading silently without assistance. Findings indicated that the greater the disability, the…

  16. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  17. Financial and workflow analysis of radiology reporting processes in the planning phase of implementation of a speech recognition system

    Science.gov (United States)

    Whang, Tom; Ratib, Osman M.; Umamoto, Kathleen; Grant, Edward G.; McCoy, Michael J.

    2002-05-01

    The goal of this study is to determine the financial value and workflow improvements achievable by replacing traditional transcription services with a speech recognition system in a large, university hospital setting. Workflow metrics were measured at two hospitals, one of which exclusively uses a transcription service (UCLA Medical Center), and the other which exclusively uses speech recognition (West Los Angeles VA Hospital). Workflow metrics include time spent per report (the sum of time spent interpreting, dictating, reviewing, and editing), transcription turnaround, and total report turnaround. Compared to traditional transcription, speech recognition resulted in radiologists spending 13-32% more time per report, but it also resulted in reduction of report turnaround time by 22-62% and reduction of marginal cost per report by 94%. The model developed here helps justify the introduction of a speech recognition system by showing that the benefits of reduced operating costs and decreased turnaround time outweigh the cost of increased time spent per report. Whether the ultimate goal is to achieve a financial objective or to improve operational efficiency, it is important to conduct a thorough analysis of workflow before implementation.

  18. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  19. The classification problem in machine learning: an overview with study cases in emotion recognition and music-speech differentiation

    OpenAIRE

    Rodríguez Cadavid, Santiago

    2015-01-01

    This work addresses the well-known classification problem in machine learning -- The goal of this study is to approach the reader to the methodological aspects of the feature extraction, feature selection and classifier performance through simple and understandable theoretical aspects and two study cases -- Finally, a very good classification performance was obtained for the emotion recognition from speech

  20. Unobtrusive multimodal emotion detection in adaptive interfaces: speech and facial expressions

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van; Neerincx, M.A.

    2007-01-01

    Two unobtrusive modalities for automatic emotion recognition are discussed: speech and facial expressions. First, an overview is given of emotion recognition studies based on a combination of speech and facial expressions. We will identify difficulties concerning data collection, data fusion, system

  1. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  2. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Seymour Rowan

    2008-01-01

    Full Text Available Abstract We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  3. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

    Directory of Open Access Journals (Sweden)

    Ji Ming

    2008-03-01

    Full Text Available We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

  4. A freely-available authoring system for browser-based CALL with speech recognition

    Directory of Open Access Journals (Sweden)

    Myles O'Brien

    2017-06-01

    Full Text Available A system for authoring browser-based CALL material incorporating Google speech recognition has been developed and made freely available for download. The system provides a teacher with a simple way to set up CALL material, including an optional image, sound or video, which will elicit spoken (and/or typed answers from the user and check them against a list of specified permitted answers, giving feedback with hints when necessary. The teacher needs no HTML or Javascript expertise, just the facilities and ability to edit text files and upload to the Internet. The structure and functioning of the system are explained in detail, and some suggestions are given for practical use. Finally, some of its limitations are described.

  5. Speech recognition: impact on workflow and report availability; Spracherkennung: Auswirkung auf Workflow und Befundverfuegbarkeit

    Energy Technology Data Exchange (ETDEWEB)

    Glaser, C.; Trumm, C.; Nissen-Meyer, S.; Francke, M.; Kuettner, B.; Reiser, M. [Klinikum Grosshadern der Ludwig-Maximilians-Universitaet Muenchen (Germany). Institut fuer Klinische Radiologie

    2005-08-01

    With ongoing technical refinements speech recognition systems (SRS) are becoming an increasingly attractive alternative to traditional methods of preparing and transcribing medical reports. The two main components of any SRS are the acoustic model and the language model. Features of modern SRS with continuous speech recognition are macros with individually definable texts and report templates as well as the option to navigate in a text or to control SRS or RIS functions by speech recognition. The best benefit from SRS can be obtained if it is integrated into a RIS/RIS-PACS installation. Report availability and time efficiency of the reporting process (related to recognition rate, time expenditure for editing and correcting a report) are the principal determinants of the clinical performance of any SRS. For practical purposes the recognition rate is estimated by the error rate (unit ''word''). Error rates range from 4 to 28%. Roughly 20% of them are errors in the vocabulary which may result in clinically relevant misinterpretation. It is thus mandatory to thoroughly correct any transcribed text as well as to continuously train and adapt the SRS vocabulary. The implementation of SRS dramatically improves report availability. This is most pronounced for CT and CR. However, the individual time expenditure for (SRS-based) reporting increased by 20-25% (CR) and according to literature data there is an increase by 30% for CT and MRI. The extent to which the transcription staff profits from SRS depends largely on its qualification. Online dictation implies a workload shift from the transcription staff to the reporting radiologist. (orig.) [German] Mit der voranschreitenden technischen Entwicklung werden Spracherkennungssysteme (SES) - gerade vor dem Hintergrund der aktuell unabweisbaren Kostenreduktion bei gleichbleibender Qualitaet in der Patientenversorgung - eine zunehmend attraktive Alternative zur traditionellen Befunderstellung. Die 2

  6. Thoracic lymph node station recognition on CT images based on automatic anatomy recognition with an optimal parent strategy

    Science.gov (United States)

    Xu, Guoping; Udupa, Jayaram K.; Tong, Yubing; Cao, Hanqiang; Odhner, Dewey; Torigian, Drew A.; Wu, Xingyu

    2018-03-01

    Currently, there are many papers that have been published on the detection and segmentation of lymph nodes from medical images. However, it is still a challenging problem owing to low contrast with surrounding soft tissues and the variations of lymph node size and shape on computed tomography (CT) images. This is particularly very difficult on low-dose CT of PET/CT acquisitions. In this study, we utilize our previous automatic anatomy recognition (AAR) framework to recognize the thoracic-lymph node stations defined by the International Association for the Study of Lung Cancer (IASLC) lymph node map. The lymph node stations themselves are viewed as anatomic objects and are localized by using a one-shot method in the AAR framework. Two strategies have been taken in this paper for integration into AAR framework. The first is to combine some lymph node stations into composite lymph node stations according to their geometrical nearness. The other is to find the optimal parent (organ or union of organs) as an anchor for each lymph node station based on the recognition error and thereby find an overall optimal hierarchy to arrange anchor organs and lymph node stations. Based on 28 contrast-enhanced thoracic CT image data sets for model building, 12 independent data sets for testing, our results show that thoracic lymph node stations can be localized within 2-3 voxels compared to the ground truth.

  7. Automatic modulation format recognition for the next generation optical communication networks using artificial neural networks

    Science.gov (United States)

    Guesmi, Latifa; Hraghi, Abir; Menif, Mourad

    2015-03-01

    A new technique for Automatic Modulation Format Recognition (AMFR) in next generation optical communication networks is presented. This technique uses the Artificial Neural Network (ANN) in conjunction with the features of Linear Optical Sampling (LOS) of the detected signal at high bit rates using direct detection or coherent detection. The use of LOS method for this purpose mainly driven by the increase of bit rates which enables the measurement of eye diagrams. The efficiency of this technique is demonstrated under different transmission impairments such as chromatic dispersion (CD) in the range of -500 to 500 ps/nm, differential group delay (DGD) in the range of 0-15 ps and the optical signal tonoise ratio (OSNR) in the range of 10-30 dB. The results of numerical simulation for various modulation formats demonstrate successful recognition from a known bit rates with a higher estimation accuracy, which exceeds 99.8%.

  8. Monitoring caustic injuries from emergency department databases using automatic keyword recognition software.

    Science.gov (United States)

    Vignally, P; Fondi, G; Taggi, F; Pitidis, A

    2011-03-31

    In Italy the European Union Injury Database reports the involvement of chemical products in 0.9% of home and leisure accidents. The Emergency Department registry on domestic accidents in Italy and the Poison Control Centres record that 90% of cases of exposure to toxic substances occur in the home. It is not rare for the effects of chemical agents to be observed in hospitals, with a high potential risk of damage - the rate of this cause of hospital admission is double the domestic injury average. The aim of this study was to monitor the effects of injuries caused by caustic agents in Italy using automatic free-text recognition in Emergency Department medical databases. We created a Stata software program to automatically identify caustic or corrosive injury cases using an agent-specific list of keywords. We focused attention on the procedure's sensitivity and specificity. Ten hospitals in six regions of Italy participated in the study. The program identified 112 cases of injury by caustic or corrosive agents. Checking the cases by quality controls (based on manual reading of ED reports), we assessed 99 cases as true positive, i.e. 88.4% of the patients were automatically recognized by the software as being affected by caustic substances (99% CI: 80.6%- 96.2%), that is to say 0.59% (99% CI: 0.45%-0.76%) of the whole sample of home injuries, a value almost three times as high as that expected (p < 0.0001) from European codified information. False positives were 11.6% of the recognized cases (99% CI: 5.1%- 21.5%). Our automatic procedure for caustic agent identification proved to have excellent product recognition capacity with an acceptable level of excess sensitivity. Contrary to our a priori hypothesis, the automatic recognition system provided a level of identification of agents possessing caustic effects that was significantly much greater than was predictable on the basis of the values from current codifications reported in the European Database.

  9. Assessment of hearing aid algorithms using a master hearing aid: the influence of hearing aid experience on the relationship between speech recognition and cognitive capacity.

    Science.gov (United States)

    Rählmann, Sebastian; Meis, Markus; Schulte, Michael; Kießling, Jürgen; Walger, Martin; Meister, Hartmut

    2017-04-27

    Model-based hearing aid development considers the assessment of speech recognition using a master hearing aid (MHA). It is known that aided speech recognition in noise is related to cognitive factors such as working memory capacity (WMC). This relationship might be mediated by hearing aid experience (HAE). The aim of this study was to examine the relationship of WMC and speech recognition with a MHA for listeners with different HAE. Using the MHA, unaided and aided 80% speech recognition thresholds in noise were determined. Individual WMC capacity was assed using the Verbal Learning and Memory Test (VLMT) and the Reading Span Test (RST). Forty-nine hearing aid users with mild to moderate sensorineural hearing loss divided into three groups differing in HAE. Whereas unaided speech recognition did not show a significant relationship with WMC, a significant correlation could be observed between WMC and aided speech recognition. However, this only applied to listeners with HAE of up to approximately three years, and a consistent weakening of the correlation could be observed with more experience. Speech recognition scores obtained in acute experiments with an MHA are less influenced by individual cognitive capacity when experienced HA users are taken into account.

  10. Automatic recognition of conceptualization zones in scientific articles and two life science applications.

    Science.gov (United States)

    Liakata, Maria; Saha, Shyamasree; Dobnik, Simon; Batchelor, Colin; Rebholz-Schuhmann, Dietrich

    2012-04-01

    Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication. We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with 'Experiment', 'Background' and 'Model' being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated Core

  11. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Science.gov (United States)

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.

  12. Semantic and phonetic enhancements for speech-in-noise recognition by native and non-native listeners.

    Science.gov (United States)

    Bradlow, Ann R; Alexander, Jennifer A

    2007-04-01

    Previous research has shown that speech recognition differences between native and proficient non-native listeners emerge under suboptimal conditions. Current evidence has suggested that the key deficit that underlies this disproportionate effect of unfavorable listening conditions for non-native listeners is their less effective use of compensatory information at higher levels of processing to recover from information loss at the phoneme identification level. The present study investigated whether this non-native disadvantage could be overcome if enhancements at various levels of processing were presented in combination. Native and non-native listeners were presented with English sentences in which the final word varied in predictability and which were produced in either plain or clear speech. Results showed that, relative to the low-predictability-plain-speech baseline condition, non-native listener final word recognition improved only when both semantic and acoustic enhancements were available (high-predictability-clear-speech). In contrast, the native listeners benefited from each source of enhancement separately and in combination. These results suggests that native and non-native listeners apply similar strategies for speech-in-noise perception: The crucial difference is in the signal clarity required for contextual information to be effective, rather than in an inability of non-native listeners to take advantage of this contextual information per se.

  13. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  14. Random Deep Belief Networks for Recognizing Emotions from Speech Signals.

    Science.gov (United States)

    Wen, Guihua; Li, Huihui; Huang, Jubing; Li, Danyang; Xun, Eryang

    2017-01-01

    Now the human emotions can be recognized from speech signals using machine learning methods; however, they are challenged by the lower recognition accuracies in real applications due to lack of the rich representation ability. Deep belief networks (DBN) can automatically discover the multiple levels of representations in speech signals. To make full of its advantages, this paper presents an ensemble of random deep belief networks (RDBN) method for speech emotion recognition. It firstly extracts the low level features of the input speech signal and then applies them to construct lots of random subspaces. Each random subspace is then provided for DBN to yield the higher level features as the input of the classifier to output an emotion label. All outputted emotion labels are then fused through the majority voting to decide the final emotion label for the input speech signal. The conducted experimental results on benchmark speech emotion databases show that RDBN has better accuracy than the compared methods for speech emotion recognition.

  15. Lexical-Access Ability and Cognitive Predictors of Speech Recognition in Noise in Adult Cochlear Implant Users.

    Science.gov (United States)

    Kaandorp, Marre W; Smits, Cas; Merkus, Paul; Festen, Joost M; Goverts, S Theo

    2017-01-01

    Not all of the variance in speech-recognition performance of cochlear implant (CI) users can be explained by biographic and auditory factors. In normal-hearing listeners, linguistic and cognitive factors determine most of speech-in-noise performance. The current study explored specifically the influence of visually measured lexical-access ability compared with other cognitive factors on speech recognition of 24 postlingually deafened CI users. Speech-recognition performance was measured with monosyllables in quiet (consonant-vowel-consonant [CVC]), sentences-in-noise (SIN), and digit-triplets in noise (DIN). In addition to a composite variable of lexical-access ability (LA), measured with a lexical-decision test (LDT) and word-naming task, vocabulary size, working-memory capacity (Reading Span test [RSpan]), and a visual analogue of the SIN test (text reception threshold test) were measured. The DIN test was used to correct for auditory factors in SIN thresholds by taking the difference between SIN and DIN: SRT diff . Correlation analyses revealed that duration of hearing loss (dHL) was related to SIN thresholds. Better working-memory capacity was related to SIN and SRT diff scores. LDT reaction time was positively correlated with SRT diff scores. No significant relationships were found for CVC or DIN scores with the predictor variables. Regression analyses showed that together with dHL, RSpan explained 55% of the variance in SIN thresholds. When controlling for auditory performance, LA, LDT, and RSpan separately explained, together with dHL, respectively 37%, 36%, and 46% of the variance in SRT diff outcome. The results suggest that poor verbal working-memory capacity and to a lesser extent poor lexical-access ability limit speech-recognition ability in listeners with a CI.

  16. Capitalising on North American speech resources for the development of a South African English large vocabulary speech recognition system

    CSIR Research Space (South Africa)

    Kamper, H

    2014-11-01

    Full Text Available -West University, Vanderbijlpark, South Africa 2Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa {etienne.barnard, marelie.davel, cvheerden}@gmail.com, {fdwet, jbadenhorst}@csir.co.za Abstract The NCHLT speech...

  17. Neural-network classifiers for automatic real-world aerial image recognition

    Science.gov (United States)

    Greenberg, Shlomo; Guterman, Hugo

    1996-08-01

    We describe the application of the multilayer perceptron (MLP) network and a version of the adaptive resonance theory version 2-A (ART 2-A) network to the problem of automatic aerial image recognition (AAIR). The classification of aerial images, independent of their positions and orientations, is required for automatic tracking and target recognition. Invariance is achieved by the use of different invariant feature spaces in combination with supervised and unsupervised neural networks. The performance of neural-network-based classifiers in conjunction with several types of invariant AAIR global features, such as the Fourier-transform space, Zernike moments, central moments, and polar transforms, are examined. The advantages of this approach are discussed. The performance of the MLP network is compared with that of a classical correlator. The MLP neural-network correlator outperformed the binary phase-only filter (BPOF) correlator. It was found that the ART 2-A distinguished itself with its speed and its low number of required training vectors. However, only the MLP classifier was able to deal with a combination of shift and rotation geometric distortions.

  18. A Knowledge Base for Automatic Feature Recognition from Point Clouds in an Urban Scene

    Directory of Open Access Journals (Sweden)

    Xu-Feng Xing

    2018-01-01

    Full Text Available LiDAR technology can provide very detailed and highly accurate geospatial information on an urban scene for the creation of Virtual Geographic Environments (VGEs for different applications. However, automatic 3D modeling and feature recognition from LiDAR point clouds are very complex tasks. This becomes even more complex when the data is incomplete (occlusion problem or uncertain. In this paper, we propose to build a knowledge base comprising of ontology and semantic rules aiming at automatic feature recognition from point clouds in support of 3D modeling. First, several modules for ontology are defined from different perspectives to describe an urban scene. For instance, the spatial relations module allows the formalized representation of possible topological relations extracted from point clouds. Then, a knowledge base is proposed that contains different concepts, their properties and their relations, together with constraints and semantic rules. Then, instances and their specific relations form an urban scene and are added to the knowledge base as facts. Based on the knowledge and semantic rules, a reasoning process is carried out to extract semantic features of the objects and their components in the urban scene. Finally, several experiments are presented to show the validity of our approach to recognize different semantic features of buildings from LiDAR point clouds.

  19. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener.

  20. Development and Evaluation of a Speech Recognition Test for Persian Speaking Adults

    Directory of Open Access Journals (Sweden)

    Mohammad Mosleh

    2001-05-01

    Full Text Available Method and Materials: This research is carried out for development and evaluation of 25 phonemically balanced word lists for Persian speaking adults in two separate stages: development and evaluation. In the first stage, in order to balance the lists phonemically, frequency -of- occurrences of each 29phonems (6 vowels and 23 Consonants of the Persian language in adults speech are determined. This section showed some significant differences between some phonemes' frequencies. Then, all Persian monosyllabic words extracted from the Mo ‘in Persian dictionary. The semantically difficult words were refused and the appropriate words choosed according to judgment of 5 adult native speakers of Persian with high school diploma. 12 openset 25 word lists are prepared. The lists were recorded on magnetic tapes in an audio studio by a professional speaker of IRIB. "nIn the second stage, in order to evaluate the test's validity and reliability, 60 normal hearing adults (30 male, 30 female, were randomly selected and evaluated as test and retest. Findings: 1- Normal hearing adults obtained 92-1 0O scores for each list at their MCL through test-retest. 2- No significant difference was observed a/ in test-retest scores in each list (‘P>O.05 b/ between the lists at test or retest scores (P>0.05, c/between sex (P>0.05. Conclusion: This research is reliable and valid, the lists are phonemically balanced and equal in difficulty and valuable for evaluation of Persian speaking adults speech recognition.

  1. Automatic recognition of cardiac arrhythmias based on the geometric patterns of Poincaré plots

    International Nuclear Information System (INIS)

    Zhang, Lijuan; Guo, Tianci; Xi, Bin; Fan, Yang; Wang, Kun; Bi, Jiacheng; Wang, Ying

    2015-01-01

    The Poincaré plot emerges as an effective tool for assessing cardiovascular autonomic regulation. It displays nonlinear characteristics of heart rate variability (HRV) from electrocardiographic (ECG) recordings and gives a global view of the long range of ECG signals. In the telemedicine or computer-aided diagnosis system, it would offer significant auxiliary information for diagnosis if the patterns of the Poincaré plots can be automatically classified. Therefore, we developed an automatic classification system to distinguish five geometric patterns of the Poincaré plots from four types of cardiac arrhythmias. The statistics features are designed on measurements and an ensemble classifier of three types of neural networks is proposed. Aiming at the difficulty to set a proper threshold for classifying the multiple categories, the threshold selection strategy is analyzed. 24 h ECG monitoring recordings from 674 patients, which have four types of cardiac arrhythmias, are adopted for recognition. For comparison, Support Vector Machine (SVM) classifiers with linear and Gaussian kernels are also applied. The experiment results demonstrate the effectiveness of the extracted features and the better performance of the designed classifier. Our study can be applied to diagnose the corresponding sinus rhythm and arrhythmia substrates disease automatically in the telemedicine and computer-aided diagnosis system. (paper)

  2. Automatic recognition of falls in gait-slip training: Harness load cell based criteria.

    Science.gov (United States)

    Yang, Feng; Pai, Yi-Chung

    2011-08-11

    Over-head-harness systems, equipped with load cell sensors, are essential to the participants' safety and to the outcome assessment in perturbation training. The purpose of this study was to first develop an automatic outcome recognition criterion among young adults for gait-slip training and then verify such criterion among older adults. Each of 39 young and 71 older subjects, all protected by safety harness, experienced 8 unannounced, repeated slips, while walking on a 7m walkway. Each trial was monitored with a motion capture system, bilateral ground reaction force (GRF), harness force, and video recording. The fall trials were first unambiguously indentified with careful visual inspection of all video records. The recoveries without balance loss (in which subjects' trailing foot landed anteriorly to the slipping foot) were also first fully recognized from motion and GRF analyses. These analyses then set the gold standard for the outcome recognition with load cell measurements. Logistic regression analyses based on young subjects' data revealed that the peak load cell force was the best predictor of falls (with 100% accuracy) at the threshold of 30% body weight. On the other hand, the peak moving average force of load cell across 1s period, was the best predictor (with 100% accuracy) separating recoveries with backward balance loss (in which the recovery step landed posterior to slipping foot) from harness assistance at the threshold of 4.5% body weight. These threshold values were fully verified using the data from older adults (100% accuracy in recognizing falls). Because of the increasing popularity in the perturbation training coupling with the protective over-head-harness system, this new criterion could have far reaching implications in automatic outcome recognition during the movement therapy. Copyright © 2011 Elsevier Ltd. All rights reserved.

  3. AUTOMATIC RECOGNITION OF FALLS IN GAIT-SLIP: A HARNESS LOAD CELL BASED CRITERION

    Science.gov (United States)

    Yang, Feng; Pai, Yi-Chung

    2012-01-01

    Over-head-harness systems, equipped with load cell sensors, are essential to the participants’ safety and to the outcome assessment in perturbation training. The purpose of this study was to first develop an automatic outcome recognition criterion among young adults for gait-slip training and then verify such criterion among older adults. Each of 39 young and 71 older subjects, all protected by safety harness, experienced 8 unannounced, repeated slips, while walking on a 7-m walkway. Each trial was monitored with a motion capture system, bilateral ground reaction force (GRF), harness force and video recording. The fall trials were first unambiguously indentified with careful visual inspection of all video records. The recoveries without balance loss (in which subjects’ trailing foot landed anteriorly to the slipping foot) were also first fully recognized from motion and GRF analyses. These analyses then set the gold standard for the outcome recognition with load cell measurements. Logistic regression analyses based on young subjects’ data revealed that peak load cell force was the best predictor of falls (with 100% accuracy) at the threshold of 30% body weight. On the other hand, the peak moving average force of load cell across 1-s period, was the best predictor (with 100% accuracy) separating recoveries with backward balance loss (in which the recovery step landed posterior to slipping foot) from harness assistance at the threshold of 4.5% body weight. These threshold values were fully verified using the data from older adults (100% accuracy in recognizing falls). Because of the increasing popularity in the perturbation training coupling with the protective over-head-harness system, this new criterion could have far reaching implications in automatic outcome recognition during the movement therapy. PMID:21696744

  4. Morphological characterization of Mycobacterium tuberculosis in a MODS culture for an automatic diagnostics through pattern recognition.

    Directory of Open Access Journals (Sweden)

    Alicia Alva

    Full Text Available Tuberculosis control efforts are hampered by a mismatch in diagnostic technology: modern optimal diagnostic tests are least available in poor areas where they are needed most. Lack of adequate early diagnostics and MDR detection is a critical problem in control efforts. The Microscopic Observation Drug Susceptibility (MODS assay uses visual recognition of cording patterns from Mycobacterium tuberculosis (MTB to diagnose tuberculosis infection and drug susceptibility directly from a sputum sample in 7-10 days with a low cost. An important limitation that laboratories in the developing world face in MODS implementation is the presence of permanent technical staff with expertise in reading MODS. We developed a pattern recognition algorithm to automatically interpret MODS results from digital images. The algorithm using image processing, feature extraction and pattern recognition determined geometrical and illumination features used in an object-model and a photo-model to classify TB-positive images. 765 MODS digital photos were processed. The single-object model identified MTB (96.9% sensitivity and 96.3% specificity and was able to discriminate non-tuberculous mycobacteria with a high specificity (97.1% M. avium, 99.1% M. chelonae, and 93.8% M. kansasii. The photo model identified TB-positive samples with 99.1% sensitivity and 99.7% specificity. This algorithm is a valuable tool that will enable automatic remote diagnosis using Internet or cellphone telephony. The use of this algorithm and its further implementation in a telediagnostics platform will contribute to both faster TB detection and MDR TB determination leading to an earlier initiation of appropriate treatment.

  5. Automatic anatomy recognition in whole-body PET/CT images

    International Nuclear Information System (INIS)

    Wang, Huiqian; Udupa, Jayaram K.; Odhner, Dewey; Tong, Yubing; Torigian, Drew A.; Zhao, Liming

    2016-01-01

    Purpose: Whole-body positron emission tomography/computed tomography (PET/CT) has become a standard method of imaging patients with various disease conditions, especially cancer. Body-wide accurate quantification of disease burden in PET/CT images is important for characterizing lesions, staging disease, prognosticating patient outcome, planning treatment, and evaluating disease response to therapeutic interventions. However, body-wide anatomy recognition in PET/CT is a critical first step for accurately and automatically quantifying disease body-wide, body-region-wise, and organwise. This latter process, however, has remained a challenge due to the lower quality of the anatomic information portrayed in the CT component of this imaging modality and the paucity of anatomic details in the PET component. In this paper, the authors demonstrate the adaptation of a recently developed automatic anatomy recognition (AAR) methodology [Udupa et al., “Body-wide hierarchical fuzzy modeling, recognition, and delineation of anatomy in medical images,” Med. Image Anal. 18, 752–771 (2014)] to PET/CT images. Their goal was to test what level of object localization accuracy can be achieved on PET/CT compared to that achieved on diagnostic CT images. Methods: The authors advance the AAR approach in this work in three fronts: (i) from body-region-wise treatment in the work of Udupa et al. to whole body; (ii) from the use of image intensity in optimal object recognition in the work of Udupa et al. to intensity plus object-specific texture properties, and (iii) from the intramodality model-building-recognition strategy to the intermodality approach. The whole-body approach allows consideration of relationships among objects in different body regions, which was previously not possible. Consideration of object texture allows generalizing the previous optimal threshold-based fuzzy model recognition method from intensity images to any derived fuzzy membership image, and in the process

  6. Automatic anatomy recognition in whole-body PET/CT images

    Energy Technology Data Exchange (ETDEWEB)

    Wang, Huiqian [College of Optoelectronic Engineering, Chongqing University, Chongqing 400044, China and Medical Image Processing Group Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 (United States); Udupa, Jayaram K., E-mail: jay@mail.med.upenn.edu; Odhner, Dewey; Tong, Yubing; Torigian, Drew A. [Medical Image Processing Group Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 (United States); Zhao, Liming [Medical Image Processing Group Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania 19104 and Research Center of Intelligent System and Robotics, Chongqing University of Posts and Telecommunications, Chongqing 400065 (China)

    2016-01-15

    Purpose: Whole-body positron emission tomography/computed tomography (PET/CT) has become a standard method of imaging patients with various disease conditions, especially cancer. Body-wide accurate quantification of disease burden in PET/CT images is important for characterizing lesions, staging disease, prognosticating patient outcome, planning treatment, and evaluating disease response to therapeutic interventions. However, body-wide anatomy recognition in PET/CT is a critical first step for accurately and automatically quantifying disease body-wide, body-region-wise, and organwise. This latter process, however, has remained a challenge due to the lower quality of the anatomic information portrayed in the CT component of this imaging modality and the paucity of anatomic details in the PET component. In this paper, the authors demonstrate the adaptation of a recently developed automatic anatomy recognition (AAR) methodology [Udupa et al., “Body-wide hierarchical fuzzy modeling, recognition, and delineation of anatomy in medical images,” Med. Image Anal. 18, 752–771 (2014)] to PET/CT images. Their goal was to test what level of object localization accuracy can be achieved on PET/CT compared to that achieved on diagnostic CT images. Methods: The authors advance the AAR approach in this work in three fronts: (i) from body-region-wise treatment in the work of Udupa et al. to whole body; (ii) from the use of image intensity in optimal object recognition in the work of Udupa et al. to intensity plus object-specific texture properties, and (iii) from the intramodality model-building-recognition strategy to the intermodality approach. The whole-body approach allows consideration of relationships among objects in different body regions, which was previously not possible. Consideration of object texture allows generalizing the previous optimal threshold-based fuzzy model recognition method from intensity images to any derived fuzzy membership image, and in the process

  7. Automatic Human Facial Expression Recognition Based on Integrated Classifier From Monocular Video with Uncalibrated Camera

    Directory of Open Access Journals (Sweden)

    Yu Tao

    2017-01-01

    Full Text Available An automatic recognition framework for human facial expressions from a monocular video with an uncalibrated camera is proposed. The expression characteristics are first acquired from a kind of deformable template, similar to a facial muscle distribution. After associated regularization, the time sequences from the trait changes in space-time under complete expressional production are then arranged line by line in a matrix. Next, the matrix dimensionality is reduced by a method of manifold learning of neighborhood-preserving embedding. Finally, the refined matrix containing the expression trait information is recognized by a classifier that integrates the hidden conditional random field (HCRF and support vector machine (SVM. In an experiment using the Cohn–Kanade database, the proposed method showed a comparatively higher recognition rate than the individual HCRF or SVM methods in direct recognition from two-dimensional human face traits. Moreover, the proposed method was shown to be more robust than the typical Kotsia method because the former contains more structural characteristics of the data to be classified in space-time

  8. An analysis of machine translation and speech synthesis in speech-to-speech translation system

    OpenAIRE

    Hashimoto, K.; Yamagishi, J.; Byrne, W.; King, S.; Tokuda, K.

    2011-01-01

    This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. Therefore, in this paper, we focus on machine translation and speech synthesis, ...

  9. Development of Portable Automatic Number Plate Recognition System on Android Mobile Phone

    Science.gov (United States)

    Mutholib, Abdul; Gunawan, Teddy S.; Chebil, Jalel; Kartiwi, Mira

    2013-12-01

    The Automatic Number Plate Recognition (ANPR) System has performed as the main role in various access control and security, such as: tracking of stolen vehicles, traffic violations (speed trap) and parking management system. In this paper, the portable ANPR implemented on android mobile phone is presented. The main challenges in mobile application are including higher coding efficiency, reduced computational complexity, and improved flexibility. Significance efforts are being explored to find suitable and adaptive algorithm for implementation of ANPR on mobile phone. ANPR system for mobile phone need to be optimize due to its limited CPU and memory resources, its ability for geo-tagging image captured using GPS coordinates and its ability to access online database to store the vehicle's information. In this paper, the design of portable ANPR on android mobile phone will be described as follows. First, the graphical user interface (GUI) for capturing image using built-in camera was developed to acquire vehicle plate number in Malaysia. Second, the preprocessing of raw image was done using contrast enhancement. Next, character segmentation using fixed pitch and an optical character recognition (OCR) using neural network were utilized to extract texts and numbers. Both character segmentation and OCR were using Tesseract library from Google Inc. The proposed portable ANPR algorithm was implemented and simulated using Android SDK on a computer. Based on the experimental results, the proposed system can effectively recognize the license plate number at 90.86%. The required processing time to recognize a license plate is only 2 seconds on average. The result is consider good in comparison with the results obtained from previous system that was processed in a desktop PC with the range of result from 91.59% to 98% recognition rate and 0.284 second to 1.5 seconds recognition time.

  10. Review of Design of Speech Recognition and Text Analytics based Digital Banking Customer Interface and Future Directions of Technology Adoption

    OpenAIRE

    Saha, Amal K

    2017-01-01

    Banking is one of the most significant adopters of cutting-edge information technologies. Since its modern era beginning in the form of paper based accounting maintained in the branch, adoption of computerized system made it possible to centralize the processing in data centre and improve customer experience by making a more available and efficient system. The latest twist in this evolution is adoption of natural language processing and speech recognition in the user interface between the hum...

  11. Psychometrically equivalent bisyllabic words for speech recognition threshold testing in Vietnamese.

    Science.gov (United States)

    Harris, Richard W; McPherson, David L; Hanson, Claire M; Eggett, Dennis L

    2017-08-01

    This study identified, digitally recorded, edited and evaluated 89 bisyllabic Vietnamese words with the goal of identifying homogeneous words that could be used to measure the speech recognition threshold (SRT) in native talkers of Vietnamese. Native male and female talker productions of 89 Vietnamese bisyllabic words were recorded, edited and then presented at intensities ranging from -10 to 20 dBHL. Logistic regression was used to identify the best words for measuring the SRT. Forty-eight words were selected and digitally edited to have 50% intelligibility at a level equal to the mean pure-tone average (PTA) for normally hearing participants (5.2 dBHL). Twenty normally hearing native Vietnamese participants listened to and repeated bisyllabic Vietnamese words at intensities ranging from -10 to 20 dBHL. A total of 48 male and female talker recordings of bisyllabic words with steep psychometric functions (>9.0%/dB) were chosen for the final bisyllabic SRT list. Only words homogeneous with respect to threshold audibility with steep psychometric function slopes were chosen for the final list. Digital recordings of bisyllabic Vietnamese words are now available for use in measuring the SRT for patients whose native language is Vietnamese.

  12. Effects of Familiarity and Feeding on Newborn Speech-Voice Recognition

    Science.gov (United States)

    Valiante, A. Grace; Barr, Ronald G.; Zelazo, Philip R.; Brant, Rollin; Young, Simon N.

    2013-01-01

    Newborn infants preferentially orient to familiar over unfamiliar speech sounds. They are also better at remembering unfamiliar speech sounds for short periods of time if learning and retention occur after a feed than before. It is unknown whether short-term memory for speech is enhanced when the sound is familiar (versus unfamiliar) and, if so,…

  13. Semi-automatic parking slot marking recognition for intelligent parking assist systems

    Directory of Open Access Journals (Sweden)

    Ho Gi Jung

    2014-01-01

    Full Text Available This paper proposes a semi-automatic parking slot marking-based target position designation method for parking assist systems in cases where the parking slot markings are of a rectangular type, and its efficient implementation for real-time operation. After the driver observes a rearview image captured by a rearward camera installed at the rear of the vehicle through a touchscreen-based human machine interface, a target parking position is designated by touching the inside of a parking slot. To ensure the proposed method operates in real-time in an embedded environment, access of the bird's-eye view image is made efficient: image-wise batch transformation is replaced with pixel-wise instantaneous transformation. The proposed method showed a 95.5% recognition rate in 378 test cases with 63 test images. Additionally, experiments confirmed that the pixel-wise instantaneous transformation reduced execution time by 92%.

  14. Right-Ear Advantage for Speech-in-Noise Recognition in Patients with Nonlateralized Tinnitus and Normal Hearing Sensitivity.

    Science.gov (United States)

    Tai, Yihsin; Husain, Fatima T

    2018-04-01

    Despite having normal hearing sensitivity, patients with chronic tinnitus may experience more difficulty recognizing speech in adverse listening conditions as compared to controls. However, the association between the characteristics of tinnitus (severity and loudness) and speech recognition remains unclear. In this study, the Quick Speech-in-Noise test (QuickSIN) was conducted monaurally on 14 patients with bilateral tinnitus and 14 age- and hearing-matched adults to determine the relation between tinnitus characteristics and speech understanding. Further, Tinnitus Handicap Inventory (THI), tinnitus loudness magnitude estimation, and loudness matching were obtained to better characterize the perceptual and psychological aspects of tinnitus. The patients reported low THI scores, with most participants in the slight handicap category. Significant between-group differences in speech-in-noise performance were only found at the 5-dB signal-to-noise ratio (SNR) condition. The tinnitus group performed significantly worse in the left ear than in the right ear, even though bilateral tinnitus percept and symmetrical thresholds were reported in all patients. This between-ear difference is likely influenced by a right-ear advantage for speech sounds, as factors related to testing order and fatigue were ruled out. Additionally, significant correlations found between SNR loss in the left ear and tinnitus loudness matching suggest that perceptual factors related to tinnitus had an effect on speech-in-noise performance, pointing to a possible interaction between peripheral and cognitive factors in chronic tinnitus. Further studies, that take into account both hearing and cognitive abilities of patients, are needed to better parse out the effect of tinnitus in the absence of hearing impairment.

  15. Long-term temporal tracking of speech rate affects spoken-word recognition.

    Science.gov (United States)

    Baese-Berk, Melissa M; Heffner, Christopher C; Dilley, Laura C; Pitt, Mark A; Morrill, Tuuli H; McAuley, J Devin

    2014-08-01

    Humans unconsciously track a wide array of distributional characteristics in their sensory environment. Recent research in spoken-language processing has demonstrated that the speech rate surrounding a target region within an utterance influences which words, and how many words, listeners hear later in that utterance. On the basis of hypotheses that listeners track timing information in speech over long timescales, we investigated the possibility that the perception of words is sensitive to speech rate over such a timescale (e.g., an extended conversation). Results demonstrated that listeners tracked variation in the overall pace of speech over an extended duration (analogous to that of a conversation that listeners might have outside the lab) and that this global speech rate influenced which words listeners reported hearing. The effects of speech rate became stronger over time. Our findings are consistent with the hypothesis that neural entrainment by speech occurs on multiple timescales, some lasting more than an hour. © The Author(s) 2014.

  16. Model-based vision system for automatic recognition of structures in dental radiographs

    Science.gov (United States)

    Acharya, Raj S.; Samarabandu, Jagath K.; Hausmann, E.; Allen, K. A.

    1991-07-01

    X-ray diagnosis of destructive periodontal disease requires assessing serial radiographs by an expert to determine the change in the distance between cemento-enamel junction (CEJ) and the bone crest. To achieve this without the subjectivity of a human expert, a knowledge based system is proposed to automatically locate the two landmarks which are the CEJ and the level of alveolar crest at its junction with the periodontal ligament space. This work is a part of an ongoing project to automatically measure the distance between CEJ and the bone crest along a line parallel to the axis of the tooth. The approach presented in this paper is based on identifying a prominent feature such as the tooth boundary using local edge detection and edge thresholding to establish a reference and then using model knowledge to process sub-regions in locating the landmarks. Segmentation techniques invoked around these regions consists of a neural-network like hierarchical refinement scheme together with local gradient extraction, multilevel thresholding and ridge tracking. Recognition accuracy is further improved by first locating the easily identifiable parts of the bone surface and the interface between the enamel and the dentine and then extending these boundaries towards the periodontal ligament space and the tooth boundary respectively. The system is realized as a collection of tools (or knowledge sources) for pre-processing, segmentation, primary and secondary feature detection and a control structure based on the blackboard model to coordinate the activities of these tools.

  17. A rapid automatic analyzer and its methodology for effective bentonite content based on image recognition technology

    Directory of Open Access Journals (Sweden)

    Wei Long

    2016-09-01

    Full Text Available Fast and accurate determination of effective bentonite content in used clay bonded sand is very important for selecting the correct mixing ratio and mixing process to obtain high-performance molding sand. Currently, the effective bentonite content is determined by testing the ethylene blue absorbed in used clay bonded sand, which is usually a manual operation with some disadvantages including complicated process, long testing time and low accuracy. A rapid automatic analyzer of the effective bentonite content in used clay bonded sand was developed based on image recognition technology. The instrument consists of auto stirring, auto liquid removal, auto titration, step-rotation and image acquisition components, and processor. The principle of the image recognition method is first to decompose the color images into three-channel gray images based on the photosensitive degree difference of the light blue and dark blue in the three channels of red, green and blue, then to make the gray values subtraction calculation and gray level transformation of the gray images, and finally, to extract the outer circle light blue halo and the inner circle blue spot and calculate their area ratio. The titration process can be judged to reach the end-point while the area ratio is higher than the setting value.

  18. Robust Automatic Target Recognition via HRRP Sequence Based on Scatterer Matching

    Directory of Open Access Journals (Sweden)

    Yuan Jiang

    2018-02-01

    Full Text Available High resolution range profile (HRRP plays an important role in wideband radar automatic target recognition (ATR. In order to alleviate the sensitivity to clutter and target aspect, employing a sequence of HRRP is a promising approach to enhance the ATR performance. In this paper, a novel HRRP sequence-matching method based on singular value decomposition (SVD is proposed. First, the HRRP sequence is decoupled into the angle space and the range space via SVD, which correspond to the span of the left and the right singular vectors, respectively. Second, atomic norm minimization (ANM is utilized to estimate dominant scatterers in the range space and the Hausdorff distance is employed to measure the scatter similarity between the test and training data. Next, the angle space similarity between the test and training data is evaluated based on the left singular vector correlations. Finally, the range space matching result and the angle space correlation are fused with the singular values as weights. Simulation and outfield experimental results demonstrate that the proposed matching metric is a robust similarity measure for HRRP sequence recognition.

  19. A Compact Methodology to Understand, Evaluate, and Predict the Performance of Automatic Target Recognition

    Science.gov (United States)

    Li, Yanpeng; Li, Xiang; Wang, Hongqiang; Chen, Yiping; Zhuang, Zhaowen; Cheng, Yongqiang; Deng, Bin; Wang, Liandong; Zeng, Yonghu; Gao, Lei

    2014-01-01

    This paper offers a compacted mechanism to carry out the performance evaluation work for an automatic target recognition (ATR) system: (a) a standard description of the ATR system's output is suggested, a quantity to indicate the operating condition is presented based on the principle of feature extraction in pattern recognition, and a series of indexes to assess the output in different aspects are developed with the application of statistics; (b) performance of the ATR system is interpreted by a quality factor based on knowledge of engineering mathematics; (c) through a novel utility called “context-probability” estimation proposed based on probability, performance prediction for an ATR system is realized. The simulation result shows that the performance of an ATR system can be accounted for and forecasted by the above-mentioned measures. Compared to existing technologies, the novel method can offer more objective performance conclusions for an ATR system. These conclusions may be helpful in knowing the practical capability of the tested ATR system. At the same time, the generalization performance of the proposed method is good. PMID:24967605

  20. Automatic recognition of severity level for diagnosis of diabetic retinopathy using deep visual features.

    Science.gov (United States)

    Abbas, Qaisar; Fondon, Irene; Sarmiento, Auxiliadora; Jiménez, Soledad; Alemany, Pedro

    2017-11-01

    Diabetic retinopathy (DR) is leading cause of blindness among diabetic patients. Recognition of severity level is required by ophthalmologists to early detect and diagnose the DR. However, it is a challenging task for both medical experts and computer-aided diagnosis systems due to requiring extensive domain expert knowledge. In this article, a novel automatic recognition system for the five severity level of diabetic retinopathy (SLDR) is developed without performing any pre- and post-processing steps on retinal fundus images through learning of deep visual features (DVFs). These DVF features are extracted from each image by using color dense in scale-invariant and gradient location-orientation histogram techniques. To learn these DVF features, a semi-supervised multilayer deep-learning algorithm is utilized along with a new compressed layer and fine-tuning steps. This SLDR system was evaluated and compared with state-of-the-art techniques using the measures of sensitivity (SE), specificity (SP) and area under the receiving operating curves (AUC). On 750 fundus images (150 per category), the SE of 92.18%, SP of 94.50% and AUC of 0.924 values were obtained on average. These results demonstrate that the SLDR system is appropriate for early detection of DR and provide an effective treatment for prediction type of diabetes.

  1. Automatic data-driven real-time segmentation and recognition of surgical workflow.

    Science.gov (United States)

    Dergachyova, Olga; Bouget, David; Huaulmé, Arnaud; Morandi, Xavier; Jannin, Pierre

    2016-06-01

    With the intention of extending the perception and action of surgical staff inside the operating room, the medical community has expressed a growing interest towards context-aware systems. Requiring an accurate identification of the surgical workflow, such systems make use of data from a diverse set of available sensors. In this paper, we propose a fully data-driven and real-time method for segmentation and recognition of surgical phases using a combination of video data and instrument usage signals, exploiting no prior knowledge. We also introduce new validation metrics for assessment of workflow detection. The segmentation and recognition are based on a four-stage process. Firstly, during the learning time, a Surgical Process Model is automatically constructed from data annotations to guide the following process. Secondly, data samples are described using a combination of low-level visual cues and instrument information. Then, in the third stage, these descriptions are employed to train a set of AdaBoost classifiers capable of distinguishing one surgical phase from others. Finally, AdaBoost responses are used as input to a Hidden semi-Markov Model in order to obtain a final decision. On the MICCAI EndoVis challenge laparoscopic dataset we achieved a precision and a recall of 91 % in classification of 7 phases. Compared to the analysis based on one data type only, a combination of visual features and instrument signals allows better segmentation, reduction of the detection delay and discovery of the correct phase order.

  2. The speech signal segmentation algorithm using pitch synchronous analysis

    Directory of Open Access Journals (Sweden)

    Amirgaliyev Yedilkhan

    2017-03-01

    Full Text Available Parameterization of the speech signal using the algorithms of analysis synchronized with the pitch frequency is discussed. Speech parameterization is performed by the average number of zero transitions function and the signal energy function. Parameterization results are used to segment the speech signal and to isolate the segments with stable spectral characteristics. Segmentation results can be used to generate a digital voice pattern of a person or be applied in the automatic speech recognition. Stages needed for continuous speech segmentation are described.

  3. An Efficient Multimodal 2D + 3D Feature-based Approach to Automatic Facial Expression Recognition

    KAUST Repository

    Li, Huibin

    2015-07-29

    We present a fully automatic multimodal 2D + 3D feature-based facial expression recognition approach and demonstrate its performance on the BU-3DFE database. Our approach combines multi-order gradient-based local texture and shape descriptors in order to achieve efficiency and robustness. First, a large set of fiducial facial landmarks of 2D face images along with their 3D face scans are localized using a novel algorithm namely incremental Parallel Cascade of Linear Regression (iPar-CLR). Then, a novel Histogram of Second Order Gradients (HSOG) based local image descriptor in conjunction with the widely used first-order gradient based SIFT descriptor are used to describe the local texture around each 2D landmark. Similarly, the local geometry around each 3D landmark is described by two novel local shape descriptors constructed using the first-order and the second-order surface differential geometry quantities, i.e., Histogram of mesh Gradients (meshHOG) and Histogram of mesh Shape index (curvature quantization, meshHOS). Finally, the Support Vector Machine (SVM) based recognition results of all 2D and 3D descriptors are fused at both feature-level and score-level to further improve the accuracy. Comprehensive experimental results demonstrate that there exist impressive complementary characteristics between the 2D and 3D descriptors. We use the BU-3DFE benchmark to compare our approach to the state-of-the-art ones. Our multimodal feature-based approach outperforms the others by achieving an average recognition accuracy of 86.32%. Moreover, a good generalization ability is shown on the Bosphorus database.

  4. An Efficient Multimodal 2D + 3D Feature-based Approach to Automatic Facial Expression Recognition

    KAUST Repository

    Li, Huibin; Ding, Huaxiong; Huang, Di; Wang, Yunhong; Zhao, Xi; Morvan, Jean-Marie; Chen, Liming

    2015-01-01

    We present a fully automatic multimodal 2D + 3D feature-based facial expression recognition approach and demonstrate its performance on the BU-3DFE database. Our approach combines multi-order gradient-based local texture and shape descriptors in order to achieve efficiency and robustness. First, a large set of fiducial facial landmarks of 2D face images along with their 3D face scans are localized using a novel algorithm namely incremental Parallel Cascade of Linear Regression (iPar-CLR). Then, a novel Histogram of Second Order Gradients (HSOG) based local image descriptor in conjunction with the widely used first-order gradient based SIFT descriptor are used to describe the local texture around each 2D landmark. Similarly, the local geometry around each 3D landmark is described by two novel local shape descriptors constructed using the first-order and the second-order surface differential geometry quantities, i.e., Histogram of mesh Gradients (meshHOG) and Histogram of mesh Shape index (curvature quantization, meshHOS). Finally, the Support Vector Machine (SVM) based recognition results of all 2D and 3D descriptors are fused at both feature-level and score-level to further improve the accuracy. Comprehensive experimental results demonstrate that there exist impressive complementary characteristics between the 2D and 3D descriptors. We use the BU-3DFE benchmark to compare our approach to the state-of-the-art ones. Our multimodal feature-based approach outperforms the others by achieving an average recognition accuracy of 86.32%. Moreover, a good generalization ability is shown on the Bosphorus database.

  5. Intelligent Automatic Right-Left Sign Lamp Based on Brain Signal Recognition System

    Science.gov (United States)

    Winda, A.; Sofyan; Sthevany; Vincent, R. S.

    2017-12-01

    Comfort as a part of the human factor, plays important roles in nowadays advanced automotive technology. Many of the current technologies go in the direction of automotive driver assistance features. However, many of the driver assistance features still require physical movement by human to enable the features. In this work, the proposed method is used in order to make certain feature to be functioning without any physical movement, instead human just need to think about it in their mind. In this work, brain signal is recorded and processed in order to be used as input to the recognition system. Right-Left sign lamp based on the brain signal recognition system can potentially replace the button or switch of the specific device in order to make the lamp work. The system then will decide whether the signal is ‘Right’ or ‘Left’. The decision of the Right-Left side of brain signal recognition will be sent to a processing board in order to activate the automotive relay, which will be used to activate the sign lamp. Furthermore, the intelligent system approach is used to develop authorized model based on the brain signal. Particularly Support Vector Machines (SVMs)-based classification system is used in the proposed system to recognize the Left-Right of the brain signal. Experimental results confirm the effectiveness of the proposed intelligent Automatic brain signal-based Right-Left sign lamp access control system. The signal is processed by Linear Prediction Coefficient (LPC) and Support Vector Machines (SVMs), and the resulting experiment shows the training and testing accuracy of 100% and 80%, respectively.

  6. Automatic content linking: Speech-based just-in-time retrieval for multimedia archives

    NARCIS (Netherlands)

    Popescu-Belis, A.; Kilgour, J.; Poller, P.; Nanchen, A.; Boertjes, E.; Wit, J. de

    2010-01-01

    The Automatic Content Linking Device monitors a conversation and uses automatically recognized words to retrieve documents that are of potential use to the participants. The document set includes project related reports or emails, transcribed snippets of past meetings, and websites. Retrieval

  7. Speech recognition software and electronic psychiatric progress notes: physicians' ratings and preferences

    Directory of Open Access Journals (Sweden)

    Derman Yaron D

    2010-08-01

    Full Text Available Abstract Background The context of the current study was mandatory adoption of electronic clinical documentation within a large mental health care organization. Psychiatric electronic documentation has unique needs by the nature of dense narrative content. Our goal was to determine if speech recognition (SR would ease the creation of electronic progress note (ePN documents by physicians at our institution. Methods Subjects: Twelve physicians had access to SR software on their computers for a period of four weeks to create ePN. Measurements: We examined SR software in relation to its perceived usability, data entry time savings, impact on the quality of care and quality of documentation, and the impact on clinical and administrative workflow, as compared to existing methods for data entry. Data analysis: A series of Wilcoxon signed rank tests were used to compare pre- and post-SR measures. A qualitative study design was used. Results Six of twelve participants completing the study favoured the use of SR (five with SR alone plus one with SR via hand-held digital recorder for creating electronic progress notes over their existing mode of data entry. There was no clear perceived benefit from SR in terms of data entry time savings, quality of care, quality of documentation, or impact on clinical and administrative workflow. Conclusions Although our findings are mixed, SR may be a technology with some promise for mental health documentation. Future investigations of this nature should use more participants, a broader range of document types, and compare front- and back-end SR methods.

  8. Automated recognition of helium speech. Phase I: Investigation of microprocessor based analysis/synthesis system

    Science.gov (United States)

    Jelinek, H. J.

    1986-01-01

    This is the Final Report of Electronic Design Associates on its Phase I SBIR project. The purpose of this project is to develop a method for correcting helium speech, as experienced in diver-surface communication. The goal of the Phase I study was to design, prototype, and evaluate a real time helium speech corrector system based upon digital signal processing techniques. The general approach was to develop hardware (an IBM PC board) to digitize helium speech and software (a LAMBDA computer based simulation) to translate the speech. As planned in the study proposal, this initial prototype may now be used to assess expected performance from a self contained real time system which uses an identical algorithm. The Final Report details the work carried out to produce the prototype system. Four major project tasks were: a signal processing scheme for converting helium speech to normal sounding speech was generated. The signal processing scheme was simulated on a general purpose (LAMDA) computer. Actual helium speech was supplied to the simulation and the converted speech was generated. An IBM-PC based 14 bit data Input/Output board was designed and built. A bibliography of references on speech processing was generated.

  9. Speech endpoint detection with non-language speech sounds for generic speech processing applications

    Science.gov (United States)

    McClain, Matthew; Romanowski, Brian

    2009-05-01

    Non-language speech sounds (NLSS) are sounds produced by humans that do not carry linguistic information. Examples of these sounds are coughs, clicks, breaths, and filled pauses such as "uh" and "um" in English. NLSS are prominent in conversational speech, but can be a significant source of errors in speech processing applications. Traditionally, these sounds are ignored by speech endpoint detection algorithms, where speech regions are identified in the audio signal prior to processing. The ability to filter NLSS as a pre-processing step can significantly enhance the performance of many speech processing applications, such as speaker identification, language identification, and automatic speech recognition. In order to be used in all such applications, NLSS detection must be performed without the use of language models that provide knowledge of the phonology and lexical structure of speech. This is especially relevant to situations where the languages used in the audio are not known apriori. We present the results of preliminary experiments using data from American and British English speakers, in which segments of audio are classified as language speech sounds (LSS) or NLSS using a set of acoustic features designed for language-agnostic NLSS detection and a hidden-Markov model (HMM) to model speech generation. The results of these experiments indicate that the features and model used are capable of detection certain types of NLSS, such as breaths and clicks, while detection of other types of NLSS such as filled pauses will require future research.

  10. Speech perception for adult cochlear implant recipients in a realistic background noise: effectiveness of preprocessing strategies and external options for improving speech recognition in noise.

    Science.gov (United States)

    Gifford, René H; Revit, Lawrence J

    2010-01-01

    Although cochlear implant patients are achieving increasingly higher levels of performance, speech perception in noise continues to be problematic. The newest generations of implant speech processors are equipped with preprocessing and/or external accessories that are purported to improve listening in noise. Most speech perception measures in the clinical setting, however, do not provide a close approximation to real-world listening environments. To assess speech perception for adult cochlear implant recipients in the presence of a realistic restaurant simulation generated by an eight-loudspeaker (R-SPACE) array in order to determine whether commercially available preprocessing strategies and/or external accessories yield improved sentence recognition in noise. Single-subject, repeated-measures design with two groups of participants: Advanced Bionics and Cochlear Corporation recipients. Thirty-four subjects, ranging in age from 18 to 90 yr (mean 54.5 yr), participated in this prospective study. Fourteen subjects were Advanced Bionics recipients, and 20 subjects were Cochlear Corporation recipients. Speech reception thresholds (SRTs) in semidiffuse restaurant noise originating from an eight-loudspeaker array were assessed with the subjects' preferred listening programs as well as with the addition of either Beam preprocessing (Cochlear Corporation) or the T-Mic accessory option (Advanced Bionics). In Experiment 1, adaptive SRTs with the Hearing in Noise Test sentences were obtained for all 34 subjects. For Cochlear Corporation recipients, SRTs were obtained with their preferred everyday listening program as well as with the addition of Focus preprocessing. For Advanced Bionics recipients, SRTs were obtained with the integrated behind-the-ear (BTE) mic as well as with the T-Mic. Statistical analysis using a repeated-measures analysis of variance (ANOVA) evaluated the effects of the preprocessing strategy or external accessory in reducing the SRT in noise. In addition

  11. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent secre...

  12. Speech activity detection for the automated speaker recognition system of critical use

    Directory of Open Access Journals (Sweden)

    M. M. Bykov

    2017-06-01

    Full Text Available In the article, the authors developed a method for detecting speech activity for an automated system for recognizing critical use of speeches with wavelet parameterization of speech signal and classification at intervals of “language”/“pause” using a curvilinear neural network. The method of wavelet-parametrization proposed by the authors allows choosing the optimal parameters of wavelet transformation in accordance with the user-specified error of presentation of speech signal. Also, the method allows estimating the loss of information depending on the selected parameters of continuous wavelet transformation (NPP, which allowed to reduce the number of scalable coefficients of the LVP of the speech signal in order of magnitude with the allowable degree of distortion of the local spectrum of the LVP. An algorithm for detecting speech activity with a curvilinear neural network classifier is also proposed, which shows the high quality of segmentation of speech signals at intervals "language" / "pause" and is resistant to the presence in the speech signal of narrowband noise and technogenic noise due to the inherent properties of the curvilinear neural network.

  13. Predicting the effect of spectral subtraction on the speech recognition threshold based on the signal-to-noise ratio in the envelope domain

    DEFF Research Database (Denmark)

    Jørgensen, Søren; Dau, Torsten

    2011-01-01

    rarely been evaluated perceptually in terms of speech intelligibility. This study analyzed the effects of the spectral subtraction strategy proposed by Berouti at al. [ICASSP 4 (1979), 208-211] on the speech recognition threshold (SRT) obtained with sentences presented in stationary speech-shaped noise....... The SRT was measured in five normal-hearing listeners in six conditions of spectral subtraction. The results showed an increase of the SRT after processing, i.e. a decreased speech intelligibility, in contrast to what is predicted by the Speech Transmission Index (STI). Here, another approach is proposed......, denoted the speech-based envelope power spectrum model (sEPSM) which predicts the intelligibility based on the signal-to-noise ratio in the envelope domain. In contrast to the STI, the sEPSM is sensitive to the increased amount of the noise envelope power as a consequence of the spectral subtraction...

  14. An investigation and comparison of speech recognition software for determining if bird song recordings contain legible human voices

    Directory of Open Access Journals (Sweden)

    Tim D. Hunt

    Full Text Available The purpose of this work was to test the effectiveness of using readily available speech recognition API services to determine if recordings of bird song had inadvertently recorded human voices. A mobile phone was used to record a human speaking at increasing distances from the phone in an outside setting with bird song occurring in the background. One of the services was trained with sample recordings and each service was compared for their ability to return recognized words. The services from Google and IBM performed similarly and the Microsoft service, that allowed training, performed slightly better. However, all three services failed to perform at a level that would enable recordings with recognizable human speech to be deleted in order to maintain full privacy protection.

  15. OLIVE: Speech-Based Video Retrieval

    NARCIS (Netherlands)

    de Jong, Franciska M.G.; Gauvain, Jean-Luc; den Hartog, Jurgen; den Hartog, Jeremy; Netter, Klaus

    1999-01-01

    This paper describes the Olive project which aims to support automated indexing of video material by use of human language technologies. Olive is making use of speech recognition to automatically derive transcriptions of the sound tracks, generating time-coded linguistic elements which serve as the

  16. A distributed approach to speech resource collection

    CSIR Research Space (South Africa)

    Molapo, R

    2013-12-01

    Full Text Available The authors describe the integration of several tools to enable the end-to-end development of an Automatic Speech Recognition system in a typical under-resourced language. The authors analyse the data acquired by each of the tools and develop an ASR...

  17. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers

    2005-01-01

    and standard MFCC. AQBNE also outperforms the Aurora Baseline for the Medium Mismatch (MM) and Well Matched (WM) conditions. Though for all three conditions, the Aurora Advanced Frontend achieves superior performance, the AQBNE is still a relevant method to consider for small foot print applications....

  18. Advancing Noise Robust Automatic Speech Recognition for Command and Control Applications

    National Research Council Canada - National Science Library

    Bass, James D

    2006-01-01

    .... The reliable elimination of the keyboard and mouse in mounted and un-mounted C2 systems has been a desire of systems developers and requirements writers since the development of PC-based ASR systems in the early 1990...

  19. Towards Robust Visual Speech Recognition : Automatic Systems for Lip Reading of Dutch

    NARCIS (Netherlands)

    Chitu, A.G.

    2010-01-01

    In the last two decades we witnessed a rapid increase of the computational power governed by Moore's Law. As a side effect, the affordability of cheaper and faster CPUs increased as well. Therefore, many new “smart” devices flooded the market and made informational systems widely spread. The number

  20. Advances in image compression and automatic target recognition; Proceedings of the Meeting, Orlando, FL, Mar. 30, 31, 1989

    Science.gov (United States)

    Tescher, Andrew G. (Editor)

    1989-01-01

    Various papers on image compression and automatic target recognition are presented. Individual topics addressed include: target cluster detection in cluttered SAR imagery, model-based target recognition using laser radar imagery, Smart Sensor front-end processor for feature extraction of images, object attitude estimation and tracking from a single video sensor, symmetry detection in human vision, analysis of high resolution aerial images for object detection, obscured object recognition for an ATR application, neural networks for adaptive shape tracking, statistical mechanics and pattern recognition, detection of cylinders in aerial range images, moving object tracking using local windows, new transform method for image data compression, quad-tree product vector quantization of images, predictive trellis encoding of imagery, reduced generalized chain code for contour description, compact architecture for a real-time vision system, use of human visibility functions in segmentation coding, color texture analysis and synthesis using Gibbs random fields.